FKLearn Tutorial:

f825b6b64ce1480d82b0016efc5ae3ad

  • It was created with the idea of scaling machine learning through the company by standardizing model development and implementing an easy interface to allow all users to develop the best practices on Machine Learning
  • Currently powering more than 30 models in production
  • FKLearn was created having 4 principles that guided it’s development:

54c79751f46d43bdaa8586288ad98582

Input Analysis

Imports

[1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import matplotlib
sns.set_style("whitegrid")
sns.set_palette("husl")

import warnings
warnings.filterwarnings('ignore')

Input Dataset

  • This dataset was created with simulated data about users spend behavior on Credit Card
  • The model target is the average spend of the next 2 months and we created several features that are related to the target
[3]:
# Generate this dataset using the FKLearn Tutorial Dataset.ipynb notebook
df = pd.read_csv("fklearn-tutorial-input-dataset.csv")
[4]:
df['month_date'] = pd.to_datetime(df.month_date)
[5]:
df.head()
[5]:
id month income created_at phone_type bureau_score spend_desire random_noise monthly_spend month_date avg_last_2_months_spend target
0 0 17 5015.685403 2018-05-14 samsung NaN 759.448286 929.006227 3452.226271 2018-06-12 NaN 3516.757404
1 0 18 5015.685403 2018-05-14 samsung 522.001029 759.448286 1103.297890 3880.094787 2018-07-13 3666.160529 3228.706602
2 0 19 5015.685403 2018-05-14 samsung 381.931691 759.448286 970.967715 3153.420020 2018-08-13 3516.757404 3357.249359
3 0 20 5015.685403 2018-05-14 samsung 316.660045 759.448286 960.317678 3303.993183 2018-09-13 3228.706602 3358.220025
4 0 21 5015.685403 2018-05-14 samsung 319.226632 759.448286 940.807237 3410.505535 2018-10-14 3357.249359 3235.362465
[6]:
df.describe().T
[6]:
count mean std min 25% 50% 75% max
id 121471.0 5020.204798 2.882829e+03 0.000000 2532.000000 5012.000000 7531.000000 9.999000e+03
month 121471.0 15.629920 5.541509e+00 1.000000 12.000000 17.000000 20.000000 2.300000e+01
income 121471.0 505248.655911 2.179283e+06 302.532705 3758.661316 5179.311664 6627.112964 9.999999e+06
bureau_score 109433.0 295.955427 1.346843e+02 0.005788 198.558112 293.211936 388.348340 9.665136e+02
spend_desire 121471.0 497.221982 2.024356e+02 -356.866864 360.737656 497.799260 635.393654 1.260928e+03
random_noise 121471.0 999.910829 1.000080e+02 514.929020 932.140569 999.841128 1067.019196 1.428127e+03
monthly_spend 121471.0 2643.700884 6.920308e+02 170.815303 2192.303084 2554.895551 2961.684065 6.537560e+03
avg_last_2_months_spend 111471.0 2643.832147 5.860818e+02 188.307429 2238.787433 2585.316456 2984.386894 5.782090e+03
target 101871.0 2639.039170 5.853034e+02 412.076860 2235.043343 2580.730663 2978.910581 5.782090e+03
[7]:
features = ["income", "bureau_score", "spend_desire", "random_noise", "monthly_spend", "avg_last_2_months_spend"]
[8]:
df.isna().sum()
[8]:
id                             0
month                          0
income                         0
created_at                     0
phone_type                     0
bureau_score               12038
spend_desire                   0
random_noise                   0
monthly_spend                  0
month_date                     0
avg_last_2_months_spend    10000
target                     19600
dtype: int64

Features and Target:

$ Month (M) :nbsphinx-math:`sim `[1, 24] $

$ Income (I) \sim `:nbsphinx-math:mathcal{N}`(5000, 2000) E [300, 20000] $

$ Phone Type (P) :nbsphinx-math:`sim `[“samsung”, “motorola”, “iphone”, “lg”] $

$ Bureau Score (B) \sim `:nbsphinx-math:mathcal{N}`(\frac{500}{(Month ^{0.1}}, 200) E [0, 1000] $

$ Spend Desire (W) \sim `:nbsphinx-math:mathcal{N}`(500, 200) $

$ Random Noise (R) \sim `:nbsphinx-math:mathcal{N}`(1000, 100) $

$ Monthly Spend \sim `{max}(0, (S \* I + I ^ 2 + P \* W ^ 2 + P \* B + :nbsphinx-math:mathcal{N}`(1, 0.3)) * :nbsphinx-math:`mathcal{N}`(2000, 1000)) $

$ Avg Last 2 Months Spend :nbsphinx-math:`sim `(Spend(m) + Spend(m-1)) / 2 $

$ Target :nbsphinx-math:`sim `(Spend(m + 1) + Spend(m + 2)) / 2 $

[9]:
plt.plot(sorted(df.month.unique()), df.groupby("month").agg({"id": "count"}))
plt.title("Amount of customers by month")
[9]:
Text(0.5, 1.0, 'Amount of customers by month')
../_images/examples_fklearn_overview_12_1.png
[10]:
fig, axes = plt.subplots(2, 2, figsize=(8,8))
axes[0, 0].hist(df.bureau_score, range(0, 1000, 50))
axes[1, 0].hist(df.income, range(300, 20000, 500))
axes[0, 1].hist(df.spend_desire)
axes[1, 1].hist(df.monthly_spend, range(0, 10000, 500))

titles = ["bureau_histogram", "income_histogram", "spend_desire_histogram", "monthly_spend_histogram"]
axes[0, 0].set_title(titles[0])
axes[1, 0].set_title(titles[1])
axes[0, 1].set_title(titles[2])
axes[1, 1].set_title(titles[3])
plt.show()
../_images/examples_fklearn_overview_13_0.png
[11]:
fig, axes = plt.subplots(2, 2, figsize=(10,10))
sns.pointplot(x="month", y="bureau_score", data=df, ax=axes[0, 0])
sns.pointplot(x="month", y="income", data=df, ax=axes[1, 0])
sns.pointplot(x="month", y="spend_desire", data=df, ax=axes[0, 1])
sns.pointplot(x="month", y="monthly_spend", data=df, ax=axes[1, 1])
plt.show()
../_images/examples_fklearn_overview_14_0.png

Target Analysis

[12]:
pd.DataFrame(df.groupby("month_date").apply(lambda x: x.target.isna().sum()), columns=["null_count_by_month"])
[12]:
null_count_by_month
month_date
2017-02-01 0
2017-03-04 0
2017-04-04 0
2017-05-05 0
2017-06-05 0
2017-07-06 0
2017-08-06 0
2017-09-06 0
2017-10-07 0
2017-11-07 0
2017-12-08 0
2018-01-08 0
2018-02-08 0
2018-03-11 0
2018-04-11 0
2018-05-12 0
2018-06-12 0
2018-07-13 0
2018-08-13 0
2018-09-13 0
2018-10-14 0
2018-11-14 9600
2018-12-15 10000
[13]:
sns.pointplot(x="month", y="target", data=df)
[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x10bce9748>
../_images/examples_fklearn_overview_17_1.png

Getting started with fklearn: Creating a simple model

  • FKlearn is focused on creating production ready models but is flexible to be used at research and learning tasks
  • Tradional steps of training a model like: Train-Test Splitting, Training and Validating are all implemented on fklearn

Spliting the dataset into train and holdout

  • On real problems we want to validate our model both in the same period and also on a different period to guarantee our model is a good predictor for the future
  • On this case of customers spend we also would like to validate the performance of the model on a different set of customers to check for overfitting in the population e524ce92e25d4621a4b015be36269682
[14]:
from fklearn.preprocessing.splitting import space_time_split_dataset

train_set, intime_outspace_hdout, outime_inspace_hdout, outime_outspace_hdout = \
                                        space_time_split_dataset(df,
                                                 train_start_date="2017-02-01",
                                                 train_end_date="2018-07-01",
                                                 holdout_end_date="2018-10-01",
                                                 split_seed=42,
                                                 space_holdout_percentage=0.2,
                                                 space_column="id",
                                                 time_column="month_date")
[15]:
(train_set.shape,
 intime_outspace_hdout.shape,
 outime_inspace_hdout.shape,
 outime_outspace_hdout.shape)
[15]:
((54206, 12), (13421, 12), (25055, 12), (4488, 12))

Training:

On the training process we want:

  • Cap features that have unexpected values
  • Encode categorical features
  • Fill missing values
  • Train our model
  • We want to apply this same transformations when scoring our model, similarly to the fit and transform concept from sklearn

41613f89d6054942b7b20c7979c8d6e1

7b6cdedd9e75407c8f03418f953469e2

[16]:
from fklearn.training.transformation import capper, prediction_ranger, label_categorizer
from fklearn.training.imputation import imputer
from fklearn.training.regression import lgbm_regression_learner
capper_fn = capper(columns_to_cap=["income"], precomputed_caps={"income": 20000.0})
ranger_fn = prediction_ranger(prediction_min=0.0, prediction_max=100000.0, prediction_column="prediction")
label_fn = label_categorizer(columns_to_categorize=["phone_type"])
imputer_fn = imputer(columns_to_impute=["bureau_score"], impute_strategy="median")
regression_fn = lgbm_regression_learner(features=features, target="target", learning_rate=0.1, num_estimators=200)
[17]:
from fklearn.training.pipeline import build_pipeline
train_fn = build_pipeline(label_fn, capper_fn, imputer_fn, regression_fn, ranger_fn)
[18]:
predict_fn, scored_train_set, train_logs = train_fn(train_set)
/Users/henriquelopes/miniconda3/lib/python3.6/site-packages/sklearn/utils/deprecation.py:66: DeprecationWarning: Class Imputer is deprecated; Imputer was deprecated in version 0.20 and will be removed in 0.22. Import impute.SimpleImputer from sklearn instead.
  warnings.warn(msg, category=DeprecationWarning)
[19]:
scored_train_set.head()
[19]:
id month income created_at phone_type bureau_score spend_desire random_noise monthly_spend month_date avg_last_2_months_spend target prediction
9 2 3 2424.105189 2017-03-28 0 611.650913 641.98234 858.263250 3059.915274 2017-04-04 NaN 3492.075105 2886.165513
10 2 4 20000.000000 2017-03-28 0 310.829209 641.98234 984.614824 4447.520879 2017-05-05 3753.718077 2425.698944 2788.816923
11 2 5 2424.105189 2017-03-28 0 310.829209 641.98234 1004.881140 2536.629332 2017-06-05 3492.075105 2256.332465 2783.500567
12 2 6 2424.105189 2017-03-28 0 156.248181 641.98234 960.071795 2314.768557 2017-07-06 2425.698944 3476.051462 2758.309663
13 2 7 2424.105189 2017-03-28 0 167.091261 641.98234 991.301503 2197.896373 2017-08-06 2256.332465 3648.279899 2789.522977
[20]:
pd.options.display.max_rows = 120
pd.DataFrame(train_logs['lgbm_regression_learner']['feature_importance'],
             index=["importance"]).T.sort_values("importance", ascending=False)
[20]:
importance
avg_last_2_months_spend 1149
spend_desire 1129
bureau_score 987
income 967
monthly_spend 960
random_noise 808
[21]:
scored_outime_outspace_hdout = predict_fn(outime_outspace_hdout)
scored_intime_outspace_hdout = predict_fn(intime_outspace_hdout)
scored_outime_inspace_hdout = predict_fn(outime_inspace_hdout)
[22]:
scored_outime_outspace_hdout[["id", "month_date", "income", "bureau_score", "target", "prediction"]].head()
[22]:
id month_date income bureau_score target prediction
1 0 2018-07-13 5015.685403 522.001029 3228.706602 3451.870684
2 0 2018-08-13 5015.685403 381.931691 3357.249359 3210.763017
3 0 2018-09-13 5015.685403 316.660045 3358.220025 3385.630512
148 13 2018-07-13 20000.000000 330.107289 2107.578901 2398.327320
149 13 2018-08-13 1731.884844 260.790931 2091.307307 2328.283377

Evaluation:

  • We want to evaluate the performance of our model using multiple metrics for each dataframe (out_of_time, out_of_space, out_of_time_and_space)

90112ec01aaf44b19bf18fc4133592e9

[23]:
from fklearn.validation.evaluators import r2_evaluator, spearman_evaluator, combined_evaluators
[24]:
r2_eval_fn = r2_evaluator(prediction_column="prediction", target_column="target")
spearman_eval_fn = spearman_evaluator(prediction_column="prediction", target_column="target")
eval_fn = combined_evaluators(evaluators=[r2_eval_fn, spearman_eval_fn])
[25]:
outime_outspace_hdout_logs = eval_fn(scored_outime_outspace_hdout)
intime_outspace_hdout_logs = eval_fn(scored_intime_outspace_hdout)
outime_inspace_hdout_logs = eval_fn(scored_outime_inspace_hdout)
[26]:
{"out_of_time": outime_outspace_hdout_logs,
 "in_time": intime_outspace_hdout_logs,
  "out_of_time_and_space": outime_inspace_hdout_logs}
[26]:
{'out_of_time': {'r2_evaluator__target': 0.5956009555286967,
  'spearman_evaluator__target': 0.7892796752342339},
 'in_time': {'r2_evaluator__target': 0.588576925917941,
  'spearman_evaluator__target': 0.787194941139121},
 'out_of_time_and_space': {'r2_evaluator__target': 0.5749628505861282,
  'spearman_evaluator__target': 0.7790056568693072}}

Extractors

  • We want to transform this logs into dataframes that we can visualize better

53defd1acd394bd6815e9f99e0473679

[27]:
from fklearn.metrics.pd_extractors import evaluator_extractor, combined_evaluator_extractor, extract
r2_extractor = evaluator_extractor(evaluator_name="r2_evaluator__target")
spearman_extractor = evaluator_extractor(evaluator_name="spearman_evaluator__target")
full_extractor = combined_evaluator_extractor(base_extractors=[r2_extractor, spearman_extractor])
[28]:
pd.concat(
    [full_extractor(outime_outspace_hdout_logs).assign(part="out_of_time"),
    full_extractor(intime_outspace_hdout_logs).assign(part="in_time_out_of_space"),
    full_extractor(outime_inspace_hdout_logs).assign(part="out_of_time_and_space")])
[28]:
r2_evaluator__target spearman_evaluator__target part
0 0.595601 0.789280 out_of_time
0 0.588577 0.787195 in_time_out_of_space
0 0.574963 0.779006 out_of_time_and_space

Learning Curves:

bcfd79e88a4a459992954caec6f9ba7d

We can reuse our previously defined training function and evaluators!

Could we improve if we had more data?? AKA Spatial Learning Curve

  • If we had more data we could use that for training and see how much the model improves
  • To estimate that we can train with subsamples of the full training set and check how performance changes

ae89e4df042a41af98117f380f5e18b4

b2385f091657466492be350558bd096c

[29]:
from fklearn.validation.splitters import spatial_learning_curve_splitter

split_fn = spatial_learning_curve_splitter(space_column="id",
                                           time_column="month_date",
                                           training_limit="2018-04-01",
                                           train_percentages=[0.1, 0.2, 0.4, 0.6, 0.8, 1.0])
[30]:
from fklearn.validation.validator import parallel_validator

spatial_learning_curve_logs = parallel_validator(train_set, split_fn, train_fn, eval_fn, n_jobs=8)
/Users/henriquelopes/miniconda3/lib/python3.6/site-packages/sklearn/utils/deprecation.py:66: DeprecationWarning: Class Imputer is deprecated; Imputer was deprecated in version 0.20 and will be removed in 0.22. Import impute.SimpleImputer from sklearn instead.
  warnings.warn(msg, category=DeprecationWarning)
/Users/henriquelopes/miniconda3/lib/python3.6/site-packages/sklearn/utils/deprecation.py:66: DeprecationWarning: Class Imputer is deprecated; Imputer was deprecated in version 0.20 and will be removed in 0.22. Import impute.SimpleImputer from sklearn instead.
  warnings.warn(msg, category=DeprecationWarning)
/Users/henriquelopes/miniconda3/lib/python3.6/site-packages/sklearn/utils/deprecation.py:66: DeprecationWarning: Class Imputer is deprecated; Imputer was deprecated in version 0.20 and will be removed in 0.22. Import impute.SimpleImputer from sklearn instead.
  warnings.warn(msg, category=DeprecationWarning)
/Users/henriquelopes/miniconda3/lib/python3.6/site-packages/sklearn/utils/deprecation.py:66: DeprecationWarning: Class Imputer is deprecated; Imputer was deprecated in version 0.20 and will be removed in 0.22. Import impute.SimpleImputer from sklearn instead.
  warnings.warn(msg, category=DeprecationWarning)
/Users/henriquelopes/miniconda3/lib/python3.6/site-packages/sklearn/utils/deprecation.py:66: DeprecationWarning: Class Imputer is deprecated; Imputer was deprecated in version 0.20 and will be removed in 0.22. Import impute.SimpleImputer from sklearn instead.
  warnings.warn(msg, category=DeprecationWarning)
/Users/henriquelopes/miniconda3/lib/python3.6/site-packages/sklearn/utils/deprecation.py:66: DeprecationWarning: Class Imputer is deprecated; Imputer was deprecated in version 0.20 and will be removed in 0.22. Import impute.SimpleImputer from sklearn instead.
  warnings.warn(msg, category=DeprecationWarning)
[31]:
spatial_learning_curve_logs
[31]:
{'train_log': [{'label_categorizer': {'transformed_column': ['phone_type'],
    'replace_unseen': nan,
    'running_time': '0.022 s'},
   'capper': {'caps': {'income': 20000.0},
    'transformed_column': ['income'],
    'precomputed_caps': {'income': 20000.0},
    'running_time': '0.076 s'},
   'imputer': {'impute_strategy': 'median',
    'columns_to_impute': ['bureau_score'],
    'training_proportion_of_nulls': {'bureau_score': 0.09924550203134068},
    'statistics': array([325.16258753]),
    'running_time': '0.038 s'},
   'lgbm_regression_learner': {'features': ['income',
     'bureau_score',
     'spend_desire',
     'random_noise',
     'monthly_spend',
     'avg_last_2_months_spend'],
    'target': 'target',
    'prediction_column': 'prediction',
    'package': 'lightgbm',
    'package_version': '2.2.3',
    'parameters': {'eta': 0.1,
     'objective': 'regression',
     'num_estimators': 200},
    'feature_importance': {'income': 979,
     'bureau_score': 1008,
     'spend_desire': 1048,
     'random_noise': 958,
     'monthly_spend': 844,
     'avg_last_2_months_spend': 1163},
    'training_samples': 5169,
    'running_time': '7.715 s'},
   'object': <lightgbm.basic.Booster at 0x1a2263c5f8>,
   'prediction_ranger': {'prediction_min': 0.0,
    'prediction_max': 100000.0,
    'transformed_column': ['prediction']},
   '__fkml__': {'pipeline': ['label_categorizer',
     'capper',
     'imputer',
     'lgbm_regression_learner',
     'prediction_ranger'],
    'output_columns': ['id',
     'month',
     'income',
     'created_at',
     'phone_type',
     'bureau_score',
     'spend_desire',
     'random_noise',
     'monthly_spend',
     'month_date',
     'avg_last_2_months_spend',
     'target',
     'prediction'],
    'features': ['id',
     'month',
     'income',
     'created_at',
     'phone_type',
     'bureau_score',
     'spend_desire',
     'random_noise',
     'monthly_spend',
     'month_date',
     'avg_last_2_months_spend',
     'target'],
    'learners': {'label_categorizer': {'fn': <function fklearn.training.transformation.label_categorizer.<locals>.p(new_df:pandas.core.frame.DataFrame) -> pandas.core.frame.DataFrame>,
      'log': {'label_categorizer': {'transformed_column': ['phone_type'],
        'replace_unseen': nan,
        'running_time': '0.022 s'}}},
     'capper': {'fn': <function fklearn.training.transformation.capper.<locals>.p(new_data_set:pandas.core.frame.DataFrame) -> pandas.core.frame.DataFrame>,
      'log': {'capper': {'caps': {'income': 20000.0},
        'transformed_column': ['income'],
        'precomputed_caps': {'income': 20000.0},
        'running_time': '0.076 s'}}},
     'imputer': {'fn': <function fklearn.training.imputation.imputer.<locals>.p(new_data_set:pandas.core.frame.DataFrame) -> pandas.core.frame.DataFrame>,
      'log': {'imputer': {'impute_strategy': 'median',
        'columns_to_impute': ['bureau_score'],
        'training_proportion_of_nulls': {'bureau_score': 0.09924550203134068},
        'statistics': array([325.16258753]),
        'running_time': '0.038 s'}}},
     'lgbm_regression_learner': {'fn': <function fklearn.training.regression.lgbm_regression_learner.<locals>.p(new_df:pandas.core.frame.DataFrame, apply_shap:bool=False) -> pandas.core.frame.DataFrame>,
      'log': {'lgbm_regression_learner': {'features': ['income',
         'bureau_score',
         'spend_desire',
         'random_noise',
         'monthly_spend',
         'avg_last_2_months_spend'],
        'target': 'target',
        'prediction_column': 'prediction',
        'package': 'lightgbm',
        'package_version': '2.2.3',
        'parameters': {'eta': 0.1,
         'objective': 'regression',
         'num_estimators': 200},
        'feature_importance': {'income': 979,
         'bureau_score': 1008,
         'spend_desire': 1048,
         'random_noise': 958,
         'monthly_spend': 844,
         'avg_last_2_months_spend': 1163},
        'training_samples': 5169,
        'running_time': '7.715 s'},
       'object': <lightgbm.basic.Booster at 0x1a2263c5f8>}},
     'prediction_ranger': {'fn': <function fklearn.training.transformation.prediction_ranger.<locals>.p(new_df:pandas.core.frame.DataFrame) -> pandas.core.frame.DataFrame>,
      'log': {'prediction_ranger': {'prediction_min': 0.0,
        'prediction_max': 100000.0,
        'transformed_column': ['prediction']}}}}}},
  {'label_categorizer': {'transformed_column': ['phone_type'],
    'replace_unseen': nan,
    'running_time': '0.052 s'},
   'capper': {'caps': {'income': 20000.0},
    'transformed_column': ['income'],
    'precomputed_caps': {'income': 20000.0},
    'running_time': '0.049 s'},
   'imputer': {'impute_strategy': 'median',
    'columns_to_impute': ['bureau_score'],
    'training_proportion_of_nulls': {'bureau_score': 0.10143635463100545},
    'statistics': array([326.11326198]),
    'running_time': '0.032 s'},
   'lgbm_regression_learner': {'features': ['income',
     'bureau_score',
     'spend_desire',
     'random_noise',
     'monthly_spend',
     'avg_last_2_months_spend'],
    'target': 'target',
    'prediction_column': 'prediction',
    'package': 'lightgbm',
    'package_version': '2.2.3',
    'parameters': {'eta': 0.1,
     'objective': 'regression',
     'num_estimators': 200},
    'feature_importance': {'income': 997,
     'bureau_score': 965,
     'spend_desire': 1028,
     'random_noise': 907,
     'monthly_spend': 895,
     'avg_last_2_months_spend': 1208},
    'training_samples': 10095,
    'running_time': '7.943 s'},
   'object': <lightgbm.basic.Booster at 0x1a2263c828>,
   'prediction_ranger': {'prediction_min': 0.0,
    'prediction_max': 100000.0,
    'transformed_column': ['prediction']},
   '__fkml__': {'pipeline': ['label_categorizer',
     'capper',
     'imputer',
     'lgbm_regression_learner',
     'prediction_ranger'],
    'output_columns': ['id',
     'month',
     'income',
     'created_at',
     'phone_type',
     'bureau_score',
     'spend_desire',
     'random_noise',
     'monthly_spend',
     'month_date',
     'avg_last_2_months_spend',
     'target',
     'prediction'],
    'features': ['id',
     'month',
     'income',
     'created_at',
     'phone_type',
     'bureau_score',
     'spend_desire',
     'random_noise',
     'monthly_spend',
     'month_date',
     'avg_last_2_months_spend',
     'target'],
    'learners': {'label_categorizer': {'fn': <function fklearn.training.transformation.label_categorizer.<locals>.p(new_df:pandas.core.frame.DataFrame) -> pandas.core.frame.DataFrame>,
      'log': {'label_categorizer': {'transformed_column': ['phone_type'],
        'replace_unseen': nan,
        'running_time': '0.052 s'}}},
     'capper': {'fn': <function fklearn.training.transformation.capper.<locals>.p(new_data_set:pandas.core.frame.DataFrame) -> pandas.core.frame.DataFrame>,
      'log': {'capper': {'caps': {'income': 20000.0},
        'transformed_column': ['income'],
        'precomputed_caps': {'income': 20000.0},
        'running_time': '0.049 s'}}},
     'imputer': {'fn': <function fklearn.training.imputation.imputer.<locals>.p(new_data_set:pandas.core.frame.DataFrame) -> pandas.core.frame.DataFrame>,
      'log': {'imputer': {'impute_strategy': 'median',
        'columns_to_impute': ['bureau_score'],
        'training_proportion_of_nulls': {'bureau_score': 0.10143635463100545},
        'statistics': array([326.11326198]),
        'running_time': '0.032 s'}}},
     'lgbm_regression_learner': {'fn': <function fklearn.training.regression.lgbm_regression_learner.<locals>.p(new_df:pandas.core.frame.DataFrame, apply_shap:bool=False) -> pandas.core.frame.DataFrame>,
      'log': {'lgbm_regression_learner': {'features': ['income',
         'bureau_score',
         'spend_desire',
         'random_noise',
         'monthly_spend',
         'avg_last_2_months_spend'],
        'target': 'target',
        'prediction_column': 'prediction',
        'package': 'lightgbm',
        'package_version': '2.2.3',
        'parameters': {'eta': 0.1,
         'objective': 'regression',
         'num_estimators': 200},
        'feature_importance': {'income': 997,
         'bureau_score': 965,
         'spend_desire': 1028,
         'random_noise': 907,
         'monthly_spend': 895,
         'avg_last_2_months_spend': 1208},
        'training_samples': 10095,
        'running_time': '7.943 s'},
       'object': <lightgbm.basic.Booster at 0x1a2263c828>}},
     'prediction_ranger': {'fn': <function fklearn.training.transformation.prediction_ranger.<locals>.p(new_df:pandas.core.frame.DataFrame) -> pandas.core.frame.DataFrame>,
      'log': {'prediction_ranger': {'prediction_min': 0.0,
        'prediction_max': 100000.0,
        'transformed_column': ['prediction']}}}}}},
  {'label_categorizer': {'transformed_column': ['phone_type'],
    'replace_unseen': nan,
    'running_time': '0.091 s'},
   'capper': {'caps': {'income': 20000.0},
    'transformed_column': ['income'],
    'precomputed_caps': {'income': 20000.0},
    'running_time': '0.025 s'},
   'imputer': {'impute_strategy': 'median',
    'columns_to_impute': ['bureau_score'],
    'training_proportion_of_nulls': {'bureau_score': 0.09938140148058006},
    'statistics': array([326.0731499]),
    'running_time': '0.040 s'},
   'lgbm_regression_learner': {'features': ['income',
     'bureau_score',
     'spend_desire',
     'random_noise',
     'monthly_spend',
     'avg_last_2_months_spend'],
    'target': 'target',
    'prediction_column': 'prediction',
    'package': 'lightgbm',
    'package_version': '2.2.3',
    'parameters': {'eta': 0.1,
     'objective': 'regression',
     'num_estimators': 200},
    'feature_importance': {'income': 1025,
     'bureau_score': 957,
     'spend_desire': 1028,
     'random_noise': 912,
     'monthly_spend': 917,
     'avg_last_2_months_spend': 1161},
    'training_samples': 19722,
    'running_time': '8.104 s'},
   'object': <lightgbm.basic.Booster at 0x1a22647048>,
   'prediction_ranger': {'prediction_min': 0.0,
    'prediction_max': 100000.0,
    'transformed_column': ['prediction']},
   '__fkml__': {'pipeline': ['label_categorizer',
     'capper',
     'imputer',
     'lgbm_regression_learner',
     'prediction_ranger'],
    'output_columns': ['id',
     'month',
     'income',
     'created_at',
     'phone_type',
     'bureau_score',
     'spend_desire',
     'random_noise',
     'monthly_spend',
     'month_date',
     'avg_last_2_months_spend',
     'target',
     'prediction'],
    'features': ['id',
     'month',
     'income',
     'created_at',
     'phone_type',
     'bureau_score',
     'spend_desire',
     'random_noise',
     'monthly_spend',
     'month_date',
     'avg_last_2_months_spend',
     'target'],
    'learners': {'label_categorizer': {'fn': <function fklearn.training.transformation.label_categorizer.<locals>.p(new_df:pandas.core.frame.DataFrame) -> pandas.core.frame.DataFrame>,
      'log': {'label_categorizer': {'transformed_column': ['phone_type'],
        'replace_unseen': nan,
        'running_time': '0.091 s'}}},
     'capper': {'fn': <function fklearn.training.transformation.capper.<locals>.p(new_data_set:pandas.core.frame.DataFrame) -> pandas.core.frame.DataFrame>,
      'log': {'capper': {'caps': {'income': 20000.0},
        'transformed_column': ['income'],
        'precomputed_caps': {'income': 20000.0},
        'running_time': '0.025 s'}}},
     'imputer': {'fn': <function fklearn.training.imputation.imputer.<locals>.p(new_data_set:pandas.core.frame.DataFrame) -> pandas.core.frame.DataFrame>,
      'log': {'imputer': {'impute_strategy': 'median',
        'columns_to_impute': ['bureau_score'],
        'training_proportion_of_nulls': {'bureau_score': 0.09938140148058006},
        'statistics': array([326.0731499]),
        'running_time': '0.040 s'}}},
     'lgbm_regression_learner': {'fn': <function fklearn.training.regression.lgbm_regression_learner.<locals>.p(new_df:pandas.core.frame.DataFrame, apply_shap:bool=False) -> pandas.core.frame.DataFrame>,
      'log': {'lgbm_regression_learner': {'features': ['income',
         'bureau_score',
         'spend_desire',
         'random_noise',
         'monthly_spend',
         'avg_last_2_months_spend'],
        'target': 'target',
        'prediction_column': 'prediction',
        'package': 'lightgbm',
        'package_version': '2.2.3',
        'parameters': {'eta': 0.1,
         'objective': 'regression',
         'num_estimators': 200},
        'feature_importance': {'income': 1025,
         'bureau_score': 957,
         'spend_desire': 1028,
         'random_noise': 912,
         'monthly_spend': 917,
         'avg_last_2_months_spend': 1161},
        'training_samples': 19722,
        'running_time': '8.104 s'},
       'object': <lightgbm.basic.Booster at 0x1a22647048>}},
     'prediction_ranger': {'fn': <function fklearn.training.transformation.prediction_ranger.<locals>.p(new_df:pandas.core.frame.DataFrame) -> pandas.core.frame.DataFrame>,
      'log': {'prediction_ranger': {'prediction_min': 0.0,
        'prediction_max': 100000.0,
        'transformed_column': ['prediction']}}}}}},
  {'label_categorizer': {'transformed_column': ['phone_type'],
    'replace_unseen': nan,
    'running_time': '0.077 s'},
   'capper': {'caps': {'income': 20000.0},
    'transformed_column': ['income'],
    'precomputed_caps': {'income': 20000.0},
    'running_time': '0.023 s'},
   'imputer': {'impute_strategy': 'median',
    'columns_to_impute': ['bureau_score'],
    'training_proportion_of_nulls': {'bureau_score': 0.09925883896592078},
    'statistics': array([324.7631039]),
    'running_time': '0.060 s'},
   'lgbm_regression_learner': {'features': ['income',
     'bureau_score',
     'spend_desire',
     'random_noise',
     'monthly_spend',
     'avg_last_2_months_spend'],
    'target': 'target',
    'prediction_column': 'prediction',
    'package': 'lightgbm',
    'package_version': '2.2.3',
    'parameters': {'eta': 0.1,
     'objective': 'regression',
     'num_estimators': 200},
    'feature_importance': {'income': 1031,
     'bureau_score': 999,
     'spend_desire': 1074,
     'random_noise': 844,
     'monthly_spend': 903,
     'avg_last_2_months_spend': 1149},
    'training_samples': 28199,
    'running_time': '8.213 s'},
   'object': <lightgbm.basic.Booster at 0x1a226472b0>,
   'prediction_ranger': {'prediction_min': 0.0,
    'prediction_max': 100000.0,
    'transformed_column': ['prediction']},
   '__fkml__': {'pipeline': ['label_categorizer',
     'capper',
     'imputer',
     'lgbm_regression_learner',
     'prediction_ranger'],
    'output_columns': ['id',
     'month',
     'income',
     'created_at',
     'phone_type',
     'bureau_score',
     'spend_desire',
     'random_noise',
     'monthly_spend',
     'month_date',
     'avg_last_2_months_spend',
     'target',
     'prediction'],
    'features': ['id',
     'month',
     'income',
     'created_at',
     'phone_type',
     'bureau_score',
     'spend_desire',
     'random_noise',
     'monthly_spend',
     'month_date',
     'avg_last_2_months_spend',
     'target'],
    'learners': {'label_categorizer': {'fn': <function fklearn.training.transformation.label_categorizer.<locals>.p(new_df:pandas.core.frame.DataFrame) -> pandas.core.frame.DataFrame>,
      'log': {'label_categorizer': {'transformed_column': ['phone_type'],
        'replace_unseen': nan,
        'running_time': '0.077 s'}}},
     'capper': {'fn': <function fklearn.training.transformation.capper.<locals>.p(new_data_set:pandas.core.frame.DataFrame) -> pandas.core.frame.DataFrame>,
      'log': {'capper': {'caps': {'income': 20000.0},
        'transformed_column': ['income'],
        'precomputed_caps': {'income': 20000.0},
        'running_time': '0.023 s'}}},
     'imputer': {'fn': <function fklearn.training.imputation.imputer.<locals>.p(new_data_set:pandas.core.frame.DataFrame) -> pandas.core.frame.DataFrame>,
      'log': {'imputer': {'impute_strategy': 'median',
        'columns_to_impute': ['bureau_score'],
        'training_proportion_of_nulls': {'bureau_score': 0.09925883896592078},
        'statistics': array([324.7631039]),
        'running_time': '0.060 s'}}},
     'lgbm_regression_learner': {'fn': <function fklearn.training.regression.lgbm_regression_learner.<locals>.p(new_df:pandas.core.frame.DataFrame, apply_shap:bool=False) -> pandas.core.frame.DataFrame>,
      'log': {'lgbm_regression_learner': {'features': ['income',
         'bureau_score',
         'spend_desire',
         'random_noise',
         'monthly_spend',
         'avg_last_2_months_spend'],
        'target': 'target',
        'prediction_column': 'prediction',
        'package': 'lightgbm',
        'package_version': '2.2.3',
        'parameters': {'eta': 0.1,
         'objective': 'regression',
         'num_estimators': 200},
        'feature_importance': {'income': 1031,
         'bureau_score': 999,
         'spend_desire': 1074,
         'random_noise': 844,
         'monthly_spend': 903,
         'avg_last_2_months_spend': 1149},
        'training_samples': 28199,
        'running_time': '8.213 s'},
       'object': <lightgbm.basic.Booster at 0x1a226472b0>}},
     'prediction_ranger': {'fn': <function fklearn.training.transformation.prediction_ranger.<locals>.p(new_df:pandas.core.frame.DataFrame) -> pandas.core.frame.DataFrame>,
      'log': {'prediction_ranger': {'prediction_min': 0.0,
        'prediction_max': 100000.0,
        'transformed_column': ['prediction']}}}}}},
  {'label_categorizer': {'transformed_column': ['phone_type'],
    'replace_unseen': nan,
    'running_time': '0.102 s'},
   'capper': {'caps': {'income': 20000.0},
    'transformed_column': ['income'],
    'precomputed_caps': {'income': 20000.0},
    'running_time': '0.026 s'},
   'imputer': {'impute_strategy': 'median',
    'columns_to_impute': ['bureau_score'],
    'training_proportion_of_nulls': {'bureau_score': 0.09956821366274914},
    'statistics': array([323.60346949]),
    'running_time': '0.052 s'},
   'lgbm_regression_learner': {'features': ['income',
     'bureau_score',
     'spend_desire',
     'random_noise',
     'monthly_spend',
     'avg_last_2_months_spend'],
    'target': 'target',
    'prediction_column': 'prediction',
    'package': 'lightgbm',
    'package_version': '2.2.3',
    'parameters': {'eta': 0.1,
     'objective': 'regression',
     'num_estimators': 200},
    'feature_importance': {'income': 1031,
     'bureau_score': 984,
     'spend_desire': 1092,
     'random_noise': 831,
     'monthly_spend': 910,
     'avg_last_2_months_spend': 1152},
    'training_samples': 34971,
    'running_time': '8.448 s'},
   'object': <lightgbm.basic.Booster at 0x1a226473c8>,
   'prediction_ranger': {'prediction_min': 0.0,
    'prediction_max': 100000.0,
    'transformed_column': ['prediction']},
   '__fkml__': {'pipeline': ['label_categorizer',
     'capper',
     'imputer',
     'lgbm_regression_learner',
     'prediction_ranger'],
    'output_columns': ['id',
     'month',
     'income',
     'created_at',
     'phone_type',
     'bureau_score',
     'spend_desire',
     'random_noise',
     'monthly_spend',
     'month_date',
     'avg_last_2_months_spend',
     'target',
     'prediction'],
    'features': ['id',
     'month',
     'income',
     'created_at',
     'phone_type',
     'bureau_score',
     'spend_desire',
     'random_noise',
     'monthly_spend',
     'month_date',
     'avg_last_2_months_spend',
     'target'],
    'learners': {'label_categorizer': {'fn': <function fklearn.training.transformation.label_categorizer.<locals>.p(new_df:pandas.core.frame.DataFrame) -> pandas.core.frame.DataFrame>,
      'log': {'label_categorizer': {'transformed_column': ['phone_type'],
        'replace_unseen': nan,
        'running_time': '0.102 s'}}},
     'capper': {'fn': <function fklearn.training.transformation.capper.<locals>.p(new_data_set:pandas.core.frame.DataFrame) -> pandas.core.frame.DataFrame>,
      'log': {'capper': {'caps': {'income': 20000.0},
        'transformed_column': ['income'],
        'precomputed_caps': {'income': 20000.0},
        'running_time': '0.026 s'}}},
     'imputer': {'fn': <function fklearn.training.imputation.imputer.<locals>.p(new_data_set:pandas.core.frame.DataFrame) -> pandas.core.frame.DataFrame>,
      'log': {'imputer': {'impute_strategy': 'median',
        'columns_to_impute': ['bureau_score'],
        'training_proportion_of_nulls': {'bureau_score': 0.09956821366274914},
        'statistics': array([323.60346949]),
        'running_time': '0.052 s'}}},
     'lgbm_regression_learner': {'fn': <function fklearn.training.regression.lgbm_regression_learner.<locals>.p(new_df:pandas.core.frame.DataFrame, apply_shap:bool=False) -> pandas.core.frame.DataFrame>,
      'log': {'lgbm_regression_learner': {'features': ['income',
         'bureau_score',
         'spend_desire',
         'random_noise',
         'monthly_spend',
         'avg_last_2_months_spend'],
        'target': 'target',
        'prediction_column': 'prediction',
        'package': 'lightgbm',
        'package_version': '2.2.3',
        'parameters': {'eta': 0.1,
         'objective': 'regression',
         'num_estimators': 200},
        'feature_importance': {'income': 1031,
         'bureau_score': 984,
         'spend_desire': 1092,
         'random_noise': 831,
         'monthly_spend': 910,
         'avg_last_2_months_spend': 1152},
        'training_samples': 34971,
        'running_time': '8.448 s'},
       'object': <lightgbm.basic.Booster at 0x1a226473c8>}},
     'prediction_ranger': {'fn': <function fklearn.training.transformation.prediction_ranger.<locals>.p(new_df:pandas.core.frame.DataFrame) -> pandas.core.frame.DataFrame>,
      'log': {'prediction_ranger': {'prediction_min': 0.0,
        'prediction_max': 100000.0,
        'transformed_column': ['prediction']}}}}}},
  {'label_categorizer': {'transformed_column': ['phone_type'],
    'replace_unseen': nan,
    'running_time': '0.074 s'},
   'capper': {'caps': {'income': 20000.0},
    'transformed_column': ['income'],
    'precomputed_caps': {'income': 20000.0},
    'running_time': '0.023 s'},
   'imputer': {'impute_strategy': 'median',
    'columns_to_impute': ['bureau_score'],
    'training_proportion_of_nulls': {'bureau_score': 0.10037035049111695},
    'statistics': array([322.60691645]),
    'running_time': '0.063 s'},
   'lgbm_regression_learner': {'features': ['income',
     'bureau_score',
     'spend_desire',
     'random_noise',
     'monthly_spend',
     'avg_last_2_months_spend'],
    'target': 'target',
    'prediction_column': 'prediction',
    'package': 'lightgbm',
    'package_version': '2.2.3',
    'parameters': {'eta': 0.1,
     'objective': 'regression',
     'num_estimators': 200},
    'feature_importance': {'income': 1032,
     'bureau_score': 973,
     'spend_desire': 1118,
     'random_noise': 814,
     'monthly_spend': 947,
     'avg_last_2_months_spend': 1116},
    'training_samples': 37262,
    'running_time': '8.429 s'},
   'object': <lightgbm.basic.Booster at 0x1a2263cc18>,
   'prediction_ranger': {'prediction_min': 0.0,
    'prediction_max': 100000.0,
    'transformed_column': ['prediction']},
   '__fkml__': {'pipeline': ['label_categorizer',
     'capper',
     'imputer',
     'lgbm_regression_learner',
     'prediction_ranger'],
    'output_columns': ['id',
     'month',
     'income',
     'created_at',
     'phone_type',
     'bureau_score',
     'spend_desire',
     'random_noise',
     'monthly_spend',
     'month_date',
     'avg_last_2_months_spend',
     'target',
     'prediction'],
    'features': ['id',
     'month',
     'income',
     'created_at',
     'phone_type',
     'bureau_score',
     'spend_desire',
     'random_noise',
     'monthly_spend',
     'month_date',
     'avg_last_2_months_spend',
     'target'],
    'learners': {'label_categorizer': {'fn': <function fklearn.training.transformation.label_categorizer.<locals>.p(new_df:pandas.core.frame.DataFrame) -> pandas.core.frame.DataFrame>,
      'log': {'label_categorizer': {'transformed_column': ['phone_type'],
        'replace_unseen': nan,
        'running_time': '0.074 s'}}},
     'capper': {'fn': <function fklearn.training.transformation.capper.<locals>.p(new_data_set:pandas.core.frame.DataFrame) -> pandas.core.frame.DataFrame>,
      'log': {'capper': {'caps': {'income': 20000.0},
        'transformed_column': ['income'],
        'precomputed_caps': {'income': 20000.0},
        'running_time': '0.023 s'}}},
     'imputer': {'fn': <function fklearn.training.imputation.imputer.<locals>.p(new_data_set:pandas.core.frame.DataFrame) -> pandas.core.frame.DataFrame>,
      'log': {'imputer': {'impute_strategy': 'median',
        'columns_to_impute': ['bureau_score'],
        'training_proportion_of_nulls': {'bureau_score': 0.10037035049111695},
        'statistics': array([322.60691645]),
        'running_time': '0.063 s'}}},
     'lgbm_regression_learner': {'fn': <function fklearn.training.regression.lgbm_regression_learner.<locals>.p(new_df:pandas.core.frame.DataFrame, apply_shap:bool=False) -> pandas.core.frame.DataFrame>,
      'log': {'lgbm_regression_learner': {'features': ['income',
         'bureau_score',
         'spend_desire',
         'random_noise',
         'monthly_spend',
         'avg_last_2_months_spend'],
        'target': 'target',
        'prediction_column': 'prediction',
        'package': 'lightgbm',
        'package_version': '2.2.3',
        'parameters': {'eta': 0.1,
         'objective': 'regression',
         'num_estimators': 200},
        'feature_importance': {'income': 1032,
         'bureau_score': 973,
         'spend_desire': 1118,
         'random_noise': 814,
         'monthly_spend': 947,
         'avg_last_2_months_spend': 1116},
        'training_samples': 37262,
        'running_time': '8.429 s'},
       'object': <lightgbm.basic.Booster at 0x1a2263cc18>}},
     'prediction_ranger': {'fn': <function fklearn.training.transformation.prediction_ranger.<locals>.p(new_df:pandas.core.frame.DataFrame) -> pandas.core.frame.DataFrame>,
      'log': {'prediction_ranger': {'prediction_min': 0.0,
        'prediction_max': 100000.0,
        'transformed_column': ['prediction']}}}}}}],
 'validator_log': [{'fold_num': 0,
   'eval_results': [{'r2_evaluator__target': 0.5086378476205488,
     'spearman_evaluator__target': 0.7386593691328156}],
   'split_log': {'train_start': Timestamp('2017-02-01 00:00:00'),
    'train_end': Timestamp('2018-03-11 00:00:00'),
    'train_size': 5169,
    'test_start': Timestamp('2018-04-11 00:00:00'),
    'test_end': Timestamp('2018-06-12 00:00:00'),
    'test_size': 16944,
    'percentage': 0.1}},
  {'fold_num': 1,
   'eval_results': [{'r2_evaluator__target': 0.528636837654197,
     'spearman_evaluator__target': 0.7559395783320386}],
   'split_log': {'train_start': Timestamp('2017-02-01 00:00:00'),
    'train_end': Timestamp('2018-03-11 00:00:00'),
    'train_size': 10095,
    'test_start': Timestamp('2018-04-11 00:00:00'),
    'test_end': Timestamp('2018-06-12 00:00:00'),
    'test_size': 16944,
    'percentage': 0.2}},
  {'fold_num': 2,
   'eval_results': [{'r2_evaluator__target': 0.5526850769787538,
     'spearman_evaluator__target': 0.7708684906608395}],
   'split_log': {'train_start': Timestamp('2017-02-01 00:00:00'),
    'train_end': Timestamp('2018-03-11 00:00:00'),
    'train_size': 19722,
    'test_start': Timestamp('2018-04-11 00:00:00'),
    'test_end': Timestamp('2018-06-12 00:00:00'),
    'test_size': 16944,
    'percentage': 0.4}},
  {'fold_num': 3,
   'eval_results': [{'r2_evaluator__target': 0.5590326161307089,
     'spearman_evaluator__target': 0.7746842469162849}],
   'split_log': {'train_start': Timestamp('2017-02-01 00:00:00'),
    'train_end': Timestamp('2018-03-11 00:00:00'),
    'train_size': 28199,
    'test_start': Timestamp('2018-04-11 00:00:00'),
    'test_end': Timestamp('2018-06-12 00:00:00'),
    'test_size': 16944,
    'percentage': 0.6}},
  {'fold_num': 4,
   'eval_results': [{'r2_evaluator__target': 0.5610652544945294,
     'spearman_evaluator__target': 0.7753973663995514}],
   'split_log': {'train_start': Timestamp('2017-02-01 00:00:00'),
    'train_end': Timestamp('2018-03-11 00:00:00'),
    'train_size': 34971,
    'test_start': Timestamp('2018-04-11 00:00:00'),
    'test_end': Timestamp('2018-06-12 00:00:00'),
    'test_size': 16944,
    'percentage': 0.8}},
  {'fold_num': 5,
   'eval_results': [{'r2_evaluator__target': 0.5623244757695537,
     'spearman_evaluator__target': 0.7760550988045266}],
   'split_log': {'train_start': Timestamp('2017-02-01 00:00:00'),
    'train_end': Timestamp('2018-03-11 00:00:00'),
    'train_size': 37262,
    'test_start': Timestamp('2018-04-11 00:00:00'),
    'test_end': Timestamp('2018-06-12 00:00:00'),
    'test_size': 16944,
    'percentage': 1.0}}]}
[32]:
data = extract(spatial_learning_curve_logs['validator_log'], full_extractor)
[33]:
data
[33]:
r2_evaluator__target spearman_evaluator__target fold_num train_start train_end train_size test_start test_end test_size percentage
0 0.508638 0.738659 0 2017-02-01 2018-03-11 5169 2018-04-11 2018-06-12 16944 0.1
0 0.528637 0.755940 1 2017-02-01 2018-03-11 10095 2018-04-11 2018-06-12 16944 0.2
0 0.552685 0.770868 2 2017-02-01 2018-03-11 19722 2018-04-11 2018-06-12 16944 0.4
0 0.559033 0.774684 3 2017-02-01 2018-03-11 28199 2018-04-11 2018-06-12 16944 0.6
0 0.561065 0.775397 4 2017-02-01 2018-03-11 34971 2018-04-11 2018-06-12 16944 0.8
0 0.562324 0.776055 5 2017-02-01 2018-03-11 37262 2018-04-11 2018-06-12 16944 1.0
[34]:
fig, ax = plt.subplots(figsize=(10, 8))
sns.pointplot(data=data, ax=ax, x="percentage", y="r2_evaluator__target")
plt.title("Spatial learning curve (trained from %s to %s)" % (data.train_start.dt.date.min(), data.train_end.dt.date.max()));
../_images/examples_fklearn_overview_49_0.png

Performance over time

  • That metric can be thought as split your dataset month by month and computing the desired metrics
  • We can do that with ease using our split evaluators
[35]:
from fklearn.validation.evaluators import split_evaluator
from fklearn.metrics.pd_extractors import split_evaluator_extractor
out_of_space_holdout = pd.concat([scored_intime_outspace_hdout, scored_outime_outspace_hdout])

monthly_eval_fn = split_evaluator(eval_fn=eval_fn,
                                  split_col="month",
                                  split_values=list(range(0, 25)))

monthly_extractor = split_evaluator_extractor(base_extractor=full_extractor,
                                              split_col="month",
                                              split_values=list(range(0, 25)))
[36]:
out_of_space_logs = monthly_eval_fn(out_of_space_holdout)
[37]:
monthly_performance = monthly_extractor(out_of_space_logs)
[38]:
monthly_performance
[38]:
r2_evaluator__target spearman_evaluator__target split_evaluator__month
0 NaN NaN 0
0 0.421870 0.744979 1
0 0.491403 0.701534 2
0 0.478469 0.729370 3
0 0.534005 0.736146 4
0 0.538004 0.752896 5
0 0.528558 0.749593 6
0 0.599890 0.799352 7
0 0.603244 0.796775 8
0 0.627977 0.808087 9
0 0.598558 0.779260 10
0 0.604571 0.788846 11
0 0.604477 0.793243 12
0 0.575609 0.775225 13
0 0.582989 0.789979 14
0 0.578106 0.779769 15
0 0.576511 0.784794 16
0 0.611587 0.806708 17
0 0.616746 0.800946 18
0 0.581834 0.780355 19
0 0.576350 0.783299 20
0 NaN NaN 21
0 NaN NaN 22
0 NaN NaN 23
0 NaN NaN 24
[39]:
fig, ax = plt.subplots(figsize=(10, 8))
sns.pointplot(data=monthly_performance, ax=ax, x="split_evaluator__month", y="r2_evaluator__target")
[39]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c240b51d0>
../_images/examples_fklearn_overview_55_1.png

Impact of More Data on Monthly Performance?

  • We want both our previous way of spliting our dataset and the evaluator we created
  • We can use both to create a new spatial learning curve!
[40]:
monthly_spatial_learning_curve_logs = parallel_validator(train_set,
                                                         split_fn,
                                                         train_fn,
                                                         monthly_eval_fn,
                                                         n_jobs=8)
[41]:
monthly_data = extract(monthly_spatial_learning_curve_logs['validator_log'], monthly_extractor).loc[lambda df: df.r2_evaluator__target.notna()]
[42]:
monthly_data
[42]:
r2_evaluator__target spearman_evaluator__target split_evaluator__month fold_num train_start train_end train_size test_start test_end test_size percentage
0 0.540541 0.758164 15 0 2017-02-01 2018-03-11 5094 2018-04-11 2018-06-12 16711 0.1
0 0.514314 0.743269 16 0 2017-02-01 2018-03-11 5094 2018-04-11 2018-06-12 16711 0.1
0 0.507848 0.743218 17 0 2017-02-01 2018-03-11 5094 2018-04-11 2018-06-12 16711 0.1
0 0.566848 0.774746 15 1 2017-02-01 2018-03-11 9987 2018-04-11 2018-06-12 16711 0.2
0 0.543697 0.761993 16 1 2017-02-01 2018-03-11 9987 2018-04-11 2018-06-12 16711 0.2
0 0.539739 0.762033 17 1 2017-02-01 2018-03-11 9987 2018-04-11 2018-06-12 16711 0.2
0 0.586378 0.786420 15 2 2017-02-01 2018-03-11 19440 2018-04-11 2018-06-12 16711 0.4
0 0.565958 0.775002 16 2 2017-02-01 2018-03-11 19440 2018-04-11 2018-06-12 16711 0.4
0 0.562612 0.773501 17 2 2017-02-01 2018-03-11 19440 2018-04-11 2018-06-12 16711 0.4
0 0.593950 0.789789 15 3 2017-02-01 2018-03-11 28068 2018-04-11 2018-06-12 16711 0.6
0 0.572617 0.777124 16 3 2017-02-01 2018-03-11 28068 2018-04-11 2018-06-12 16711 0.6
0 0.567864 0.775599 17 3 2017-02-01 2018-03-11 28068 2018-04-11 2018-06-12 16711 0.6
0 0.597177 0.790955 15 4 2017-02-01 2018-03-11 34682 2018-04-11 2018-06-12 16711 0.8
0 0.577139 0.780944 16 4 2017-02-01 2018-03-11 34682 2018-04-11 2018-06-12 16711 0.8
0 0.570256 0.776765 17 4 2017-02-01 2018-03-11 34682 2018-04-11 2018-06-12 16711 0.8
0 0.597451 0.791407 15 5 2017-02-01 2018-03-11 36681 2018-04-11 2018-06-12 16711 1.0
0 0.575909 0.779593 16 5 2017-02-01 2018-03-11 36681 2018-04-11 2018-06-12 16711 1.0
0 0.570274 0.777338 17 5 2017-02-01 2018-03-11 36681 2018-04-11 2018-06-12 16711 1.0
[43]:
fig, ax = plt.subplots(figsize=(10, 8))
sns.pointplot(data=monthly_data, ax=ax, x="percentage", y="r2_evaluator__target", hue="split_evaluator__month")
plt.title("Spatial Learning Curve By Month")
plt.xticks(rotation=45);
../_images/examples_fklearn_overview_60_0.png

What else does fklearn provide me?

  • Several other learning curves that can be used depending on what you want to evaluate
  • Several other algorithms for models
  • Other tools with similar interface for feature selection, parameter tuning
  • All this methods are integrated with similar signatures in a way that is easy to reuse training, spliting and evaluation functions

Learn more:

  1. Medium blog post: https://medium.com/building-nubank/
  2. Documentation: https://fklearn.readthedocs.io/