API

This is a list with all relevant fklearn functions. Docstrings should provide enough information in order to understand any individual function.

Preprocessing

Rebalancing (fklearn.preprocessing.rebalancing)

rebalance_by_categorical Resample dataset so that the result contains the same number of lines per category in categ_column.
rebalance_by_continuous Resample dataset so that the result contains the same number of lines per bucket in a continuous column.

Splitting (fklearn.preprocessing.splitting)

space_time_split_dataset Splits panel data using both ID and Time columns, resulting in four datasets
time_split_dataset Splits temporal data into a training and testing datasets such that all training data comes before the testings one.

Training

Calibration (fklearn.training.calibration)

isotonic_calibration_learner Fits a single feature isotonic regression to the dataset.

Classification (fklearn.training.classification)

lgbm_classification_learner Fits an LGBM classifier to the dataset.
logistic_classification_learner Fits an logistic regression classifier to the dataset.
nlp_logistic_classification_learner Fits a text vectorizer (TfidfVectorizer) followed by a logistic regression (LogisticRegression).
xgb_classification_learner Fits an XGBoost classifier to the dataset.

Ensemble (fklearn.training.ensemble)

xgb_octopus_classification_learner Octopus ensemble allows you to inject domain specific knowledge to force a split in an initial feature, instead of assuming the tree model will do that intelligent split on its own.

Imputation (fklearn.training.imputation)

imputer Fits a missing value imputer to the dataset.
placeholder_imputer Fills missing values with a fixed value.

Pipeline (fklearn.training.pipeline)

build_pipeline(*learners) Builds a pipeline of chained learners functions with the possibility of using keyword arguments in the predict functions of the pipeline.

Regression (fklearn.training.regression)

gp_regression_learner Fits an gaussian process regressor to the dataset.
lgbm_regression_learner Fits an LGBM regressor to the dataset.
linear_regression_learner Fits an linear regression classifier to the dataset.
xgb_regression_learner Fits an XGBoost regressor to the dataset.

Transformation (fklearn.training.transformation)

apply_replacements(df, columns, vec, Dict], …) Base function to apply the replacements values found on the “vec” vectors into the df DataFrame.
capper Learns the maximum value for each of the columns_to_cap and used that as the cap for those columns.
count_categorizer Replaces categorical variables by count.
custom_transformer Applies a custom function to the desired columns.
discrete_ecdfer Learns an Empirical Cumulative Distribution Function from the specified column in the input DataFrame.
ecdfer Learns an Empirical Cumulative Distribution Function from the specified column in the input DataFrame.
floorer Learns the minimum value for each of the columns_to_floor and used that as the floot for those columns.
label_categorizer Replaces categorical variables with a numeric identifier.
missing_warner Creates a new column to warn about rows that columns that don’t have missing in the training set but have missing on the scoring
null_injector Applies a custom function to the desired columns.
onehot_categorizer Onehot encoding on categorical columns.
prediction_ranger Caps and floors the specified prediction column to a set range.
quantile_biner Discretize continuous numerical columns into its quantiles.
rank_categorical Rank categorical features by their frequency in the train set.
selector Filters a DataFrames by selecting only the desired columns.
standard_scaler Fits a standard scaler to the dataset.
truncate_categorical Truncate infrequent categories and replace them by a single one.
value_mapper Map values in selected columns in the DataFrame according to dictionaries of replacements.

Unsupervised (fklearn.training.unsupervised)

isolation_forest_learner Fits an anomaly detection algorithm (Isolation Forest) to the dataset

Tuning

Model Agnostic Feature Choice (fklearn.tuning.model_agnostic_fc)

correlation_feature_selection Feature selection based on correlation
variance_feature_selection Feature selection based on variance

Parameter Tuning (fklearn.tuning.parameter_tuners)

grid_search_cv Runs several training functions with each run taken from the parameter space
random_search_tuner Runs several training functions with each run taken from the parameter space
seed([seed]) Seed the generator.

Samplers (fklearn.tuning.samplers)

remove_by_feature_importance Performs feature selection based on feature importance
remove_by_feature_shuffling Performs feature selection based on the evaluation of the test vs the evaluation of the test with randomly shuffled features
remove_features_subsets Performs feature selection based on the best performing model out of several trained models

Selectors (fklearn.tuning.selectors)

backward_subset_feature_selection(…) Performs train-evaluation iterations while testing the subsets of features to compute statistics about the importance of each feature category
feature_importance_backward_selection(…) Performs train-evaluation iterations while subsampling the used features to compute statistics about feature relevance
poor_man_boruta_selection(train_data, …) Performs train-evaluation iterations while shuffiling the used features to compute statistics about feature relevance

Stoppers (fklearn.tuning.stoppers)

aggregate_stop_funcs(*stop_funcs) Aggregate stop functions
stop_by_iter_num Checks for logs to see if feature selection should stop
stop_by_no_improvement Checks for logs to see if feature selection should stop
stop_by_no_improvement_parallel Checks for logs to see if feature selection should stop
stop_by_num_features Checks for logs to see if feature selection should stop
stop_by_num_features_parallel Selects the best log out of a list to see if feature selection should stop

Validation

Evaluators (fklearn.validation.evaluators)

auc_evaluator Computes the ROC AUC score, given true label and prediction scores.
brier_score_evaluator Computes the Brier score, given true label and prediction scores.
combined_evaluators Combine partially applies evaluation functions.
correlation_evaluator Computes the Pearson correlation between prediction and target.
expected_calibration_error_evaluator Computes the expected calibration error (ECE), given true label and prediction scores.
fbeta_score_evaluator Computes the recall score, given true label and prediction scores.
generic_sklearn_evaluator(name_prefix, …) Returns an evaluator build from a metric from sklearn.metrics
hash_evaluator Computes the hash of a pandas dataframe, filtered by hash columns.
logloss_evaluator Computes the logloss score, given true label and prediction scores.
mean_prediction_evaluator Computes mean for the specified column.
mse_evaluator Computes the Mean Squared Error, given true label and predictions.
permutation_evaluator Permutation importance evaluator.
precision_evaluator Computes the precision score, given true label and prediction scores.
r2_evaluator Computes the R2 score, given true label and predictions.
recall_evaluator Computes the recall score, given true label and prediction scores.
spearman_evaluator Computes the Spearman correlation between prediction and target.
split_evaluator Splits the dataset into the categories in split_col and evaluate model performance in each split.
temporal_split_evaluator Splits the dataset into the temporal categories by time_col and evaluate model performance in each split.

Splitters (fklearn.validation.splitters)

forward_stability_curve_time_splitter Splits the data into temporal buckets with both the training and testing folds both moving forward.
k_fold_splitter Makes K random train/test split folds for cross validation.
out_of_time_and_space_splitter Makes K grouped train/test split folds for cross validation.
reverse_time_learning_curve_splitter Splits the data into temporal buckets given by the specified frequency.
spatial_learning_curve_splitter Splits the data for a spatial learning curve.
stability_curve_time_in_space_splitter Splits the data into temporal buckets given by the specified frequency.
stability_curve_time_space_splitter Splits the data into temporal buckets given by the specified frequency.
stability_curve_time_splitter Splits the data into temporal buckets given by the specified frequency.
time_and_space_learning_curve_splitter Splits the data into temporal buckets given by the specified frequency.
time_learning_curve_splitter Splits the data into temporal buckets given by the specified frequency.

Validator (fklearn.validation.validator)

parallel_validator Splits the training data into folds given by the split function and performs a train-evaluation sequence on each fold.
validator Splits the training data into folds given by the split function and performs a train-evaluation sequence on each fold by calling validator_iteration.
validator_iteration(data, train_index, …) Perform an iteration of train test split, training and evaluation.

Definitions

fklearn.data.datasets.make_confounded_data(n: int) → Tuple[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame, pandas.core.frame.DataFrame][source]

Generates fake data for counterfactual experimentation. The covariants are sex, age and severity, the treatment is a binary variable, medication and the response days until recovery.

Parameters:n (int) – The number of samples to generate
Returns:
  • df_rnd (pd.DataFrame) – A dataframe where the treatment is randomly assigned.
  • df_obs (pd.DataFrame) – A dataframe with confounding.
  • df_df (pd.DataFrame) – A counter factual dataframe with confounding. Same as df_obs, but with the treatment flipped.
fklearn.data.datasets.make_tutorial_data(n: int) → pandas.core.frame.DataFrame[source]

Generates fake data for a tutorial. There are 3 numerical features (“num1”, “num3” and “num3”) and tow categorical features (“cat1” and “cat2”) sex, age and severity, the treatment is a binary variable, medication and the response days until recovery.

Parameters:n (int) – The number of samples to generate
Returns:df – A tutorial dataset
Return type:pd.DataFrame
fklearn.preprocessing.rebalancing.rebalance_by_categorical[source]

Resample dataset so that the result contains the same number of lines per category in categ_column.

Parameters:
  • dataset (pandas.DataFrame) – A Pandas’ DataFrame with an categ_column column
  • categ_column (str) – The name of the categorical column
  • max_lines_by_categ (int (default None)) – The maximum number of lines by category. If None it will be set to the number of lines for the smallest category
  • seed (int (default 1)) – Random state for consistency.
Returns:

rebalanced_dataset – A dataset with fewer lines than dataset, but with the same number of lines per category in categ_column

Return type:

pandas.DataFrame

fklearn.preprocessing.rebalancing.rebalance_by_continuous[source]

Resample dataset so that the result contains the same number of lines per bucket in a continuous column.

Parameters:
  • dataset (pandas.DataFrame) – A Pandas’ DataFrame with an categ_column column
  • continuous_column (str) – The name of the continuous column
  • buckets (int) – The number of buckets to split the continuous column into
  • max_lines_by_categ (int (default None)) – The maximum number of lines by category. If None it will be set to the number of lines for the smallest category
  • by_quantile (bool (default False)) – If True, uses pd.qcut instead of pd.cut to get the buckets from the continuous column
  • seed (int (default 1)) – Random state for consistency.
Returns:

rebalanced_dataset – A dataset with fewer lines than dataset, but with the same number of lines per category in categ_column

Return type:

pandas.DataFrame

fklearn.preprocessing.splitting.space_time_split_dataset[source]

Splits panel data using both ID and Time columns, resulting in four datasets

  1. A training set;
  2. An in training time, but out sample id hold out dataset;
  3. An out of training time, but in sample id hold out dataset;
  4. An out of training time and out of sample id hold out dataset.
Parameters:
  • dataset (pandas.DataFrame) – A Pandas’ DataFrame with an Identifier Column and a Date Column. The model will be trained to predict the target column from the features.
  • train_start_date (str) – A date string representing a the starting time of the training data. It should be in the same format as the Date Column in dataset.
  • train_end_date (str) – A date string representing a the ending time of the training data. This will also be used as the start date of the holdout period. It should be in the same format as the Date Column in dataset.
  • holdout_end_date (str) – A date string representing a the ending time of the holdout data. It should be in the same format as the Date Column in dataset.
  • split_seed (int) – A seed used by the random number generator.
  • space_holdout_percentage (float) – The out of id holdout size as a proportion of the in id training size.
  • space_column (str) – The name of the Identifier column of dataset.
  • time_column (str) – The name of the Date column of dataset.
  • holdout_space (np.array) – An array containing the hold out IDs. If not specified, A random subset of IDs will be selected for holdout.
Returns:

  • train_set (pandas.DataFrame) – The in ID sample and in time training set.
  • intime_outspace_hdout (pandas.DataFrame) – The out of ID sample and in time hold out set.
  • outime_inspace_hdout (pandas.DataFrame) – The out of ID sample and in time hold out set.
  • holdout_space (pandas.DataFrame) – The out of ID sample and in time hold out set.

fklearn.preprocessing.splitting.time_split_dataset[source]

Splits temporal data into a training and testing datasets such that all training data comes before the testings one.

Parameters:
  • dataset (pandas.DataFrame) – A Pandas’ DataFrame with an Identifier Column and a Date Column. The model will be trained to predict the target column from the features.
  • train_start_date (str) – A date string representing a the starting time of the training data. It should be in the same format as the Date Column in dataset.
  • train_end_date (str) – A date string representing a the ending time of the training data. This will also be used as the start date of the holdout period. It should be in the same format as the Date Column in dataset.
  • holdout_end_date (str) – A date string representing a the ending time of the holdout data. It should be in the same format as the Date Column in dataset.
  • time_column (str) – The name of the Date column of dataset.
Returns:

  • train_set (pandas.DataFrame) – The in ID sample and in time training set.
  • test_set (pandas.DataFrame) – The out of ID sample and in time hold out set.

fklearn.training.calibration.isotonic_calibration_learner[source]

Fits a single feature isotonic regression to the dataset.

Parameters:
  • df (pandas.DataFrame) – A Pandas’ DataFrame with features and target columns. The model will be trained to predict the target column from the features.
  • target_column (str) – The name of the column in df that should be used as target for the model. This column should be binary, since this is a classification model.
  • prediction_column (str) – The name of the column with the uncalibrated predictions from the model.
  • output_column (str) – The name of the column with the calibrated predictions from the model.
Returns:

  • p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
  • new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
  • log (dict) – A log-like Dict that stores information of the Isotonic Calibration model.

fklearn.training.classification.lgbm_classification_learner[source]

Fits an LGBM classifier to the dataset.

It first generates a Dataset with the specified features and labels from df. Then, it fits a LGBM model to this Dataset. Return the predict function for the model and the predictions for the input dataset.

Parameters:
  • df (pandas.DataFrame) – A pandas DataFrame with features and target columns. The model will be trained to predict the target column from the features.
  • features (list of str) – A list os column names that are used as features for the model. All this names should be in df.
  • target (str) – The name of the column in df that should be used as target for the model. This column should be binary, since this is a classification model.
  • learning_rate (float) – Float in the range (0, 1] Step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features. and eta actually shrinks the feature weights to make the boosting process more conservative. See the learning_rate hyper-parameter in: https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.rst
  • num_estimators (int) – Int in the range (0, inf) Number of boosted trees to fit. See the num_iterations hyper-parameter in: https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.rst
  • extra_params (dict, optional) – Dictionary in the format {“hyperparameter_name” : hyperparameter_value}. Other parameters for the LGBM model. See the list in: https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.rst If not passed, the default will be used.
  • prediction_column (str) – The name of the column with the predictions from the model.
  • weight_column (str, optional) – The name of the column with scores to weight the data.
Returns:

  • p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
  • new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
  • log (dict) – A log-like Dict that stores information of the LGBM Classifier model.

fklearn.training.classification.logistic_classification_learner[source]

Fits an logistic regression classifier to the dataset. Return the predict function for the model and the predictions for the input dataset.

Parameters:
  • df (pandas.DataFrame) – A Pandas’ DataFrame with features and target columns. The model will be trained to predict the target column from the features.
  • features (list of str) – A list os column names that are used as features for the model. All this names should be in df.
  • target (str) – The name of the column in df that should be used as target for the model. This column should be binary, since this is a classification model.
  • params (dict) – The LogisticRegression parameters in the format {“par_name”: param}. See: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
  • prediction_column (str) – The name of the column with the predictions from the model. If a multiclass problem, additional prediction_column_i columns will be added for i in range(0,n_classes).
  • weight_column (str, optional) – The name of the column with scores to weight the data.
Returns:

  • p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
  • new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
  • log (dict) – A log-like Dict that stores information of the Logistic Regression model.

fklearn.training.classification.nlp_logistic_classification_learner[source]

Fits a text vectorizer (TfidfVectorizer) followed by a logistic regression (LogisticRegression).

Parameters:
Returns:

  • p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
  • new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
  • log (dict) – A log-like Dict that stores information of the NLP Logistic Regression model.

fklearn.training.classification.xgb_classification_learner[source]

Fits an XGBoost classifier to the dataset. It first generates a DMatrix with the specified features and labels from df. Then, it fits a XGBoost model to this DMatrix. Return the predict function for the model and the predictions for the input dataset.

Parameters:
  • df (pandas.DataFrame) – A Pandas’ DataFrame with features and target columns. The model will be trained to predict the target column from the features.
  • features (list of str) – A list os column names that are used as features for the model. All this names should be in df.
  • target (str) – The name of the column in df that should be used as target for the model. This column should be binary, since this is a classification model.
  • learning_rate (float) – Float in the range (0, 1] Step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features. and eta actually shrinks the feature weights to make the boosting process more conservative. See the eta hyper-parameter in: http://xgboost.readthedocs.io/en/latest/parameter.html
  • num_estimators (int) – Int in the range (0, inf) Number of boosted trees to fit. See the n_estimators hyper-parameter in: http://xgboost.readthedocs.io/en/latest/python/python_api.html
  • extra_params (dict, optional) – Dictionary in the format {“hyperparameter_name” : hyperparameter_value}. Other parameters for the XGBoost model. See the list in: http://xgboost.readthedocs.io/en/latest/parameter.html If not passed, the default will be used.
  • prediction_column (str) – The name of the column with the predictions from the model. If a multiclass problem, additional prediction_column_i columns will be added for i in range(0,n_classes).
  • weight_column (str, optional) – The name of the column with scores to weight the data.
Returns:

  • p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
  • new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
  • log (dict) – A log-like Dict that stores information of the XGboost Classifier model.

fklearn.training.ensemble.xgb_octopus_classification_learner[source]

Octopus ensemble allows you to inject domain specific knowledge to force a split in an initial feature, instead of assuming the tree model will do that intelligent split on its own. It works by first defining a split on your dataset and then training one individual model in each separated dataset.

Parameters:
  • train_set (pd.DataFrame) – A Pandas’ DataFrame with features, target columns and a splitting column that must be categorical.
  • learning_rate_by_bin (dict) –

    A dictionary of learning rate in the XGBoost model to use in each model split. Ex: if you want to split your training by tenure and you have a tenure column with integer values [1,2,3,…,12], you have to specify a list of learning rates for each split:

    {
        1: 0.08,
        2: 0.08,
        ...
        12: 0.1
    }
    
  • num_estimators_by_bin (dict) –

    A dictionary of number of tree estimators in the XGBoost model to use in each model split. Ex: if you want to split your training by tenure and you have a tenure column with integer values [1,2,3,…,12], you have to specify a list of estimators for each split:

    {
        1: 300,
        2: 250,
        ...
        12: 300
    }
    
  • extra_params_by_bin (dict) –

    A dictionary of extra parameters dictionaries in the XGBoost model to use in each model split. Ex: if you want to split your training by tenure and you have a tenure column with integer values [1,2,3,…,12], you have to specify a list of extra parameters for each split:

    {
        1: {
            'reg_alpha': 0.0,
            'colsample_bytree': 0.4,
            ...
            'colsample_bylevel': 0.8
            }
        2: {
            'reg_alpha': 0.1,
            'colsample_bytree': 0.6,
            ...
            'colsample_bylevel': 0.4
            }
        ...
        12: {
            'reg_alpha': 0.0,
            'colsample_bytree': 0.7,
            ...
            'colsample_bylevel': 1.0
            }
    }
    
  • features_by_bin (dict) –

    A dictionary of features to use in each model split. Ex: if you want to split your training by tenure and you have a tenure column with integer values [1,2,3,…,12], you have to specify a list of features for each split:

    {
        1: [feature-1, feature-2, feature-3, ...],
        2: [feature-1, feature-3, feature-5, ...],
        ...
        12: [feature-2, feature-4, feature-8, ...]
    }
    
  • train_split_col (str) – The name of the categorical column where the model will make the splits. Ex: if you want to split your training by tenure, you can have a categorical column called “tenure”.
  • train_split_bins (list) – A list with the actual values of the categories from the train_split_col. Ex: if you want to split your training by tenure and you have a tenure column with integer values [1,2,3,…,12] you can pass this list and you will split your training into 12 different models.
  • nthread (int) – Number of threads for the XGBoost learners.
  • target_column (str) – The name of the target column.
  • prediction_column (str) – The name of the column with the predictions from the model.
Returns:

  • p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
  • new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
  • log (dict) – A log-like Dict that stores information of the Octopus XGB Classifier model.

fklearn.training.imputation.imputer[source]

Fits a missing value imputer to the dataset.

Parameters:
  • df (pandas.DataFrame) – A Pandas’ DataFrame with columns to impute missing values. It must contain all columns listed in columns_to_impute
  • columns_to_impute (List of strings) – A list of names of the columns for missing value imputation.
  • impute_strategy (String, (default="median")) – The imputation strategy. - If “mean”, then replace missing values using the mean along the axis. - If “median”, then replace missing values using the median along the axis. - If “most_frequent”, then replace missing using the most frequent value along the axis.
Returns:

  • p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
  • new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
  • log (dict) – A log-like Dict that stores information of the SimpleImputer model.

fklearn.training.imputation.placeholder_imputer[source]

Fills missing values with a fixed value.

Parameters:
  • df (pandas.DataFrame) – A Pandas’ DataFrame with columns to fill missing values. It must contain all columns listed in columns_to_impute
  • columns_to_impute (List of strings) – A list of names of the columns for filling missing value.
  • placeholder_value (Any, (default=-999)) – The value used to fill in missing values.
Returns:

  • p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
  • new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
  • log (dict) – A log-like Dict that stores information of the Placeholder SimpleImputer model.

fklearn.training.pipeline.build_pipeline(*learners) → Callable[pandas.core.frame.DataFrame, Tuple[Callable[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame], pandas.core.frame.DataFrame, Dict[str, Dict[str, Any]]]][source]

Builds a pipeline of chained learners functions with the possibility of using keyword arguments in the predict functions of the pipeline.

Say you have two learners, you create a pipeline with pipeline = build_pipeline(learner1, learner2). Those learners must be functions with just one unfilled argument (the dataset itself).

Then, you train the pipeline with predict_fn, transformed_df, logs = pipeline(df), which will be like applying the learners in the following order: learner2(learner1(df)).

Finally, you predict on different datasets with pred_df = predict_fn(new_df), with optional kwargs. For example, if you have XGBoost or LightGBM, you can get SHAP values with predict_fn(new_df, apply_shap=True).

Parameters:learners (partially-applied learner functions.) –
Returns:
  • p (function pandas.DataFrame, **kwargs -> pandas.DataFrame) – A function that when applied to a DataFrame will apply all learner functions in sequence, with optional kwargs.
  • new_df (pandas.DataFrame) – A DataFrame that is the result of applying all learner function in sequence.
  • log (dict) – A log-like Dict that stores information of all learner functions.
fklearn.training.regression.catboost_regressor_learner[source]

Fits an CatBoost regressor to the dataset. It first generates a Pool with the specified features and labels from df. Then it fits a CatBoost model to this Pool. Return the predict function for the model and the predictions for the input dataset.

Parameters:
  • df (pandas.DataFrame) – A Pandas’ DataFrame with features and target columns. The model will be trained to predict the target column from the features.
  • features (list of str) – A list os column names that are used as features for the model. All this names should be in df.
  • target (str) – The name of the column in df that should be used as target for the model. This column should be numerical and continuous, since this is a regression model.
  • learning_rate (float) – Float in range [0,1]. Step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features. and eta actually shrinks the feature weights to make the boosting process more conservative. See the eta hyper-parameter in: https://catboost.ai/docs/concepts/python-reference_parameters-list.html
  • num_estimators (int) – Int in range [0, inf] Number of boosted trees to fit. See the n_estimators hyper-parameter in: https://catboost.ai/docs/concepts/python-reference_parameters-list.html
  • extra_params (dict, optional) – Dictionary in the format {“hyperparameter_name” : hyperparameter_value. Other parameters for the CatBoost model. See the list in: https://catboost.ai/docs/concepts/python-reference_catboostregressor.html If not passed, the default will be used.
  • prediction_column (str) – The name of the column with the predictions from the model.
  • weight_column (str, optional) – The name of the column with scores to weight the data.
Returns:

  • p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
  • new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
  • log (dict) – A log-like Dict that stores information of the CatBoostRegressor model.

fklearn.training.regression.gp_regression_learner[source]

Fits an gaussian process regressor to the dataset.

Parameters:
  • df (pandas.DataFrame) – A Pandas’ DataFrame with features and target columns. The model will be trained to predict the target column from the features.
  • features (list of str) – A list os column names that are used as features for the model. All this names should be in df.
  • target (str) – The name of the column in df that should be used as target for the model. This column should be numerical and continuous, since this is a regression model.
  • kernel (sklearn.gaussian_process.kernels) – The kernel specifying the covariance function of the GP. If None is passed, the kernel “1.0 * RBF(1.0)” is used as default. Note that the kernel’s hyperparameters are optimized during fitting.
  • alpha (float) – Value added to the diagonal of the kernel matrix during fitting. Larger values correspond to increased noise level in the observations. This can also prevent a potential numerical issue during fitting, by ensuring that the calculated values form a positive definite matrix.
  • extra_variance (float) – The amount of extra variance to scale to the predictions in standard deviations. If left as the default “fit”, Uses the standard deviation of the target.
  • return_std (bool) – If True, the standard-deviation of the predictive distribution at the query points is returned along with the mean.
  • extra_params (dict {"hyperparameter_name" : hyperparameter_value}, optional) – Other parameters for the GaussianProcessRegressor model. See the list in: http://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.GaussianProcessRegressor.html If not passed, the default will be used.
  • prediction_column (str) – The name of the column with the predictions from the model.
Returns:

  • p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
  • new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
  • log (dict) – A log-like Dict that stores information of the Gaussian Process Regressor model.

fklearn.training.regression.lgbm_regression_learner[source]

Fits an LGBM regressor to the dataset.

It first generates a Dataset with the specified features and labels from df. Then, it fits a LGBM model to this Dataset. Return the predict function for the model and the predictions for the input dataset.

Parameters:
  • df (pandas.DataFrame) – A Pandas’ DataFrame with features and target columns. The model will be trained to predict the target column from the features.
  • features (list of str) – A list os column names that are used as features for the model. All this names should be in df.
  • target (str) – The name of the column in df that should be used as target for the model. This column should be binary, since this is a classification model.
  • learning_rate (float) – Float in the range (0, 1] Step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features. and eta actually shrinks the feature weights to make the boosting process more conservative. See the learning_rate hyper-parameter in: https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.rst
  • num_estimators (int) – Int in the range (0, inf) Number of boosted trees to fit. See the num_iterations hyper-parameter in: https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.rst
  • extra_params (dict, optional) – Dictionary in the format {“hyperparameter_name” : hyperparameter_value}. Other parameters for the LGBM model. See the list in: https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.rst If not passed, the default will be used.
  • prediction_column (str) – The name of the column with the predictions from the model.
  • weight_column (str, optional) – The name of the column with scores to weight the data.
Returns:

  • p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
  • new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
  • log (dict) – A log-like Dict that stores information of the LGBM Regressor model.

fklearn.training.regression.linear_regression_learner[source]

Fits an linear regression classifier to the dataset. Return the predict function for the model and the predictions for the input dataset.

Parameters:
  • df (pandas.DataFrame) – A Pandas’ DataFrame with features and target columns. The model will be trained to predict the target column from the features.
  • features (list of str) – A list os column names that are used as features for the model. All this names should be in df.
  • target (str) – The name of the column in df that should be used as target for the model. This column should be continuous, since this is a regression model.
  • params (dict) – The LinearRegression parameters in the format {“par_name”: param}. See: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
  • prediction_column (str) – The name of the column with the predictions from the model.
  • weight_column (str, optional) – The name of the column with scores to weight the data.
Returns:

  • p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
  • new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
  • log (dict) – A log-like Dict that stores information of the Linear Regression model.

fklearn.training.regression.xgb_regression_learner[source]

Fits an XGBoost regressor to the dataset. It first generates a DMatrix with the specified features and labels from df. Then it fits a XGBoost model to this DMatrix. Return the predict function for the model and the predictions for the input dataset.

Parameters:
  • df (pandas.DataFrame) – A Pandas’ DataFrame with features and target columns. The model will be trained to predict the target column from the features.
  • features (list of str) – A list os column names that are used as features for the model. All this names should be in df.
  • target (str) – The name of the column in df that should be used as target for the model. This column should be numerical and continuous, since this is a regression model.
  • learning_rate (float) – Float in range [0,1]. Step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features. and eta actually shrinks the feature weights to make the boosting process more conservative. See the eta hyper-parameter in: http://xgboost.readthedocs.io/en/latest/parameter.html
  • num_estimators (int) – Int in range [0, inf] Number of boosted trees to fit. See the n_estimators hyper-parameter in: http://xgboost.readthedocs.io/en/latest/python/python_api.html
  • extra_params (dict, optional) – Dictionary in the format {“hyperparameter_name” : hyperparameter_value. Other parameters for the XGBoost model. See the list in: http://xgboost.readthedocs.io/en/latest/parameter.html If not passed, the default will be used.
  • prediction_column (str) – The name of the column with the predictions from the model.
  • weight_column (str, optional) – The name of the column with scores to weight the data.
Returns:

  • p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
  • new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
  • log (dict) – A log-like Dict that stores information of the XGboost Regressor model.

fklearn.training.transformation.apply_replacements(df: pandas.core.frame.DataFrame, columns: List[str], vec: Dict[str, Dict], replace_unseen: Any) → pandas.core.frame.DataFrame[source]

Base function to apply the replacements values found on the “vec” vectors into the df DataFrame.

Parameters:
  • df (pandas.DataFrame) – A Pandas DataFrame containing the data to be replaced.
  • columns (list of str) – The df columns names to perform the replacements.
  • vec (dict) – A dict mapping a col to dict mapping a value to its replacement. For example: vec = {“feature1”: {1: 2, 3: 5, 6: 8}}
  • replace_unseen (Any) – Default value to replace when original value is not present in the vec dict for the feature
fklearn.training.transformation.capper[source]

Learns the maximum value for each of the columns_to_cap and used that as the cap for those columns. If precomputed caps are passed, the function uses that as the cap value instead of computing the maximum.

Parameters:
  • df (pandas.DataFrame) – A Pandas’ DataFrame that must contain columns_to_cap columns.
  • columns_to_cap (list of str) – A list os column names that should be caped.
  • precomputed_caps (dict) – A dictionary on the format {“column_name” : cap_value}. That maps column names to pre computed cap values
Returns:

  • p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
  • new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
  • log (dict) – A log-like Dict that stores information of the Capper model.

fklearn.training.transformation.count_categorizer[source]

Replaces categorical variables by count.

Parameters:
  • df (pandas.DataFrame) – A Pandas’ DataFrame that must contain columns_to_categorize columns.
  • columns_to_categorize (list of str) – A list of categorical column names.
  • replace_unseen (int) – The value to impute unseen categories.
  • store_mapping (bool (default: False)) – Whether to store the feature value -> integer dictionary in the log
Returns:

  • p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
  • new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
  • log (dict) – A log-like Dict that stores information of the Count Categorizer model.

fklearn.training.transformation.custom_transformer[source]

Applies a custom function to the desired columns.

Parameters:
  • df (pandas.DataFrame) – A Pandas’ DataFrame that must contain columns
  • columns_to_transform (list of str) – A list of column names that will remain in the dataframe during training time (fit)
  • transformation_function (function(pandas.DataFrame) -> pandas.DataFrame) – A function that receives a DataFrame as input, performs a transformation on its columns and returns another DataFrame.
Returns:

  • p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
  • new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
  • log (dict) – A log-like Dict that stores information of the Custom Transformer model.

fklearn.training.transformation.discrete_ecdfer[source]

Learns an Empirical Cumulative Distribution Function from the specified column in the input DataFrame. It is usually used in the prediction column to convert a predicted probability into a score from 0 to 1000.

Parameters:
  • df (Pandas' pandas.DataFrame) – A Pandas’ DataFrame that must contain a prediction_column columns.
  • ascending (bool) – Whether to compute an ascending ECDF or a descending one.
  • prediction_column (str) – The name of the column in df to learn the ECDF from.
  • ecdf_column (str) – The name of the new ECDF column added by this function.
  • max_range (int) –
    The maximum value for the ECDF. It will go will go
    from 0 to max_range.
  • round_method (Callable) – A function perform the round of transformed values for ex: (int, ceil, floor, round)
Returns:

  • p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
  • new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
  • log (dict) – A log-like Dict that stores information of the Discrete ECDFer model.

fklearn.training.transformation.ecdfer[source]

Learns an Empirical Cumulative Distribution Function from the specified column in the input DataFrame. It is usually used in the prediction column to convert a predicted probability into a score from 0 to 1000.

Parameters:
  • df (Pandas' pandas.DataFrame) – A Pandas’ DataFrame that must contain a prediction_column columns.
  • ascending (bool) – Whether to compute an ascending ECDF or a descending one.
  • prediction_column (str) – The name of the column in df to learn the ECDF from.
  • ecdf_column (str) – The name of the new ECDF column added by this function
  • max_range (int) –
    The maximum value for the ECDF. It will go will go
    from 0 to max_range.
Returns:

  • p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
  • new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
  • log (dict) – A log-like Dict that stores information of the ECDFer model.

fklearn.training.transformation.floorer[source]

Learns the minimum value for each of the columns_to_floor and used that as the floot for those columns. If precomputed floors are passed, the function uses that as the cap value instead of computing the minimun.

Parameters:
  • df (pandas.DataFrame) – A Pandas’ DataFrame that must contain columns_to_floor columns.
  • columns_to_floor (list of str) – A list os column names that should be floored.
  • precomputed_floors (dict) – A dictionary on the format {“column_name” : floor_value} that maps column names to pre computed floor values
Returns:

  • p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
  • new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
  • log (dict) – A log-like Dict that stores information of the Floorer model.

fklearn.training.transformation.label_categorizer[source]

Replaces categorical variables with a numeric identifier.

Parameters:
  • df (pandas.DataFrame) – A Pandas’ DataFrame that must contain columns_to_categorize columns.
  • columns_to_categorize (list of str) – A list of categorical column names.
  • replace_unseen (int, str, float, or nan) – The value to impute unseen categories.
  • store_mapping (bool (default: False)) – Whether to store the feature value -> integer dictionary in the log
Returns:

  • p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
  • new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
  • log (dict) – A log-like Dict that stores information of the Label Categorizer model.

fklearn.training.transformation.missing_warner[source]

Creates a new column to warn about rows that columns that don’t have missing in the training set but have missing on the scoring

Parameters:
  • df (pandas.DataFrame) – A Pandas’ DataFrame.
  • cols_list (list of str) – List of columns to consider when evaluating missingness
  • new_column_name (str) – Name of the column created to alert the existence of missing values
Returns:

  • p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
  • new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
  • log (dict) – A log-like Dict that stores information of the Missing Alerter model.

fklearn.training.transformation.null_injector[source]

Applies a custom function to the desired columns.

Parameters:
  • df (pandas.DataFrame) – A Pandas’ DataFrame that must contain columns_to_inject as columns
  • columns_to_inject (list of str) – A list of features to inject nulls. If groups is not None it will be ignored.
  • proportion (float) – Proportion of nulls to inject in the columns.
  • groups (list of list of str (default = None)) – A list of group of features. If not None, feature in the same group will be set to NaN together.
  • seed (int) – Random seed for consistency.
Returns:

  • p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
  • new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
  • log (dict) – A log-like Dict that stores information of the Null Injector model.

fklearn.training.transformation.onehot_categorizer[source]

Onehot encoding on categorical columns.

Parameters:
  • df (pd.DataFrame) – A Pandas’ DataFrame that must contain columns_to_categorize columns.
  • columns_to_categorize (list of str) – A list of categorical column names. Must be non-empty.
  • hardcode_nans (bool) – Hardcodes an extra column with: 1 if nan or unseen else 0.
  • drop_first_column (bool) – Drops the first column to create (k-1)-sized one-hot arrays for k features per categorical column. Can be used to avoid colinearity.
  • store_mapping (bool (default: False)) – Whether to store the feature value -> integer dictionary in the log
fklearn.training.transformation.prediction_ranger[source]

Caps and floors the specified prediction column to a set range.

Parameters:
  • df (pandas.DataFrame) – A Pandas’ DataFrame that must contain a prediction_column columns.
  • prediction_min (float) – The floor for the prediction.
  • prediction_max (float) – The cap for the prediction.
  • prediction_column (str) – The name of the column in df to cap and floor
Returns:

  • p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
  • new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
  • log (dict) – A log-like Dict that stores information of the Prediction Ranger model.

fklearn.training.transformation.quantile_biner[source]

Discretize continuous numerical columns into its quantiles. Uses pandas.qcut to find the bins and then numpy.digitize to fit the columns into bins.

Parameters:
  • df (pandas.DataFrame) – A Pandas’ DataFrame that must contain columns_to_categorize columns.
  • columns_to_bin (list of str) – A list of numerical column names.
  • q (int) – Number of quantiles. 10 for deciles, 4 for quartiles, etc. Alternately array of quantiles, e.g. [0, .25, .5, .75, 1.] for quartiles. See https://pandas.pydata.org/pandas-docs/stable/generated/pandas.qcut.html
  • right (bool) – Indicating whether the intervals include the right or the left bin edge. Default behavior is (right==False) indicating that the interval does not include the right edge. The left bin end is open in this case, i.e., bins[i-1] <= x < bins[i] is the default behavior for monotonically increasing bins. See https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.digitize.html
Returns:

  • p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
  • new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
  • log (dict) – A log-like Dict that stores information of the Quantile Biner model.

Returns:

  • p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
  • new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
  • log (dict) – A log-like Dict that stores information of the Onehot Categorizer model.

fklearn.training.transformation.rank_categorical[source]

Rank categorical features by their frequency in the train set.

Parameters:
  • df (Pandas' DataFrame) – A Pandas’ DataFrame that must contain a prediction_column columns.
  • columns_to_rank (list of str) – The df columns names to perform the rank.
  • replace_unseen (int, str, float, or nan) – The value to impute unseen categories.
  • store_mapping (bool (default: False)) – Whether to store the feature value -> integer dictionary in the log
Returns:

  • p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
  • new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
  • log (dict) – A log-like Dict that stores information of the Rank Categorical model.

fklearn.training.transformation.selector[source]

Filters a DataFrames by selecting only the desired columns.

Parameters:
  • df (pandas.DataFrame) – A Pandas’ DataFrame that must contain columns
  • training_columns (list of str) – A list of column names that will remain in the dataframe during training time (fit)
  • predict_columns (list of str) – A list of column names that will remain in the dataframe during prediction time (transform) If None, it defaults to training_columns.
Returns:

  • p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
  • new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
  • log (dict) – A log-like Dict that stores information of the Selector model.

fklearn.training.transformation.standard_scaler[source]

Fits a standard scaler to the dataset.

Parameters:
  • df (pandas.DataFrame) – A Pandas’ DataFrame with columns to scale. It must contain all columns listed in columns_to_scale.
  • columns_to_scale (list of str) – A list of names of the columns for standard scaling.
Returns:

  • p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
  • new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
  • log (dict) – A log-like Dict that stores information of the Standard Scaler model.

fklearn.training.transformation.truncate_categorical[source]

Truncate infrequent categories and replace them by a single one. You can think of it like “others” category.

Parameters:
  • df (pandas.DataFrame) – A Pandas’ DataFrame that must contain a prediction_column columns.
  • columns_to_truncate (list of str) – The df columns names to perform the truncation.
  • percentile (float) – Categories less frequent than the percentile will be replaced by the same one.
  • replacement (int, str, float or nan) – The value to use when a category is less frequent that the percentile variable.
  • replace_unseen (int, str, float, or nan) – The value to impute unseen categories.
  • store_mapping (bool (default: False)) – Whether to store the feature value -> integer dictionary in the log.
Returns:

  • p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
  • new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
  • log (dict) – A log-like Dict that stores information of the Truncate Categorical model.

fklearn.training.transformation.value_mapper[source]

Map values in selected columns in the DataFrame according to dictionaries of replacements. Learner wrapper for apply_replacements

Parameters:
  • df (pandas.DataFrame) – A Pandas DataFrame containing the data to be replaced.
  • value_maps (dict of dicts) – A dict mapping a col to dict mapping a value to its replacement. For example: value_maps = {“feature1”: {1: 2, 3: 5, 6: 8}}
  • ignore_unseen (bool) – If True, values not explicitly declared in value_maps will be left as is. If False, these will be replaced by replace_unseen_to.
  • replace_unseen_to (Any) – Default value to replace when original value is not present in the vec dict for the feature.
fklearn.training.unsupervised.isolation_forest_learner[source]

Fits an anomaly detection algorithm (Isolation Forest) to the dataset

Parameters:
  • df (pandas.DataFrame) – A Pandas’ DataFrame with features and target columns. The model will be trained to predict the target column from the features.
  • features (list of str) – A list os column names that are used as features for the model. All this names should be in df.
  • params (dict) – The IsolationForest parameters in the format {“par_name”: param}. See: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html
  • prediction_column (str) – The name of the column with the predictions from the model.
Returns:

  • p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
  • new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
  • log (dict) – A log-like Dict that stores information of the Isolation Forest model.

fklearn.tuning.model_agnostic_fc.correlation_feature_selection[source]

Feature selection based on correlation

Parameters:
  • train_set (pd.DataFrame) – A Pandas’ DataFrame with the training data
  • features (list of str) – The list of features to consider when dropping with correlation
  • threshold (float) – The correlation threshold. Will drop features with correlation equal or above this threshold
Returns:

Return type:

log with feature correlation, features to drop and final features

fklearn.tuning.model_agnostic_fc.variance_feature_selection[source]

Feature selection based on variance

Parameters:
  • train_set (pd.DataFrame) – A Pandas’ DataFrame with the training data
  • features (list of str) – The list of features to consider when dropping with variance
  • threshold (float) – The variance threshold. Will drop features with variance equal or bellow this threshold
Returns:

Return type:

log with feature variance, features to drop and final features

fklearn.tuning.parameter_tuners.grid_search_cv[source]

Runs several training functions with each run taken from the parameter space

Parameters:
  • space (dict) –

    A dictionary with keys as parameter for the model and values as callable that return a parameter. Callable must take no parameters and can return always a constant value. Example:

    space = {
        'learning_rate': lambda: [1e-3, 1e-2, 1e-1, 1, 10],
        'num_estimators': lambda: [20, 100, 150]
        }
    
  • train_set (pd.DataFrame) – The training set
  • param_train_fn (function(space, train_set) -> p, new_df, train_log) –

    A curried training function that os only function of the parameters for the model and the training set. Example:

    @curry
    def param_train_fn(space, train_set):
        return xgb_classification_learner(features=["x"],
                                          target="target",
                                          learning_rate=space["learning_rate"],
                                          num_estimators=space["num_estimators"])(train_set)
    
  • split_fn (function(dataset) -> list of folds) –

    Partially defined split function that takes a dataset and returns a list of folds. Each fold is a Tuple of arrays. The fist array in each tuple contains training indexes while the second array contains validation indexes. Examples:

    out_of_time_and_space_splitter(n_splits=n_splits,
                                   in_time_limit=in_time_limit,
                                   space_column=space_column,
                                   time_column=time_column)
    
  • eval_fn (function(dataset) -> eval_log) – A base evaluation function that returns a simple evaluation log. Can’t be a spited or the extractor won’t work. Example: auc_evaluator(target_column=”target”)
  • save_intermediary_fn (function(log) -> save to file) – Partially defined saver function that receives a log result from a tuning step and saves it into a file Example: save_intermediary_result(save_path=’tuning.pkl’)
  • load_intermediary_fn (function(path) -> save to file) – Partially defined load function that receives a path and loads previous logs from this file Example: load_intermediary_result(‘tuning.pkl’)
  • warm_start_file (str) – File containing intermediary results for grid search. If this file is present, we will perform grid search from the last combination of parameters.
  • n_jobs (int) – Number of parallel processes to spawn when evaluating a training function
Returns:

tuning_log – A list of tuning log, each containing a training log and a validation log.

Return type:

list of dict

fklearn.tuning.parameter_tuners.random_search_tuner[source]

Runs several training functions with each run taken from the parameter space

Parameters:
  • space (dict) –

    A dictionary with keys as parameter for the model and values as callable that return a parameter. Callable must take no parameters and can return always a constant value. Example:

    space = {
        'learning_rate': lambda: np.random.choice([1e-3, 1e-2, 1e-1, 1, 10]),
        'num_estimators': lambda: np.random.choice([20, 100, 150])
        }
    
  • train_set (pd.DataFrame) – The training set
  • param_train_fn (function(space, train_set) -> p, new_df, train_log) –

    A curried training function that os only function of the parameters for the model and the training set. Example:

    @curry
    def param_train_fn(space, train_set):
        return xgb_classification_learner(features=["x"],
                                          target="target",
                                          learning_rate=space["learning_rate"],
                                          num_estimators=space["num_estimators"])(train_set)
    
  • split_fn (function(dataset) -> list of folds) –

    Partially defined split function that takes a dataset and returns a list of folds. Each fold is a Tuple of arrays. The fist array in each tuple contains training indexes while the second array contains validation indexes. Examples:

    out_of_time_and_space_splitter(n_splits=n_splits,
                                   in_time_limit=in_time_limit,
                                   space_column=space_column,
                                   time_column=time_column)
    
  • eval_fn (function(dataset) -> eval_log) – A base evaluation function that returns a simple evaluation log. Can’t be a spited or the extractor won’t work. Example: auc_evaluator(target_column=”target”)
  • iterations (int) – The number of iterations to run the parameter tuner
  • random_seed (int) – Random seed
  • save_intermediary_fn (function(log) -> save to file) – Partially defined saver function that receives a log result from a tuning step and appends it into a file Example: save_intermediary_result(save_path=’tuning.pkl’)
  • n_jobs (int) – Number of parallel processes to spawn when evaluating a training function
Returns:

tuning_log – A list of tuning log, each containing a training log and a validation log.

Return type:

list of dict

fklearn.tuning.parameter_tuners.seed(seed=None)

Seed the generator.

This method is called when RandomState is initialized. It can be called again to re-seed the generator. For details, see RandomState.

Parameters:seed (int or 1-d array_like, optional) – Seed for RandomState. Must be convertible to 32 bit unsigned integers.

See also

RandomState()

fklearn.tuning.samplers.remove_by_feature_importance[source]

Performs feature selection based on feature importance

Parameters:
  • log (dict) – Dictionaries evaluations.
  • num_removed_by_step (int (default 5)) – The number of features to remove
Returns:

features – The remaining features after removing based on feature importance

Return type:

list of str

fklearn.tuning.samplers.remove_by_feature_shuffling[source]

Performs feature selection based on the evaluation of the test vs the evaluation of the test with randomly shuffled features

Parameters:
  • log (LogType) – Dictionaries evaluations.
  • predict_fn (function pandas.DataFrame -> pandas.DataFrame) – A partially defined predictor that takes a DataFrame and returns the predicted score for this dataframe
  • eval_fn (function DataFrame -> log dict) – A partially defined evaluation function that takes a dataset with prediction and returns the evaluation logs.
  • eval_data (pandas.DataFrame) – Data used to evaluate the model after shuffling
  • extractor (function str -> float) – A extractor that take a string and returns the value of that string on a dict
  • metric_name (str) – String with the name of the column that refers to the metric column to be extracted
  • max_removed_by_step (int (default 5)) – The maximum number of features to remove. It will only consider the least max_removed_by_step in terms of feature importance. If speed_up_by_importance=True it will first filter the least relevant feature an shuffle only those. If speed_up_by_importance=False it will shuffle all features and drop the last max_removed_by_step in terms of PIMP. In both cases, the features will only be removed if drop in performance is up to the defined threshold.
  • threshold (float (default 0.005)) – Threshold for model performance comparison
  • speed_up_by_importance (bool (default True)) – If it should narrow search looking at feature importance first before getting PIMP importance. If True, will only shuffle the top num_removed_by_step in terms of feature importance.
  • parallel (bool (default False)) –
  • nthread (int (default 1)) –
  • seed (int (default 7)) – Random seed
Returns:

features – The remaining features after removing based on feature importance

Return type:

list of str

fklearn.tuning.samplers.remove_features_subsets[source]

Performs feature selection based on the best performing model out of several trained models

Parameters:
  • log_list (list of dict) – A list of log-like lists of dictionaries evaluations.
  • extractor (function string -> float) – A extractor that take a string and returns the value of that string on a dict
  • metric_name (str) – String with the name of the column that refers to the metric column to be extracted
  • num_removed_by_step (int (default 1)) – The number of features to remove
Returns:

keys – The remaining keys of feature sets after choosing the current best subset

Return type:

list of str

fklearn.tuning.selectors.backward_subset_feature_selection(train_data: pandas.core.frame.DataFrame, param_train_fn: Callable[[pandas.core.frame.DataFrame, List[str]], Tuple[Callable[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame], pandas.core.frame.DataFrame, Dict[str, Dict[str, Any]]]], features_sets: Dict[str, List[str]], split_fn: Callable[pandas.core.frame.DataFrame, Tuple[List[Tuple[pandas.core.indexes.base.Index, List[pandas.core.indexes.base.Index]]], List[Dict[str, Any]]]], eval_fn: Callable[pandas.core.frame.DataFrame, Dict[str, Union[float, Dict]]], extractor: Callable[str, float], metric_name: str, threshold: float = 0.005, num_removed_by_step: int = 3, early_stop: int = 2, iter_limit: int = 50, min_remaining_features: int = 50, save_intermediary_fn: Callable[List[Dict[str, Union[Dict[str, Any], List[Dict[str, Any]]]]], None] = None, n_jobs: int = 1) → List[List[Dict[str, Any]]][source]

Performs train-evaluation iterations while testing the subsets of features to compute statistics about the importance of each feature category

Parameters:
  • train_data (pandas.DataFrame) – A Pandas’ DataFrame with training data
  • param_train_fn (function (pandas.DataFrame, list of str) -> prediction_function, predictions_dataset, logs) – A partially defined learning function that takes a training set and a feature list and returns a predict function, a dataset with training predictions and training logs.
  • features_sets (dict of string -> list) – Each String Key on the dict is a subset of columns from the dataset, the function will analyse the influence of each group of features on the model performance
  • split_fn (function pandas.DataFrame -> list of tuple) – Partially defined split function that takes a dataset and returns a list of folds. Each fold is a Tuple of arrays. The fist array in each tuple contains training indexes while the second array contains validation indexes.
  • eval_fn (function pandas.DataFrame -> dict) – A partially defined evaluation function that takes a dataset with prediction and returns the evaluation logs.
  • extractor (function str -> float) – A extractor that take a string and returns the value of that string on a dict
  • metric_name (str) – String with the name of the column that refers to the metric column to be extracted
  • num_removed_by_step (int (default 3)) – Number of features removed at each iteration
  • threshold (float (default 0.005)) – Threshold for model performance comparison
  • early_stop (int (default 2)) – Number of rounds without improvement before stopping process
  • iter_limit (int (default 50)) – Maximum number of iterations before stopping
  • min_remaining_features (int (default 50)) – Minimum number of features that should remain in the model, combining num_removed_by_step and iter_limit accomplishes the same functionality as this parameter.
  • save_intermediary_fn (function(log) -> save to file) – Partially defined saver function that receives a log result from a tuning step and appends it into a file Example: save_intermediary_result(save_path=’tuning.pkl’)
  • n_jobs (int) – Number of parallel processes to spawn.
Returns:

logs – A list log-like lists of dictionaries evaluations. Each element of the list is validation step of the algorithm.

Return type:

list of list of dict

fklearn.tuning.selectors.feature_importance_backward_selection(train_data: pandas.core.frame.DataFrame, param_train_fn: Callable[[pandas.core.frame.DataFrame, List[str]], Tuple[Callable[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame], pandas.core.frame.DataFrame, Dict[str, Dict[str, Any]]]], features: List[str], split_fn: Callable[pandas.core.frame.DataFrame, Tuple[List[Tuple[pandas.core.indexes.base.Index, List[pandas.core.indexes.base.Index]]], List[Dict[str, Any]]]], eval_fn: Callable[pandas.core.frame.DataFrame, Dict[str, Union[float, Dict]]], extractor: Callable[str, float], metric_name: str, num_removed_by_step: int = 5, threshold: float = 0.005, early_stop: int = 2, iter_limit: int = 50, min_remaining_features: int = 50, save_intermediary_fn: Callable[List[Dict[str, Union[Dict[str, Any], List[Dict[str, Any]]]]], None] = None, n_jobs: int = 1) → List[List[Dict[str, Any]]][source]

Performs train-evaluation iterations while subsampling the used features to compute statistics about feature relevance

Parameters:
  • train_data (pandas.DataFrame) – A Pandas’ DataFrame with training data
  • auxiliary_columns (list of str) – List of columns from the dataset that are not used as features but are used for evaluation or cross validation. (id, date, etc)
  • param_train_fn (function (DataFrame, List of Strings) -> prediction_function, predictions_dataset, logs) – A partially defined learning function that takes a training set and a feature list and returns a predict function, a dataset with training predictions and training logs.
  • features (list of str) – Elements must be columns of the train_data
  • split_fn (function pandas.DataFrame -> list of tuple) – Partially defined split function that takes a dataset and returns a list of folds. Each fold is a Tuple of arrays. The fist array in each tuple contains training indexes while the second array contains validation indexes.
  • eval_fn (function pandas.DataFrame -> dict) – A partially defined evaluation function that takes a dataset with prediction and returns the evaluation logs.
  • extractor (function str -> float) – A extractor that take a string and returns the value of that string on a dict
  • metric_name (str) – String with the name of the column that refers to the metric column to be extracted
  • num_removed_by_step (int (default 5)) – Number of features removed at each iteration
  • threshold (float (default 0.005)) – Threshold for model performance comparison
  • early_stop (int (default 2)) – Number of rounds without improvement before stopping process
  • iter_limit (int (default 50)) – Maximum number of iterations before stopping
  • min_remaining_features (int (default 50)) – Minimum number of features that should remain in the model, combining num_removed_by_step and iter_limit accomplishes the same functionality as this parameter.
  • save_intermediary_fn (function(log) -> save to file) – Partially defined saver function that receives a log result from a tuning step and appends it into a file Example: save_intermediary_result(save_path=’tuning.pkl’)
  • n_jobs (int) – Number of parallel processes to spawn.
Returns:

Logs – A list log-like lists of dictionaries evaluations. Each element of the list is validation step of the algorithm.

Return type:

list of list of dict

fklearn.tuning.selectors.poor_man_boruta_selection(train_data: pandas.core.frame.DataFrame, test_data: pandas.core.frame.DataFrame, param_train_fn: Callable[[pandas.core.frame.DataFrame, List[str]], Tuple[Callable[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame], pandas.core.frame.DataFrame, Dict[str, Dict[str, Any]]]], features: List[str], eval_fn: Callable[pandas.core.frame.DataFrame, Dict[str, Union[float, Dict]]], extractor: Callable[str, float], metric_name: str, max_removed_by_step: int = 5, threshold: float = 0.005, early_stop: int = 2, iter_limit: int = 50, min_remaining_features: int = 50, save_intermediary_fn: Callable[Dict[str, Any], None] = None, speed_up_by_importance: bool = False, parallel: bool = False, nthread: int = 1, seed: int = 7) → List[Dict[str, Any]][source]

Performs train-evaluation iterations while shuffiling the used features to compute statistics about feature relevance

Parameters:
  • train_data (pandas.DataFrame) – A Pandas’ DataFrame with training data
  • test_data (pandas.DataFrame) – A Pandas’ DataFrame with test data
  • param_train_fn (function (pandas.DataFrame, list of str) -> prediction_function, predictions_dataset, logs) – A partially defined AND curried learning function that takes a training set and a feature list and returns a predict function, a dataset with training predictions and training logs.
  • features (list of str) – Elements must be columns of the train_data
  • eval_fn (function pandas.DataFrame -> dict) – A partially defined evaluation function that takes a dataset with prediction and returns the evaluation logs.
  • extractor (function str -> float) – A extractor that take a string and returns the value of that string on a dict
  • metric_name (str) – String with the name of the column that refers to the metric column to be extracted
  • max_removed_by_step (int (default 50)) – The maximum number of features to remove. It will only consider the least max_removed_by_step in terms of feature importance. If speed_up_by_importance=True it will first filter the least relevant feature an shuffle only those. If speed_up_by_importance=False it will shuffle all features and drop the last max_removed_by_step in terms of PIMP. In both cases, the features will only be removed if drop in performance is up to the defined threshold.
  • threshold (float (default 0.005)) – Threshold for model performance comparison
  • early_stop (int (default 2)) – Number of rounds without improvement before stopping process
  • iter_limit (int (default 50)) – Maximum number of iterations before stopping
  • min_remaining_features (int (default 50)) – Minimum number of features that should remain in the model, combining num_removed_by_step and iter_limit accomplishes the same functionality as this parameter.
  • save_intermediary_fn (function(log) -> save to file) – Partially defined saver function that receives a log result from a tuning step and appends it into a file Example: save_intermediary_result(save_path=’tuning.pkl’)
  • speed_up_by_importance (bool (default True)) – If it should narrow search looking at feature importance first before getting PIMP importance. If True, will only shuffle the top num_removed_by_step in terms of feature importance.
  • max_removed_by_step – If speed_up_by_importance=False, this will limit the number of features dropped by iteration. It will only drop the max_removed_by_step features that decrease the metric by the least when dropped.
  • parallel (bool (default False)) – Run shuffling and prediction in parallel. Only applies if speed_up_by_importance=False
  • nthread (int (default 1)) – Number of threads to run predictions. ONly applied if speed_up_by_importance=False
  • seed (int (default 7)) – random state for consistency.
Returns:

logs – A list log-like lists of dictionaries evaluations. Each element of the list is validation step of the algorithm.

Return type:

list of list of dict

fklearn.tuning.stoppers.aggregate_stop_funcs(*stop_funcs) → Callable[List[List[Dict[str, Any]]], bool][source]

Aggregate stop functions

Parameters:stop_funcs (list of function list of dict -> bool) –
Returns:l – Function that performs the Or logic of all stop_fn applied to the logs
Return type:function logs -> bool
fklearn.tuning.stoppers.stop_by_iter_num[source]

Checks for logs to see if feature selection should stop

Parameters:
  • logs (list of list of dict) – A list of log-like lists of dictionaries evaluations.
  • iter_limit (int (default 50)) – Limit of Iterations
Returns:

stop – A boolean whether to stop recursion or not

Return type:

bool

fklearn.tuning.stoppers.stop_by_no_improvement[source]

Checks for logs to see if feature selection should stop

Parameters:
  • logs (list of list of dict) – A list of log-like lists of dictionaries evaluations.
  • extractor (function str -> float) – A extractor that take a string and returns the value of that string on a dict
  • metric_name (str) – String with the name of the column that refers to the metric column to be extracted
  • early_stop (int (default 3)) – Number of iteration without improval before stopping
  • threshold (float (default 0.001)) – Threshold for model performance comparison
Returns:

stop – A boolean whether to stop recursion or not

Return type:

bool

fklearn.tuning.stoppers.stop_by_no_improvement_parallel[source]

Checks for logs to see if feature selection should stop

Parameters:
  • logs (list of list of dict) – A list of log-like lists of dictionaries evaluations.
  • extractor (function str -> float) – A extractor that take a string and returns the value of that string on a dict
  • metric_name (str) – String with the name of the column that refers to the metric column to be extracted
  • early_stop (int (default 3)) – Number of iterations without improvements before stopping
  • threshold (float (default 0.001)) – Threshold for model performance comparison
Returns:

stop – A boolean whether to stop recursion or not

Return type:

bool

fklearn.tuning.stoppers.stop_by_num_features[source]

Checks for logs to see if feature selection should stop

Parameters:
  • logs (list of list of dict) – A list of log-like lists of dictionaries evaluations.
  • min_num_features (int (default 50)) – The minimun number of features the model can have before stopping
Returns:

stop – A boolean whether to stop recursion or not

Return type:

bool

fklearn.tuning.stoppers.stop_by_num_features_parallel[source]

Selects the best log out of a list to see if feature selection should stop

Parameters:
  • logs (list of list of list of dict) – A list of log-like lists of dictionaries evaluations.
  • extractor (function str -> float) – A extractor that take a string and returns the value of that string on a dict
  • metric_name (str) – String with the name of the column that refers to the metric column to be extracted
  • min_num_features (int (default 50)) – The minimun number of features the model can have before stopping
Returns:

stop – A boolean whether to stop recursion or not

Return type:

bool

fklearn.validation.evaluators.auc_evaluator[source]

Computes the ROC AUC score, given true label and prediction scores.

Parameters:
  • test_data (Pandas' DataFrame) – A Pandas’ DataFrame with with target and prediction scores.
  • prediction_column (Strings) – The name of the column in test_data with the prediction scores.
  • target_column (String) – The name of the column in test_data with the binary target.
  • eval_name (String, optional (default=None)) – the name of the evaluator as it will appear in the logs.
Returns:

log – A log-like dictionary with the ROC AUC Score

Return type:

dict

fklearn.validation.evaluators.brier_score_evaluator[source]

Computes the Brier score, given true label and prediction scores.

Parameters:
  • test_data (Pandas' DataFrame) – A Pandas’ DataFrame with with target and prediction scores.
  • prediction_column (Strings) – The name of the column in test_data with the prediction scores.
  • target_column (String) – The name of the column in test_data with the binary target.
  • eval_name (String, optional (default=None)) – The name of the evaluator as it will appear in the logs.
Returns:

log – A log-like dictionary with the Brier score.

Return type:

dict

fklearn.validation.evaluators.combined_evaluators[source]

Combine partially applies evaluation functions.

Parameters:
  • test_data (Pandas' DataFrame) – A Pandas’ DataFrame to apply the evaluators on
  • evaluators (List) – List of evaluator functions
Returns:

log – A log-like dictionary with the column mean

Return type:

dict

fklearn.validation.evaluators.correlation_evaluator[source]

Computes the Pearson correlation between prediction and target.

Parameters:
  • test_data (Pandas' DataFrame) – A Pandas’ DataFrame with with target and prediction.
  • prediction_column (Strings) – The name of the column in test_data with the prediction.
  • target_column (String) – The name of the column in test_data with the continuous target.
  • eval_name (String, optional (default=None)) – the name of the evaluator as it will appear in the logs.
Returns:

log – A log-like dictionary with the Pearson correlation

Return type:

dict

fklearn.validation.evaluators.expected_calibration_error_evaluator[source]

Computes the expected calibration error (ECE), given true label and prediction scores. See “On Calibration of Modern Neural Networks”(https://arxiv.org/abs/1706.04599) for more information.

The ECE is the distance between the actuals observed frequency and the predicted probabilities, for a given choice of bins.

Perfect calibration results in a score of 0.

For example, if for the bin [0, 0.1] we have the three data points:
  1. prediction: 0.1, actual: 0
  2. prediction: 0.05, actual: 1
  3. prediction: 0.0, actual 0

Then the predicted average is (0.1 + 0.05 + 0.00)/3 = 0.05, and the empirical frequency is (0 + 1 + 0)/3 = 1/3. Therefore, the distance for this bin is:

|1/3 - 0.05| ~= 0.28.

Graphical intuition:

Actuals (empirical frequency between 0 and 1)
|     *
|   *
| *
 ______ Predictions (probabilties between 0 and 1)
Parameters:
  • test_data (Pandas' DataFrame) – A Pandas’ DataFrame with with target and prediction scores.
  • prediction_column (Strings) – The name of the column in test_data with the prediction scores.
  • target_column (String) – The name of the column in test_data with the binary target.
  • eval_name (String, optional (default=None)) – The name of the evaluator as it will appear in the logs.
  • n_bins (Int (default=100)) – The number of bins. This is a trade-off between the number of points in each bin and the probability range they span. You want a small enough range that still contains a significant number of points for the distance to work.
  • bin_choice (String (default="count")) – Two possibilities: “count” for equally populated bins (e.g. uses pandas.qcut for the bins) “prob” for equally spaced probabilities (e.g. uses pandas.cut for the bins), with distance weighed by the number of samples in each bin.
Returns:

log – A log-like dictionary with the expected calibration error.

Return type:

dict

fklearn.validation.evaluators.fbeta_score_evaluator[source]

Computes the recall score, given true label and prediction scores.

Parameters:
  • test_data (pandas.DataFrame) – A Pandas’ DataFrame with with target and prediction scores.
  • threshold (float) –
    A threshold for the prediction column above which samples
    will be classified as 1
  • beta (float) – The beta parameter determines the weight of precision in the combined score. beta < 1 lends more weight to precision, while beta > 1 favors recall (beta -> 0 considers only precision, beta -> inf only recall).
  • prediction_column (str) – The name of the column in test_data with the prediction scores.
  • target_column (str) – The name of the column in test_data with the binary target.
  • eval_name (str, optional (default=None)) – the name of the evaluator as it will appear in the logs.
Returns:

log – A log-like dictionary with the Precision Score

Return type:

dict

fklearn.validation.evaluators.generic_sklearn_evaluator(name_prefix: str, sklearn_metric: Callable[..., float]) → Callable[..., Dict[str, Union[float, Dict]]][source]

Returns an evaluator build from a metric from sklearn.metrics

Parameters:
  • name_prefix (str) – The default name of the evaluator will be name_prefix + target_column.
  • sklearn_metric (Callable) – Metric function from sklearn.metrics. It should take as parameters y_true, y_score, kwargs.
Returns:

eval_fn – An evaluator function that uses the provided metric

Return type:

Callable

fklearn.validation.evaluators.hash_evaluator[source]

Computes the hash of a pandas dataframe, filtered by hash columns. The purpose is to uniquely identify a dataframe, to be able to check if two dataframes are equal or not.

Parameters:
  • test_data (Pandas' DataFrame) – A Pandas’ DataFrame to be hashed.
  • hash_columns (List[str], optional (default=None)) – A list of column names to filter the dataframe before hashing. If None, it will hash the dataframe with all the columns
  • eval_name (String, optional (default=None)) – the name of the evaluator as it will appear in the logs.
  • consider_index (bool, optional (default=False)) – If true, will consider the index of the dataframe to calculate the hash. The default behaviour will ignore the index and just hash the content of the features.
Returns:

log – A log-like dictionary with the hash of the dataframe

Return type:

dict

fklearn.validation.evaluators.logloss_evaluator[source]

Computes the logloss score, given true label and prediction scores.

Parameters:
  • test_data (Pandas' DataFrame) – A Pandas’ DataFrame with with target and prediction scores.
  • prediction_column (Strings) – The name of the column in test_data with the prediction scores.
  • target_column (String) – The name of the column in test_data with the binary target.
  • eval_name (String, optional (default=None)) – the name of the evaluator as it will appear in the logs.
Returns:

log – A log-like dictionary with the logloss score.

Return type:

dict

fklearn.validation.evaluators.mean_prediction_evaluator[source]

Computes mean for the specified column.

Parameters:
  • test_data (Pandas' DataFrame) – A Pandas’ DataFrame with a column to compute the mean
  • prediction_column (Strings) – The name of the column in test_data to compute the mean.
  • eval_name (String, optional (default=None)) – the name of the evaluator as it will appear in the logs.
Returns:

log – A log-like dictionary with the column mean

Return type:

dict

fklearn.validation.evaluators.mse_evaluator[source]

Computes the Mean Squared Error, given true label and predictions.

Parameters:
  • test_data (Pandas' DataFrame) – A Pandas’ DataFrame with with target and predictions.
  • prediction_column (Strings) – The name of the column in test_data with the predictions.
  • target_column (String) – The name of the column in test_data with the continuous target.
  • eval_name (String, optional (default=None)) – the name of the evaluator as it will appear in the logs.
Returns:

log – A log-like dictionary with the MSE Score

Return type:

dict

fklearn.validation.evaluators.permutation_evaluator[source]

Permutation importance evaluator. It works by shuffling one or more features on test_data dataframe, getting the preditions with predict_fn, and evaluating the results with eval_fn.

Parameters:
  • test_data (Pandas' DataFrame) – A Pandas’ DataFrame with with target, predictions and features.
  • predict_fn (function DataFrame -> DataFrame) – Function that receives the input dataframe and returns a dataframe with the pipeline predictions.
  • eval_fn (function DataFrame -> Log Dict) – A partially applied evaluation function.
  • baseline (bool) – Also evaluates the predict_fn on an unshuffled baseline.
  • features (List of strings) – The features to shuffle and then evaluate eval_fn on the shuffled results. The default case shuffles all dataframe columns.
  • shuffle_all_at_once (bool) – Shuffle all features at once instead of one per turn.
  • random_state (int) – Seed to be used by the random number generator.
  • eval_name (String, optional (default=None)) – the name of the evaluator as it will appear in the logs.
Returns:

log – A log-like dictionary with evaluation results by feature shuffle. Use the permutation_extractor for better visualization of the results.

Return type:

dict

fklearn.validation.evaluators.precision_evaluator[source]

Computes the precision score, given true label and prediction scores.

Parameters:
  • test_data (pandas.DataFrame) – A Pandas’ DataFrame with with target and prediction scores.
  • threshold (float) –
    A threshold for the prediction column above which samples
    will be classified as 1
  • prediction_column (str) – The name of the column in test_data with the prediction scores.
  • target_column (str) – The name of the column in test_data with the binary target.
  • eval_name (str, optional (default=None)) – the name of the evaluator as it will appear in the logs.
Returns:

log – A log-like dictionary with the Precision Score

Return type:

dict

fklearn.validation.evaluators.r2_evaluator[source]

Computes the R2 score, given true label and predictions.

Parameters:
  • test_data (Pandas' DataFrame) – A Pandas’ DataFrame with with target and prediction.
  • prediction_column (Strings) – The name of the column in test_data with the prediction.
  • target_column (String) – The name of the column in test_data with the continuous target.
  • eval_name (String, optional (default=None)) – the name of the evaluator as it will appear in the logs.
Returns:

log – A log-like dictionary with the R2 Score

Return type:

dict

fklearn.validation.evaluators.recall_evaluator[source]

Computes the recall score, given true label and prediction scores.

Parameters:
  • test_data (pandas.DataFrame) – A Pandas’ DataFrame with with target and prediction scores.
  • threshold (float) –
    A threshold for the prediction column above which samples
    will be classified as 1
  • prediction_column (str) – The name of the column in test_data with the prediction scores.
  • target_column (str) – The name of the column in test_data with the binary target.
  • eval_name (str, optional (default=None)) – the name of the evaluator as it will appear in the logs.
Returns:

log – A log-like dictionary with the Precision Score

Return type:

dict

fklearn.validation.evaluators.spearman_evaluator[source]

Computes the Spearman correlation between prediction and target. The Spearman correlation evaluates the rank order between two variables: https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient

Parameters:
  • test_data (Pandas' DataFrame) – A Pandas’ DataFrame with with target and prediction.
  • prediction_column (Strings) – The name of the column in test_data with the prediction.
  • target_column (String) – The name of the column in test_data with the continuous target.
  • eval_name (String, optional (default=None)) – the name of the evaluator as it will appear in the logs.
Returns:

log – A log-like dictionary with the Spearman correlation

Return type:

dict

fklearn.validation.evaluators.split_evaluator[source]

Splits the dataset into the categories in split_col and evaluate model performance in each split. Useful when you belive the model performs differs in a sub population defined by split_col.

Parameters:
  • test_data (Pandas' DataFrame) – A Pandas’ DataFrame with with target and predictions.
  • eval_fn (function DataFrame -> Log Dict) – A partially applied evaluation function.
  • split_col (String) – The name of the column in test_data to split by.
  • split_values (Array, optional (default=None)) – An Array to split by. If not provided, test_data[split_col].unique() will be used.
  • eval_name (String, optional (default=None)) – the name of the evaluator as it will appear in the logs.
Returns:

log – A log-like dictionary with evaluation results by split.

Return type:

dict

fklearn.validation.evaluators.temporal_split_evaluator[source]

Splits the dataset into the temporal categories by time_col and evaluate model performance in each split.

The splits are implicitly defined by the time_format. For example, for the default time format (“%Y-%m”), we will split by year and month.

Parameters:
  • test_data (Pandas' DataFrame) – A Pandas’ DataFrame with with target and predictions.
  • eval_fn (function DataFrame -> Log Dict) – A partially applied evaluation function.
  • time_col (string) – The name of the column in test_data to split by.
  • time_format (string) – The way to format the time_col into temporal categories.
  • split_values (Array of string, optional (default=None)) – An array of date formatted strings to split the evaluation by. If not provided, all unique formatted dates will be used.
  • eval_name (String, optional (default=None)) – the name of the evaluator as it will appear in the logs.
Returns:

log – A log-like dictionary with evaluation results by split.

Return type:

dict

fklearn.validation.splitters.forward_stability_curve_time_splitter[source]

Splits the data into temporal buckets with both the training and testing folds both moving forward. The folds move forward by a fixed timedelta step. Optionally, there can be a gap between the end of the training period and the start of the holdout period.

Similar to the stability curve time splitter, with the difference that the training period also moves forward with each fold.

The clearest use case is to evaluate a periodic re-training framework.

Parameters:
  • train_data (pandas.DataFrame) – A Pandas’ DataFrame that will be split for stability curve estimation.
  • training_time_start (datetime.datetime or str) – Date for the start of the training period. If move_training_start_with_steps is True, each step will increase this date by step.
  • training_time_end (datetime.datetime or str) – Date for the end of the training period. Each step increases this date by step.
  • time_column (str) – The name of the Date column of train_data.
  • holdout_gap (datetime.timedelta) – Timedelta of the gap between the end of the training period and the start of the validation period.
  • holdout_size (datetime.timedelta) – Timedelta of the range between the start and the end of the holdout period.
  • step (datetime.timedelta) – Timedelta that shifts both the training period and the holdout period by this value.
  • move_training_start_with_steps (bool) – If True, the training start date will increase by step for each fold. If False, the training start date remains fixed at the training_time_start value.
Returns:

  • Folds (list of tuples) – A list of folds. Each fold is a Tuple of arrays. The fist array in each tuple contains training indexes while the second array contains validation indexes.
  • logs (list of dict) – A list of logs, one for each fold

fklearn.validation.splitters.k_fold_splitter[source]

Makes K random train/test split folds for cross validation. The folds are made so that every sample is used at least once for evaluating and K-1 times for training.

If stratified is set to True, the split preserves the distribution of stratify_column

Parameters:
  • train_data (pandas.DataFrame) – A Pandas’ DataFrame that will be split into K-Folds for cross validation.
  • n_splits (int) – The number of folds K for the K-Fold cross validation strategy.
  • random_state (int) – Seed to be used by the random number generator.
  • stratify_column (string) – Column name in train_data to be used for stratified split.
Returns:

  • Folds (list of tuples) – A list of folds. Each fold is a Tuple of arrays. The fist array in each tuple contains training indexes while the second array contains validation indexes.
  • logs (list of dict) – A list of logs, one for each fold

fklearn.validation.splitters.out_of_time_and_space_splitter[source]

Makes K grouped train/test split folds for cross validation. The folds are made so that every ID is used at least once for evaluating and K-1 times for training. Also, for each fold, evaluation will always be out-of-ID and out-of-time.

Parameters:
  • train_data (pandas.DataFrame) – A Pandas’ DataFrame that will be split into K out-of-time and ID folds for cross validation.
  • n_splits (int) – The number of folds K for the K-Fold cross validation strategy.
  • in_time_limit (str or datetime.datetime) – A String representing the end time of the training data. It should be in the same format as the Date column in train_data.
  • time_column (str) – The name of the Date column of train_data.
  • space_column (str) – The name of the ID column of train_data.
  • holdout_gap (datetime.timedelta) – Timedelta of the gap between the end of the training period and the start of the validation period.
Returns:

  • Folds (list of tuples) – A list of folds. Each fold is a Tuple of arrays. The fist array in each tuple contains training indexes while the second array contains validation indexes.
  • logs (list of dict) – A list of logs, one for each fold

fklearn.validation.splitters.reverse_time_learning_curve_splitter[source]

Splits the data into temporal buckets given by the specified frequency. Uses a fixed out-of-ID and time hold out set for every fold. Training size increases per fold, with less recent data being added in each fold. Useful for inverse learning curve validation, that is, for seeing how hold out performance increases as the training size increases with less recent data.

Parameters:
  • train_data (pandas.DataFrame) – A Pandas’ DataFrame that will be split inverse learning curve estimation.
  • time_column (str) – The name of the Date column of train_data.
  • training_time_limit (str) – The Date String for the end of the testing period. Should be of the same format as time_column.
  • lower_time_limit (str) – A Date String for the begining of the training period. This allows limiting the learning curve from bellow, avoiding heavy computation with very old data.
  • freq (str) – The temporal frequency. See: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases
  • holdout_gap (datetime.timedelta) – Timedelta of the gap between the end of the training period and the start of the validation period.
  • min_samples (int) – The minimum number of samples required in the split to keep the split.
Returns:

  • Folds (list of tuples) – A list of folds. Each fold is a Tuple of arrays. The fist array in each tuple contains training indexes while the second array contains validation indexes.
  • logs (list of dict) – A list of logs, one for each fold

fklearn.validation.splitters.spatial_learning_curve_splitter[source]

Splits the data for a spatial learning curve. Progressively adds more and more examples to the training in order to verify the impact of having more data available on a validation set.

The validation set starts after the training set, with an optional time gap.

Similar to the temporal learning curves, but with spatial increases in the training set.

Parameters:
  • train_data (pandas.DataFrame) – A Pandas’ DataFrame that will be split for learning curve estimation.
  • space_column (str) – The name of the ID column of train_data.
  • time_column (str) – The name of the temporal column of train_data.
  • training_limit (datetime or str) – The date limiting the training (after which the holdout begins).
  • holdout_gap (timedelta) – The gap between the end of training and the start of the holdout. If you have censored data, use a gap similar to the censor time.
  • train_percentages (list or tuple of floats) – A list containing the percentages of IDs to use in the training. Defaults to (0.25, 0.5, 0.75, 1.0). For example: For the default value, there would be four model trainings, containing respectively 25%, 50%, 75%, and 100% of the IDs that are not part of the held out set.
  • random_state (int) – A seed for the random number generator that shuffles the IDs.
Returns:

  • Folds (list of tuples) – A list of folds. Each fold is a Tuple of arrays. The fist array in each tuple contains training indexes while the second array contains validation indexes.
  • logs (list of dict) – A list of logs, one for each fold

fklearn.validation.splitters.stability_curve_time_in_space_splitter[source]

Splits the data into temporal buckets given by the specified frequency. Training set is fixed before hold out and uses a rolling window hold out set. Each fold moves the hold out further into the future. Useful to see how model performance degrades as the training data gets more outdated. Folds are made so that ALL IDs in the holdout also appear in the training set.

Parameters:
  • train_data (pandas.DataFrame) – A Pandas’ DataFrame that will be split for stability curve estimation.
  • training_time_limit (str) – The Date String for the end of the testing period. Should be of the same format as time_column.
  • space_column (str) – The name of the ID column of train_data.
  • time_column (str) – The name of the Date column of train_data.
  • freq (str) – The temporal frequency. See: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases
  • space_hold_percentage (float (default=0.5)) – The proportion of hold out IDs.
  • random_state (int) – A seed for the random number generator for ID sampling across train and hold out sets.
  • min_samples (int) – The minimum number of samples required in the split to keep the split.
Returns:

  • Folds (list of tuples) – A list of folds. Each fold is a Tuple of arrays. The fist array in each tuple contains training indexes while the second array contains validation indexes.
  • logs (list of dict) – A list of logs, one for each fold

fklearn.validation.splitters.stability_curve_time_space_splitter[source]

Splits the data into temporal buckets given by the specified frequency. Training set is fixed before hold out and uses a rolling window hold out set. Each fold moves the hold out further into the future. Useful to see how model performance degrades as the training data gets more outdated. Folds are made so that NONE of the IDs in the holdout appears in the training set.

Parameters:
  • train_data (pandas.DataFrame) – A Pandas’ DataFrame that will be split for stability curve estimation.
  • training_time_limit (str) – The Date String for the end of the testing period. Should be of the same format as time_column
  • space_column (str) – The name of the ID column of train_data
  • time_column (str) – The name of the Date column of train_data
  • freq (str) – The temporal frequency. See: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases
  • space_hold_percentage (float) – The proportion of hold out IDs
  • random_state (int) – A seed for the random number generator for ID sampling across train and hold out sets.
  • min_samples (int) – The minimum number of samples required in the split to keep the split.
Returns:

  • Folds (list of tuples) – A list of folds. Each fold is a Tuple of arrays. The fist array in each tuple contains training indexes while the second array contains validation indexes.
  • logs (list of dict) – A list of logs, one for each fold

fklearn.validation.splitters.stability_curve_time_splitter[source]

Splits the data into temporal buckets given by the specified frequency. Training set is fixed before hold out and uses a rolling window hold out set. Each fold moves the hold out further into the future. Useful to see how model performance degrades as the training data gets more outdated. Training and holdout sets can have same IDs

Parameters:
  • train_data (pandas.DataFrame) – A Pandas’ DataFrame that will be split for stability curve estimation.
  • training_time_limit (str) – The Date String for the end of the testing period. Should be of the same format as time_column.
  • time_column (str) – The name of the Date column of train_data.
  • freq (str) – The temporal frequency. See: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases
  • min_samples (int) – The minimum number of samples required in a split to keep it.
Returns:

  • Folds (list of tuples) – A list of folds. Each fold is a Tuple of arrays. The fist array in each tuple contains training indexes while the second array contains validation indexes.
  • logs (list of dict) – A list of logs, one for each fold

fklearn.validation.splitters.time_and_space_learning_curve_splitter[source]

Splits the data into temporal buckets given by the specified frequency. Uses a fixed out-of-ID and time hold out set for every fold. Training size increases per fold, with more recent data being added in each fold. Useful for learning curve validation, that is, for seeing how hold out performance increases as the training size increases with more recent data.

Parameters:
  • train_data (pandas.DataFrame) – A Pandas’ DataFrame that will be split for learning curve estimation.
  • training_time_limit (str) – The Date String for the end of the testing period. Should be of the same format as time_column.
  • space_column (str) – The name of the ID column of train_data.
  • time_column (str) – The name of the Date column of train_data.
  • freq (str) – The temporal frequency. See: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases
  • space_hold_percentage (float) – The proportion of hold out IDs.
  • holdout_gap (datetime.timedelta) – Timedelta of the gap between the end of the training period and the start of the validation period.
  • random_state (int) – A seed for the random number generator for ID sampling across train and hold out sets.
  • min_samples (int) – The minimum number of samples required in the split to keep the split.
Returns:

  • Folds (list of tuples) – A list of folds. Each fold is a Tuple of arrays. The fist array in each tuple contains training indexes while the second array contains validation indexes.
  • logs (list of dict) – A list of logs, one for each fold

fklearn.validation.splitters.time_learning_curve_splitter[source]

Splits the data into temporal buckets given by the specified frequency.

Uses a fixed out-of-ID and time hold out set for every fold. Training size increases per fold, with more recent data being added in each fold. Useful for learning curve validation, that is, for seeing how hold out performance increases as the training size increases with more recent data.

Parameters:
  • train_data (pandas.DataFrame) – A Pandas’ DataFrame that will be split for learning curve estimation.
  • training_time_limit (str) – The Date String for the end of the testing period. Should be of the same format as time_column.
  • time_column (str) – The name of the Date column of train_data.
  • freq (str) – The temporal frequency. See: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases
  • holdout_gap (datetime.timedelta) – Timedelta of the gap between the end of the training period and the start of the validation period.
  • min_samples (int) – The minimum number of samples required in the split to keep the split.
Returns:

  • Folds (list of tuples) – A list of folds. Each fold is a Tuple of arrays. The fist array in each tuple contains training indexes while the second array contains validation indexes.
  • logs (list of dict) – A list of logs, one for each fold

fklearn.validation.validator.parallel_validator[source]

Splits the training data into folds given by the split function and performs a train-evaluation sequence on each fold. Tries to run each fold in parallel using up to n_jobs processes.

Parameters:
  • train_data (pandas.DataFrame) – A Pandas’ DataFrame with training data
  • split_fn (function pandas.DataFrame -> list of tuple) – Partially defined split function that takes a dataset and returns a list of folds. Each fold is a Tuple of arrays. The fist array in each tuple contains training indexes while the second array contains validation indexes.
  • train_fn (function pandas.DataFrame -> prediction_function, predictions_dataset, logs) – A partially defined learning function that takes a training set and returns a predict function, a dataset with training predictions and training logs.
  • eval_fn (function pandas.DataFrame -> dict) – A partially defined evaluation function that takes a dataset with prediction and returns the evaluation logs.
  • n_jobs (int) – Number of parallel processes to spawn.
  • predict_oof (bool) – Whether to return out of fold predictions on the logs
Returns:

Return type:

A list log-like dictionary evaluations.

fklearn.validation.validator.validator[source]

Splits the training data into folds given by the split function and performs a train-evaluation sequence on each fold by calling validator_iteration.

Parameters:
  • train_data (pandas.DataFrame) – A Pandas’ DataFrame with training data
  • split_fn (function pandas.DataFrame -> list of tuple) – Partially defined split function that takes a dataset and returns a list of folds. Each fold is a Tuple of arrays. The fist array in each tuple contains training indexes while the second array contains validation indexes.
  • train_fn (function pandas.DataFrame -> prediction_function, predictions_dataset, logs) – A partially defined learning function that takes a training set and returns a predict function, a dataset with training predictions and training logs.
  • eval_fn (function pandas.DataFrame -> dict) – A partially defined evaluation function that takes a dataset with prediction and returns the evaluation logs.
  • perturb_fn_train (PerturbFnType) – A partially defined corruption function that takes a dataset and returns a corrupted dataset. Perturbation applied at train-time.
  • perturb_fn_test (PerturbFnType) – A partially defined corruption function that takes a dataset and returns a corrupted dataset. Perturbation applied at test-time.
  • predict_oof (bool) – Whether to return out of fold predictions on the logs
Returns:

Return type:

A list of log-like dictionary evaluations.

fklearn.validation.validator.validator_iteration(data: pandas.core.frame.DataFrame, train_index: pandas.core.indexes.base.Index, test_indexes: pandas.core.indexes.base.Index, fold_num: int, train_fn: Callable[pandas.core.frame.DataFrame, Tuple[Callable[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame], pandas.core.frame.DataFrame, Dict[str, Dict[str, Any]]]], eval_fn: Callable[pandas.core.frame.DataFrame, Dict[str, Union[float, Dict]]], predict_oof: bool = False) → Dict[str, Any][source]

Perform an iteration of train test split, training and evaluation.

Parameters:
  • data (pandas.DataFrame) – A Pandas’ DataFrame with training and testing subsets
  • train_index (numpy.Array) – The index of the training subset of data.
  • test_indexes (list of numpy.Array) – A list of indexes of the testing subsets of data.
  • fold_num (int) – The number of the fold in the current iteration
  • train_fn (function pandas.DataFrame -> prediction_function, predictions_dataset, logs) – A partially defined learning function that takes a training set and returns a predict function, a dataset with training predictions and training logs.
  • eval_fn (function pandas.DataFrame -> dict) – A partially defined evaluation function that takes a dataset with prediction and returns the evaluation logs.
  • predict_oof (bool) – Whether to return out of fold predictions on the logs
Returns:

Return type:

A log-like dictionary evaluations.