fklearn.training package¶

Submodules¶

fklearn.training.calibration module¶

fklearn.training.calibration.find_thresholds_with_same_risk[source]¶

Calculate fair calibration, where for each band any sensitive factor group have the same target mean.

Parameters:

df (pandas.DataFrame) – A Pandas’ DataFrame with features and target columns. The model will be trained to predict the target column from the features.
sensitive_factor (str) – Column where we have the different group classifications that we want to have the same target mean
unfair_band_column (str) – Column with the original bands
model_prediction_output (str) – Risk model’s output
target_column (str) – The name of the column in df that should be used as target for the model. This column should be binary, since this is a classification model.
output_column_name (str) – The name of the column with the fair bins.

Returns:

p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
log (dict) – A log-like Dict that stores information of the find_thresholds_with_same_risk model.

fklearn.training.calibration.isotonic_calibration_learner[source]¶

Fits a single feature isotonic regression to the dataset.

Parameters:

df (pandas.DataFrame) – A Pandas’ DataFrame with features and target columns. The model will be trained to predict the target column from the features.
target_column (str) – The name of the column in df that should be used as target for the model. This column should be binary, since this is a classification model.
prediction_column (str) – The name of the column with the uncalibrated predictions from the model.
output_column (str) – The name of the column with the calibrated predictions from the model.
y_min (float) – Lower bound of Isotonic Regression
y_max (float) – Upper bound of Isotonic Regression

Returns:

p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
log (dict) – A log-like Dict that stores information of the Isotonic Calibration model.

fklearn.training.classification module¶

fklearn.training.classification.catboost_classification_learner[source]¶

Fits an CatBoost classifier to the dataset. It first generates a DMatrix with the specified features and labels from df. Then, it fits a CatBoost model to this DMatrix. Return the predict function for the model and the predictions for the input dataset.

Parameters:

df (pandas.DataFrame) – A Pandas’ DataFrame with features and target columns. The model will be trained to predict the target column from the features.
features (list of str) – A list os column names that are used as features for the model. All this names should be in df.
target (str) – The name of the column in df that should be used as target for the model. This column should be discrete, since this is a classification model.
learning_rate (float) – Float in the range (0, 1] Step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features. and eta actually shrinks the feature weights to make the boosting process more conservative. See the eta hyper-parameter in: https://catboost.ai/docs/concepts/python-reference_parameters-list.html
num_estimators (int) – Int in the range (0, inf) Number of boosted trees to fit. See the n_estimators hyper-parameter in: https://catboost.ai/docs/concepts/python-reference_parameters-list.html
extra_params (dict, optional) – Dictionary in the format {“hyperparameter_name” : hyperparameter_value}. Other parameters for the CatBoost model. See the list in: https://catboost.ai/docs/concepts/python-reference_catboostregressor.html If not passed, the default will be used.
prediction_column (str) – The name of the column with the predictions from the model. If a multiclass problem, additional prediction_column_i columns will be added for i in range(0,n_classes).
weight_column (str, optional) – The name of the column with scores to weight the data.
encode_extra_cols (bool (default: True)) – If True, treats all columns in df with name pattern fklearn_feat__col==val` as feature columns.

Returns:

p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
log (dict) – A log-like Dict that stores information of the catboost_classification_learner model.

fklearn.training.classification.lgbm_classification_learner[source]¶

Fits an LGBM classifier to the dataset.

It first generates a Dataset with the specified features and labels from df. Then, it fits a LGBM model to this Dataset. Return the predict function for the model and the predictions for the input dataset.

Parameters:

df (pandas.DataFrame) – A pandas DataFrame with features and target columns. The model will be trained to predict the target column from the features.
features (list of str) – A list os column names that are used as features for the model. All this names should be in df.
target (str) – The name of the column in df that should be used as target for the model. This column should be discrete, since this is a classification model.
learning_rate (float) – Float in the range (0, 1] Step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features. and eta actually shrinks the feature weights to make the boosting process more conservative. See the learning_rate hyper-parameter in: https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.rst
num_estimators (int) – Int in the range (0, inf) Number of boosted trees to fit. See the num_iterations hyper-parameter in: https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.rst
extra_params (dict, optional) – Dictionary in the format {“hyperparameter_name” : hyperparameter_value}. Other parameters for the LGBM model. See the list in: https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.rst If not passed, the default will be used.
prediction_column (str) – The name of the column with the predictions from the model.
weight_column (str, optional) – The name of the column with scores to weight the data.
encode_extra_cols (bool (default: True)) – If True, treats all columns in df with name pattern fklearn_feat__col==val` as feature columns.
valid_sets (list of pandas.DataFrame, optional (default=None)) – A list of datasets to be used for early-stopping during training.
valid_names (list of strings, optional (default=None)) – A list of dataset names matching the list of datasets provided through the valid_sets parameter.
feval (callable, list of callable, or None, optional (default=None)) – Customized evaluation function. Each evaluation function should accept two parameters: preds, eval_data, and return (eval_name, eval_result, is_higher_better) or list of such tuples.
init_model (str, pathlib.Path, Booster or None, optional (default=None)) – Filename of LightGBM model or Booster instance used for continue training.
feature_name (list of str, or 'auto', optional (default="auto")) – Feature names. If ‘auto’ and data is pandas DataFrame, data columns names are used.
categorical_feature (list of str or int, or 'auto', optional (default="auto")) – Categorical features. If list of int, interpreted as indices. If list of str, interpreted as feature names (need to specify feature_name as well). If ‘auto’ and data is pandas DataFrame, pandas unordered categorical columns are used. All values in categorical features will be cast to int32 and thus should be less than int32 max value (2147483647). Large values could be memory consuming. Consider using consecutive integers starting from zero. All negative values in categorical features will be treated as missing values. The output cannot be monotonically constrained with respect to a categorical feature. Floating point numbers in categorical features will be rounded towards 0.
keep_training_booster (bool, optional (default=False)) – Whether the returned Booster will be used to keep training. If False, the returned value will be converted into _InnerPredictor before returning. This means you won’t be able to use eval, eval_train or eval_valid methods of the returned Booster. When your model is very large and cause the memory error, you can try to set this param to True to avoid the model conversion performed during the internal call of model_to_string. You can still use _InnerPredictor as init_model for future continue training.
callbacks (list of callable, or None, optional (default=None)) – List of callback functions that are applied at each iteration. See Callbacks in LightGBM Python API for more information.
dataset_init_score (list, list of lists (for multi-class task), numpy array, pandas Series, pandas DataFrame (for) – multi-class task), or None, optional (default=None) Init score for Dataset. It could be the prediction of the majority class or a prediction from any other model.

Returns:

p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
log (dict) – A log-like Dict that stores information of the LGBM Classifier model.

fklearn.training.classification.logistic_classification_learner[source]¶

Fits an logistic regression classifier to the dataset. Return the predict function for the model and the predictions for the input dataset.

Parameters:

df (pandas.DataFrame) – A Pandas’ DataFrame with features and target columns. The model will be trained to predict the target column from the features.
features (list of str) – A list os column names that are used as features for the model. All this names should be in df.
target (str) – The name of the column in df that should be used as target for the model. This column should be discrete, since this is a classification model.
params (dict) – The LogisticRegression parameters in the format {“par_name”: param}. See: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
prediction_column (str) – The name of the column with the predictions from the model. If a multiclass problem, additional prediction_column_i columns will be added for i in range(0,n_classes).
weight_column (str, optional) – The name of the column with scores to weight the data.
encode_extra_cols (bool (default: True)) – If True, treats all columns in df with name pattern fklearn_feat__col==val` as feature columns.

Returns:

p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
log (dict) – A log-like Dict that stores information of the Logistic Regression model.

fklearn.training.classification.nlp_logistic_classification_learner[source]¶

Fits a text vectorizer (TfidfVectorizer) followed by a logistic regression (LogisticRegression).

Parameters:

df (pandas.DataFrame) – A Pandas’ DataFrame with features and target columns. The model will be trained to predict the target column from the features.
text_feature_cols (list of str) – A list of column names of the text features used for the model. All these names should be in df.
target (str) – The name of the column in df that should be used as target for the model. This column should be discrete, since this is a classification model.
vectorizer_params (dict) – The TfidfVectorizer parameters in the format {“par_name”: param}. See: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
logistic_params (dict) – The LogisticRegression parameters in the format {“par_name”: param}. See: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
prediction_column (str) – The name of the column with the predictions from the model.

Returns:

p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
log (dict) – A log-like Dict that stores information of the NLP Logistic Regression model.

fklearn.training.classification.xgb_classification_learner[source]¶

Fits an XGBoost classifier to the dataset. It first generates a DMatrix with the specified features and labels from df. Then, it fits a XGBoost model to this DMatrix. Return the predict function for the model and the predictions for the input dataset.

Parameters:

df (pandas.DataFrame) – A Pandas’ DataFrame with features and target columns. The model will be trained to predict the target column from the features.
features (list of str) – A list os column names that are used as features for the model. All this names should be in df.
target (str) – The name of the column in df that should be used as target for the model. This column should be discrete, since this is a classification model.
learning_rate (float) – Float in the range (0, 1] Step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features. and eta actually shrinks the feature weights to make the boosting process more conservative. See the eta hyper-parameter in: http://xgboost.readthedocs.io/en/latest/parameter.html
num_estimators (int) – Int in the range (0, inf) Number of boosted trees to fit. See the n_estimators hyper-parameter in: http://xgboost.readthedocs.io/en/latest/python/python_api.html
extra_params (dict, optional) – Dictionary in the format {“hyperparameter_name” : hyperparameter_value}. Other parameters for the XGBoost model. See the list in: http://xgboost.readthedocs.io/en/latest/parameter.html If not passed, the default will be used.
prediction_column (str) – The name of the column with the predictions from the model. If a multiclass problem, additional prediction_column_i columns will be added for i in range(0,n_classes).
weight_column (str, optional) – The name of the column with scores to weight the data.
encode_extra_cols (bool (default: True)) – If True, treats all columns in df with name pattern fklearn_feat__col==val` as feature columns.

Returns:

p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
log (dict) – A log-like Dict that stores information of the XGboost Classifier model.

fklearn.training.ensemble module¶

fklearn.training.ensemble.xgb_octopus_classification_learner[source]¶

Octopus ensemble allows you to inject domain specific knowledge to force a split in an initial feature, instead of assuming the tree model will do that intelligent split on its own. It works by first defining a split on your dataset and then training one individual model in each separated dataset.

Parameters:

train_set (pd.DataFrame) – A Pandas’ DataFrame with features, target columns and a splitting column that must be categorical.
learning_rate_by_bin (dict) –
A dictionary of learning rate in the XGBoost model to use in each model split. Ex: if you want to split your training by tenure and you have a tenure column with integer values [1,2,3,…,12], you have to specify a list of learning rates for each split:
```
{
    1: 0.08,
    2: 0.08,
    ...
    12: 0.1
}
```
num_estimators_by_bin (dict) –
A dictionary of number of tree estimators in the XGBoost model to use in each model split. Ex: if you want to split your training by tenure and you have a tenure column with integer values [1,2,3,…,12], you have to specify a list of estimators for each split:
```
{
    1: 300,
    2: 250,
    ...
    12: 300
}
```

extra_params_by_bin (dict) –

A dictionary of extra parameters dictionaries in the XGBoost model to use in each model split. Ex: if you want to split your training by tenure and you have a tenure column with integer values [1,2,3,…,12], you have to specify a list of extra parameters for each split:

{
    1: {
        'reg_alpha': 0.0,
        'colsample_bytree': 0.4,
        ...
        'colsample_bylevel': 0.8
        }
    2: {
        'reg_alpha': 0.1,
        'colsample_bytree': 0.6,
        ...
        'colsample_bylevel': 0.4
        }
    ...
    12: {
        'reg_alpha': 0.0,
        'colsample_bytree': 0.7,
        ...
        'colsample_bylevel': 1.0
        }
}

features_by_bin (dict) –
A dictionary of features to use in each model split. Ex: if you want to split your training by tenure and you have a tenure column with integer values [1,2,3,…,12], you have to specify a list of features for each split:
```
{
    1: [feature-1, feature-2, feature-3, ...],
    2: [feature-1, feature-3, feature-5, ...],
    ...
    12: [feature-2, feature-4, feature-8, ...]
}
```
train_split_col (str) – The name of the categorical column where the model will make the splits. Ex: if you want to split your training by tenure, you can have a categorical column called “tenure”.
train_split_bins (list) – A list with the actual values of the categories from the train_split_col. Ex: if you want to split your training by tenure and you have a tenure column with integer values [1,2,3,…,12] you can pass this list and you will split your training into 12 different models.
nthread (int) – Number of threads for the XGBoost learners.
target_column (str) – The name of the target column.
prediction_column (str) – The name of the column with the predictions from the model.

Returns:

p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
log (dict) – A log-like Dict that stores information of the Octopus XGB Classifier model.

fklearn.training.imputation module¶

fklearn.training.imputation.imputer[source]¶

Fits a missing value imputer to the dataset.

Parameters:

df (pandas.DataFrame) – A Pandas’ DataFrame with columns to impute missing values. It must contain all columns listed in columns_to_impute
columns_to_impute (List of strings) – A list of names of the columns for missing value imputation.
impute_strategy (String, (default="median")) – The imputation strategy. - If “mean”, then replace missing values using the mean along the axis. - If “median”, then replace missing values using the median along the axis. - If “most_frequent”, then replace missing using the most frequent value along the axis.
placeholder_value (Any, (default=None)) – if not None, use this as default value when some features only contains NA values on training. For transformation, NA values on those features will be replaced by fill_value.

Returns:

p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
log (dict) – A log-like Dict that stores information of the SimpleImputer model.

fklearn.training.imputation.placeholder_imputer[source]¶

Fills missing values with a fixed value.

Parameters:

df (pandas.DataFrame) – A Pandas’ DataFrame with columns to fill missing values. It must contain all columns listed in columns_to_impute
columns_to_impute (List of strings) – A list of names of the columns for filling missing value.
placeholder_value (Any, (default=-999)) – The value used to fill in missing values.

Returns:

p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
log (dict) – A log-like Dict that stores information of the Placeholder SimpleImputer model.

fklearn.training.pipeline module¶

fklearn.training.pipeline.build_pipeline(*learners, has_repeated_learners: bool = False) → Callable[[pandas.core.frame.DataFrame], Tuple[Callable[[...], pandas.core.frame.DataFrame], pandas.core.frame.DataFrame, Dict[str, Any]]][source]¶

Builds a pipeline of different chained learners functions with the possibility of using keyword arguments in the predict functions of the pipeline.

Say you have two learners, you create a pipeline with pipeline = build_pipeline(learner1, learner2). Those learners must be functions with just one unfilled argument (the dataset itself).

Then, you train the pipeline with predict_fn, transformed_df, logs = pipeline(df), which will be like applying the learners in the following order: learner2(learner1(df)).

Finally, you predict on different datasets with pred_df = predict_fn(new_df), with optional kwargs. For example, if you have XGBoost or LightGBM, you can get SHAP values with predict_fn(new_df, apply_shap=True).

Parameters:

learners (partially-applied learner functions.) –
has_repeated_learners (bool) – Boolean value indicating wheter the pipeline contains learners with the same name or not.

Returns:

p (function pandas.DataFrame, **kwargs -> pandas.DataFrame) – A function that when applied to a DataFrame will apply all learner functions in sequence, with optional kwargs.
new_df (pandas.DataFrame) – A DataFrame that is the result of applying all learner function in sequence.
log (dict) – A log-like Dict that stores information of all learner functions.

fklearn.training.regression module¶

fklearn.training.regression.catboost_regressor_learner[source]¶

Fits an CatBoost regressor to the dataset. It first generates a Pool with the specified features and labels from df. Then it fits a CatBoost model to this Pool. Return the predict function for the model and the predictions for the input dataset.

Parameters:

df (pandas.DataFrame) – A Pandas’ DataFrame with features and target columns. The model will be trained to predict the target column from the features.
features (list of str) – A list os column names that are used as features for the model. All this names should be in df.
target (str) – The name of the column in df that should be used as target for the model. This column should be numerical and continuous, since this is a regression model.
learning_rate (float) – Float in range [0,1]. Step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features. and eta actually shrinks the feature weights to make the boosting process more conservative. See the eta hyper-parameter in: https://catboost.ai/docs/concepts/python-reference_parameters-list.html
num_estimators (int) – Int in range [0, inf] Number of boosted trees to fit. See the n_estimators hyper-parameter in: https://catboost.ai/docs/concepts/python-reference_parameters-list.html
extra_params (dict, optional) – Dictionary in the format {“hyperparameter_name” : hyperparameter_value. Other parameters for the CatBoost model. See the list in: https://catboost.ai/docs/concepts/python-reference_catboostregressor.html If not passed, the default will be used.
prediction_column (str) – The name of the column with the predictions from the model.
weight_column (str, optional) – The name of the column with scores to weight the data.

Returns:

p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
log (dict) – A log-like Dict that stores information of the CatBoostRegressor model.

fklearn.training.regression.custom_supervised_model_learner[source]¶

Fits a custom model to the dataset. Return the predict function, the predictions for the input dataset and a log describing the model.

Parameters:

df (pandas.DataFrame) – A Pandas’ DataFrame with features and target columns. The model will be trained to predict the target column from features.
features (list of str) – A list os column names that are used as features for the model. All this names should be in df.
target (str) – The name of the column in df that should be used as target for the model.
model (Object) – Machine learning model to be used for regression or clasisfication. model object must have “.fit” attribute to train the data. For classification problems, it also needs “.predict_proba” attribute. For regression problemsm it needs “.predict” attribute.
supervised_type (str) – Type of supervised learning to be used The options are: ‘classification’ or ‘regression’
log (Dict[str, Dict]) – Log with additional information of the custom model used. It must start with just one element with the model name.
prediction_column (str) – The name of the column with the predictions from the model. For classification problems, all probabilities wiill be added: for i in range(0,n_classes). For regression just prediction_column will be added.

Returns:

p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
log (dict) – A log-like Dict that stores information of the Custom Supervised Model Learner model.

fklearn.training.regression.elasticnet_regression_learner[source]¶

Fits an elastic net regressor to the dataset. Return the predict function for the model and the predictions for the input dataset.

Parameters:

df (pandas.DataFrame) – A Pandas’ DataFrame with features and target columns. The model will be trained to predict the target column from the features.
features (list of str) – A list os column names that are used as features for the model. All this names should be in df.
target (str) – The name of the column in df that should be used as target for the model. This column should be continuous, since this is a regression model.
params (dict) – The ElasticNet parameters in the format {“par_name”: param}. See: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html
prediction_column (str) – The name of the column with the predictions from the model.
weight_column (str, optional) – The name of the column with scores to weight the data.
encode_extra_cols (bool (default: True)) – If True, treats all columns in df with name pattern fklearn_feat__col==val` as feature columns.

Returns:

p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
log (dict) – A log-like Dict that stores information of the ElasticNet Regression model.

fklearn.training.regression.gp_regression_learner[source]¶

Fits an gaussian process regressor to the dataset.

Parameters:

df (pandas.DataFrame) – A Pandas’ DataFrame with features and target columns. The model will be trained to predict the target column from the features.
features (list of str) – A list os column names that are used as features for the model. All this names should be in df.
target (str) – The name of the column in df that should be used as target for the model. This column should be numerical and continuous, since this is a regression model.
kernel (sklearn.gaussian_process.kernels) – The kernel specifying the covariance function of the GP. If None is passed, the kernel “1.0 * RBF(1.0)” is used as default. Note that the kernel’s hyperparameters are optimized during fitting.
alpha (float) – Value added to the diagonal of the kernel matrix during fitting. Larger values correspond to increased noise level in the observations. This can also prevent a potential numerical issue during fitting, by ensuring that the calculated values form a positive definite matrix.
extra_variance (float) – The amount of extra variance to scale to the predictions in standard deviations. If left as the default “fit”, Uses the standard deviation of the target.
return_std (bool) – If True, the standard-deviation of the predictive distribution at the query points is returned along with the mean.
extra_params (dict {"hyperparameter_name" : hyperparameter_value}, optional) – Other parameters for the GaussianProcessRegressor model. See the list in: http://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.GaussianProcessRegressor.html If not passed, the default will be used.
prediction_column (str) – The name of the column with the predictions from the model.
encode_extra_cols (bool (default: True)) – If True, treats all columns in df with name pattern fklearn_feat__col==val` as feature columns.

Returns:

p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
log (dict) – A log-like Dict that stores information of the Gaussian Process Regressor model.

fklearn.training.regression.lgbm_regression_learner[source]¶

Fits an LGBM regressor to the dataset.

It first generates a Dataset with the specified features and labels from df. Then, it fits a LGBM model to this Dataset. Return the predict function for the model and the predictions for the input dataset.

Parameters:

df (pandas.DataFrame) – A Pandas’ DataFrame with features and target columns. The model will be trained to predict the target column from the features.
features (list of str) – A list os column names that are used as features for the model. All this names should be in df.
target (str) – The name of the column in df that should be used as target for the model. This column should be binary, since this is a classification model.
learning_rate (float) – Float in the range (0, 1] Step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features. and eta actually shrinks the feature weights to make the boosting process more conservative. See the learning_rate hyper-parameter in: https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.rst
num_estimators (int) – Int in the range (0, inf) Number of boosted trees to fit. See the num_iterations hyper-parameter in: https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.rst
extra_params (dict, optional) – Dictionary in the format {“hyperparameter_name” : hyperparameter_value}. Other parameters for the LGBM model. See the list in: https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.rst If not passed, the default will be used.
prediction_column (str) – The name of the column with the predictions from the model.
weight_column (str, optional) – The name of the column with scores to weight the data.
encode_extra_cols (bool (default: True)) – If True, treats all columns in df with name pattern fklearn_feat__col==val` as feature columns.

Returns:

p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
log (dict) – A log-like Dict that stores information of the LGBM Regressor model.

fklearn.training.regression.linear_regression_learner[source]¶

Fits an linear regressor to the dataset. Return the predict function for the model and the predictions for the input dataset.

Parameters:

df (pandas.DataFrame) – A Pandas’ DataFrame with features and target columns. The model will be trained to predict the target column from the features.
features (list of str) – A list os column names that are used as features for the model. All this names should be in df.
target (str) – The name of the column in df that should be used as target for the model. This column should be continuous, since this is a regression model.
params (dict) – The LinearRegression parameters in the format {“par_name”: param}. See: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
prediction_column (str) – The name of the column with the predictions from the model.
weight_column (str, optional) – The name of the column with scores to weight the data.
encode_extra_cols (bool (default: True)) – If True, treats all columns in df with name pattern fklearn_feat__col==val` as feature columns.

Returns:

p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
log (dict) – A log-like Dict that stores information of the Linear Regression model.

fklearn.training.regression.xgb_regression_learner[source]¶

Fits an XGBoost regressor to the dataset. It first generates a DMatrix with the specified features and labels from df. Then it fits a XGBoost model to this DMatrix. Return the predict function for the model and the predictions for the input dataset.

Parameters:

df (pandas.DataFrame) – A Pandas’ DataFrame with features and target columns. The model will be trained to predict the target column from the features.
features (list of str) – A list os column names that are used as features for the model. All this names should be in df.
target (str) – The name of the column in df that should be used as target for the model. This column should be numerical and continuous, since this is a regression model.
learning_rate (float) – Float in range [0,1]. Step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features. and eta actually shrinks the feature weights to make the boosting process more conservative. See the eta hyper-parameter in: http://xgboost.readthedocs.io/en/latest/parameter.html
num_estimators (int) – Int in range [0, inf] Number of boosted trees to fit. See the n_estimators hyper-parameter in: http://xgboost.readthedocs.io/en/latest/python/python_api.html
extra_params (dict, optional) – Dictionary in the format {“hyperparameter_name” : hyperparameter_value. Other parameters for the XGBoost model. See the list in: http://xgboost.readthedocs.io/en/latest/parameter.html If not passed, the default will be used.
prediction_column (str) – The name of the column with the predictions from the model.
weight_column (str, optional) – The name of the column with scores to weight the data.
encode_extra_cols (bool (default: True)) – If True, treats all columns in df with name pattern fklearn_feat__col==val` as feature columns.

Returns:

p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
log (dict) – A log-like Dict that stores information of the XGboost Regressor model.

fklearn.training.transformation module¶

fklearn.training.transformation.apply_replacements(df: pandas.core.frame.DataFrame, columns: List[str], vec: Dict[str, Dict], replace_unseen: Any) → pandas.core.frame.DataFrame[source]¶

Base function to apply the replacements values found on the “vec” vectors into the df DataFrame.

Parameters:

df (pandas.DataFrame) – A Pandas DataFrame containing the data to be replaced.
columns (list of str) – The df columns names to perform the replacements.
vec (dict) – A dict mapping a col to dict mapping a value to its replacement. For example: vec = {“feature1”: {1: 2, 3: 5, 6: 8}}
replace_unseen (Any) – Default value to replace when original value is not present in the vec dict for the feature

fklearn.training.transformation.capper(df: pandas.core.frame.DataFrame = '__no__default__', columns_to_cap: List[str] = '__no__default__', precomputed_caps: Dict[str, float] = None) → Union[Callable, Tuple[Callable[[...], pandas.core.frame.DataFrame], pandas.core.frame.DataFrame, Dict[str, Any]]][source]¶

Learns the maximum value for each of the columns_to_cap and used that as the cap for those columns. If precomputed caps are passed, the function uses that as the cap value instead of computing the maximum.

Parameters:

df (pandas.DataFrame) – A Pandas’ DataFrame that must contain columns_to_cap columns.
columns_to_cap (list of str) – A list os column names that should be caped.
precomputed_caps (dict) – A dictionary on the format {“column_name” : cap_value}. That maps column names to pre computed cap values

Returns:

p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
log (dict) – A log-like Dict that stores information of the Capper model.

fklearn.training.transformation.count_categorizer(df: pandas.core.frame.DataFrame = '__no__default__', columns_to_categorize: List[str] = '__no__default__', replace_unseen: int = -1, store_mapping: bool = False) → Union[Callable, Tuple[Callable[[...], pandas.core.frame.DataFrame], pandas.core.frame.DataFrame, Dict[str, Any]]][source]¶

Replaces categorical variables by count.

The default behaviour is to replace the original values. To store the original values in a new column, specify prefix or suffix in the parameters, or specify a dictionary with the desired column mapping using the columns_mapping parameter.

Parameters:

df (pandas.DataFrame) – A Pandas’ DataFrame that must contain columns_to_categorize columns.
columns_to_categorize (list of str) – A list of categorical column names.
replace_unseen (int) – The value to impute unseen categories.
store_mapping (bool (default: False)) – Whether to store the feature value -> integer dictionary in the log

Returns:

p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
log (dict) – A log-like Dict that stores information of the Count Categorizer model.

fklearn.training.transformation.custom_transformer(df: pandas.core.frame.DataFrame = '__no__default__', columns_to_transform: List[str] = '__no__default__', transformation_function: Callable[[pandas.core.frame.DataFrame], pandas.core.frame.DataFrame] = '__no__default__', is_vectorized: bool = False) → Union[Callable, Tuple[Callable[[...], pandas.core.frame.DataFrame], pandas.core.frame.DataFrame, Dict[str, Any]]][source]¶

Applies a custom function to the desired columns.

The default behaviour is to replace the original values. To store the original values in a new column, specify prefix or suffix in the parameters, or specify a dictionary with the desired column mapping using the columns_mapping parameter.

Parameters:

df (pandas.DataFrame) – A Pandas’ DataFrame that must contain columns
columns_to_transform (list of str) – A list of column names that will remain in the dataframe during training time (fit)
transformation_function (function(pandas.DataFrame) -> pandas.DataFrame) – A function that receives a DataFrame as input, performs a transformation on its columns and returns another DataFrame.

Returns:

p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
log (dict) – A log-like Dict that stores information of the Custom Transformer model.

fklearn.training.transformation.discrete_ecdfer[source]¶

Learns an Empirical Cumulative Distribution Function from the specified column in the input DataFrame. It is usually used in the prediction column to convert a predicted probability into a score from 0 to 1000.

Parameters:

df (Pandas' pandas.DataFrame) – A Pandas’ DataFrame that must contain a prediction_column columns.
ascending (bool) – Whether to compute an ascending ECDF or a descending one.
prediction_column (str) – The name of the column in df to learn the ECDF from.
ecdf_column (str) – The name of the new ECDF column added by this function.
max_range (int) –

The maximum value for the ECDF. It will go will go

from 0 to max_range.
round_method (Callable) – A function perform the round of transformed values for ex: (int, ceil, floor, round)

Returns:

p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
log (dict) – A log-like Dict that stores information of the Discrete ECDFer model.

fklearn.training.transformation.ecdfer[source]¶

Learns an Empirical Cumulative Distribution Function from the specified column in the input DataFrame. It is usually used in the prediction column to convert a predicted probability into a score from 0 to 1000.

Parameters:

df (Pandas' pandas.DataFrame) – A Pandas’ DataFrame that must contain a prediction_column columns.
ascending (bool) – Whether to compute an ascending ECDF or a descending one.
prediction_column (str) – The name of the column in df to learn the ECDF from.
ecdf_column (str) – The name of the new ECDF column added by this function
max_range (int) –

The maximum value for the ECDF. It will go will go

from 0 to max_range.

Returns:

p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
log (dict) – A log-like Dict that stores information of the ECDFer model.

fklearn.training.transformation.floorer(df: pandas.core.frame.DataFrame = '__no__default__', columns_to_floor: List[str] = '__no__default__', precomputed_floors: Dict[str, float] = None) → Union[Callable, Tuple[Callable[[...], pandas.core.frame.DataFrame], pandas.core.frame.DataFrame, Dict[str, Any]]][source]¶

Learns the minimum value for each of the columns_to_floor and used that as the floot for those columns. If precomputed floors are passed, the function uses that as the cap value instead of computing the minimun.

Parameters:

df (pandas.DataFrame) – A Pandas’ DataFrame that must contain columns_to_floor columns.
columns_to_floor (list of str) – A list os column names that should be floored.
precomputed_floors (dict) – A dictionary on the format {“column_name” : floor_value} that maps column names to pre computed floor values

Returns:

p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
log (dict) – A log-like Dict that stores information of the Floorer model.

fklearn.training.transformation.label_categorizer(df: pandas.core.frame.DataFrame = '__no__default__', columns_to_categorize: List[str] = '__no__default__', replace_unseen: Union[str, float] = nan, store_mapping: bool = False) → Union[Callable, Tuple[Callable[[...], pandas.core.frame.DataFrame], pandas.core.frame.DataFrame, Dict[str, Any]]][source]¶

Replaces categorical variables with a numeric identifier.

The default behaviour is to replace the original values. To store the original values in a new column, specify prefix or suffix in the parameters, or specify a dictionary with the desired column mapping using the columns_mapping parameter.

Parameters:

df (pandas.DataFrame) – A Pandas’ DataFrame that must contain columns_to_categorize columns.
columns_to_categorize (list of str) – A list of categorical column names.
replace_unseen (int, str, float, or nan) – The value to impute unseen categories.
store_mapping (bool (default: False)) – Whether to store the feature value -> integer dictionary in the log

Returns:

p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
log (dict) – A log-like Dict that stores information of the Label Categorizer model.

fklearn.training.transformation.missing_warner[source]¶

Creates a new column to warn about rows that columns that don’t have missing in the training set but have missing on the scoring

Parameters:

df (pandas.DataFrame) – A Pandas’ DataFrame.
cols_list (list of str) – List of columns to consider when evaluating missingness
new_column_name (str) – Name of the column created to alert the existence of missing values

Returns:

p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
log (dict) – A log-like Dict that stores information of the Missing Alerter model.

fklearn.training.transformation.null_injector[source]¶

Injects null into columns

Parameters:

df (pandas.DataFrame) – A Pandas’ DataFrame that must contain columns_to_inject as columns
columns_to_inject (list of str) – A list of features to inject nulls. If groups is not None it will be ignored.
proportion (float) – Proportion of nulls to inject in the columns.
groups (list of list of str (default = None)) – A list of group of features. If not None, feature in the same group will be set to NaN together.
seed (int) – Random seed for consistency.

Returns:

p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
log (dict) – A log-like Dict that stores information of the Null Injector model.

fklearn.training.transformation.onehot_categorizer(df: pandas.core.frame.DataFrame = '__no__default__', columns_to_categorize: List[str] = '__no__default__', hardcode_nans: bool = False, drop_first_column: bool = False, store_mapping: bool = False) → Union[Callable, Tuple[Callable[[...], pandas.core.frame.DataFrame], pandas.core.frame.DataFrame, Dict[str, Any]]][source]¶

Onehot encoding on categorical columns. Encoded columns are removed and substituted by columns named fklearn_feat__col==val, where col is the name of the column and val is one of the values the feature can assume.

The default behaviour is to replace the original values. To store the original values in a new column, specify prefix or suffix in the parameters, or specify a dictionary with the desired column mapping using the columns_mapping parameter.

Parameters:

df (pd.DataFrame) – A Pandas’ DataFrame that must contain columns_to_categorize columns.
columns_to_categorize (list of str) – A list of categorical column names. Must be non-empty.
hardcode_nans (bool) – Hardcodes an extra column with: 1 if nan or unseen else 0.
drop_first_column (bool) – Drops the first column to create (k-1)-sized one-hot arrays for k features per categorical column. Can be used to avoid colinearity.
store_mapping (bool (default: False)) – Whether to store the feature value -> integer dictionary in the log

Returns:

p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
log (dict) – A log-like Dict that stores information of the Onehot Categorizer model.

fklearn.training.transformation.prediction_ranger[source]¶

Caps and floors the specified prediction column to a set range.

Parameters:

df (pandas.DataFrame) – A Pandas’ DataFrame that must contain a prediction_column columns.
prediction_min (float) – The floor for the prediction.
prediction_max (float) – The cap for the prediction.
prediction_column (str) – The name of the column in df to cap and floor

Returns:

p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
log (dict) – A log-like Dict that stores information of the Prediction Ranger model.

fklearn.training.transformation.quantile_biner(df: pandas.core.frame.DataFrame = '__no__default__', columns_to_bin: List[str] = '__no__default__', q: int = 4, right: bool = False) → Union[Callable, Tuple[Callable[[...], pandas.core.frame.DataFrame], pandas.core.frame.DataFrame, Dict[str, Any]]][source]¶

Discretize continuous numerical columns into its quantiles. Uses pandas.qcut to find the bins and then numpy.digitize to fit the columns into bins.

The default behaviour is to replace the original values. To store the original values in a new column, specify prefix or suffix in the parameters, or specify a dictionary with the desired column mapping using the columns_mapping parameter.

Parameters:

df (pandas.DataFrame) – A Pandas’ DataFrame that must contain columns_to_categorize columns.
columns_to_bin (list of str) – A list of numerical column names.
q (int) – Number of quantiles. 10 for deciles, 4 for quartiles, etc. Alternately array of quantiles, e.g. [0, .25, .5, .75, 1.] for quartiles. See https://pandas.pydata.org/pandas-docs/stable/generated/pandas.qcut.html
right (bool) – Indicating whether the intervals include the right or the left bin edge. Default behavior is (right==False) indicating that the interval does not include the right edge. The left bin end is open in this case, i.e., bins[i-1] <= x < bins[i] is the default behavior for monotonically increasing bins. See https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.digitize.html

Returns:

p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
log (dict) – A log-like Dict that stores information of the Quantile Biner model.

fklearn.training.transformation.rank_categorical(df: pandas.core.frame.DataFrame = '__no__default__', columns_to_rank: List[str] = '__no__default__', replace_unseen: Union[str, float] = nan, store_mapping: bool = False) → Union[Callable, Tuple[Callable[[...], pandas.core.frame.DataFrame], pandas.core.frame.DataFrame, Dict[str, Any]]][source]¶

Rank categorical features by their frequency in the train set.

The default behaviour is to replace the original values. To store the original values in a new column, specify prefix or suffix in the parameters, or specify a dictionary with the desired column mapping using the columns_mapping parameter.

Parameters:

df (Pandas' DataFrame) – A Pandas’ DataFrame that must contain a prediction_column columns.
columns_to_rank (list of str) – The df columns names to perform the rank.
replace_unseen (int, str, float, or nan) – The value to impute unseen categories.
store_mapping (bool (default: False)) – Whether to store the feature value -> integer dictionary in the log

Returns:

p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
log (dict) – A log-like Dict that stores information of the Rank Categorical model.

fklearn.training.transformation.selector[source]¶

Filters a DataFrames by selecting only the desired columns.

Parameters:

df (pandas.DataFrame) – A Pandas’ DataFrame that must contain columns
training_columns (list of str) – A list of column names that will remain in the dataframe during training time (fit)
predict_columns (list of str) – A list of column names that will remain in the dataframe during prediction time (transform) If None, it defaults to training_columns.

Returns:

p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
log (dict) – A log-like Dict that stores information of the Selector model.

fklearn.training.transformation.standard_scaler(df: pandas.core.frame.DataFrame = '__no__default__', columns_to_scale: List[str] = '__no__default__') → Union[Callable, Tuple[Callable[[...], pandas.core.frame.DataFrame], pandas.core.frame.DataFrame, Dict[str, Any]]][source]¶

Fits a standard scaler to the dataset.

The default behaviour is to replace the original values. To store the original values in a new column, specify prefix or suffix in the parameters, or specify a dictionary with the desired column mapping using the columns_mapping parameter.

Parameters:

df (pandas.DataFrame) – A Pandas’ DataFrame with columns to scale. It must contain all columns listed in columns_to_scale.
columns_to_scale (list of str) – A list of names of the columns for standard scaling.

Returns:

p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
log (dict) – A log-like Dict that stores information of the Standard Scaler model.

fklearn.training.transformation.target_categorizer(df: pandas.core.frame.DataFrame = '__no__default__', columns_to_categorize: List[str] = '__no__default__', target_column: str = '__no__default__', smoothing: float = 1.0, ignore_unseen: bool = True, store_mapping: bool = False) → Union[Callable, Tuple[Callable[[...], pandas.core.frame.DataFrame], pandas.core.frame.DataFrame, Dict[str, Any]]][source]¶

Replaces categorical variables with the smoothed mean of the target variable by category. Uses a weighted average with the overall mean of the target variable for smoothing.

The default behaviour is to replace the original values. To store the original values in a new column, specify prefix or suffix in the parameters, or specify a dictionary with the desired column mapping using the columns_mapping parameter.

Parameters:

df (pandas.DataFrame) – A Pandas’ DataFrame that must contain columns_to_categorize and target_column columns.
columns_to_categorize (list of str) – A list of categorical column names.
target_column (str) – Target column name. Target can be binary or continuous.
smoothing (float (default: 1.0)) – Weight given to overall target mean against target mean by category. The value must be greater than or equal to 0
ignore_unseen (bool (default: True)) – If True, unseen values will be encoded as nan If False, these will be replaced by target mean.
store_mapping (bool (default: False)) – Whether to store the feature value -> float dictionary in the log.

Returns:

p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
log (dict) – A log-like Dict that stores information of the Target Categorizer model.

fklearn.training.transformation.truncate_categorical(df: pandas.core.frame.DataFrame = '__no__default__', columns_to_truncate: List[str] = '__no__default__', percentile: float = '__no__default__', replacement: Union[str, float] = -9999, replace_unseen: Union[str, float] = -9999, store_mapping: bool = False) → Union[Callable, Tuple[Callable[[...], pandas.core.frame.DataFrame], pandas.core.frame.DataFrame, Dict[str, Any]]][source]¶

Truncate infrequent categories and replace them by a single one. You can think of it like “others” category.

The default behaviour is to replace the original values. To store the original values in a new column, specify prefix or suffix in the parameters, or specify a dictionary with the desired column mapping using the columns_mapping parameter.

Parameters:

df (pandas.DataFrame) – A Pandas’ DataFrame that must contain a prediction_column columns.
columns_to_truncate (list of str) – The df columns names to perform the truncation.
percentile (float) – Categories less frequent than the percentile will be replaced by the same one.
replacement (int, str, float or nan) – The value to use when a category is less frequent that the percentile variable.
replace_unseen (int, str, float, or nan) – The value to impute unseen categories.
store_mapping (bool (default: False)) – Whether to store the feature value -> integer dictionary in the log.

Returns:

p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
log (dict) – A log-like Dict that stores information of the Truncate Categorical model.

fklearn.training.transformation.value_mapper(df: pandas.core.frame.DataFrame = '__no__default__', value_maps: Dict[str, Dict] = '__no__default__', ignore_unseen: bool = True, replace_unseen_to: Any = nan) → Union[Callable, Tuple[Callable[[...], pandas.core.frame.DataFrame], pandas.core.frame.DataFrame, Dict[str, Any]]][source]¶

Map values in selected columns in the DataFrame according to dictionaries of replacements. Learner wrapper for apply_replacements

Parameters:

df (pandas.DataFrame) – A Pandas DataFrame containing the data to be replaced.
value_maps (dict of dicts) – A dict mapping a col to dict mapping a value to its replacement. For example: value_maps = {“feature1”: {1: 2, 3: 5, 6: 8}}
ignore_unseen (bool) – If True, values not explicitly declared in value_maps will be left as is. If False, these will be replaced by replace_unseen_to.
replace_unseen_to (Any) – Default value to replace when original value is not present in the vec dict for the feature.

fklearn.training.unsupervised module¶

fklearn.training.unsupervised.isolation_forest_learner[source]¶

Fits an anomaly detection algorithm (Isolation Forest) to the dataset

Parameters:

df (pandas.DataFrame) – A Pandas’ DataFrame with features and target columns. The model will be trained to predict the target column from the features.
features (list of str) – A list os column names that are used as features for the model. All this names should be in df.
params (dict) – The IsolationForest parameters in the format {“par_name”: param}. See: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html
prediction_column (str) – The name of the column with the predictions from the model.
encode_extra_cols (bool (default: True)) – If True, treats all columns in df with name pattern fklearn_feat__col==val` as feature columns.

Returns:

p (function pandas.DataFrame -> pandas.DataFrame) – A function that when applied to a DataFrame with the same columns as df returns a new DataFrame with a new column with predictions from the model.
new_df (pandas.DataFrame) – A df-like DataFrame with the same columns as the input df plus a column with predictions from the model.
log (dict) – A log-like Dict that stores information of the Isolation Forest model.

fklearn.training.utils module¶

fklearn.training.utils.expand_features_encoded(df: pandas.core.frame.DataFrame, features: List[str]) → List[str][source]¶

Expand the list of features to include features created automatically by fklearn in encoders such as Onehot-encoder. All features created by fklearn have the naming pattern fklearn_feat__col==val. This function looks for these names in the DataFrame columns, checks if they can be derivative of any of the features listed in features, adds them to the new list of features and removes the original names from the list.

E.g. df has columns col1 with values 0 and 1 and col2. After Onehot-encoding col1 df will have columns fklearn_feat_col1==0, fklearn_feat_col1==1, col2. This function will then add fklearn_feat_col1==0 and fklearn_feat_col1==1 to the list of features and remove col1. If for some reason df also has another column fklearn_feat_col3==x but col3 is not on the list of features, this column will not be added.

Parameters:	df (pd.DataFrame) – A Pandas’ DataFrame with all features. features (list of str) – The original list of features.

fklearn.training.utils.log_learner_time[source]¶

fklearn.training.utils.print_learner_run[source]¶

fklearn.training package¶

Submodules¶

fklearn.training.calibration module¶

fklearn.training.classification module¶

fklearn.training.ensemble module¶

fklearn.training.imputation module¶

fklearn.training.pipeline module¶

fklearn.training.regression module¶

fklearn.training.transformation module¶

fklearn.training.unsupervised module¶

fklearn.training.utils module¶

Module contents¶