fklearn.preprocessing package¶

Submodules¶

fklearn.preprocessing.rebalancing module¶

fklearn.preprocessing.rebalancing.rebalance_by_categorical[source]¶

Resample dataset so that the result contains the same number of lines per category in categ_column.

Parameters:	dataset (pandas.DataFrame) – A Pandas’ DataFrame with an categ_column column categ_column (str) – The name of the categorical column max_lines_by_categ (int (default None)) – The maximum number of lines by category. If None it will be set to the number of lines for the smallest category seed (int (default 1)) – Random state for consistency.
Returns:	rebalanced_dataset – A dataset with fewer lines than dataset, but with the same number of lines per category in categ_column
Return type:	pandas.DataFrame

fklearn.preprocessing.rebalancing.rebalance_by_continuous[source]¶

Resample dataset so that the result contains the same number of lines per bucket in a continuous column.

Parameters:	dataset (pandas.DataFrame) – A Pandas’ DataFrame with an categ_column column continuous_column (str) – The name of the continuous column buckets (int) – The number of buckets to split the continuous column into max_lines_by_categ (int (default None)) – The maximum number of lines by category. If None it will be set to the number of lines for the smallest category by_quantile (bool (default False)) – If True, uses pd.qcut instead of pd.cut to get the buckets from the continuous column seed (int (default 1)) – Random state for consistency.
Returns:	rebalanced_dataset – A dataset with fewer lines than dataset, but with the same number of lines per category in categ_column
Return type:	pandas.DataFrame

fklearn.preprocessing.schema module¶

fklearn.preprocessing.schema.column_duplicatable(columns_to_bind: str) → Callable[source]¶

Decorator to prepend the feature_duplicator learner.

Identifies the columns to be duplicated and applies duplicator.

Parameters:	columns_to_bind (str) – Sets feature_duplicator’s “columns_to_duplicate” parameter equal to the columns_to_bind parameter from the decorated learner

fklearn.preprocessing.schema.feature_duplicator[source]¶

Duplicates some columns in the dataframe.

When encoding features, a good practice is to save the encoded version in a different column rather than replacing the original values. The purpose of this function is to duplicate the column to be encoded, to be later replaced by the encoded values.

The duplication method is used to preserve the original behaviour (replace).

Parameters:	df (pandas.DataFrame) – A Pandas’ DataFrame with columns_to_duplicate columns columns_to_duplicate (list of str) – List of columns names columns_mapping (int (default None)) – Mapping of source columns to destination columns prefix (int (default None)) – prefix to add to columns to duplicate suffix (int (default None)) – Suffix to add to columns to duplicate
Returns:	increased_dataset – A dataset with repeated columns
Return type:	pandas.DataFrame

fklearn.preprocessing.splitting module¶

fklearn.preprocessing.splitting.space_time_split_dataset[source]¶

Splits panel data using both ID and Time columns, resulting in four datasets

A training set;
An in training time, but out sample id hold out dataset;
An out of training time, but in sample id hold out dataset;
An out of training time and out of sample id hold out dataset.

Parameters:

dataset (pandas.DataFrame) – A Pandas’ DataFrame with an Identifier Column and a Date Column. The model will be trained to predict the target column from the features.
train_start_date (str) – A date string representing a the starting time of the training data. It should be in the same format as the Date Column in dataset.
train_end_date (str) – A date string representing a the ending time of the training data. This will also be used as the start date of the holdout period if no holdout_start_date is given. It should be in the same format as the Date Column in dataset.
holdout_end_date (str) – A date string representing a the ending time of the holdout data. It should be in the same format as the Date Column in dataset.
split_seed (int) – A seed used by the random number generator.
space_holdout_percentage (float) – The out of id holdout size as a proportion of the in id training size.
space_column (str) – The name of the Identifier column of dataset.
time_column (str) – The name of the Date column of dataset.
holdout_space (np.array) – An array containing the hold out IDs. If not specified, A random subset of IDs will be selected for holdout.
holdout_start_date (str) – A date string representing the starting time of the holdout data. If None is given it will be equal to train_end_date. It should be in the same format as the Date Column in dataset.

Returns:

train_set (pandas.DataFrame) – The in ID sample and in time training set.
intime_outspace_hdout (pandas.DataFrame) – The out of ID sample and in time hold out set.
outime_inspace_hdout (pandas.DataFrame) – The in ID sample and out of time hold out set.
outime_outspace_hdout (pandas.DataFrame) – The out of ID sample and out of time hold out set.

fklearn.preprocessing.splitting.stratified_split_dataset[source]¶

Splits data into a training and testing datasets such that they maintain the same class ratio of the original dataset.

Parameters:

dataset (pandas.DataFrame) – A Pandas’ DataFrame with the target column. The model will be trained to predict the target column from the features.
target_column (str) – The name of the target column of dataset.
test_size (float) – Represent the proportion of the dataset to include in the test split. should be between 0.0 and 1.0.
random_state (int or None, optional (default=None)) – If int, random_state is the seed used by the random number generator; If None, the random number generator is the RandomState instance used by np.random.

Returns:

train_set (pandas.DataFrame) – The train dataset sampled from the full dataset.
test_set (pandas.DataFrame) – The test dataset sampled from the full dataset.

fklearn.preprocessing.splitting.time_split_dataset[source]¶

Splits temporal data into a training and testing datasets such that all training data comes before the testings one.

Parameters:

dataset (pandas.DataFrame) – A Pandas’ DataFrame with an Identifier Column and a Date Column. The model will be trained to predict the target column from the features.
train_start_date (str) – A date string representing a the starting time of the training data. It should be in the same format as the Date Column in dataset.
train_end_date (str) – A date string representing a the ending time of the training data. This will also be used as the start date of the holdout period if no holdout_start_date is given. It should be in the same format as the Date Column in dataset.
holdout_end_date (str) – A date string representing a the ending time of the holdout data. It should be in the same format as the Date Column in dataset.
time_column (str) – The name of the Date column of dataset.
holdout_start_date (str) – A date string representing the starting time of the holdout data. If None is given it will be equal to train_end_date. It should be in the same format as the Date Column in dataset.

Returns:

train_set (pandas.DataFrame) – The in ID sample and in time training set.
test_set (pandas.DataFrame) – The out of ID sample and in time hold out set.

fklearn.preprocessing package¶

Submodules¶

fklearn.preprocessing.rebalancing module¶

fklearn.preprocessing.schema module¶

fklearn.preprocessing.splitting module¶

Module contents¶