fklearn.preprocessing package

Submodules

fklearn.preprocessing.rebalancing module

fklearn.preprocessing.rebalancing.rebalance_by_categorical[source]

Resample dataset so that the result contains the same number of lines per category in categ_column.

Parameters:
  • dataset (pandas.DataFrame) – A Pandas’ DataFrame with an categ_column column
  • categ_column (str) – The name of the categorical column
  • max_lines_by_categ (int (default None)) – The maximum number of lines by category. If None it will be set to the number of lines for the smallest category
  • seed (int (default 1)) – Random state for consistency.
Returns:

rebalanced_dataset – A dataset with fewer lines than dataset, but with the same number of lines per category in categ_column

Return type:

pandas.DataFrame

fklearn.preprocessing.rebalancing.rebalance_by_continuous[source]

Resample dataset so that the result contains the same number of lines per bucket in a continuous column.

Parameters:
  • dataset (pandas.DataFrame) – A Pandas’ DataFrame with an categ_column column
  • continuous_column (str) – The name of the continuous column
  • buckets (int) – The number of buckets to split the continuous column into
  • max_lines_by_categ (int (default None)) – The maximum number of lines by category. If None it will be set to the number of lines for the smallest category
  • by_quantile (bool (default False)) – If True, uses pd.qcut instead of pd.cut to get the buckets from the continuous column
  • seed (int (default 1)) – Random state for consistency.
Returns:

rebalanced_dataset – A dataset with fewer lines than dataset, but with the same number of lines per category in categ_column

Return type:

pandas.DataFrame

fklearn.preprocessing.schema module

fklearn.preprocessing.schema.column_duplicatable(columns_to_bind: str) → Callable[source]

Decorator to prepend the feature_duplicator learner.

Identifies the columns to be duplicated and applies duplicator.

Parameters:columns_to_bind (str) – Sets feature_duplicator’s “columns_to_duplicate” parameter equal to the columns_to_bind parameter from the decorated learner
fklearn.preprocessing.schema.feature_duplicator[source]

Duplicates some columns in the dataframe.

When encoding features, a good practice is to save the encoded version in a different column rather than replacing the original values. The purpose of this function is to duplicate the column to be encoded, to be later replaced by the encoded values.

The duplication method is used to preserve the original behaviour (replace).

Parameters:
  • df (pandas.DataFrame) – A Pandas’ DataFrame with columns_to_duplicate columns
  • columns_to_duplicate (list of str) – List of columns names
  • columns_mapping (int (default None)) – Mapping of source columns to destination columns
  • prefix (int (default None)) – prefix to add to columns to duplicate
  • suffix (int (default None)) – Suffix to add to columns to duplicate
Returns:

increased_dataset – A dataset with repeated columns

Return type:

pandas.DataFrame

fklearn.preprocessing.splitting module

fklearn.preprocessing.splitting.space_time_split_dataset[source]

Splits panel data using both ID and Time columns, resulting in four datasets

  1. A training set;
  2. An in training time, but out sample id hold out dataset;
  3. An out of training time, but in sample id hold out dataset;
  4. An out of training time and out of sample id hold out dataset.
Parameters:
  • dataset (pandas.DataFrame) – A Pandas’ DataFrame with an Identifier Column and a Date Column. The model will be trained to predict the target column from the features.
  • train_start_date (str) – A date string representing a the starting time of the training data. It should be in the same format as the Date Column in dataset.
  • train_end_date (str) – A date string representing a the ending time of the training data. This will also be used as the start date of the holdout period if no holdout_start_date is given. It should be in the same format as the Date Column in dataset.
  • holdout_end_date (str) – A date string representing a the ending time of the holdout data. It should be in the same format as the Date Column in dataset.
  • split_seed (int) – A seed used by the random number generator.
  • space_holdout_percentage (float) – The out of id holdout size as a proportion of the in id training size.
  • space_column (str) – The name of the Identifier column of dataset.
  • time_column (str) – The name of the Date column of dataset.
  • holdout_space (np.array) – An array containing the hold out IDs. If not specified, A random subset of IDs will be selected for holdout.
  • holdout_start_date (str) – A date string representing the starting time of the holdout data. If None is given it will be equal to train_end_date. It should be in the same format as the Date Column in dataset.
Returns:

  • train_set (pandas.DataFrame) – The in ID sample and in time training set.
  • intime_outspace_hdout (pandas.DataFrame) – The out of ID sample and in time hold out set.
  • outime_inspace_hdout (pandas.DataFrame) – The in ID sample and out of time hold out set.
  • outime_outspace_hdout (pandas.DataFrame) – The out of ID sample and out of time hold out set.

fklearn.preprocessing.splitting.stratified_split_dataset[source]

Splits data into a training and testing datasets such that they maintain the same class ratio of the original dataset.

Parameters:
  • dataset (pandas.DataFrame) – A Pandas’ DataFrame with the target column. The model will be trained to predict the target column from the features.
  • target_column (str) – The name of the target column of dataset.
  • test_size (float) – Represent the proportion of the dataset to include in the test split. should be between 0.0 and 1.0.
  • random_state (int or None, optional (default=None)) – If int, random_state is the seed used by the random number generator; If None, the random number generator is the RandomState instance used by np.random.
Returns:

  • train_set (pandas.DataFrame) – The train dataset sampled from the full dataset.
  • test_set (pandas.DataFrame) – The test dataset sampled from the full dataset.

fklearn.preprocessing.splitting.time_split_dataset[source]

Splits temporal data into a training and testing datasets such that all training data comes before the testings one.

Parameters:
  • dataset (pandas.DataFrame) – A Pandas’ DataFrame with an Identifier Column and a Date Column. The model will be trained to predict the target column from the features.
  • train_start_date (str) – A date string representing a the starting time of the training data. It should be in the same format as the Date Column in dataset.
  • train_end_date (str) – A date string representing a the ending time of the training data. This will also be used as the start date of the holdout period if no holdout_start_date is given. It should be in the same format as the Date Column in dataset.
  • holdout_end_date (str) – A date string representing a the ending time of the holdout data. It should be in the same format as the Date Column in dataset.
  • time_column (str) – The name of the Date column of dataset.
  • holdout_start_date (str) – A date string representing the starting time of the holdout data. If None is given it will be equal to train_end_date. It should be in the same format as the Date Column in dataset.
Returns:

  • train_set (pandas.DataFrame) – The in ID sample and in time training set.
  • test_set (pandas.DataFrame) – The out of ID sample and in time hold out set.

Module contents