hyperparameter_hunter package

Subpackages

Submodules

hyperparameter_hunter.algorithm_handlers module

hyperparameter_hunter.algorithm_handlers.identify_algorithm(model_initializer)

Determine the name, and module of the algorithm provided by model_initializer

Parameters
model_initializer: functools.partial, or class, or class instance

The algorithm class being used to initialize a model

Returns
algorithm_name: str

The name of the algorithm provided by model_initializer

module_name: str

The name of the module housing the algorithm provided by model_initializer

Examples

>>> from sklearn.cluster import DBSCAN, SpectralClustering
>>> from functools import partial
>>> identify_algorithm(DBSCAN)
('DBSCAN', 'sklearn')
>>> identify_algorithm(DBSCAN())
('DBSCAN', 'sklearn')
>>> identify_algorithm(partial(SpectralClustering))
('SpectralClustering', 'sklearn')
hyperparameter_hunter.algorithm_handlers.identify_algorithm_hyperparameters(model_initializer)

Determine keyword-arguments accepted by model_initializer, along with their default values

Parameters
model_initializer: functools.partial, or class, or class instance

The algorithm class being used to initialize a model

Returns
hyperparameter_defaults: dict

The dict of kwargs accepted by model_initializer and their default values

hyperparameter_hunter.environment module

This module is central to the proper functioning of the entire library. It defines Environment, which (when activated) is used by the vast majority of the other operation-critical modules in the library. Environment can be viewed as a simple storage container that defines settings that characterize the Experiments/OptimizationProtocols to be conducted, and influence how those processes are carried out

Notes

Despite the fact that hyperparameter_hunter.settings is the only module listed as being “related”, pretty much all the other modules in the library are related to hyperparameter_hunter.environment.Environment by way of this relation

class hyperparameter_hunter.environment.Environment(train_dataset, environment_params_path=None, *, results_path=None, metrics=None, holdout_dataset=None, test_dataset=None, target_column=None, id_column=None, do_predict_proba=None, prediction_formatter=None, metrics_params=None, cv_type=None, runs=None, global_random_seed=None, random_seeds=None, random_seed_bounds=None, cv_params=None, verbose=None, file_blacklist=None, reporting_params=None, to_csv_params=None, do_full_save=None, experiment_callbacks=None, experiment_recorders=None, save_transformed_metrics=None)

Bases: object

Class to organize the parameters that allow Experiments/OptPros to be fairly compared

Environment is the collective starting point for all of HyperparameterHunter’s biggest and best toys: Experiments and OptimizationProtocols. Without an Environment, neither of these will work.

The Environment is where we declare all the parameters that transcend traditional “hyperparameters”. It houses the stuff without which machine learning can’t even really start. Specifically, Environment cares about 1) The data used for fitting/predicting, 2) The cross-validation scheme used to split the data and fit models; and 3) How to evaluate the predictions made on that data. There are plenty of other goodies documented below, but the absolutely mission-critical parameters concerned with the above tasks are train_dataset, cv_type, cv_params, and metrics. Additionally, it’s important to provide results_path, so Experiment/OptPro results can be saved, which is kind of what HyperparameterHunter is all about

Parameters
train_dataset: Pandas.DataFrame, or str path

The training data for the experiment. Will be split into train/holdout data, if applicable, and train/validation data if cross-validation is to be performed. If str, will attempt to read file at path via pandas.read_csv(). For more information on which columns will be used during fitting/predicting, see the “Dataset columns” note in the “Notes” section below

environment_params_path: String path, or None, default=None

If not None and is valid .json filepath containing an object (dict), the file’s contents are treated as the default values for all keys that match any of the below kwargs used to initialize Environment

results_path: String path, or None, default=None

If valid directory path and the results directory has not yet been created, it will be created here. If this does not end with <ASSETS_DIRNAME>, it will be appended. If <ASSETS_DIRNAME> already exists at this path, new results will also be stored here. If None or invalid, results will not be stored

metrics: Dict, List, or None, default=None

Iterable describing the metrics to be recorded, along with a means to compute the value of each metric. Should be of one of the two following forms:

List Form:

  • [“<metric name>”, “<metric name>”, …]: Where each value is a string that names an attribute in sklearn.metrics

  • [Metric, Metric, …]: Where each value of the list is an instance of metrics.Metric

  • [(<name>, <metric_function>, [<direction>]), (<*args>), …]: Where each value of the list is a tuple of arguments that will be used to instantiate a metrics.Metric. Arguments given in tuples must be in order expected by metrics.Metric: (name, metric_function, direction)

Dict Form:

  • {“<metric name>”: <metric_function>, …}: Where each key is a name for the corresponding metric callable, which is used to compute the value of the metric

  • {“<metric name>”: (<metric_function>, <direction>), …}: Where each key is a name for the corresponding metric callable and direction, all of which are used to instantiate a metrics.Metric

  • {“<metric name>”: “<sklearn metric name>”, …}: Where each key is a name for the metric, and each value is the name of the attribute in sklearn.metrics for which the corresponding key is an alias

  • {“<metric name>”: None, …}: Where each key is the name of the attribute in sklearn.metrics

  • {“<metric name>”: Metric, …}: Where each key names an instance of metrics.Metric. This is the internally-used format to which all other formats will be converted

Metric callable functions should expect inputs of form (target, prediction), and should return floats. See the documentation of metrics.Metric for information regarding expected parameters and types

holdout_dataset: Pandas.DataFrame, callable, str path, or None, default=None

If pd.DataFrame, this is the holdout dataset. If callable, expects a function that takes (self.train: DataFrame, self.target_column: str) as input and returns the new (self.train: DataFrame, self.holdout: DataFrame). If str, will attempt to read file at path via pandas.read_csv(). Else, there is no holdout set. For more information on which columns will be used during fitting/predicting, see the “Dataset columns” note in the “Notes” section below

test_dataset: Pandas.DataFrame, str path, or None, default=None

The testing data for the experiment. Structure should be identical to that of train_dataset, except its target_column column can be empty or non-existent, because test_dataset predictions will never be evaluated. If str, will attempt to read file at path via pandas.read_csv(). For more information on which columns will be used during fitting/predicting, see the “Dataset columns” note in the “Notes” section below

target_column: Str, or list, default=’target’

If str, denotes the column name in all provided datasets (except test) that contains the target output. If list, should be a list of strs designating multiple target columns. For example, in a multi-class classification dataset like UCI’s hand-written digits, target_column would be a list containing ten strings. In this example, the target_column data would be sparse, with a 1 to signify that a sample is a written example of a digit (0-9). For a working example, see ‘hyperparameter_hunter/examples/lib_keras_multi_classification_example.py’

id_column: Str, or None, default=None

If not None, str denoting the column name in all provided datasets containing sample IDs

do_predict_proba: Boolean, or int, default=False
  • If False, models.Model.fit() will call models.Model.model.predict()

  • If True, it will call models.Model.model.predict_proba(), and the values in all columns will be used as the actual prediction values

  • If do_predict_proba is an int, models.Model.fit() will call models.Model.model.predict_proba(), as is the case when do_predict_proba is True, but the int supplied as do_predict_proba declares the column index to use as the actual prediction values

  • For example, for a model to call the predict method, do_predict_proba=False (default). For a model to call the predict_proba method, and use all of the class probabilities, do_predict_proba=True. To call the predict_proba method, and use the class probabilities in the first column, do_predict_proba=0. To use the second column (index 1) of the result, do_predict_proba=1 - This often corresponds to the positive class’s probabilities in binary classification problems. To use the third column do_predict_proba=2, and so on

prediction_formatter: Callable, or None, default=None

If callable, expected to have same signature as utils.result_utils.format_predictions(). That is, the callable will receive (raw_predictions: np.array, dataset_df: pd.DataFrame, target_column: str, id_column: str or None) as input and should return a properly formatted prediction DataFrame. The callable uses raw_predictions as the content, dataset_df to provide any id column, and target_column to identify the column in which to place raw_predictions

metrics_params: Dict, or None, default=dict()

Dictionary of extra parameters to provide to metrics.ScoringMixIn.__init__(). metrics must be provided either 1) as an input kwarg to Environment.__init__() (see metrics), or 2) as a key in metrics_params, but not both. An Exception will be raised if both are given, or if neither is given

cv_type: Class or str, default=’KFold’

The class to define cross-validation splits. If str, it must be an attribute of sklearn.model_selection._split, and it must be a cross-validation class that inherits one of the following sklearn classes: BaseCrossValidator, or _RepeatedSplits. Valid str values include ‘KFold’, and ‘RepeatedKFold’, although there are many more. It must implement the following methods: [__init__, split]. If using a custom class, see the following tested sklearn classes for proper implementations: [KFold, StratifiedKFold, RepeatedKFold, RepeatedStratifiedKFold]. The arguments provided to cv_type.__init__() will be Environment.cv_params, which should include the following: [‘n_splits’ <int>, ‘n_repeats’ <int> (if applicable)]. cv_type.split() will receive the following arguments: [BaseExperiment.train_input_data, BaseExperiment.train_target_data]

runs: Int, default=1

The number of times to fit a model within each fold to perform multiple-run-averaging with different random seeds

global_random_seed: Int, default=32

The initial random seed used just before generating an Experiment’s random_seeds. This ensures consistency for random_seeds between Experiments, without having to explicitly provide it here

random_seeds: None, or List, default=None

If None, random_seeds of the appropriate shape will be created automatically. Else, must be a list of ints of shape (cv_params[‘n_repeats’], cv_params[‘n_splits’], runs). If cv_params does not have the key n_repeats (because standard cross-validation is being used), the value will default to 1. See experiments.BaseExperiment._random_seed_initializer() for info on expected shape

random_seed_bounds: List, default=[0, 100000]

A list containing two integers: the lower and upper bounds, respectively, for generating an Experiment’s random seeds in experiments.BaseExperiment._random_seed_initializer(). Generally, leave this kwarg alone

cv_params: dict, or None, default=dict()

Parameters provided upon initialization of cv_type. Keys may be any args accepted by cv_type.__init__(). Number of fold splits must be provided via “n_splits”, and number of repeats (if applicable for cv_type) must be provided via “n_repeats”

verbose: Int, boolean, default=3

Verbosity of printing for any experiments performed while this Environment is active

Higher values indicate more frequent logging. Logs are still recorded in the heartbeat file regardless of verbosity level. verbose only dictates which logs are visible in the console. The following table illustrates which types of logging messages will be visible with each verbosity level:

| Verbosity | Keys/IDs | Final Score | Repetitions* | Folds | Runs* | Run Starts* | Result Files | Other |
|:---------:|:--------:|:-----------:|:------------:|:-----:|:-----:|:-----------:|:------------:|:-----:|
|     0     |          |             |              |       |       |             |              |       |
|     1     |    Yes   |     Yes     |              |       |       |             |              |       |
|     2     |    Yes   |     Yes     |      Yes     |  Yes  |       |             |              |       |
|     3     |    Yes   |     Yes     |      Yes     |  Yes  |  Yes  |             |              |       |
|     4     |    Yes   |     Yes     |      Yes     |  Yes  |  Yes  |     Yes     |      Yes     |  Yes  |

*: If such logging is deemed appropriate with the given cross-validation parameters. In other words, repetition/run logging will only be verbose if Environment was given more than one repetition/run, respectively

file_blacklist: List of str, or None, or ‘ALL’, default=None

If list of str, the result files named within are not saved to their respective directory in “<ASSETS_DIRNAME>/Experiments”. If None, all result files are saved. If ‘ALL’, nothing at all will be saved for the Experiments. If the path of the file that initializes an Experiment does not end with a “.py” extension, the Experiment proceeds as if “script_backup” had been added to file_blacklist. This means that backup files will not be created for Jupyter notebooks (or any other non-“.py” files). For info on acceptable values, see validate_file_blacklist()

reporting_params: Dict, default=dict()

Parameters passed to initialize reporting.ReportingHandler

to_csv_params: Dict, default=dict()

Parameters passed to the calls to pandas.frame.DataFrame.to_csv() in recorders. In particular, this is where an Experiment’s final prediction files are saved, so the values here will affect the format of the .csv prediction files. Warning: If to_csv_params contains the key “path_or_buf”, it will be removed. Otherwise, all items are supplied directly to to_csv(), including kwargs it might not be expecting if they are given

do_full_save: None, or callable, default=:func:`utils.result_utils.default_do_full_save`

If callable, expected to take an Experiment’s result description dict as input and return a boolean. If None, treated as a callable that returns True. This parameter is used by recorders.DescriptionRecorder to determine whether the Experiment result files following the description should also be created. If do_full_save returns False, result file-saving is stopped early, and only the description is saved. If do_full_save returns True, all files not in file_blacklist are saved normally. This allows you to skip creation of an Experiment’s predictions, logs, and heartbeats if its score does not meet some threshold you set, for example. do_full_save receives the Experiment description dict as input, so for help setting do_full_save, just look into one of your Experiment descriptions

experiment_callbacks: `LambdaCallback`, or list of `LambdaCallback` (optional)

Callbacks injected directly into Experiments, adding new functionality, or customizing existing processes. Should be a LambdaCallback or a list of such classes. LambdaCallback can be created using callbacks.bases.lambda_callback(), which documents the options for creating callbacks. experiment_callbacks will be added to the MRO of the executed Experiment class by experiment_core.ExperimentMeta at __call__ time, making experiment_callbacks new base classes of the Experiment. See callbacks.bases.lambda_callback() for more information. Note that the Experiments conducted by OptPros will still benefit from experiment_callbacks. The presence of LambdaCallbacks will affect neither Environment keys, nor Experiment keys. In other words, for the purposes of Experiment matching/recording, all other factors being equal, an Experiment with experiment_callbacks is considered identical to an Experiment without, despite whatever custom functionality was added by the LambdaCallbacks

experiment_recorders: List, None, default=None

If not None, may be a list whose values are tuples of (<recorders.BaseRecorder descendant>, <str result_path>). The result_path str should be a path relative to results_path that specifies the directory/file in which the product of the custom recorder should be saved. The contents of experiment_recorders will be provided to recorders.RecorderList upon completion of an Experiment, and, if the subclassing documentation in recorders is followed properly, will create or update a result file for the just-executed Experiment

save_transformed_metrics: Boolean (optional)

Declares manner in which a model’s predictions should be evaluated through the provided metrics, with regard to target data transformations. This setting can be ignored if no transformation of the target variable takes place (either through FeatureEngineer, EngineerStep, or otherwise).

The default value of save_transformed_metrics depends on the dtype of the target data in train_dataset. If all target columns are numeric, save_transformed_metrics`=False, meaning metric evaluation should use the original/inverted targets and predictions. Else if any target column is non-numeric, `save_transformed_metrics`=True, meaning evaluation should use the transformed targets and predictions because most metrics require numeric inputs. This is described further in :attr:`save_transformed_metrics. A more descriptive name for this may be “calculate_metrics_using_transformed_predictions”, but that’s a bit verbose–even by my standards

Other Parameters
cross_validation_type: …
  • Alias for cv_type *

cross_validation_params: …
  • Alias for cv_params *

metrics_map: …
  • Alias for metrics *

reporting_handler_params: …
  • Alias for reporting_params *

root_results_path: …
  • Alias for results_path *

Notes

Dataset columns: In order to specify the columns to be used by the three dataset kwargs (train_dataset, holdout_dataset, test_dataset) during fitting and predicting, a few attributes can be used. On Environment initialization, the columns specified by the following kwargs will be separated from the rest of the dataset during training/predicting: 1) target_column, which names the column containing the target output labels for the input data; and 2) id_column, which (if given) represents the name of the column that contains identifying information for each data sample, and should otherwise have no relation to the actual data. Additionally, the feature_selector kwarg of the descendants of hyperparameter_hunter.experiments.BaseExperiment (like hyperparameter_hunter.experiments.CVExperiment) is used to filter out columns of the given datasets prior to fitting. See its documentation for more information, but it can effectively be used to remove any columns from the datasets

Overriding default kwargs at environment_params_path: If you have any of the above kwargs specified in the .json file at environment_params_path (except environment_params_path, which will be ignored), you can override its value by passing it as a kwarg when initializing Environment. The contents at environment_params_path are only used when the matching kwarg supplied at initialization is None. See “/examples/environment_params_path_example.py” for details

The order of precedence for determining the value of each parameter is as follows, with items at the top having the highest priority, and deferring only to the items below if their own value is None:

do_predict_proba: Because this parameter can be either a boolean or an integer, it is important to explicitly pass booleans rather than truthy or falsey values. Similarly, only pass integers if you intend for the value to be used as a column index. Do not pass 0 to mean False, or 1 to mean True

Attributes
train_input: DatasetSentinel

Sentinel replaced with current train input data during Model fitting/predicting. Commonly given in the model_extra_params kwargs of hyperparameter_hunter.experiments.BaseExperiment or hyperparameter_hunter.optimization.protocol_core.BaseOptPro.forge_experiment() for eval_set-like hyperparameters. Importantly, the actual value of this Sentinel is determined after performing cross-validation data splitting, and after executing FeatureEngineer

train_target: DatasetSentinel

Like train_input, except for current train target data

validation_input: DatasetSentinel

Like train_input, except for current validation input data

validation_target: DatasetSentinel

Like train_input, except for current validation target data

holdout_input: DatasetSentinel

Like train_input, except for current holdout input data

holdout_target: DatasetSentinel

Like train_input, except for current holdout target data

Methods

environment_workflow(self)

Execute all methods required to validate the environment and run Experiments

format_result_paths(self)

Remove paths contained in file_blacklist, and format others to prepare for saving results

generate_cross_experiment_key(self)

Generate a key to describe the current Environment’s cross-experiment parameters

initialize_reporting(self)

Initialize reporting for the Environment and Experiments conducted during its lifetime

update_custom_environment_params(self)

Try to update null parameters from environment_params_path, or DEFAULT_PARAMS

validate_parameters(self)

Ensure the provided parameters are valid and properly formatted

DEFAULT_PARAMS = {'cv_params': {}, 'cv_type': 'KFold', 'do_full_save': <function default_do_full_save>, 'do_predict_proba': False, 'environment_params_path': None, 'file_blacklist': None, 'global_random_seed': 32, 'id_column': None, 'metrics': None, 'metrics_params': {}, 'prediction_formatter': <function format_predictions>, 'random_seed_bounds': [0, 100000], 'random_seeds': None, 'reporting_params': {'console_params': None, 'float_format': '{:.5f}', 'heartbeat_params': None, 'heartbeat_path': None}, 'results_path': None, 'runs': 1, 'save_transformed_metrics': None, 'target_column': 'target', 'to_csv_params': {}, 'verbose': 3}
property results_path
property target_column
property train_dataset
property test_dataset
property holdout_dataset
property file_blacklist
property cv_type
property to_csv_params
property cross_experiment_params
property experiment_callbacks
property save_transformed_metrics

If save_transformed_metrics is True, and target transformation does occur, then experiment metrics are calculated using the transformed targets and predictions, which is the form returned directly by a fitted model’s predict method. For example, if target data is label-encoded, and an feature_engineering.EngineerStep is used to one-hot encode the target, then metrics functions will receive the following as input: (one-hot-encoded targets, one-hot-encoded predictions).

Conversely, if save_transformed_metrics is False, and target transformation does occur, then experiment metrics are calculated using the inverse of the transformed targets and predictions, which is same form as the original target data. Continuing the example of label-encoded target data, and an feature_engineering.EngineerStep to one-hot encode the target, in this case, metrics functions will receive the following as input: (label-encoded targets, label-encoded predictions)

environment_workflow(self)

Execute all methods required to validate the environment and run Experiments

validate_parameters(self)

Ensure the provided parameters are valid and properly formatted

format_result_paths(self)

Remove paths contained in file_blacklist, and format others to prepare for saving results

update_custom_environment_params(self)

Try to update null parameters from environment_params_path, or DEFAULT_PARAMS

generate_cross_experiment_key(self)

Generate a key to describe the current Environment’s cross-experiment parameters

initialize_reporting(self)

Initialize reporting for the Environment and Experiments conducted during its lifetime

property train_input

Get a DatasetSentinel representing an Experiment’s fold_train_input

Returns
DatasetSentinel:

A Sentinel that will be converted to hyperparameter_hunter.experiments.BaseExperiment.fold_train_input upon Model initialization

property train_target

Get a DatasetSentinel representing an Experiment’s fold_train_target

Returns
DatasetSentinel:

A Sentinel that will be converted to hyperparameter_hunter.experiments.BaseExperiment.fold_train_target upon Model initialization

property validation_input

Get a DatasetSentinel representing an Experiment’s fold_validation_input

Returns
DatasetSentinel:

A Sentinel that will be converted to hyperparameter_hunter.experiments.BaseExperiment.fold_validation_input upon Model initialization

property validation_target

Get a DatasetSentinel representing an Experiment’s fold_validation_target

Returns
DatasetSentinel:

A Sentinel that will be converted to hyperparameter_hunter.experiments.BaseExperiment.fold_validation_target upon Model initialization

property holdout_input

Get a DatasetSentinel representing an Experiment’s holdout_input_data

Returns
DatasetSentinel:

A Sentinel that will be converted to hyperparameter_hunter.experiments.BaseExperiment.holdout_input_data upon Model initialization

property holdout_target

Get a DatasetSentinel representing an Experiment’s holdout_target_data

Returns
DatasetSentinel:

A Sentinel that will be converted to hyperparameter_hunter.experiments.BaseExperiment.holdout_target_data upon Model initialization

hyperparameter_hunter.environment.define_holdout_set(train_set:pandas.core.frame.DataFrame, holdout_set:Union[pandas.core.frame.DataFrame, <built-in function callable>, str, NoneType], target_column:Union[str, List[str]]) → Tuple[pandas.core.frame.DataFrame, Union[pandas.core.frame.DataFrame, NoneType]]

Create holdout_set (if necessary) by loading a DataFrame from a .csv file, or by separating train_set, and return the updated (train_set, holdout_set) pair

Parameters
train_set: Pandas.DataFrame

Training DataFrame. Will be split into train/holdout data, if holdout_set is callable

holdout_set: Pandas.DataFrame, callable, str, or None

If pd.DataFrame, this is the holdout dataset. If callable, expects a function that takes (train_set, target_column) as input and returns the new (train_set, holdout_set). If str, will attempt to read file at path via pandas.read_csv(). Else, no holdout set

target_column: Str, or list

If str, denotes the column name in provided datasets that contains the target output. If list, should be a list of strs designating multiple target columns

Returns
train_set: Pandas.DataFrame

train_set if holdout_set is not callable. Else train_set modified by holdout_set

holdout_set: Pandas.DataFrame, or None

Original DataFrame, or DataFrame read from str filepath, or a portion of train_set if holdout_set is callable, or None

hyperparameter_hunter.environment.validate_file_blacklist(blacklist)

Validate contents of blacklist. For most values, the corresponding file is saved upon completion of the experiment. See the “Notes” section below for details on some special cases

Parameters
blacklist: List of strings, or None

The result files that should not be saved

Returns
blacklist: List

If not empty, acceptable list of result file types to blacklist

Notes

‘heartbeat’: If the heartbeat file is saved, a new file is not generated and saved to the “Experiments/Heartbeats” directory as is the case with most other files. Instead, the general “Heartbeat.log” file is copied and renamed to the current experiment id, then saved to the appropriate dir. This is because the general “Heartbeat.log” file represents the heartbeat for whatever experiment is currently in progress.

‘script_backup’: This file is saved as quickly as possible after starting a new experiment, rather than waiting for the experiment to end. There are two reasons for this behavior: 1) to avoid saving any changes that may have been made to a file after it has been executed, and 2) to have the offending file in the event of a catastrophic failure that results in no other files being saved. As stated in the documentation of the file_blacklist parameter of Environment, if the path of the file that initializes an Experiment does not end with a “.py” extension, the Experiment proceeds as if “script_backup” had been added to blacklist. This means that backup files will not be created for Jupyter notebooks (or any other non-“.py” files)

‘description’ and ‘tested_keys’: These two results types constitute a bare minimum of sorts for experiment recording. If either of these two are blacklisted, then as far as the library is concerned, the experiment never took place.

‘tested_keys’ (continued): If this string is included in the blacklist, then the contents of the “KeyAttributeLookup” directory will also be excluded from the list of files to update

‘current_heartbeat’: The general heartbeat file that should be stored at ‘HyperparameterHunterAssets/Heartbeat.log’. If this value is blacklisted, then ‘heartbeat’ is also added to blacklist automatically out of necessity. This is done because the heartbeat file for the current experiment cannot be created as a copy of the general heartbeat file if the general heartbeat file is never created in the first place

hyperparameter_hunter.experiment_core module

This module is the core of all of the experimentation in hyperparameter_hunter, hence its name. It is impossible to understand hyperparameter_hunter.experiments without first having a grasp on what hyperparameter_hunter.experiment_core.ExperimentMeta is doing. This module serves to bridge the gap between Experiments, and hyperparameter_hunter.callbacks by dynamically making Experiments inherit various callbacks depending on the inputs given in order to make Experiments completely functional

Related

hyperparameter_hunter.experiments

Defines the structure of the experimentation process. While certainly very important, hyperparameter_hunter.experiments wouldn’t do much at all without hyperparameter_hunter.callbacks, or hyperparameter_hunter.experiment_core

hyperparameter_hunter.callbacks

Defines parent classes to the classes defined in hyperparameter_hunter.experiments. This not only makes it very easy to find the entire workflow for a given task, but also ensures that each instance of an Experiment inherits exactly the functionality that it needs. For example, if no holdout data was given, then experiment_core.ExperimentMeta will not add callbacks.evaluators.EvaluatorHoldout or callbacks.predictors.PredictorHoldout to the list of callbacks inherited by the Experiment. This means that the Experiment never needs to check for the existence of holdout data in order to determine how it should proceed because it literally doesn’t have the code that deals with holdout data

Notes

Was a metaclass really necessary here? Probably not, but it’s being used for two reasons: 1) metaclasses are fun, and programming (especially artificial intelligence) should be fun; and 2) it allowed for a very clean separation between the various functions demanded by Experiments that are provided by hyperparameter_hunter.callbacks. Having each of the callbacks separated in their own classes makes it very easy to debug existing functionality, and to add new callbacks in the future

class hyperparameter_hunter.experiment_core.ExperimentMeta

Bases: type

Create a new class object that stores necessary class-wide callbacks to __class_wide_bases

Methods

__call__(cls, \*args, \*\*kwargs)

Store necessary instance-wide callbacks to __instance_bases, sort all dynamically added callback base classes, then add them to the instance

mro()

return a type’s method resolution order

hyperparameter_hunter.experiment_core.base_callback_class_sorter(auxiliary_bases, parent_class_order=None)

Sort callback classes in order to preserve the intended MRO of their descendant, and to enable callbacks that may depend on one another to function properly

Parameters
auxiliary_bases: List

The callback classes to be sorted according to the order in which their parent is found in parent_class_order. For example, if a class (x) in auxiliary_bases is the only descendant of the last class in parent_class_order, then class x will be moved to the last position in sorted_auxiliary_bases. If multiple classes in auxiliary_bases are descendants of the same parent in parent_class_order, they will be sorted alphabetically (from A-Z)

parent_class_order: List, or None, default=<See description>

List of base callback classes that define the sort order for auxiliary_bases. Note that these are not the normal callback classes that add to the functionality of an Experiment, but the base classes from which the callback classes are descendants. All the classes in parent_class_order should be defined in hyperparameter_hunter.callbacks.bases. The last class in parent_class_order should be hyperparameter_hunter.callbacks.bases.BaseCallback, which is the parent class for all other base classes. This ensures that custom callbacks defined by hyperparameter_hunter.callbacks.bases.lambda_callback() will be recognized as valid and executed last

Returns
sorted_auxiliary_bases: List

The contents of auxiliary_bases sorted according to their parents’ location in parent_class_order, then alphabetically

Raises
ValueError

If auxiliary_bases contains a class that is not a descendant of any of the classes in parent_class_order

Examples

>>> in_0 = [AggregatorEvaluations, AggregatorTimes, EvaluatorOOF, EvaluatorHoldout, LoggerFitStatus, PredictorOOF, PredictorHoldout, PredictorTest]
>>> out_0 = [PredictorHoldout, PredictorOOF, PredictorTest, EvaluatorHoldout, EvaluatorOOF, AggregatorEvaluations, AggregatorTimes, LoggerFitStatus]
>>> assert base_callback_class_sorter(in_0) == out_0
>>> in_1 = [AggregatorEvaluations, AggregatorTimes, EvaluatorOOF, EvaluatorHoldout, LoggerFitStatus, PredictorOOF, PredictorHoldout, PredictorTest]
>>> out_1 = [PredictorHoldout, PredictorOOF, PredictorTest, EvaluatorHoldout, EvaluatorOOF, AggregatorEvaluations, AggregatorTimes, LoggerFitStatus]
>>> assert base_callback_class_sorter(in_1) == out_1
>>> in_2 = [PredictorOOF, PredictorHoldout, AggregatorTimes, PredictorTest, AggregatorEvaluations, EvaluatorOOF, EvaluatorHoldout, LoggerFitStatus]
>>> out_2 = [PredictorHoldout, PredictorOOF, PredictorTest, EvaluatorHoldout, EvaluatorOOF, AggregatorEvaluations, AggregatorTimes, LoggerFitStatus]
>>> assert base_callback_class_sorter(in_2) == out_2
>>> in_3 = [PredictorTest, EvaluatorHoldout, LoggerFitStatus, AggregatorTimes, PredictorHoldout, PredictorOOF, AggregatorEvaluations, EvaluatorOOF]
>>> out_3 = [PredictorHoldout, PredictorOOF, PredictorTest, EvaluatorHoldout, EvaluatorOOF, AggregatorEvaluations, AggregatorTimes, LoggerFitStatus]
>>> assert base_callback_class_sorter(in_3) == out_3
>>> in_4 = [LoggerFitStatus, EvaluatorOOF, PredictorTest, EvaluatorHoldout, AggregatorTimes, AggregatorEvaluations, PredictorHoldout, PredictorOOF]
>>> out_4 = [PredictorHoldout, PredictorOOF, PredictorTest, EvaluatorHoldout, EvaluatorOOF, AggregatorEvaluations, AggregatorTimes, LoggerFitStatus]
>>> assert base_callback_class_sorter(in_4) == out_4
>>> in_5 = [AggregatorEvaluations, PredictorTest, PredictorOOF, EvaluatorOOF, EvaluatorHoldout]
>>> out_5 = [PredictorOOF, PredictorTest, EvaluatorHoldout, EvaluatorOOF, AggregatorEvaluations]
>>> assert base_callback_class_sorter(in_5) == out_5
>>> in_6 = [EvaluatorOOF, PredictorOOF, EvaluatorHoldout, AggregatorEvaluations, PredictorTest]
>>> out_6 = [PredictorOOF, PredictorTest, EvaluatorHoldout, EvaluatorOOF, AggregatorEvaluations]
>>> assert base_callback_class_sorter(in_6) == out_6
>>> in_7 = [PredictorTest, EvaluatorHoldout, PredictorOOF]
>>> out_7 = [PredictorOOF, PredictorTest, EvaluatorHoldout]
>>> assert base_callback_class_sorter(in_7) == out_7
>>> in_8 = [PredictorTest, PredictorOOF, EvaluatorHoldout]
>>> out_8 = [PredictorOOF, PredictorTest, EvaluatorHoldout]
>>> assert base_callback_class_sorter(in_8) == out_8
>>> base_callback_class_sorter([type("Foo", (object,), {}), PredictorTest, EvaluatorHoldout, PredictorOOF])
Traceback (most recent call last):
    File "experiment_core.py", line ?, in base_callback_class_sorter
ValueError: Base class not descendant of acceptable parent class: [<class 'hyperparameter_hunter.experiment_core.Foo'>]

hyperparameter_hunter.experiments module

This module contains the classes used for constructing and conducting an Experiment (most notably, CVExperiment). Any class contained herein whose name starts with “Base” should not be used directly. CVExperiment is the preferred means of conducting one-off experimentation

Related

hyperparameter_hunter.experiment_core

Defines ExperimentMeta, an understanding of which is critical to being able to understand experiments

hyperparameter_hunter.metrics

Defines ScoringMixIn, a parent of experiments.BaseExperiment that enables scoring and evaluating models

hyperparameter_hunter.models

Used to instantiate the actual learning models, which are a single part of the entire experimentation workflow, albeit the most significant part

Notes

As mentioned above, the inner workings of experiments will be very confusing without a grasp on what’s going on in experiment_core, and its related modules

class hyperparameter_hunter.experiments.BaseExperiment(model_initializer, model_init_params=None, model_extra_params=None, feature_engineer=None, feature_selector=None, notes=None, do_raise_repeated=False, auto_start=True, target_metric=None)

Bases: hyperparameter_hunter.metrics.ScoringMixIn

One-off Experimentation base class

Bare-bones Description: Runs the cross-validation scheme defined by Environment, during which 1) Datasets are processed according to feature_engineer; 2) Models are built by instantiating model_initializer with model_init_params; 3) Models are trained on processed data, optionally using parameters from model_extra_params; 4) Results are logged and recorded for each fitting period; 5) Descriptions, predictions, results (both averages and individual periods), etc. are saved.

What’s the Big Deal? The most important takeaway from the above description is that descriptions/results are THOROUGH and REUSABLE. By thorough, I mean that all of a model’s hyperparameters are saved, not just the ones given in model_init_params. This may sound odd, but it’s important because it makes results reusable during optimization, when you may be using a different set of hyperparameters. It helps with other things like preventing duplicate experiments and ensembling, as well. But the big part is that this transforms hyperparameter optimization from an isolated, throwaway process we can only afford when an ML project is sufficiently “mature” to a process that covers the entire lifespan of a project. No Experiment is forgotten or wasted. Optimization is automatically given the data it needs to succeed by drawing on all your past Experiments and optimization rounds.

The Experiment has three primary missions: 1. Act as scaffold for organizing ML Experimentation and optimization 2. Record Experiment descriptions and results 3. Eliminate lots of repetitive/error-prone boilerplate code

Providing a scaffold for the entire ML process is critical because without a standardized format, everything we do looks different. Without a unified scaffold, development is slower, more confusing, and less adaptable. One of the benefits of standardizing the format of ML Experimentation is that it enables us to exhaustively record all the important characteristics of Experiment, as well as an assortment of customizable result files – all in a way that allows them to be reused in the future.

What About Data/Metrics? Experiments require an active Environment in order to function, from which the Experiment collects important cross-experiment parameters, such as datasets, metrics, cross-validation schemes, and even callbacks to inherit, among many other properties documented in Environment

Parameters
model_initializer: Class, or functools.partial, or class instance

Algorithm class used to initialize a model, such as XGBoost’s XGBRegressor, or SKLearn’s KNeighborsClassifier; although, there are hundreds of possibilities across many different ML libraries. model_initializer is expected to define at least fit and predict methods. model_initializer will be initialized with model_init_params, and its “extra” methods (fit, predict, etc.) will be invoked with parameters in model_extra_params

model_init_params: Dict, or object (optional)

Dictionary of arguments given to create an instance of model_initializer. Any kwargs that are considered valid by the __init__ method of model_initializer are valid in model_init_params.

One of the key features that makes HyperparameterHunter so magical is that ALL hyperparameters in the signature of model_initializer (and their default values) are discovered – whether or not they are explicitly given in model_init_params. Not only does this make Experiment result descriptions incredibly thorough, it also makes optimization smoother, more effective, and far less work for the user. For example, take LightGBM’s LGBMRegressor, with model_init_params`=`dict(learning_rate=0.2). HyperparameterHunter recognizes that this differs from the default of 0.1. It also recognizes that LGBMRegressor is actually initialized with more than a dozen other hyperparameters we didn’t bother mentioning, and it records their values, too. So if we want to optimize num_leaves tomorrow, the OptPro doesn’t start from scratch. It knows that we ran an Experiment that didn’t explicitly mention num_leaves, but its default value was 31, and it uses this information to fuel optimization – all without us having to manually keep track of tons of janky collections of hyperparameters. In fact, we really don’t need to go out of our way at all. HyperparameterHunter just acts as our faithful lab assistant, keeping track of all the stuff we’d rather not worry about

model_extra_params: Dict (optional)

Dictionary of extra parameters for models’ non-initialization methods (like fit, predict, predict_proba, etc.), and for neural networks. To specify parameters for an extra method, place them in a dict named for the extra method to which the parameters should be given. For example, to call fit with early_stopping_rounds`=5, use `model_extra_params`=`dict(fit=dict(early_stopping_rounds=5)).

For models whose fit methods have a kwarg like eval_set (such as XGBoost’s), one can use the DatasetSentinel attributes of the current active Environment, documented under its “Attributes” section and under train_input. An example using several DatasetSentinels can be found in HyperparameterHunter’s [XGBoost Classification Example](https://github.com/HunterMcGushion/hyperparameter_hunter/blob/master/examples/xgboost_examples/classification.py)

feature_engineer: `FeatureEngineer`, or list (optional)

Feature engineering/transformation/pre-processing steps to apply to datasets defined in Environment. If list, will be used to initialize FeatureEngineer, and can contain any of the following values:

  1. EngineerStep instance

  2. Function input to :class:~hyperparameter_hunter.feature_engineering.EngineerStep`

For important information on properly formatting EngineerStep functions, please see the documentation of EngineerStep. OptPros can perform hyperparameter optimization of feature_engineer steps. This capability adds a third allowed value to the above list and is documented in forge_experiment()

feature_selector: List of str, callable, or list of booleans (optional)

Column names to include as input data for all provided DataFrames. If None, feature_selector is set to all columns in train_dataset, less target_column, and id_column. feature_selector is provided as the second argument for calls to pandas.DataFrame.loc when constructing datasets

notes: String (optional)

Additional information about the Experiment that will be saved with the Experiment’s description result file. This serves no purpose other than to facilitate saving Experiment details in a more readable format

do_raise_repeated: Boolean, default=False

If True and this Experiment locates a previous Experiment’s results with matching Environment and Hyperparameter Keys, a RepeatedExperimentError will be raised. Else, a warning will be logged

auto_start: Boolean, default=True

If True, after the Experiment is initialized, it will automatically call BaseExperiment.preparation_workflow(), followed by BaseExperiment.experiment_workflow(), effectively completing all essential tasks without requiring additional method calls

target_metric: Tuple, str, default=(‘oof’, <:attr:`environment.Environment.metrics`[0]>)

Path denoting the metric to be used to compare completed Experiments or to use for certain early stopping procedures in some model classes. The first value should be one of [‘oof’, ‘holdout’, ‘in_fold’]. The second value should be the name of a metric being recorded according to the values supplied in hyperparameter_hunter.environment.Environment.metrics_params. See the documentation for hyperparameter_hunter.metrics.get_formatted_target_metric() for more info. Any values returned by, or used as the target_metric input to this function are acceptable values for target_metric

callbacks: `LambdaCallback`, or list of `LambdaCallback` (optional)

Callbacks injected directly into concrete Experiment (CVExperiment), adding new functionality, or customizing existing processes. Should be a LambdaCallback or a list of such classes. LambdaCallback can be created using callbacks.bases.lambda_callback(), which documents the options for creating callbacks. callbacks will be added to the MRO of the Experiment by experiment_core.ExperimentMeta at __call__ time, making callbacks new base classes of the Experiment. See callbacks.bases.lambda_callback() for more information. The presence of LambdaCallbacks will not affect Experiment keys. In other words, for the purposes of Experiment matching/recording, all other factors being equal, an Experiment with callbacks is considered identical to an Experiment without, despite whatever custom functionality was added by the LambdaCallbacks

See also

hyperparameter_hunter.optimization.protocol_core.BaseOptPro.forge_experiment()

OptPro method to define hyperparameter search scaffold for building Experiments during optimization. This method follows the same format as Experiment initialization, but it adds the ability to provide hyperparameter values as ranges to search over, via subclasses of Dimension. The other notable difference is that forge_experiment removes the auto_start and target_metric kwargs, which is described in the forge_experiment docstring Notes

Environment

Provides critical information on how Experiments should be conducted, as well as the data to be used by Experiments. An Environment must be active before executing any Experiment or OptPro

lambda_callback()

Enables customization of the Experimentation process and access to all Experiment internals through a collection of methods that are invoked at all the important periods over an Experiment’s lifespan. These can be provided via the experiment_callbacks kwarg of Environment, and the callback classes literally get thrown in to the parent classes of the Experiment, so they’re kind of a big deal

Methods

evaluate(self, data_type, target, prediction)

Apply metric(s) to the given data to calculate the value of the prediction

execute(self)

Execute the fitting protocol for the Experiment, comprising the following: instantiation of learners for each run, preprocessing of data as appropriate, training learners, making predictions, and evaluating and aggregating those predictions and other stats/metrics for later use

experiment_workflow(self)

Define the actual experiment process, including execution, result saving, and cleanup

on_exp_start(self)

Prepare data prior to executing fitting protocol (cross-validation), by 1) Initializing formal datasets attributes, 2) Invoking feature_engineer to perform “pre_cv”-stage preprocessing, and 3) Updating datasets to include their (transformed) counterparts in feature_engineer

preparation_workflow(self)

Execute all tasks that must take place before the experiment is actually started.

experiment_workflow(self)

Define the actual experiment process, including execution, result saving, and cleanup

preparation_workflow(self)

Execute all tasks that must take place before the experiment is actually started. Such tasks include (but are not limited to): Creating experiment IDs and hyperparameter keys, creating script backups, and validating parameters

abstract execute(self)

Execute the fitting protocol for the Experiment, comprising the following: instantiation of learners for each run, preprocessing of data as appropriate, training learners, making predictions, and evaluating and aggregating those predictions and other stats/metrics for later use

on_exp_start(self)

Prepare data prior to executing fitting protocol (cross-validation), by 1) Initializing formal datasets attributes, 2) Invoking feature_engineer to perform “pre_cv”-stage preprocessing, and 3) Updating datasets to include their (transformed) counterparts in feature_engineer

class hyperparameter_hunter.experiments.BaseCVExperiment(model_initializer, model_init_params=None, model_extra_params=None, feature_engineer=None, feature_selector=None, notes=None, do_raise_repeated=False, auto_start=True, target_metric=None)

Bases: hyperparameter_hunter.experiments.BaseExperiment

One-off Experimentation base class

Bare-bones Description: Runs the cross-validation scheme defined by Environment, during which 1) Datasets are processed according to feature_engineer; 2) Models are built by instantiating model_initializer with model_init_params; 3) Models are trained on processed data, optionally using parameters from model_extra_params; 4) Results are logged and recorded for each fitting period; 5) Descriptions, predictions, results (both averages and individual periods), etc. are saved.

What’s the Big Deal? The most important takeaway from the above description is that descriptions/results are THOROUGH and REUSABLE. By thorough, I mean that all of a model’s hyperparameters are saved, not just the ones given in model_init_params. This may sound odd, but it’s important because it makes results reusable during optimization, when you may be using a different set of hyperparameters. It helps with other things like preventing duplicate experiments and ensembling, as well. But the big part is that this transforms hyperparameter optimization from an isolated, throwaway process we can only afford when an ML project is sufficiently “mature” to a process that covers the entire lifespan of a project. No Experiment is forgotten or wasted. Optimization is automatically given the data it needs to succeed by drawing on all your past Experiments and optimization rounds.

The Experiment has three primary missions: 1. Act as scaffold for organizing ML Experimentation and optimization 2. Record Experiment descriptions and results 3. Eliminate lots of repetitive/error-prone boilerplate code

Providing a scaffold for the entire ML process is critical because without a standardized format, everything we do looks different. Without a unified scaffold, development is slower, more confusing, and less adaptable. One of the benefits of standardizing the format of ML Experimentation is that it enables us to exhaustively record all the important characteristics of Experiment, as well as an assortment of customizable result files – all in a way that allows them to be reused in the future.

What About Data/Metrics? Experiments require an active Environment in order to function, from which the Experiment collects important cross-experiment parameters, such as datasets, metrics, cross-validation schemes, and even callbacks to inherit, among many other properties documented in Environment

Parameters
model_initializer: Class, or functools.partial, or class instance

Algorithm class used to initialize a model, such as XGBoost’s XGBRegressor, or SKLearn’s KNeighborsClassifier; although, there are hundreds of possibilities across many different ML libraries. model_initializer is expected to define at least fit and predict methods. model_initializer will be initialized with model_init_params, and its “extra” methods (fit, predict, etc.) will be invoked with parameters in model_extra_params

model_init_params: Dict, or object (optional)

Dictionary of arguments given to create an instance of model_initializer. Any kwargs that are considered valid by the __init__ method of model_initializer are valid in model_init_params.

One of the key features that makes HyperparameterHunter so magical is that ALL hyperparameters in the signature of model_initializer (and their default values) are discovered – whether or not they are explicitly given in model_init_params. Not only does this make Experiment result descriptions incredibly thorough, it also makes optimization smoother, more effective, and far less work for the user. For example, take LightGBM’s LGBMRegressor, with model_init_params`=`dict(learning_rate=0.2). HyperparameterHunter recognizes that this differs from the default of 0.1. It also recognizes that LGBMRegressor is actually initialized with more than a dozen other hyperparameters we didn’t bother mentioning, and it records their values, too. So if we want to optimize num_leaves tomorrow, the OptPro doesn’t start from scratch. It knows that we ran an Experiment that didn’t explicitly mention num_leaves, but its default value was 31, and it uses this information to fuel optimization – all without us having to manually keep track of tons of janky collections of hyperparameters. In fact, we really don’t need to go out of our way at all. HyperparameterHunter just acts as our faithful lab assistant, keeping track of all the stuff we’d rather not worry about

model_extra_params: Dict (optional)

Dictionary of extra parameters for models’ non-initialization methods (like fit, predict, predict_proba, etc.), and for neural networks. To specify parameters for an extra method, place them in a dict named for the extra method to which the parameters should be given. For example, to call fit with early_stopping_rounds`=5, use `model_extra_params`=`dict(fit=dict(early_stopping_rounds=5)).

For models whose fit methods have a kwarg like eval_set (such as XGBoost’s), one can use the DatasetSentinel attributes of the current active Environment, documented under its “Attributes” section and under train_input. An example using several DatasetSentinels can be found in HyperparameterHunter’s [XGBoost Classification Example](https://github.com/HunterMcGushion/hyperparameter_hunter/blob/master/examples/xgboost_examples/classification.py)

feature_engineer: `FeatureEngineer`, or list (optional)

Feature engineering/transformation/pre-processing steps to apply to datasets defined in Environment. If list, will be used to initialize FeatureEngineer, and can contain any of the following values:

  1. EngineerStep instance

  2. Function input to :class:~hyperparameter_hunter.feature_engineering.EngineerStep`

For important information on properly formatting EngineerStep functions, please see the documentation of EngineerStep. OptPros can perform hyperparameter optimization of feature_engineer steps. This capability adds a third allowed value to the above list and is documented in forge_experiment()

feature_selector: List of str, callable, or list of booleans (optional)

Column names to include as input data for all provided DataFrames. If None, feature_selector is set to all columns in train_dataset, less target_column, and id_column. feature_selector is provided as the second argument for calls to pandas.DataFrame.loc when constructing datasets

notes: String (optional)

Additional information about the Experiment that will be saved with the Experiment’s description result file. This serves no purpose other than to facilitate saving Experiment details in a more readable format

do_raise_repeated: Boolean, default=False

If True and this Experiment locates a previous Experiment’s results with matching Environment and Hyperparameter Keys, a RepeatedExperimentError will be raised. Else, a warning will be logged

auto_start: Boolean, default=True

If True, after the Experiment is initialized, it will automatically call BaseExperiment.preparation_workflow(), followed by BaseExperiment.experiment_workflow(), effectively completing all essential tasks without requiring additional method calls

target_metric: Tuple, str, default=(‘oof’, <:attr:`environment.Environment.metrics`[0]>)

Path denoting the metric to be used to compare completed Experiments or to use for certain early stopping procedures in some model classes. The first value should be one of [‘oof’, ‘holdout’, ‘in_fold’]. The second value should be the name of a metric being recorded according to the values supplied in hyperparameter_hunter.environment.Environment.metrics_params. See the documentation for hyperparameter_hunter.metrics.get_formatted_target_metric() for more info. Any values returned by, or used as the target_metric input to this function are acceptable values for target_metric

callbacks: `LambdaCallback`, or list of `LambdaCallback` (optional)

Callbacks injected directly into concrete Experiment (CVExperiment), adding new functionality, or customizing existing processes. Should be a LambdaCallback or a list of such classes. LambdaCallback can be created using callbacks.bases.lambda_callback(), which documents the options for creating callbacks. callbacks will be added to the MRO of the Experiment by experiment_core.ExperimentMeta at __call__ time, making callbacks new base classes of the Experiment. See callbacks.bases.lambda_callback() for more information. The presence of LambdaCallbacks will not affect Experiment keys. In other words, for the purposes of Experiment matching/recording, all other factors being equal, an Experiment with callbacks is considered identical to an Experiment without, despite whatever custom functionality was added by the LambdaCallbacks

See also

hyperparameter_hunter.optimization.protocol_core.BaseOptPro.forge_experiment()

OptPro method to define hyperparameter search scaffold for building Experiments during optimization. This method follows the same format as Experiment initialization, but it adds the ability to provide hyperparameter values as ranges to search over, via subclasses of Dimension. The other notable difference is that forge_experiment removes the auto_start and target_metric kwargs, which is described in the forge_experiment docstring Notes

Environment

Provides critical information on how Experiments should be conducted, as well as the data to be used by Experiments. An Environment must be active before executing any Experiment or OptPro

lambda_callback()

Enables customization of the Experimentation process and access to all Experiment internals through a collection of methods that are invoked at all the important periods over an Experiment’s lifespan. These can be provided via the experiment_callbacks kwarg of Environment, and the callback classes literally get thrown in to the parent classes of the Experiment, so they’re kind of a big deal

Methods

cross_validation_workflow(self)

Execute workflow for cross-validation process, consisting of the following tasks: 1) Create train and validation split indices for all folds, 2) Iterate through folds, performing cv_fold_workflow for each, 3) Average accumulated predictions over fold splits, 4) Evaluate final predictions, 5) Format final predictions to prepare for saving

cv_fold_workflow(self)

Execute workflow for individual fold, consisting of the following tasks: Execute overridden on_fold_start() tasks, 2) Perform cv_run_workflow for each run, 3) Execute overridden on_fold_end() tasks

cv_run_workflow(self)

Execute run workflow, consisting of: 1) Execute overridden on_run_start() tasks, 2) Initialize and fit Model, 3) Execute overridden on_run_end() tasks

evaluate(self, data_type, target, prediction)

Apply metric(s) to the given data to calculate the value of the prediction

execute(self)

Execute the fitting protocol for the Experiment, comprising the following: instantiation of learners for each run, preprocessing of data as appropriate, training learners, making predictions, and evaluating and aggregating those predictions and other stats/metrics for later use

experiment_workflow(self)

Define the actual experiment process, including execution, result saving, and cleanup

on_exp_start(self)

Prepare data prior to executing fitting protocol (cross-validation), by 1) Initializing formal datasets attributes, 2) Invoking feature_engineer to perform “pre_cv”-stage preprocessing, and 3) Updating datasets to include their (transformed) counterparts in feature_engineer

on_fold_start(self)

Override on_fold_start() tasks set by experiment_core.ExperimentMeta, consisting of: 1) Split train/validation data, 2) Make copies of holdout/test data for current fold (for feature engineering), 3) Log start, 4) Execute original tasks

on_run_start(self)

Override on_run_start() tasks organized by experiment_core.ExperimentMeta, consisting of: 1) Set random seed and update model parameters according to current seed, 2) Log run start, 3) Execute original tasks

preparation_workflow(self)

Execute all tasks that must take place before the experiment is actually started.

execute(self)

Execute the fitting protocol for the Experiment, comprising the following: instantiation of learners for each run, preprocessing of data as appropriate, training learners, making predictions, and evaluating and aggregating those predictions and other stats/metrics for later use

cross_validation_workflow(self)

Execute workflow for cross-validation process, consisting of the following tasks: 1) Create train and validation split indices for all folds, 2) Iterate through folds, performing cv_fold_workflow for each, 3) Average accumulated predictions over fold splits, 4) Evaluate final predictions, 5) Format final predictions to prepare for saving

on_fold_start(self)

Override on_fold_start() tasks set by experiment_core.ExperimentMeta, consisting of: 1) Split train/validation data, 2) Make copies of holdout/test data for current fold (for feature engineering), 3) Log start, 4) Execute original tasks

cv_fold_workflow(self)

Execute workflow for individual fold, consisting of the following tasks: Execute overridden on_fold_start() tasks, 2) Perform cv_run_workflow for each run, 3) Execute overridden on_fold_end() tasks

on_run_start(self)

Override on_run_start() tasks organized by experiment_core.ExperimentMeta, consisting of: 1) Set random seed and update model parameters according to current seed, 2) Log run start, 3) Execute original tasks

cv_run_workflow(self)

Execute run workflow, consisting of: 1) Execute overridden on_run_start() tasks, 2) Initialize and fit Model, 3) Execute overridden on_run_end() tasks

class hyperparameter_hunter.experiments.CVExperiment(model_initializer, model_init_params=None, model_extra_params=None, feature_engineer=None, feature_selector=None, notes=None, do_raise_repeated=False, auto_start=True, target_metric=None, callbacks=None)

Bases: hyperparameter_hunter.experiments.BaseCVExperiment

One-off Experimentation base class

Bare-bones Description: Runs the cross-validation scheme defined by Environment, during which 1) Datasets are processed according to feature_engineer; 2) Models are built by instantiating model_initializer with model_init_params; 3) Models are trained on processed data, optionally using parameters from model_extra_params; 4) Results are logged and recorded for each fitting period; 5) Descriptions, predictions, results (both averages and individual periods), etc. are saved.

What’s the Big Deal? The most important takeaway from the above description is that descriptions/results are THOROUGH and REUSABLE. By thorough, I mean that all of a model’s hyperparameters are saved, not just the ones given in model_init_params. This may sound odd, but it’s important because it makes results reusable during optimization, when you may be using a different set of hyperparameters. It helps with other things like preventing duplicate experiments and ensembling, as well. But the big part is that this transforms hyperparameter optimization from an isolated, throwaway process we can only afford when an ML project is sufficiently “mature” to a process that covers the entire lifespan of a project. No Experiment is forgotten or wasted. Optimization is automatically given the data it needs to succeed by drawing on all your past Experiments and optimization rounds.

The Experiment has three primary missions: 1. Act as scaffold for organizing ML Experimentation and optimization 2. Record Experiment descriptions and results 3. Eliminate lots of repetitive/error-prone boilerplate code

Providing a scaffold for the entire ML process is critical because without a standardized format, everything we do looks different. Without a unified scaffold, development is slower, more confusing, and less adaptable. One of the benefits of standardizing the format of ML Experimentation is that it enables us to exhaustively record all the important characteristics of Experiment, as well as an assortment of customizable result files – all in a way that allows them to be reused in the future.

What About Data/Metrics? Experiments require an active Environment in order to function, from which the Experiment collects important cross-experiment parameters, such as datasets, metrics, cross-validation schemes, and even callbacks to inherit, among many other properties documented in Environment

Parameters
model_initializer: Class, or functools.partial, or class instance

Algorithm class used to initialize a model, such as XGBoost’s XGBRegressor, or SKLearn’s KNeighborsClassifier; although, there are hundreds of possibilities across many different ML libraries. model_initializer is expected to define at least fit and predict methods. model_initializer will be initialized with model_init_params, and its “extra” methods (fit, predict, etc.) will be invoked with parameters in model_extra_params

model_init_params: Dict, or object (optional)

Dictionary of arguments given to create an instance of model_initializer. Any kwargs that are considered valid by the __init__ method of model_initializer are valid in model_init_params.

One of the key features that makes HyperparameterHunter so magical is that ALL hyperparameters in the signature of model_initializer (and their default values) are discovered – whether or not they are explicitly given in model_init_params. Not only does this make Experiment result descriptions incredibly thorough, it also makes optimization smoother, more effective, and far less work for the user. For example, take LightGBM’s LGBMRegressor, with model_init_params`=`dict(learning_rate=0.2). HyperparameterHunter recognizes that this differs from the default of 0.1. It also recognizes that LGBMRegressor is actually initialized with more than a dozen other hyperparameters we didn’t bother mentioning, and it records their values, too. So if we want to optimize num_leaves tomorrow, the OptPro doesn’t start from scratch. It knows that we ran an Experiment that didn’t explicitly mention num_leaves, but its default value was 31, and it uses this information to fuel optimization – all without us having to manually keep track of tons of janky collections of hyperparameters. In fact, we really don’t need to go out of our way at all. HyperparameterHunter just acts as our faithful lab assistant, keeping track of all the stuff we’d rather not worry about

model_extra_params: Dict (optional)

Dictionary of extra parameters for models’ non-initialization methods (like fit, predict, predict_proba, etc.), and for neural networks. To specify parameters for an extra method, place them in a dict named for the extra method to which the parameters should be given. For example, to call fit with early_stopping_rounds`=5, use `model_extra_params`=`dict(fit=dict(early_stopping_rounds=5)).

For models whose fit methods have a kwarg like eval_set (such as XGBoost’s), one can use the DatasetSentinel attributes of the current active Environment, documented under its “Attributes” section and under train_input. An example using several DatasetSentinels can be found in HyperparameterHunter’s [XGBoost Classification Example](https://github.com/HunterMcGushion/hyperparameter_hunter/blob/master/examples/xgboost_examples/classification.py)

feature_engineer: `FeatureEngineer`, or list (optional)

Feature engineering/transformation/pre-processing steps to apply to datasets defined in Environment. If list, will be used to initialize FeatureEngineer, and can contain any of the following values:

  1. EngineerStep instance

  2. Function input to :class:~hyperparameter_hunter.feature_engineering.EngineerStep`

For important information on properly formatting EngineerStep functions, please see the documentation of EngineerStep. OptPros can perform hyperparameter optimization of feature_engineer steps. This capability adds a third allowed value to the above list and is documented in forge_experiment()

feature_selector: List of str, callable, or list of booleans (optional)

Column names to include as input data for all provided DataFrames. If None, feature_selector is set to all columns in train_dataset, less target_column, and id_column. feature_selector is provided as the second argument for calls to pandas.DataFrame.loc when constructing datasets

notes: String (optional)

Additional information about the Experiment that will be saved with the Experiment’s description result file. This serves no purpose other than to facilitate saving Experiment details in a more readable format

do_raise_repeated: Boolean, default=False

If True and this Experiment locates a previous Experiment’s results with matching Environment and Hyperparameter Keys, a RepeatedExperimentError will be raised. Else, a warning will be logged

auto_start: Boolean, default=True

If True, after the Experiment is initialized, it will automatically call BaseExperiment.preparation_workflow(), followed by BaseExperiment.experiment_workflow(), effectively completing all essential tasks without requiring additional method calls

target_metric: Tuple, str, default=(‘oof’, <:attr:`environment.Environment.metrics`[0]>)

Path denoting the metric to be used to compare completed Experiments or to use for certain early stopping procedures in some model classes. The first value should be one of [‘oof’, ‘holdout’, ‘in_fold’]. The second value should be the name of a metric being recorded according to the values supplied in hyperparameter_hunter.environment.Environment.metrics_params. See the documentation for hyperparameter_hunter.metrics.get_formatted_target_metric() for more info. Any values returned by, or used as the target_metric input to this function are acceptable values for target_metric

callbacks: `LambdaCallback`, or list of `LambdaCallback` (optional)

Callbacks injected directly into concrete Experiment (CVExperiment), adding new functionality, or customizing existing processes. Should be a LambdaCallback or a list of such classes. LambdaCallback can be created using callbacks.bases.lambda_callback(), which documents the options for creating callbacks. callbacks will be added to the MRO of the Experiment by experiment_core.ExperimentMeta at __call__ time, making callbacks new base classes of the Experiment. See callbacks.bases.lambda_callback() for more information. The presence of LambdaCallbacks will not affect Experiment keys. In other words, for the purposes of Experiment matching/recording, all other factors being equal, an Experiment with callbacks is considered identical to an Experiment without, despite whatever custom functionality was added by the LambdaCallbacks

See also

hyperparameter_hunter.optimization.protocol_core.BaseOptPro.forge_experiment()

OptPro method to define hyperparameter search scaffold for building Experiments during optimization. This method follows the same format as Experiment initialization, but it adds the ability to provide hyperparameter values as ranges to search over, via subclasses of Dimension. The other notable difference is that forge_experiment removes the auto_start and target_metric kwargs, which is described in the forge_experiment docstring Notes

Environment

Provides critical information on how Experiments should be conducted, as well as the data to be used by Experiments. An Environment must be active before executing any Experiment or OptPro

lambda_callback()

Enables customization of the Experimentation process and access to all Experiment internals through a collection of methods that are invoked at all the important periods over an Experiment’s lifespan. These can be provided via the experiment_callbacks kwarg of Environment, and the callback classes literally get thrown in to the parent classes of the Experiment, so they’re kind of a big deal

Attributes
source_script

Methods

cross_validation_workflow(self)

Execute workflow for cross-validation process, consisting of the following tasks: 1) Create train and validation split indices for all folds, 2) Iterate through folds, performing cv_fold_workflow for each, 3) Average accumulated predictions over fold splits, 4) Evaluate final predictions, 5) Format final predictions to prepare for saving

cv_fold_workflow(self)

Execute workflow for individual fold, consisting of the following tasks: Execute overridden on_fold_start() tasks, 2) Perform cv_run_workflow for each run, 3) Execute overridden on_fold_end() tasks

cv_run_workflow(self)

Execute run workflow, consisting of: 1) Execute overridden on_run_start() tasks, 2) Initialize and fit Model, 3) Execute overridden on_run_end() tasks

evaluate(self, data_type, target, prediction)

Apply metric(s) to the given data to calculate the value of the prediction

execute(self)

Execute the fitting protocol for the Experiment, comprising the following: instantiation of learners for each run, preprocessing of data as appropriate, training learners, making predictions, and evaluating and aggregating those predictions and other stats/metrics for later use

experiment_workflow(self)

Define the actual experiment process, including execution, result saving, and cleanup

on_exp_start(self)

Prepare data prior to executing fitting protocol (cross-validation), by 1) Initializing formal datasets attributes, 2) Invoking feature_engineer to perform “pre_cv”-stage preprocessing, and 3) Updating datasets to include their (transformed) counterparts in feature_engineer

on_fold_start(self)

Override on_fold_start() tasks set by experiment_core.ExperimentMeta, consisting of: 1) Split train/validation data, 2) Make copies of holdout/test data for current fold (for feature engineering), 3) Log start, 4) Execute original tasks

on_run_start(self)

Override on_run_start() tasks organized by experiment_core.ExperimentMeta, consisting of: 1) Set random seed and update model parameters according to current seed, 2) Log run start, 3) Execute original tasks

preparation_workflow(self)

Execute all tasks that must take place before the experiment is actually started.

source_script = None
hyperparameter_hunter.experiments.get_cv_indices(folds, cv_params, input_data, target_data)

Produce iterables of cross validation indices in the shape of (n_repeats, n_folds)

Parameters
folds: Instance of `cv_type`

Cross validation folds object, whose split() receives input_data and target_data

cv_params: Dict

Parameters given to instantiate folds. Must contain n_splits. May contain n_repeats

input_data: pandas.DataFrame

Input data to be split by folds, to which yielded indices will correspond

target_data: pandas.DataFrame

Target data to be split by folds, to which yielded indices will correspond

Yields
Generator

Cross validation indices in shape of (<n_repeats or 1>, <n_splits>)

hyperparameter_hunter.feature_engineering module

This module organizes and executes feature engineering/preprocessing step functions. The central components of the module are FeatureEngineer and EngineerStep - everything else is built to support those two classes. This module works with a very broad definition of “feature engineering”. The following is a non-exhaustive list of transformations that are considered valid for FeatureEngineer step functions:

  • Manual feature creation

  • Input data scaling/normalization/standardization

  • Target data transformation

  • Re-sampling

  • Data imputation

  • Feature selection/elimination

  • Encoding (one-hot, label, etc.)

  • Binarization/binning/discretization

  • Feature extraction (as for NLP/image recognition tasks)

  • Feature shuffling

Related

hyperparameter_hunter.space

Only related when optimizing FeatureEngineer steps within an Optimization Protocol, but defines Categorical, which is the mechanism for defining a feature engineer step search space, and RejectedOptional, which is used to represent the absence of a feature engineer step, when labeled as optional

class hyperparameter_hunter.feature_engineering.EMPTY_SENTINEL

Bases: object

class hyperparameter_hunter.feature_engineering.DatasetNameReport(params: Tuple[str], stage: str)

Bases: object

Characterize the relationships between the dataset names params

Parameters
params: Tuple[str]

Dataset names requested by a feature engineering step callable. Must be a subset of {“train_data”, “train_inputs”, “train_targets”, “validation_data”, “validation_inputs”, “validation_targets”, “holdout_data”, “holdout_inputs”, “holdout_targets”, “test_inputs”, “all_data”, “all_inputs”, “all_targets”, “non_train_data”, “non_train_inputs”, “non_train_targets”}

stage: String in {“pre_cv”, “intra_cv”}

Feature engineering stage during which the datasets params are requested

Attributes
merged_datasets: List[tuple]

Tuples of strings denoting paths to datasets that represent a merge between multiple datasets. Merged datasets are those prefixed with either “all” or “non_train”. These paths are locations in descendants

coupled_datasets: List[tuple]

Tuples of strings denoting paths to datasets that represent a coupling of “inputs” and “targets” datasets. Coupled datasets are those suffixed with “data”. These paths are locations in descendants, and the values at each path should be a dict containing keys with “inputs” and “targets” suffixes

leaves: Dict[tuple, str]

Mapping of full path tuples in descendants to their leaf values. Tuple paths represent the steps necessary to reach the standard dataset leaf value in descendants by traversing merged and coupled datasets. Values in leaves should be identical to the last element of the corresponding tuple key

descendants: DescendantsType

Nested dict in which all keys are dataset name strings, and all leaf values are None. Represents the structure of the requested dataset names, traversing over merged and coupled datasets (if necessary) in order to reach the standard dataset leaves

hyperparameter_hunter.feature_engineering.names_for_merge(merge_to:str, stage:str) → List[str]

Retrieve the names of the standard datasets that are allowed to be included in a merged DataFrame of type merge_to at stage stage

Parameters
merge_to: String

Type of merged dataframe to produce. Should be one of the following: {“all_data”, “all_inputs”, “all_targets”, “non_train_data”, “non_train_inputs”, “non_train_targets”}

stage: String in {“pre_cv”, “intra_cv}

Feature engineering stage for which the merged dataframe is requested. The results produced with each option differ only in that a merged_df created with stage=”pre_cv” will never contain “validation” data because it doesn’t exist before cross-validation has begun. Conversely, a merged_df created with stage=”intra_cv” will contain the appropriate “validation” data if it exists

Returns
names: List

Subset of {“train_data”, “train_inputs”, “train_targets”, “validation_data”, “validation_inputs”, “validation_targets”, “holdout_data”, “holdout_inputs”, “holdout_targets”, “test_inputs”}

Examples

>>> names_for_merge("all_data", "intra_cv")
['train_data', 'validation_data', 'holdout_data']
>>> names_for_merge("all_inputs", "intra_cv")
['train_inputs', 'validation_inputs', 'holdout_inputs', 'test_inputs']
>>> names_for_merge("all_targets", "intra_cv")
['train_targets', 'validation_targets', 'holdout_targets']
>>> names_for_merge("all_data", "pre_cv")
['train_data', 'holdout_data']
>>> names_for_merge("all_inputs", "pre_cv")
['train_inputs', 'holdout_inputs', 'test_inputs']
>>> names_for_merge("all_targets", "pre_cv")
['train_targets', 'holdout_targets']
>>> names_for_merge("non_train_data", "intra_cv")
['validation_data', 'holdout_data']
>>> names_for_merge("non_train_inputs", "intra_cv")
['validation_inputs', 'holdout_inputs', 'test_inputs']
>>> names_for_merge("non_train_targets", "intra_cv")
['validation_targets', 'holdout_targets']
>>> names_for_merge("non_train_data", "pre_cv")
['holdout_data']
>>> names_for_merge("non_train_inputs", "pre_cv")
['holdout_inputs', 'test_inputs']
>>> names_for_merge("non_train_targets", "pre_cv")
['holdout_targets']
hyperparameter_hunter.feature_engineering.merge_dfs(merge_to:str, stage:str, dfs:Dict[str, pandas.core.frame.DataFrame]) → pandas.core.frame.DataFrame

Construct a multi-indexed DataFrame containing the values of dfs deemed necessary by merge_to and stage. This is the opposite of split_merged_df

Parameters
merge_to: String

Type of merged_df to produce. Should be one of the following: {“all_data”, “all_inputs”, “all_targets”, “non_train_data”, “non_train_inputs”, “non_train_targets”}

stage: String in {“pre_cv”, “intra_cv}

Feature engineering stage for which merged_df is requested

dfs: Dict

Mapping of dataset names to their DataFrame values. Keys in dfs should be a subset of {“train_data”, “train_inputs”, “train_targets”, “validation_data”, “validation_inputs”, “validation_targets”, “holdout_data”, “holdout_inputs”, “holdout_targets”, “test_inputs”}

Returns
merged_df: pd.DataFrame

Multi-indexed DataFrame, in which the first index is a string naming the dataset in dfs from which the corresponding data originates. The following index(es) are the original index(es) from the dataset in dfs. All primary indexes in merged_df will be one of the strings considered to be valid keys for dfs

Raises
ValueError

If all the DataFrames that would have been used in merged_df are None. This can happen if requesting merge_to=”non_train_targets” during stage=”pre_cv” when there is no holdout dataset available. Under these circumstances, the holdout dataset targets would be the sole contents of merged_df, rendering merged_df invalid since the data is unavailable

See also

names_for_merge

Describes how stage values differ

hyperparameter_hunter.feature_engineering.split_merged_df(merged_df:pandas.core.frame.DataFrame) → Dict[str, pandas.core.frame.DataFrame]

Separate a multi-indexed DataFrame into a dict mapping primary indexes in merged_df to DataFrames containing one fewer dimension than merged_df. This is the opposite of merge_dfs

Parameters
merged_df: pd.DataFrame

Multi-indexed DataFrame of the form returned by merge_dfs() to split into the separate DataFrames named by the primary indexes of merged_df

Returns
dfs: Dict

Mapping of dataset names to their DataFrame values. Keys in dfs will be a subset of {“train_data”, “train_inputs”, “train_targets”, “validation_data”, “validation_inputs”, “validation_targets”, “holdout_data”, “holdout_inputs”, “holdout_targets”, “test_inputs”} containing only those values that are also primary indexes in merged_df

hyperparameter_hunter.feature_engineering.validate_dataset_names(params:Tuple[str], stage:str) → List[str]

Produce the names of merged datasets in params and verify there are no duplicate references to any datasets in params

Parameters
params: Tuple[str]

Dataset names requested by a feature engineering step callable. Must be a subset of {“train_data”, “train_inputs”, “train_targets”, “validation_data”, “validation_inputs”, “validation_targets”, “holdout_data”, “holdout_inputs”, “holdout_targets”, “test_inputs”, “all_data”, “all_inputs”, “all_targets”, “non_train_data”, “non_train_inputs”, “non_train_targets”}

stage: String in {“pre_cv”, “intra_cv}

Feature engineering stage for which merged_df is requested

Returns
List[str]

Names of merged datasets in params

Raises
ValueError

If requested params contain a duplicate reference to any dataset, either by way of merging/coupling or not

class hyperparameter_hunter.feature_engineering.EngineerStep(f: Callable, stage=None, name=None, params=None, do_validate=False)

Bases: object

Container for individual FeatureEngineer step functions

Compartmentalizes functions of singular engineer steps and allows for greater customization than a raw engineer step function

Parameters
f: Callable

Feature engineering step function that requests, modifies, and returns datasets params

Step functions should follow these guidelines:

  1. Request as input a subset of the 11 data strings listed in params

  2. Do whatever you want to the DataFrames given as input

  3. Return new DataFrame values of the input parameters in same order as requested

If performing a task like target transformation, causing predictions to be transformed, it is often desirable to inverse-transform the predictions to be of the expected form. This can easily be done by returning an extra value from f (after the datasets) that is either a callable, or a transformer class that was fitted during the execution of f and implements an inverse_transform method. This is the only instance in which it is acceptable for f to return values that don’t mimic its input parameters. See the engineer function definition using SKLearn’s QuantileTransformer in the Examples section below for an actual inverse-transformation-compatible implementation

stage: String in {“pre_cv”, “intra_cv”}, or None, default=None

Feature engineering stage during which the callable f will be given the datasets params to modify and return. If None, will be inferred based on params.

  • “pre_cv” functions are applied only once in the experiment: when it starts

  • “intra_cv” functions are reapplied for each fold in the cross-validation splits

If stage is left to be inferred, “pre_cv” will usually be selected. However, if any params (or parameters in the signature of f) are prefixed with “validation…” or “non_train…”, then stage will inferred as “intra_cv”. See the Notes section below for suggestions on the stage to use for different functions

name: String, or None, default=None

Identifier for the transformation applied by this engineering step. If None, f.__name__ will be used

params: Tuple[str], or None, default=None

Dataset names requested by feature engineering step callable f. If None, will be inferred by parsing the signature of f. Must be a subset of the following 11 strings:

Input Data

  1. “train_inputs”

  2. “validation_inputs”

  3. “holdout_inputs”

  4. “test_inputs”

  5. “all_inputs”

    ("train_inputs" + ["validation_inputs"] + "holdout_inputs" + "test_inputs")

  6. “non_train_inputs”

    (["validation_inputs"] + "holdout_inputs" + "test_inputs")

Target Data

  1. “train_targets”

  2. “validation_targets”

  3. “holdout_targets”

  4. “all_targets” ("train_targets" + ["validation_targets"] + "holdout_targets")

  5. “non_train_targets” (["validation_targets"] + "holdout_targets")

As an alternative to the above list, just remember that the first half of all parameter names should be one of {“train”, “validation”, “holdout”, “test”, “all”, “non_train”}, and the second half should be either “inputs” or “targets”. The only exception to this rule is “test_targets”, which doesn’t exist.

Inference of “validation” params is affected by stage. During the “pre_cv” stage, the validation dataset has not yet been created and is still a part of the train dataset. During the “intra_cv” stage, the validation dataset is created by removing a portion of the train dataset, and their values passed to f reflect this fact. This also means that the values of the merged (“all”/”non_train”-prefixed) datasets may or may not contain “validation” data depending on the stage; however, this is all handled internally, so you probably don’t need to worry about it.

params may not include multiple references to the same dataset, either directly or indirectly. This means (“train_inputs”, “train_inputs”) is invalid due to duplicate direct references. Less obviously, (“train_inputs”, “all_inputs”) is invalid because “all_inputs” includes “train_inputs”

do_validate: Boolean, or “strict”, default=False

… Experimental… Whether to validate the datasets resulting from feature engineering steps. If True, hashes of the new datasets will be compared to those of the originals to ensure they were actually modified. Results will be logged. If do_validate = “strict”, an exception will be raised if any anomalies are found, rather than logging a message. If do_validate = False, no validation will be performed

See also

FeatureEngineer

The container for EngineerStep instances - EngineerStep`s should always be provided to HyperparameterHunter through a `FeatureEngineer

Categorical

Can be used during optimization to search through a group of EngineerStep`s given as `categories. The optional kwarg of Categorical designates a FeatureEngineer step that may be one of the EngineerStep`s in `categories, or may be omitted entirely

get_engineering_step_stage()

More information on stage inference and situations where overriding it may be prudent

Notes

stage: Generally, feature engineering conducted in the “pre_cv” stage should regard each sample/row as independent entities. For example, steps like converting a string day of the week to one-hot encoded columns, or imputing missing values by replacement with -1 might be conducted “pre_cv”, since they are unlikely to introduce an information leakage. Conversely, steps like scaling/normalization, whose results for the data in one row are affected by the data in other rows should be performed “intra_cv” in order to recalculate the final values of the datasets for each cross validation split and avoid information leakage.

params: In the list of the 11 valid params strings, “test_inputs” is notably missing the “…_targets” counterpart accompanying the other datasets. The “targets” suffix is missing because test data targets are never given. Note that although “test_inputs” is still included in both “all_inputs” and “non_train_inputs”, its lack of a target column means that “all_targets” and “non_train_targets” may have different lengths than their “inputs”-suffixed counterparts

Examples

>>> from sklearn.preprocessing import StandardScaler, QuantileTransformer
>>> def s_scale(train_inputs, non_train_inputs):
...     s = StandardScaler()
...     train_inputs[train_inputs.columns] = s.fit_transform(train_inputs.values)
...     non_train_inputs[train_inputs.columns] = s.transform(non_train_inputs.values)
...     return train_inputs, non_train_inputs
>>> # Sensible parameter defaults inferred based on `f`
>>> es_0 = EngineerStep(s_scale)
>>> es_0.stage
'intra_cv'
>>> es_0.name
's_scale'
>>> es_0.params
('train_inputs', 'non_train_inputs')
>>> # Override `stage` if you want to fit your scaler on OOF data like a crazy person
>>> es_1 = EngineerStep(s_scale, stage="pre_cv")
>>> es_1.stage
'pre_cv'

Watch out for multiple requests to the same data

>>> es_2 = EngineerStep(s_scale, params=("train_inputs", "all_inputs"))
Traceback (most recent call last):
    File "feature_engineering.py", line ? in validate_dataset_names
ValueError: Requested params include duplicate references to `train_inputs` by way of:
   - ('all_inputs', 'train_inputs')
   - ('train_inputs',)
Each dataset may only be requested by a single param for each function

Error is the same if `(train_inputs, all_inputs)` is in the actual function signature

EngineerStep functions aren’t just limited to transformations. Make your own features!

>>> def sqr_sum(all_inputs):
...     all_inputs["square_sum"] = all_inputs.agg(
...         lambda row: np.sqrt(np.sum([np.square(_) for _ in row])), axis="columns"
...     )
...     return all_inputs
>>> es_3 = EngineerStep(sqr_sum)
>>> es_3.stage
'pre_cv'
>>> es_3.name
'sqr_sum'
>>> es_3.params
('all_inputs',)

Inverse-transformation Implementation:

>>> def q_transform(train_targets, non_train_targets):
...     t = QuantileTransformer(output_distribution="normal")
...     train_targets[train_targets.columns] = t.fit_transform(train_targets.values)
...     non_train_targets[train_targets.columns] = t.transform(non_train_targets.values)
...     return train_targets, non_train_targets, t
>>> # Note that `train_targets` and `non_train_targets` must still be returned in order,
>>> #   but they are followed by `t`, an instance of `QuantileTransformer` we just fitted,
>>> #   whose `inverse_transform` method will be called on predictions
>>> es_4 = EngineerStep(q_transform)
>>> es_4.stage
'intra_cv'
>>> es_4.name
'q_transform'
>>> es_4.params
('train_targets', 'non_train_targets')
>>> # `params` does not include any returned transformers - Only data requested as input
Attributes
f

Feature engineering step callable that requests, modifies, and returns datasets

name

Identifier for the transformation applied by this engineering step

params

Dataset names requested by feature engineering step callable f.

stage

Feature engineering stage during which the EngineerStep will be executed

Methods

__call__(self, \*\*datasets, …)

Apply f to datasets to produce updated datasets.

get_comparison_attrs(step_obj, dict])

Build a dict of critical EngineerStep attributes

get_datasets_for_f(self, datasets, …)

Produce a dict of DataFrames containing only the merged datasets and standard datasets requested in params.

get_key_data(self)

Produce a dict of critical attributes describing the EngineerStep instance for use by key-making classes

honorary_step_from_dict(step_dict, dimension)

Get an EngineerStep from dimension that is equal to its dict form, step_dict

inverse_transform(self, data)

Perform the inverse transformation for this engineer step (if it exists)

stringify(self)

Make a stringified representation of self, compatible with EngineerStep.__eq__()

inverse_transform(self, data)

Perform the inverse transformation for this engineer step (if it exists)

Parameters
data: Array-like

Data to inverse transform with inversion or inversion.inverse_transform

Returns
Array-like

If inversion is None, return data unmodified. Else, return the result of inversion or inversion.inverse_transform, given data

get_datasets_for_f(self, datasets:Dict[str, pandas.core.frame.DataFrame]) → Dict[str, pandas.core.frame.DataFrame]

Produce a dict of DataFrames containing only the merged datasets and standard datasets requested in params. In other words, add the requested merged datasets and remove unnecessary standard datasets

Parameters
datasets: DFDict

Original dict of datasets, containing all datasets provided to EngineerStep.__call__(), some of which may be superfluous, or may require additional processing to resolve merged/coupled datasets

Returns
DFDict

Updated version of datasets, in which unnecessary datasets have been filtered out, and the requested merged datasets have been added

get_key_data(self) → dict

Produce a dict of critical attributes describing the EngineerStep instance for use by key-making classes

Returns
Dict

Important attributes describing this EngineerStep instance

property f

Feature engineering step callable that requests, modifies, and returns datasets

property name

Identifier for the transformation applied by this engineering step

property params

Dataset names requested by feature engineering step callable f. See documentation in EngineerStep.__init__() for more information/restrictions

property stage

Feature engineering stage during which the EngineerStep will be executed

static get_comparison_attrs(step_obj:Union[_ForwardRef('EngineerStep'), dict]) → dict

Build a dict of critical EngineerStep attributes

Parameters
step_obj: EngineerStep, dict

Object for which critical EngineerStep attributes should be collected

Returns
attr_vals: Dict

Critical EngineerStep attributes. If step_obj does not have a necessary attribute (for EngineerStep) or a necessary key (for dict), its value in attr_vals will be a placeholder object. This is to facilitate comparison, while also ensuring missing values will always be considered unequal to other values

Examples

>>> def dummy_f(train_inputs, non_train_inputs):
...     return train_inputs, non_train_inputs
>>> es_0 = EngineerStep(dummy_f)
>>> EngineerStep.get_comparison_attrs(es_0)  
{'name': 'dummy_f',
 'f': <function dummy_f at ...>,
 'params': ('train_inputs', 'non_train_inputs'),
 'stage': 'intra_cv',
 'do_validate': False}
>>> EngineerStep.get_comparison_attrs(
...     dict(foo="hello", f=dummy_f, params=["all_inputs", "all_targets"], stage="pre_cv")
... )  
{'name': <object object at ...>,
 'f': <function dummy_f at ...>,
 'params': ('all_inputs', 'all_targets'),
 'stage': 'pre_cv',
 'do_validate': <object object at ...>}
stringify(self) → str

Make a stringified representation of self, compatible with EngineerStep.__eq__()

Returns
String

String describing all critical attributes of the EngineerStep instance. This value is not particularly human-friendly due to both its length and the fact that EngineerStep.f is represented by its hash

Examples

>>> def dummy_f(train_inputs, non_train_inputs):
...     return train_inputs, non_train_inputs
>>> EngineerStep(dummy_f).stringify()  
"EngineerStep(dummy_f, ..., ('train_inputs', 'non_train_inputs'), intra_cv, False)"
>>> EngineerStep(dummy_f, stage="pre_cv").stringify()  
"EngineerStep(dummy_f, ..., ('train_inputs', 'non_train_inputs'), pre_cv, False)"
classmethod honorary_step_from_dict(step_dict:dict, dimension:hyperparameter_hunter.space.dimensions.Categorical)

Get an EngineerStep from dimension that is equal to its dict form, step_dict

Parameters
step_dict: Dict

Dict of form saved in Experiment description files for EngineerStep. Expected to have following keys, with values of the given types:

  • “name”: String

  • “f”: String (SHA256 hash)

  • “params”: List[str], or Tuple[str, …]

  • “stage”: String in {“pre_cv”, “intra_cv”}

  • “do_validate”: Boolean

dimension: Categorical

Categorical instance expected to contain the EngineerStep equivalent of step_dict in its categories

Returns
EngineerStep

From dimension.categories if it is the EngineerStep equivalent of step_dict

Raises
ValueError

If dimension.categories does not contain an EngineerStep matching step_dict

class hyperparameter_hunter.feature_engineering.FeatureEngineer(steps=None, do_validate=False, **datasets: Dict[str, pandas.core.frame.DataFrame])

Bases: object

Class to organize feature engineering step callables steps (EngineerStep instances) and the datasets that the steps request and return.

Parameters
steps: List, or None, default=None

List of arbitrary length, containing any of the following values:

  1. EngineerStep instance,

  2. Function to provide as input to EngineerStep, or

  3. Categorical, with categories comprising a selection of the previous two steps values (optimization only)

The third value can only be used during optimization. The feature_engineer provided to CVExperiment, for example, may only contain the first two values. To search a space optionally including an EngineerStep, use the optional kwarg of Categorical.

See EngineerStep for information on properly formatted EngineerStep functions. Additional engineering steps may be added via add_step()

do_validate: Boolean, or “strict”, default=False

… Experimental… Whether to validate the datasets resulting from feature engineering steps. If True, hashes of the new datasets will be compared to those of the originals to ensure they were actually modified. Results will be logged. If do_validate = “strict”, an exception will be raised if any anomalies are found, rather than logging a message. If do_validate = False, no validation will be performed

**datasets: DFDict

This is not expected to be provided on initialization and is offered primarily for debugging/testing. Mapping of datasets necessary to perform feature engineering steps

See also

EngineerStep

For proper formatting of non-Categorical values of steps

Notes

If steps does include any instances of hyperparameter_hunter.space.dimensions.Categorical, this FeatureEngineer instance will not be usable by Experiments. It can only be used by Optimization Protocols. Furthermore, the FeatureEngineer that the Optimization Protocol actually ends up using will not pass identity checks against the original FeatureEngineer that contained Categorical steps

Examples

>>> from sklearn.preprocessing import StandardScaler, MinMaxScaler, QuantileTransformer
>>> # Define some engineer step functions to play with
>>> def s_scale(train_inputs, non_train_inputs):
...     s = StandardScaler()
...     train_inputs[train_inputs.columns] = s.fit_transform(train_inputs.values)
...     non_train_inputs[train_inputs.columns] = s.transform(non_train_inputs.values)
...     return train_inputs, non_train_inputs
>>> def mm_scale(train_inputs, non_train_inputs):
...     s = MinMaxScaler()
...     train_inputs[train_inputs.columns] = s.fit_transform(train_inputs.values)
...     non_train_inputs[train_inputs.columns] = s.transform(non_train_inputs.values)
...     return train_inputs, non_train_inputs
>>> def q_transform(train_targets, non_train_targets):
...     t = QuantileTransformer(output_distribution="normal")
...     train_targets[train_targets.columns] = t.fit_transform(train_targets.values)
...     non_train_targets[train_targets.columns] = t.transform(non_train_targets.values)
...     return train_targets, non_train_targets, t
>>> def sqr_sum(all_inputs):
...     all_inputs["square_sum"] = all_inputs.agg(
...         lambda row: np.sqrt(np.sum([np.square(_) for _ in row])), axis="columns"
...     )
...     return all_inputs

FeatureEngineer steps wrapped by `EngineerStep` == raw function steps - as long as the `EngineerStep` is using the default parameters

>>> # FeatureEngineer steps wrapped by `EngineerStep` == raw function steps
>>> #   ... As long as the `EngineerStep` is using the default parameters
>>> fe_0 = FeatureEngineer([sqr_sum, s_scale])
>>> fe_1 = FeatureEngineer([EngineerStep(sqr_sum), EngineerStep(s_scale)])
>>> fe_0.steps == fe_1.steps
True
>>> fe_2 = FeatureEngineer([sqr_sum, EngineerStep(s_scale), q_transform])

`Categorical` can be used during optimization and placed anywhere in `steps`. `Categorical` can also handle either `EngineerStep` categories or raw functions. Use the `optional` kwarg of `Categorical` to test some questionable steps

>>> fe_3 = FeatureEngineer([sqr_sum, Categorical([s_scale, mm_scale]), q_transform])
>>> fe_4 = FeatureEngineer([Categorical([sqr_sum], optional=True), s_scale, q_transform])
>>> fe_5 = FeatureEngineer([
...     Categorical([sqr_sum], optional=True),
...     Categorical([EngineerStep(s_scale), mm_scale]),
...     q_transform
... ])
Attributes
steps

Feature engineering steps to execute in sequence on FeatureEngineer.__call__()

Methods

__call__(self, stage, \*\*datasets, …)

Execute all feature engineering steps in steps for stage, with datasets datasets as inputs

add_step(self, step, …)

Add an engineering step to steps to be executed with the other contents of steps on FeatureEngineer.__call__()

get_key_data(self)

Produce a dict of critical attributes describing the FeatureEngineer instance for use by key-making classes

inverse_transform(self, data)

Perform the inverse transformation for all engineer steps in steps in sequence on data

inverse_transform(self, data)

Perform the inverse transformation for all engineer steps in steps in sequence on data

Parameters
data: Array-like

Data to inverse transform with any inversions present in steps

Returns
Array-like

Result of sequentially calling inverse transformations in steps on data. If any step has EngineerStep.inversion = None, data is unmodified for that step, and proceeds to next engineer step inversion

property steps

Feature engineering steps to execute in sequence on FeatureEngineer.__call__()

get_key_data(self) → dict

Produce a dict of critical attributes describing the FeatureEngineer instance for use by key-making classes

Returns
Dict

Important attributes describing this FeatureEngineer instance

add_step(self, step:Union[Callable, hyperparameter_hunter.space.dimensions.Categorical], stage:str=None, name:str=None, before:str=<class 'hyperparameter_hunter.feature_engineering.EMPTY_SENTINEL'>, after:str=<class 'hyperparameter_hunter.feature_engineering.EMPTY_SENTINEL'>, number:int=<class 'hyperparameter_hunter.feature_engineering.EMPTY_SENTINEL'>)

Add an engineering step to steps to be executed with the other contents of steps on FeatureEngineer.__call__()

Parameters
step: Callable, or `EngineerStep`, or `Categorical`

If EngineerStep instance, will be added directly to steps. Otherwise, must be a feature engineering step callable that requests, modifies, and returns datasets, which will be used to instantiate a EngineerStep to add to steps. If Categorical, categories should contain EngineerStep instances or callables

stage: String in {“pre_cv”, “intra_cv”}, or None, default=None

Feature engineering stage during which the callable step will be executed

name: String, or None, default=None

Identifier for the transformation applied by this engineering step. If None and step is not an EngineerStep, will be inferred during EngineerStep instantiation

before: String, default=EMPTY_SENTINEL

… Experimental…

after: String, default=EMPTY_SENTINEL

… Experimental…

number: String, default=EMPTY_SENTINEL

… Experimental…

hyperparameter_hunter.feature_engineering.get_engineering_step_stage(datasets:Tuple[str, ...]) → str

Determine the stage in which a feature engineering step that requests datasets as input should be executed

Parameters
datasets: Tuple[str]

Dataset names requested by a feature engineering step callable

Returns
stage: {“pre_cv”, “intra_cv”}

“pre_cv” if a step processing the given datasets should be executed in the pre-cross-validation stage. “intra_cv” if the step should be executed for each cross-validation split. If any of the elements in datasets is prefixed with “validation” or “non_train”, stage will be “intra_cv”. Otherwise, it will be “pre_cv”

Notes

Generally, feature engineering conducted in the “pre_cv” stage should regard each sample/row as independent entities. For example, steps like converting a string day of the week to one-hot encoded columns, or imputing missing values by replacement with -1 might be conducted “pre_cv”, since they are unlikely to introduce an information leakage. Conversely, steps like scaling/normalization, whose results for the data in one row are affected by the data in other rows should be performed “intra_cv” in order to recalculate the final values of the datasets for each cross validation split and avoid information leakage

Technically, the inference of stage=”intra_cv” due to the existence of a “non_train”-prefixed value in datasets could unnecessarily force steps to be executed “intra_cv” if, for example, there is no validation data. However, this is safer than the alternative of executing these steps “pre_cv”, in which validation data would be a subset of train data, probably introducing information leakage. A simple workaround for this is to explicitly provide EngineerStep with the desired stage parameter to bypass this inference

Examples

>>> get_engineering_step_stage(("train_inputs", "validation_inputs", "holdout_inputs"))
'intra_cv'
>>> get_engineering_step_stage(("all_data"))
'pre_cv'
>>> get_engineering_step_stage(("all_inputs", "all_targets"))
'pre_cv'
>>> get_engineering_step_stage(("train_data", "non_train_data"))
'intra_cv'
class hyperparameter_hunter.feature_engineering.ParameterParser

Bases: ast.NodeVisitor

ast.NodeVisitor subclass that collects the arguments specified in the signature of a callable node, as well as the values returned by the callable, in the attributes args and returns, respectively

Methods

generic_visit(self, node)

Called if no explicit visitor function exists for a node.

visit(self, node)

Visit a node.

visit_Return

visit_arg

visit_arg(self, node)
visit_Return(self, node)
hyperparameter_hunter.feature_engineering.get_engineering_step_params(f:<built-in function callable>) → Tuple[str]

Verify that callable f requests valid input parameters, and returns a tuple of the same parameters, with the assumption that the parameters are modified by f

Parameters
f: Callable

Feature engineering step function that requests, modifies, and returns datasets

Returns
Tuple

Argument/return value names declared by f

Examples

>>> def impute_negative_one(all_inputs):
...     all_inputs.fillna(-1, inplace=True)
...     return all_inputs
>>> get_engineering_step_params(impute_negative_one)
('all_inputs',)
>>> def standard_scale(train_inputs, non_train_inputs):
...     scaler = StandardScaler()
...     train_inputs[train_inputs.columns] = scaler.fit_transform(train_inputs.values)
...     non_train_inputs[train_inputs.columns] = scaler.transform(non_train_inputs.values)
...     return train_inputs, non_train_inputs
>>> get_engineering_step_params(standard_scale)
('train_inputs', 'non_train_inputs')
>>> def error_invalid_dataset(train_inputs, foo):
...     return train_inputs, foo
>>> get_engineering_step_params(error_invalid_dataset)
Traceback (most recent call last):
    File "feature_engineering.py", line ?, in get_engineering_step_params
ValueError: Invalid dataset name: 'foo'
hyperparameter_hunter.feature_engineering.hash_datasets(datasets:dict) → dict

Describe datasets with dicts of hashes for their values, column names, and column values

Parameters
datasets: Dict

Mapping of dataset names to pandas.DataFrame instances

Returns
hashes: Dict

Mapping with same keys as datasets, whose values are dicts returned from _hash_dataset() that provide hashes for each DataFrame and its column names/values

Examples

>>> df_x = pd.DataFrame(dict(a=[0, 1], b=[2, 3], c=[4, 5]))
>>> df_y = pd.DataFrame(dict(a=[0, 1], b=[6, 7], d=[8, 9]))
>>> hash_datasets(dict(x=df_x, y=df_y)) == dict(x=_hash_dataset(df_x), y=_hash_dataset(df_y))
True

hyperparameter_hunter.importer module

This module provides utilities to intercept external imports and load them using custom logic

Related

hyperparameter_hunter.__init__

Executes the import hooks to ensure assets are properly imported prior to starting any real work

hyperparameter_hunter.tracers

Defines tracing metaclasses applied by hyperparameter_hunter.importer to imports

class hyperparameter_hunter.importer.Interceptor(module_name, custom_loader, asset_name=None)

Bases: _frozen_importlib_external.PathFinder

Class to intercept loading of an external module in order to provide custom loading logic

Parameters
module_name: String

The path of the module, for which loading should be handled by custom_loader

custom_loader: Descendant of `importlib.machinery.SourceFileLoader`

Should implement exec_module(), which should call its superclass’s exec_module(), then perform the custom loading logic, and return module

Methods

find_module(fullname[, path])

find the module on sys.path or ‘path’ based on sys.path_hooks and sys.path_importer_cache.

find_spec(self, full_name[, path, target])

Perform custom loading logic if full_name == module_name

invalidate_caches()

Call the invalidate_caches() method on all path entry finders stored in sys.path_importer_caches (where implemented).

find_spec(self, full_name, path=None, target=None)

Perform custom loading logic if full_name == module_name

class hyperparameter_hunter.importer.KerasLayerLoader(fullname, path)

Bases: _frozen_importlib_external.SourceFileLoader

Cache the module name and the path to the file found by the finder.

Methods

create_module(self, spec)

Use default semantics for module creation.

exec_module(self, module)

Set module.Layer to a traced version of itself via tracers.ArgumentTracer

get_code(self, fullname)

Concrete implementation of InspectLoader.get_code.

get_data(self, path)

Return the data from path as raw bytes.

get_filename(self[, name])

Return the path to the source file as found by the finder.

get_source(self, fullname)

Concrete implementation of InspectLoader.get_source.

is_package(self, fullname)

Concrete implementation of InspectLoader.is_package by checking if the path returned by get_filename has a filename of ‘__init__.py’.

load_module(self[, name])

Load a module from a file.

path_mtime(self, path)

Optional method that returns the modification time (an int) for the specified path, where path is a str.

path_stats(self, path)

Return the metadata for the path.

set_data(self, path, data, \*[, _mode])

Write bytes data to a file.

source_to_code(self, data, path, \*[, _optimize])

Return the code object compiled from source.

exec_module(self, module)

Set module.Layer to a traced version of itself via tracers.ArgumentTracer

hyperparameter_hunter.importer.hook_keras_layer()

If Keras has yet to be imported, modify the inheritance structure of its base Layer class to inject attributes that keep track of the parameters provided to each layer

class hyperparameter_hunter.importer.KerasMultiInitializerLoader(fullname, path)

Bases: _frozen_importlib_external.SourceFileLoader

Cache the module name and the path to the file found by the finder.

Methods

create_module(self, spec)

Use default semantics for module creation.

exec_module(self, module)

Execute the module.

get_code(self, fullname)

Concrete implementation of InspectLoader.get_code.

get_data(self, path)

Return the data from path as raw bytes.

get_filename(self[, name])

Return the path to the source file as found by the finder.

get_source(self, fullname)

Concrete implementation of InspectLoader.get_source.

is_package(self, fullname)

Concrete implementation of InspectLoader.is_package by checking if the path returned by get_filename has a filename of ‘__init__.py’.

load_module(self[, name])

Load a module from a file.

path_mtime(self, path)

Optional method that returns the modification time (an int) for the specified path, where path is a str.

path_stats(self, path)

Return the metadata for the path.

set_data(self, path, data, \*[, _mode])

Write bytes data to a file.

source_to_code(self, data, path, \*[, _optimize])

Return the code object compiled from source.

exec_module(self, module)

Execute the module.

hyperparameter_hunter.importer.hook_keras_initializers()

hyperparameter_hunter.metrics module

This module defines hyperparameter_hunter.metrics.ScoringMixIn which enables hyperparameter_hunter.experiments.BaseExperiment to score predictions and collect the results of those evaluations

Related

hyperparameter_hunter.experiments

This module uses hyperparameter_hunter.metrics.ScoringMixIn as the only explicit parent class to hyperparameter_hunter.experiments.BaseExperiment (that is, the only parent class that isn’t bestowed upon it by hyperparameter_hunter.experiment_core.ExperimentMeta)

class hyperparameter_hunter.metrics.Metric(name: str, metric_function: Union[callable, str, None] = None, direction: str = 'infer')

Bases: object

Class to encapsulate all necessary information for identifying, calculating, and evaluating metrics results

Parameters
name: String

Identifying name of the metric. Should be unique relative to any other metric names that might be provided by the user

metric_function: Callable, string, None, default=None

If callable, should expect inputs of form (target, prediction), and return a float. If string, will be treated as an attribute in sklearn.metrics. If None, name will be treated as an attribute in sklearn.metrics, the value of which will be retrieved and used as metric_function

direction: {“infer”, “max”, “min”}, default=”infer”

How to compare the result of metric_function relative to previous evaluations

  • “max”: Metric values should be maximized, and higher metric values are better than lower values; it should be used for measures of accuracy

  • “min”: Metric values should be minimized, and lower metric values are better than higher values; it should be used for measures of error or loss

  • “infer”: direction will be set to:

    1. “min” if name (or metric_function’s name) contains “error” or “loss”

    2. “max” if name contains neither of the aforementioned strings

Notes

direction = “infer” looks for “error”/”loss” in name first, then in the name of metric_function. This means that name can be an abbreviation/anything for error measures and direction will still be correctly inferred as long as the actual callable for metric_function has “error”/”loss” in its name. For example, direction = “min” is safely inferred when using “mae” for “mean_absolute_error” or “rmsle” for “root_mean_squared_logarithmic_error”. This functions as described whether metric_function is a string naming an SKLearn metric, or a callable whose name includes “error”/”loss”

Examples

>>> Metric("roc_auc_score")  
Metric(roc_auc_score, <function roc_auc_score at 0x...>, max)
>>> Metric("roc_auc_score", sk_metrics.roc_auc_score)  
Metric(roc_auc_score, <function roc_auc_score at 0x...>, max)
>>> Metric("my_f1_score", "f1_score")  
Metric(my_f1_score, <function f1_score at 0x...>, max)
>>> Metric("hamming_loss", sk_metrics.hamming_loss)  
Metric(hamming_loss, <function hamming_loss at 0x...>, min)

Respect explicit `direction` even if it doesn’t make sense for the `metric_function`

>>> Metric("r2_score", sk_metrics.r2_score, direction="min")  
Metric(r2_score, <function r2_score at 0x...>, min)

Direction inference based on `metric_function` name, rather than `name` itself

>>> Metric("mae", "median_absolute_error")  
Metric(mae, <function median_absolute_error at 0x...>, min)
>>> Metric("hl", sk_metrics.hamming_loss)  
Metric(hl, <function hamming_loss at 0x...>, min)

Methods

__call__(self, target, prediction)

Call self as a function.

hyperparameter_hunter.metrics.format_metrics(metrics:Union[Dict, List]) → Dict[str, hyperparameter_hunter.metrics.Metric]

Properly format iterable metrics to contain instances of Metric

Parameters
metrics: Dict, List

Iterable describing the metrics to be recorded, along with a means to compute the value of each metric. Should be of one of the two following forms:

List Form:

  • [“<metric name>”, “<metric name>”, …]: Where each value of the list is a string that names an attribute in sklearn.metrics

  • [Metric, Metric, …]: Where each value of the list is an instance of Metric

  • [(<*args>), (<*args>), …]: Where each value of the list is a tuple of arguments that will be used to instantiate a Metric. Arguments given in tuples must be in order expected by Metric

Dict Form:

  • {“<metric name>”: <metric_function>, …}: Where each key is a name for the corresponding metric callable, which is used to compute the value of the metric

  • {“<metric name>”: (<metric_function>, <direction>), …}: Where each key is a name for the corresponding metric callable and direction, all of which are used to instantiate a Metric

  • {“<metric name>”: “<sklearn metric name>”, …}: Where each key is a name for the metric, and each value is the name of the attribute in sklearn.metrics for which the corresponding key is an alias

  • {“<metric name>”: None, …}: Where each key is the name of the attribute in sklearn.metrics

  • {“<metric name>”: Metric, …}: Where each key names an instance of Metric. This is the internally-used format to which all other formats will be converted

Metric callable functions should expect inputs of form (target, prediction), and should return floats. See the documentation of Metric for information regarding expected parameters and types

Returns
metrics_dict: Dict

Cast of metrics to a dict, in which values are instances of Metric

Examples

>>> format_metrics(["roc_auc_score", "f1_score"])  
{'roc_auc_score': Metric(roc_auc_score, <function roc_auc_score at 0x...>, max), 'f1_score': Metric(f1_score, <function f1_score at 0x...>, max)}
>>> format_metrics([Metric("log_loss"), Metric("r2_score", direction="min")])  
{'log_loss': Metric(log_loss, <function log_loss at 0x...>, min), 'r2_score': Metric(r2_score, <function r2_score at 0x...>, min)}
>>> format_metrics({"log_loss": Metric("log_loss"), "r2_score": Metric("r2_score", direction="min")})  
{'log_loss': Metric(log_loss, <function log_loss at 0x...>, min), 'r2_score': Metric(r2_score, <function r2_score at 0x...>, min)}
>>> format_metrics([("log_loss", None), ("my_r2_score", "r2_score", "min")])  
{'log_loss': Metric(log_loss, <function log_loss at 0x...>, min), 'my_r2_score': Metric(my_r2_score, <function r2_score at 0x...>, min)}
>>> format_metrics({"roc_auc": sk_metrics.roc_auc_score, "f1": sk_metrics.f1_score})  
{'roc_auc': Metric(roc_auc, <function roc_auc_score at 0x...>, max), 'f1': Metric(f1, <function f1_score at 0x...>, max)}
>>> format_metrics({"log_loss": (None, ), "my_r2_score": ("r2_score", "min")})  
{'log_loss': Metric(log_loss, <function log_loss at 0x...>, min), 'my_r2_score': Metric(my_r2_score, <function r2_score at 0x...>, min)}
>>> format_metrics({"roc_auc": "roc_auc_score", "f1": "f1_score"})  
{'roc_auc': Metric(roc_auc, <function roc_auc_score at 0x...>, max), 'f1': Metric(f1, <function f1_score at 0x...>, max)}
>>> format_metrics({"roc_auc_score": None, "f1_score": None})  
{'roc_auc_score': Metric(roc_auc_score, <function roc_auc_score at 0x...>, max), 'f1_score': Metric(f1_score, <function f1_score at 0x...>, max)}
hyperparameter_hunter.metrics.get_formatted_target_metric(target_metric:Union[tuple, str, NoneType], metrics:dict, default_dataset:str='oof') → Tuple[str, str]

Return a properly formatted target_metric tuple for use with navigating evaluation results

Parameters
target_metric: Tuple, String, or None

Path denoting metric to be used. If tuple, the first value should be in [‘oof’, ‘holdout’, ‘in_fold’], and the second value should be the name of a metric supplied in metrics. If str, should be one of the two values from the tuple form. Else, a value will be chosen

metrics: Dict

Properly formatted metrics as produced by metrics.format_metrics(), in which keys are strings identifying metrics, and values are instances of metrics.Metric. See the documentation of metrics.format_metrics() for more information on different metrics formats

default_dataset: {“oof”, “holdout”, “in_fold”}, default=”oof”

The default dataset type value to use if one is not provided

Returns
target_metric: Tuple

A formatted target_metric containing two strings: a dataset_type, followed by a metric name

Examples

>>> get_formatted_target_metric(('holdout', 'roc_auc_score'), format_metrics(['roc_auc_score', 'f1_score']))
('holdout', 'roc_auc_score')
>>> get_formatted_target_metric(('holdout',), format_metrics(['roc_auc_score', 'f1_score']))
('holdout', 'roc_auc_score')
>>> get_formatted_target_metric('holdout', format_metrics(['roc_auc_score', 'f1_score']))
('holdout', 'roc_auc_score')
>>> get_formatted_target_metric('holdout', format_metrics({'roc': 'roc_auc_score', 'f1': 'f1_score'}))
('holdout', 'roc')
>>> get_formatted_target_metric('roc_auc_score', format_metrics(['roc_auc_score', 'f1_score']))
('oof', 'roc_auc_score')
>>> get_formatted_target_metric(None, format_metrics(['f1_score', 'roc_auc_score']))
('oof', 'f1_score')
class hyperparameter_hunter.metrics.ScoringMixIn(metrics, in_fold='all', oof='all', holdout='all', do_score=True)

Bases: object

MixIn class to manage metrics to record for each dataset type, and perform evaluations

Parameters
metrics: Dict, List

Specifies all metrics to be used by their id keys, along with a means to compute the metric. If list, all values must be strings that are attributes in sklearn.metrics. If dict, key/value pairs must be of the form: (<id>, <callable/None/str sklearn.metrics attribute>), where “id” is a str name for the metric. Its corresponding value must be one of: 1) a callable to calculate the metric, 2) None if the “id” key is an attribute in sklearn.metrics and should be used to fetch a callable, 3) a string that is an attribute in sklearn.metrics and should be used to fetch a callable. Metric callable functions should expect inputs of form (target, prediction), and should return floats

in_fold: List of strings, None, default=<all ids in `metrics`>

Which metrics (from ids in metrics) should be recorded for in-fold data

oof: List of strings, None, default=<all ids in `metrics`>

Which metrics (from ids in metrics) should be recorded for out-of-fold data

holdout: List of strings, None, default=<all ids in `metrics`>

Which metrics (from ids in metrics) should be recorded for holdout data

do_score: Boolean, default=True

This is experimental. If False, scores will be neither calculated nor recorded for the duration of the experiment

Notes

For each kwarg in [in_fold, oof, holdout], the following must be true: if the value of the kwarg is a list, its contents must be a subset of metrics

Methods

evaluate(self, data_type, target, prediction)

Apply metric(s) to the given data to calculate the value of the prediction

evaluate(self, data_type, target, prediction, return_list=False, dry_run=False)

Apply metric(s) to the given data to calculate the value of the prediction

Parameters
data_type: {“in_fold”, “oof”, “holdout”}

The type of dataset for which target and prediction arguments are being provided

target: Array-like

True labels for the data. Should be same shape as prediction

prediction: Array-like

Predicted labels for the data. Should be same shape as target

return_list: Boolean, default=False

If True, return list of tuples instead of dict. See “Returns” section below for details

dry_run: Boolean, default=False

If True, the value of last_evaluation_results will not be updated to include the returned _result. The core library callbacks operate under the assumption that last_evaluation_results will be updated as usual, so restrict usage to debugging or lambda_callback() implementations

Returns
_result: OrderedDict, or list

A dict whose keys are all metric keys supplied for data_type, and whose values are the results of each metric. If return_list is True, returns a list of tuples of: (<data_type metric str>, <metric result>)

Notes

The required types of target and prediction are entirely dependent on the metric callable’s expectations

hyperparameter_hunter.metrics.get_clean_prediction(target:Iterable, prediction:Iterable)

Create prediction that is of a form comparable to target

Parameters
target: Array-like

True labels for the data. Should be same shape as prediction

prediction: Array-like

Predicted labels for the data. Should be same shape as target

Returns
prediction: Array-like

If target types are ints, and prediction types are not, given predicted labels clipped between the min, and max of target, then rounded to the nearest integer. Else, original predicted labels

hyperparameter_hunter.metrics.classify_output(target, prediction)

Force continuous prediction into the discrete, classified space of target. This is not an output/feature transformer akin to SKLearn’s discretization transformers. This function is intended for use in the very specific case of having a target that is classification-like (“binary”, “multiclass”, etc.), with prediction that resembles a “continuous” target, despite being made for target. The most common reason for this occurrence is that prediction is actually the division-averaged predictions collected along the course of a CVExperiment. In this case, the original model predictions should have been classification-like; however, due to disagreement in the division predictions, the resulting average predictions appear to be continuous

Parameters
target: Array-like

# TODO: …

prediction: Array-like

# TODO: …

Returns
numpy.array

# TODO: …

Notes

Target types used by this function are defined by sklearn.utils.multiclass.type_of_target.

If a prediction value is exactly between two target values, it will assume the lower of the two values. For example, given a single prediction of 1.5 and unique labels of [0, 1, 2, 3], the value of that prediction will be 1, rather than 2

Examples

>>> import numpy as np
>>> classify_output(np.array([0, 3, 1, 2]), [0.5, 1.51, 0.66, 4.9])
array([0, 2, 1, 3])
>>> classify_output(np.array([0, 1, 2, 3]), [0.5, 1.51, 0.66, 4.9])
array([0, 2, 1, 3])
>>> # TODO: ... Add more examples, including binary classification
hyperparameter_hunter.metrics.wrap_xgboost_metric(metric, metric_name)

Create a function to use as the eval_metric kwarg for xgboost.sklearn.XGBModel.fit()

Parameters
metric: Function

The function to calculate the value of metric, with signature: (target, prediction)

metric_name: String

The name of the metric being evaluated

Returns
eval_metric: Function

The function to pass to XGBoost’s fit(), with signature: (prediction, target). It will return a tuple of (metric_name: str, metric_value: float)

hyperparameter_hunter.models module

This module provides wrapper classes around the raw algorithms being executed to facilitate use by hyperparameter_hunter.experiments.BaseExperiment. The algorithms created by most libraries can be handled by hyperparameter_hunter.models.Model, but some need special attention, hence KerasModel, and XGBoostModel. The model classes defined herein handle algorithm instantiation, as well as fitting and predicting

Related

hyperparameter_hunter.experiments

This module is the primary user of the classes defined in hyperparameter_hunter.models

hyperparameter_hunter.sentinels

This module defines the Sentinel classes that will be converted to the actual values they represent in hyperparameter_hunter.models.Model.__init__()

hyperparameter_hunter.models.load_model(_)
hyperparameter_hunter.models.model_selector(model_initializer)

Selects the appropriate Model class to use for model_initializer

Parameters
model_initializer: callable

The callable used to create an instance of some algorithm

Returns
Model, or one of its children

Examples

>>> from sklearn.svm import SVC
>>> model_selector(SVC) == Model
True
>>> model_selector(None) == Model
True
class hyperparameter_hunter.models.Model(model_initializer, initialization_params, extra_params, train_input=None, train_target=None, validation_input=None, validation_target=None, do_predict_proba=False, target_metric=None, metrics=None)

Bases: object

Handles initialization, fitting, and prediction for provided algorithms. Consider documentation for children of Model to be identical to that of Model, except where noted

Parameters
model_initializer: Class

Expected to implement at least the following methods: 1) __init__, to which initialization_params will usually be provided unless stated otherwise in a child class’s documentation - like KerasModel. 2) fit, to which train_input, and train_target will be provided, in addition to the contents of extra_params['fit'] in some child classes - like XGBoostModel. 3) predict, or predict_proba if applicable, which should accept any array-like input of shape: (<num_samples>, train_input.shape[1])

initialization_params: Dict

A dict containing all arguments accepted by __init__() of the class model_initializer, unless stated otherwise in a child class’s documentation - like KerasModel. Arguments pertaining to random seeds will be ignored

extra_params: Dict, default={}

A dict of special parameters that are passed to a model’s non-initialization methods in special cases (such as fit, predict, predict_proba, and score). extra_params are not used for all models. See the documentation for the appropriate descendant of models.Model for information about how it handles extra_params

train_input: `pandas.DataFrame`

The model’s training input data

train_target: `pandas.DataFrame`

The true labels corresponding to the rows of train_input

validation_input: `pandas.DataFrame`, or None

The model’s validation input data to evaluate performance during fitting

validation_target: `pandas.DataFrame`, or None

The true labels corresponding to the rows of validation_input

do_predict_proba: Boolean, or int, default=False
  • If False, models.Model.fit() will call models.Model.model.predict()

  • If True, it will call models.Model.model.predict_proba(), and the values in all columns will be used as the actual prediction values

  • If do_predict_proba is an int, models.Model.fit() will call models.Model.model.predict_proba(), as is the case when do_predict_proba is True, but the int supplied as do_predict_proba declares the column index to use as the actual prediction values

  • For example, for a model to call the predict method, do_predict_proba=False (default). For a model to call the predict_proba method, and use all of the class probabilities, do_predict_proba=True. To call the predict_proba method, and use the class probabilities in the first column, do_predict_proba=0. To use the second column (index 1) of the result, do_predict_proba=1 - This often corresponds to the positive class’s probabilities in binary classification problems. To use the third column do_predict_proba=2, and so on

  • See the notes for the do_predict_proba parameter in the documentation of environment.Environment for additional usage notes

target_metric: Tuple

Used by some child classes (like XGBoostModel) to provide validation data to model.fit()

metrics: Dict

Used by some child classes (like XGBoostModel) to provide validation data to model.fit()

Methods

fit(self)

Train model according to extra_params['fit'] (if appropriate) on training data

initialize_model(self)

Create an instance of a model using model_initializer, with initialization_params as input

predict(self, input_data)

Generate model predictions for input_data

initialize_model(self)

Create an instance of a model using model_initializer, with initialization_params as input

fit(self)

Train model according to extra_params['fit'] (if appropriate) on training data

predict(self, input_data)

Generate model predictions for input_data

Parameters
input_data: Array-like

Data containing the same number of features as were trained on, for which the model will predict output values

Returns
prediction: Array-like

Output predictions made by the model, using input_data

class hyperparameter_hunter.models.XGBoostModel(model_initializer, initialization_params, extra_params, train_input=None, train_target=None, validation_input=None, validation_target=None, do_predict_proba=False, target_metric=None, metrics=None)

Bases: hyperparameter_hunter.models.Model

A special Model class for handling XGBoost algorithms. Consider documentation to be identical to that of Model, except where noted

Parameters
model_initializer: :class:`xgboost.sklearn.XGBClassifier`, or :class:`xgboost.sklearn.XGBRegressor`

See Model

initialization_params: See :class:`Model`
extra_params: Dict, default={}

Useful keys: [‘fit’, ‘predict’]. If ‘fit’ is a key with a dict value, its contents will be provided to xgboost.sklearn.XGBModel.fit(), with the exception of the following: [‘X’, ‘y’]. If any of the aforementioned keys are in extra_params['fit'] or if extra_params['fit'] is provided, but is not a dict, an Exception will be raised

train_input: See :class:`Model`
train_target: See :class:`Model`
validation_input: See :class:`Model`
validation_target: See :class:`Model`
do_predict_proba: See :class:`Model`
target_metric: Tuple

Used to determine the ‘eval_metric’ argument to xgboost.sklearn.XGBModel.fit(). See the documentation for XGBoostModel.extra_params for more information

metrics: See :class:`Model`

Methods

fit(self)

Train model according to extra_params['fit'] (if appropriate) on training data

initialize_model(self)

Create an instance of a model using model_initializer, with initialization_params as input

predict(self, input_data)

Generate model predictions for input_data

class hyperparameter_hunter.models.KerasModel(model_initializer, initialization_params, extra_params, train_input=None, train_target=None, validation_input=None, validation_target=None, do_predict_proba=False, target_metric=None, metrics=None)

Bases: hyperparameter_hunter.models.Model

A special Model class for handling Keras neural networks. Consider documentation to be identical to that of Model, except where noted

Parameters
model_initializer: :class:`keras.wrappers.scikit_learn.KerasClassifier`, or `keras.wrappers.scikit_learn.KerasRegressor`

Expected to implement at least the following methods: 1) __init__, to which initialization_params will usually be provided unless stated otherwise in a child class’s documentation - like KerasModel. 2) fit, to which train_input, and train_target will be provided, in addition to the contents of extra_params['fit'] in some child classes - like XGBoostModel. 3) predict, or predict_proba if applicable, which should accept any array-like input of shape: (<num_samples>, train_input.shape[1])

initialization_params: Dict containing `build_fn`

A dictionary containing the single key: build_fn, which is a callable function that returns a compiled Keras model

extra_params: Dict, default={}

The parameters expected to be passed to the extra methods of the compiled Keras model. Such methods include (but are not limited to) fit, predict, and predict_proba. Some of the common parameters given here include epochs, batch_size, and callbacks

train_input: `pandas.DataFrame`

The model’s training input data

train_target: `pandas.DataFrame`

The true labels corresponding to the rows of train_input

validation_input: `pandas.DataFrame`, or None

The model’s validation input data to evaluate performance during fitting

validation_target: `pandas.DataFrame`, or None

The true labels corresponding to the rows of validation_input

do_predict_proba: Boolean, or int, default=False
  • If False, models.Model.fit() will call models.Model.model.predict()

  • If True, it will call models.Model.model.predict_proba(), and the values in all columns will be used as the actual prediction values

  • If int, models.Model.fit() will call models.Model.model.predict_proba(), as is the case when do_predict_proba is True, but the int supplied as do_predict_proba declares the column index to use as the actual prediction values

For example, for a model to call the predict method, do_predict_proba=False (default). For a model to call the predict_proba method, and use all of the class probabilities, do_predict_proba=True. To call the predict_proba method, and use the class probabilities in the first column, do_predict_proba=0. To use the second column (index 1) of the result, do_predict_proba=1 - This often corresponds to the positive class’s probabilities in binary classification problems. To use the third column do_predict_proba=2, and so on.

See the notes for the do_predict_proba parameter of Environment for additional usage notes

target_metric: Tuple

Used by some child classes (like XGBoostModel) to provide validation data to model.fit()

metrics: Dict

Used by some child classes (like XGBoostModel) to provide validation data to model.fit()

Methods

fit(self)

Train model according to extra_params['fit'] (if appropriate) on training data

get_input_shape(self[, get_dim])

Calculate the shape of the input that should be expected by the model

initialize_keras_neural_network(self)

Initialize Keras model wrapper (model_initializer) with initialization_params, extra_params, and validation_data if it can be found, as well as the input dimensions for the model

initialize_model(self)

Create an instance of a model using model_initializer, with initialization_params as input

predict(self, input_data)

Generate model predictions for input_data

validate_keras_params(self)

Ensure provided input parameters are properly formatted

initialize_model(self)

Create an instance of a model using model_initializer, with initialization_params as input

fit(self)

Train model according to extra_params['fit'] (if appropriate) on training data

get_input_shape(self, get_dim=False)

Calculate the shape of the input that should be expected by the model

Parameters
get_dim: Boolean, default=False

If True, instead of returning an input_shape tuple, an input_dim scalar will be returned

Returns
Tuple, or scalar

If get_dim=False, an input_shape tuple. Else, an input_dim scalar

validate_keras_params(self)

Ensure provided input parameters are properly formatted

initialize_keras_neural_network(self)

Initialize Keras model wrapper (model_initializer) with initialization_params, extra_params, and validation_data if it can be found, as well as the input dimensions for the model

hyperparameter_hunter.sentinels module

This module defines Sentinel objects that are used to represent data that is not yet available. For example, hyperparameter_hunter.sentinels.DatasetSentinel is used in hyperparameter_hunter.environment.Environment to enable a user to pass the fold validation dataset as an argument on Experiment initialization. At the point that the sentinel is provided, the training dataset has not yet been split into folds, which is why the Sentinel is necessary

Related

hyperparameter_hunter.environment

hyperparameter_hunter.environment.Environment has the following properties that utilize hyperparameter_hunter.sentinels.DatasetSentinel: [train_input, train_target, validation_input, validation_target, holdout_input, holdout_target]. These properties can be passed as arguments to Experiment or OptimizationProtocol initialization in order to provide the dataset to a Model’s fit call, for example

hyperparameter_hunter.experiments

This is one of the points at which one might want to use the Sentinels exposed by hyperparameter_hunter.environment.Environment, specifically as values in the model_init_params and model_extra_params arguments to a descendant of hyperparameter_hunter.experiments.BaseExperiment

hyperparameter_hunter.optimization.protocol_core

This is a second point at which one might use the Sentinels exposed by hyperparameter_hunter.environment.Environment. In this case, they could be provided as values in the model_init_params and model_extra_params arguments in a call to hyperparameter_hunter.optimization.protocol_core.BaseOptPro.forge_experiment(), the structure of which intentionally mirrors that of hyperparameter_hunter.experiments.BaseExperiment.__init__()

hyperparameter_hunter.models

This is ultimately where Sentinel instances will be converted to the actual values that they represent via calls to hyperparameter_hunter.sentinels.locate_sentinels()

class hyperparameter_hunter.sentinels.Sentinel(*args, **kwargs)

Bases: object

Base class for Sentinels representing data that is not yet available. Subclasses should call super().__init__() at the end of their __init__ methods

Parameters
*args: List

Extra arguments for subclasses of Sentinel

**kwargs: Dict

Extra keyword arguments for subclasses of Sentinel

Attributes
sentinel

Retrieve Sentinel._sentinel

Methods

retrieve_by_sentinel(self)

Retrieve the actual object represented by the sentinel

property sentinel

Retrieve Sentinel._sentinel

Returns
Str

The value of Sentinel._sentinel

abstract retrieve_by_sentinel(self) → object

Retrieve the actual object represented by the sentinel

Returns
object

The object for which the sentinel was being used as a placeholder

hyperparameter_hunter.sentinels.locate_sentinels(parameters)

Produce a mirrored parameters dict, wherein Sentinel values are converted to the objects they represent

Parameters
parameters: Dict

Dict of parameters, which may contain nested Sentinel values

Returns
Dict

Mirror of parameters, except where a Sentinel was found, the value it represents is returned instead

class hyperparameter_hunter.sentinels.DatasetSentinel(dataset_type, dataset_hash, cv_type=None, global_random_seed=None, random_seeds=None)

Bases: hyperparameter_hunter.sentinels.Sentinel

Class to create sentinels representing dataset input/target values

Parameters
dataset_type: Str

Dataset type, suffixed with ‘_input’, or ‘_target’, for which a sentinel should be created. Acceptable values are as follows: [‘train_input’, ‘train_target’, ‘validation_input’, ‘validation_target’, ‘holdout_input’, ‘holdout_target’]

dataset_hash: Str

The hash of the dataset for which a sentinel should be created that was generated while creating hyperparameter_hunter.environment.Environment.cross_experiment_key

cv_type: Str, or None, default=None

If None, dataset_type should be one of [‘holdout_input’, ‘holdout_target’]. Else, should be a string that is one of the following: 1) a string attribute of sklearn.model_selection._split, or 2) a hash produced while creating hyperparameter_hunter.environment.Environment.cross_experiment_key

global_random_seed: Int, or None, default=None

If None, dataset_type should be one of [‘holdout_input’, ‘holdout_target’]. If int, should be hyperparameter_hunter.environment.Environment.global_random_seed

random_seeds: List, or None, default=None

If None, dataset_type should be one of [‘holdout_input’, ‘holdout_target’]. If list, should be hyperparameter_hunter.environment.Environment.random_seeds

Attributes
sentinel

Retrieve Sentinel._sentinel

Methods

retrieve_by_sentinel(self)

Retrieve the actual dataset represented by the sentinel

retrieve_by_sentinel(self)

Retrieve the actual dataset represented by the sentinel

Returns
object

The dataset for which the sentinel was being used as a placeholder

hyperparameter_hunter.settings module

This module is the doorway for other modules to access the information set by the active hyperparameter_hunter.environment.Environment, and to access the appropriate logging methods. Specifically, other modules will most often use hyperparameter_hunter.settings.G to access the aforementioned information. Additionally, this module defines several variables to assist in navigating the ‘HyperparameterHunterAssets’ directory structure

Related

hyperparameter_hunter.environment

This module sets hyperparameter_hunter.settings.G.Env to itself, creating the primary gateway used by other modules to access the active Environment’s information

class hyperparameter_hunter.settings.G

Bases: object

This class defines global attributes that are set upon instantiation of environment.Environment. All attributes contained herein are class variables (not instance variables) because the expectation is for the attributes of this class to be set only once, then referenced by operations that may be executed after instantiating a environment.Environment. This allows functions to be called or classes to be initiated without passing a reference to the currently active Environment, because they check the attributes of this class, instead

Attributes
Env: None

This is set to “self” in environment.Environment.__init__(). This fact allows other modules to check if settings.G.Env is None. If None, a environment.Environment has not yet been instantiated. If not None, any attributes or methods of the instantiated Env may be called

save_transformed_predictions: False

Declares format in which a model’s predictions should be saved, with regard to feature_engineering.FeatureEngineer transformations. If no transformation of the target variable takes place (either through feature_engineering.FeatureEngineer, feature_engineering.EngineerStep, or otherwise), then this setting can be ignored.

If save_transformed_predictions is True, and target transformation does occur, then experiment predictions are saved in the same form as the transformed target, which is the form returned directly by a fitted model’s predict method. For example, if target data is label-encoded, and an feature_engineering.EngineerStep is used to one-hot encode the target, then one-hot-encoded predictions will be saved.

Conversely, if save_transformed_predictions is False (default), and target transformation does occur, then experiment predictions are saved in the inverted form of the transformed target, which is the same form as the original target data. Continuing the example of label-encoded target data, and an feature_engineering.EngineerStep to one-hot encode the target, in this case, label-encoded predictions will be saved.

priority_callbacks: Tuple

Intended for internal use only. The contents of this tuple are inserted at the front of an Experiment’s list of callback bases via experiment_core.ExperimentMeta, ahead of even the Experiment’s original base classes. This is used primarily for testing callbacks, but it can also be used if you absolutely need a callback to be placed before the Experiment’s other ancestors in its MRO

log_: print

debug_: print

warn_: print

import_hooks: List

sentinel_registry: List

Methods

debug(content, \*args, \*\*kwargs)

Set in environment.Environment.initialize_reporting() to the updated version of reporting.ReportingHandler.debug()

debug_(value, …[, sep, end, file, flush])

Prints the values to a stream, or to sys.stdout by default.

log(content, \*args, \*\*kwargs)

Set in environment.Environment.initialize_reporting() to the updated version of reporting.ReportingHandler.log()

log_(value, …[, sep, end, file, flush])

Prints the values to a stream, or to sys.stdout by default.

reset_attributes()

Return the attributes of settings.G to their original values

warn(content, \*args, \*\*kwargs)

Set in environment.Environment.initialize_reporting() to the updated version of reporting.ReportingHandler.warn()

warn_()

Issue a warning, or maybe ignore it or raise an exception.

Env = None
save_transformed_predictions = False
priority_callbacks = ()
static log(content, *args, **kwargs)

Set in environment.Environment.initialize_reporting() to the updated version of reporting.ReportingHandler.log()

static debug(content, *args, **kwargs)

Set in environment.Environment.initialize_reporting() to the updated version of reporting.ReportingHandler.debug()

static warn(content, *args, **kwargs)

Set in environment.Environment.initialize_reporting() to the updated version of reporting.ReportingHandler.warn()

log_(value, ..., sep=' ', end='n', file=sys.stdout, flush=False)

Prints the values to a stream, or to sys.stdout by default. Optional keyword arguments: file: a file-like object (stream); defaults to the current sys.stdout. sep: string inserted between values, default a space. end: string appended after the last value, default a newline. flush: whether to forcibly flush the stream.

debug_(value, ..., sep=' ', end='n', file=sys.stdout, flush=False)

Prints the values to a stream, or to sys.stdout by default. Optional keyword arguments: file: a file-like object (stream); defaults to the current sys.stdout. sep: string inserted between values, default a space. end: string appended after the last value, default a newline. flush: whether to forcibly flush the stream.

warn_()

Issue a warning, or maybe ignore it or raise an exception.

import_hooks = ['keras_layer', 'keras_initializer', 'keras_variance_scaling']
sentinel_registry = []
classmethod reset_attributes()

Return the attributes of settings.G to their original values

hyperparameter_hunter.tracers module

This module defines metaclasses used to trace the parameters passed through operation-critical classes that are members of other libraries. These are only used in cases where it is impractical or impossible to effectively retrieve the arguments explicitly provided by a user, as well as the default arguments for the classes being traced. Generally, tracer metaclasses will aim to add some attributes to the class, that will collect default values, and provided arguments on the class’s creation, and an instance’s call

Related

hyperparameter_hunter.importer

This module handles the interception of certain imports in order to inject the tracer metaclasses defined in hyperparameter_hunter.tracers into the inheritance structure of objects that need to be traced

class hyperparameter_hunter.tracers.ArgumentTracer

Bases: type

Metaclass to trace the default arguments and explicitly provided arguments of its descendants. It also has special provisions for instantiating dummy models if directed to

Methods

__call__(cls, \*args, \*\*kwargs)

Call self as a function.

mro()

return a type’s method resolution order

class hyperparameter_hunter.tracers.LocationTracer

Bases: hyperparameter_hunter.tracers.ArgumentTracer

Metaclass to trace the origin of the call to initialize the descending class

Methods

__call__(cls, \*args, \*\*kwargs)

Call self as a function.

mro()

return a type’s method resolution order

Module contents

class hyperparameter_hunter.Environment(train_dataset, environment_params_path=None, *, results_path=None, metrics=None, holdout_dataset=None, test_dataset=None, target_column=None, id_column=None, do_predict_proba=None, prediction_formatter=None, metrics_params=None, cv_type=None, runs=None, global_random_seed=None, random_seeds=None, random_seed_bounds=None, cv_params=None, verbose=None, file_blacklist=None, reporting_params=None, to_csv_params=None, do_full_save=None, experiment_callbacks=None, experiment_recorders=None, save_transformed_metrics=None)

Bases: object

Class to organize the parameters that allow Experiments/OptPros to be fairly compared

Environment is the collective starting point for all of HyperparameterHunter’s biggest and best toys: Experiments and OptimizationProtocols. Without an Environment, neither of these will work.

The Environment is where we declare all the parameters that transcend traditional “hyperparameters”. It houses the stuff without which machine learning can’t even really start. Specifically, Environment cares about 1) The data used for fitting/predicting, 2) The cross-validation scheme used to split the data and fit models; and 3) How to evaluate the predictions made on that data. There are plenty of other goodies documented below, but the absolutely mission-critical parameters concerned with the above tasks are train_dataset, cv_type, cv_params, and metrics. Additionally, it’s important to provide results_path, so Experiment/OptPro results can be saved, which is kind of what HyperparameterHunter is all about

Parameters
train_dataset: Pandas.DataFrame, or str path

The training data for the experiment. Will be split into train/holdout data, if applicable, and train/validation data if cross-validation is to be performed. If str, will attempt to read file at path via pandas.read_csv(). For more information on which columns will be used during fitting/predicting, see the “Dataset columns” note in the “Notes” section below

environment_params_path: String path, or None, default=None

If not None and is valid .json filepath containing an object (dict), the file’s contents are treated as the default values for all keys that match any of the below kwargs used to initialize Environment

results_path: String path, or None, default=None

If valid directory path and the results directory has not yet been created, it will be created here. If this does not end with <ASSETS_DIRNAME>, it will be appended. If <ASSETS_DIRNAME> already exists at this path, new results will also be stored here. If None or invalid, results will not be stored

metrics: Dict, List, or None, default=None

Iterable describing the metrics to be recorded, along with a means to compute the value of each metric. Should be of one of the two following forms:

List Form:

  • [“<metric name>”, “<metric name>”, …]: Where each value is a string that names an attribute in sklearn.metrics

  • [Metric, Metric, …]: Where each value of the list is an instance of metrics.Metric

  • [(<name>, <metric_function>, [<direction>]), (<*args>), …]: Where each value of the list is a tuple of arguments that will be used to instantiate a metrics.Metric. Arguments given in tuples must be in order expected by metrics.Metric: (name, metric_function, direction)

Dict Form:

  • {“<metric name>”: <metric_function>, …}: Where each key is a name for the corresponding metric callable, which is used to compute the value of the metric

  • {“<metric name>”: (<metric_function>, <direction>), …}: Where each key is a name for the corresponding metric callable and direction, all of which are used to instantiate a metrics.Metric

  • {“<metric name>”: “<sklearn metric name>”, …}: Where each key is a name for the metric, and each value is the name of the attribute in sklearn.metrics for which the corresponding key is an alias

  • {“<metric name>”: None, …}: Where each key is the name of the attribute in sklearn.metrics

  • {“<metric name>”: Metric, …}: Where each key names an instance of metrics.Metric. This is the internally-used format to which all other formats will be converted

Metric callable functions should expect inputs of form (target, prediction), and should return floats. See the documentation of metrics.Metric for information regarding expected parameters and types

holdout_dataset: Pandas.DataFrame, callable, str path, or None, default=None

If pd.DataFrame, this is the holdout dataset. If callable, expects a function that takes (self.train: DataFrame, self.target_column: str) as input and returns the new (self.train: DataFrame, self.holdout: DataFrame). If str, will attempt to read file at path via pandas.read_csv(). Else, there is no holdout set. For more information on which columns will be used during fitting/predicting, see the “Dataset columns” note in the “Notes” section below

test_dataset: Pandas.DataFrame, str path, or None, default=None

The testing data for the experiment. Structure should be identical to that of train_dataset, except its target_column column can be empty or non-existent, because test_dataset predictions will never be evaluated. If str, will attempt to read file at path via pandas.read_csv(). For more information on which columns will be used during fitting/predicting, see the “Dataset columns” note in the “Notes” section below

target_column: Str, or list, default=’target’

If str, denotes the column name in all provided datasets (except test) that contains the target output. If list, should be a list of strs designating multiple target columns. For example, in a multi-class classification dataset like UCI’s hand-written digits, target_column would be a list containing ten strings. In this example, the target_column data would be sparse, with a 1 to signify that a sample is a written example of a digit (0-9). For a working example, see ‘hyperparameter_hunter/examples/lib_keras_multi_classification_example.py’

id_column: Str, or None, default=None

If not None, str denoting the column name in all provided datasets containing sample IDs

do_predict_proba: Boolean, or int, default=False
  • If False, models.Model.fit() will call models.Model.model.predict()

  • If True, it will call models.Model.model.predict_proba(), and the values in all columns will be used as the actual prediction values

  • If do_predict_proba is an int, models.Model.fit() will call models.Model.model.predict_proba(), as is the case when do_predict_proba is True, but the int supplied as do_predict_proba declares the column index to use as the actual prediction values

  • For example, for a model to call the predict method, do_predict_proba=False (default). For a model to call the predict_proba method, and use all of the class probabilities, do_predict_proba=True. To call the predict_proba method, and use the class probabilities in the first column, do_predict_proba=0. To use the second column (index 1) of the result, do_predict_proba=1 - This often corresponds to the positive class’s probabilities in binary classification problems. To use the third column do_predict_proba=2, and so on

prediction_formatter: Callable, or None, default=None

If callable, expected to have same signature as utils.result_utils.format_predictions(). That is, the callable will receive (raw_predictions: np.array, dataset_df: pd.DataFrame, target_column: str, id_column: str or None) as input and should return a properly formatted prediction DataFrame. The callable uses raw_predictions as the content, dataset_df to provide any id column, and target_column to identify the column in which to place raw_predictions

metrics_params: Dict, or None, default=dict()

Dictionary of extra parameters to provide to metrics.ScoringMixIn.__init__(). metrics must be provided either 1) as an input kwarg to Environment.__init__() (see metrics), or 2) as a key in metrics_params, but not both. An Exception will be raised if both are given, or if neither is given

cv_type: Class or str, default=’KFold’

The class to define cross-validation splits. If str, it must be an attribute of sklearn.model_selection._split, and it must be a cross-validation class that inherits one of the following sklearn classes: BaseCrossValidator, or _RepeatedSplits. Valid str values include ‘KFold’, and ‘RepeatedKFold’, although there are many more. It must implement the following methods: [__init__, split]. If using a custom class, see the following tested sklearn classes for proper implementations: [KFold, StratifiedKFold, RepeatedKFold, RepeatedStratifiedKFold]. The arguments provided to cv_type.__init__() will be Environment.cv_params, which should include the following: [‘n_splits’ <int>, ‘n_repeats’ <int> (if applicable)]. cv_type.split() will receive the following arguments: [BaseExperiment.train_input_data, BaseExperiment.train_target_data]

runs: Int, default=1

The number of times to fit a model within each fold to perform multiple-run-averaging with different random seeds

global_random_seed: Int, default=32

The initial random seed used just before generating an Experiment’s random_seeds. This ensures consistency for random_seeds between Experiments, without having to explicitly provide it here

random_seeds: None, or List, default=None

If None, random_seeds of the appropriate shape will be created automatically. Else, must be a list of ints of shape (cv_params[‘n_repeats’], cv_params[‘n_splits’], runs). If cv_params does not have the key n_repeats (because standard cross-validation is being used), the value will default to 1. See experiments.BaseExperiment._random_seed_initializer() for info on expected shape

random_seed_bounds: List, default=[0, 100000]

A list containing two integers: the lower and upper bounds, respectively, for generating an Experiment’s random seeds in experiments.BaseExperiment._random_seed_initializer(). Generally, leave this kwarg alone

cv_params: dict, or None, default=dict()

Parameters provided upon initialization of cv_type. Keys may be any args accepted by cv_type.__init__(). Number of fold splits must be provided via “n_splits”, and number of repeats (if applicable for cv_type) must be provided via “n_repeats”

verbose: Int, boolean, default=3

Verbosity of printing for any experiments performed while this Environment is active

Higher values indicate more frequent logging. Logs are still recorded in the heartbeat file regardless of verbosity level. verbose only dictates which logs are visible in the console. The following table illustrates which types of logging messages will be visible with each verbosity level:

| Verbosity | Keys/IDs | Final Score | Repetitions* | Folds | Runs* | Run Starts* | Result Files | Other |
|:---------:|:--------:|:-----------:|:------------:|:-----:|:-----:|:-----------:|:------------:|:-----:|
|     0     |          |             |              |       |       |             |              |       |
|     1     |    Yes   |     Yes     |              |       |       |             |              |       |
|     2     |    Yes   |     Yes     |      Yes     |  Yes  |       |             |              |       |
|     3     |    Yes   |     Yes     |      Yes     |  Yes  |  Yes  |             |              |       |
|     4     |    Yes   |     Yes     |      Yes     |  Yes  |  Yes  |     Yes     |      Yes     |  Yes  |

*: If such logging is deemed appropriate with the given cross-validation parameters. In other words, repetition/run logging will only be verbose if Environment was given more than one repetition/run, respectively

file_blacklist: List of str, or None, or ‘ALL’, default=None

If list of str, the result files named within are not saved to their respective directory in “<ASSETS_DIRNAME>/Experiments”. If None, all result files are saved. If ‘ALL’, nothing at all will be saved for the Experiments. If the path of the file that initializes an Experiment does not end with a “.py” extension, the Experiment proceeds as if “script_backup” had been added to file_blacklist. This means that backup files will not be created for Jupyter notebooks (or any other non-“.py” files). For info on acceptable values, see validate_file_blacklist()

reporting_params: Dict, default=dict()

Parameters passed to initialize reporting.ReportingHandler

to_csv_params: Dict, default=dict()

Parameters passed to the calls to pandas.frame.DataFrame.to_csv() in recorders. In particular, this is where an Experiment’s final prediction files are saved, so the values here will affect the format of the .csv prediction files. Warning: If to_csv_params contains the key “path_or_buf”, it will be removed. Otherwise, all items are supplied directly to to_csv(), including kwargs it might not be expecting if they are given

do_full_save: None, or callable, default=:func:`utils.result_utils.default_do_full_save`

If callable, expected to take an Experiment’s result description dict as input and return a boolean. If None, treated as a callable that returns True. This parameter is used by recorders.DescriptionRecorder to determine whether the Experiment result files following the description should also be created. If do_full_save returns False, result file-saving is stopped early, and only the description is saved. If do_full_save returns True, all files not in file_blacklist are saved normally. This allows you to skip creation of an Experiment’s predictions, logs, and heartbeats if its score does not meet some threshold you set, for example. do_full_save receives the Experiment description dict as input, so for help setting do_full_save, just look into one of your Experiment descriptions

experiment_callbacks: `LambdaCallback`, or list of `LambdaCallback` (optional)

Callbacks injected directly into Experiments, adding new functionality, or customizing existing processes. Should be a LambdaCallback or a list of such classes. LambdaCallback can be created using callbacks.bases.lambda_callback(), which documents the options for creating callbacks. experiment_callbacks will be added to the MRO of the executed Experiment class by experiment_core.ExperimentMeta at __call__ time, making experiment_callbacks new base classes of the Experiment. See callbacks.bases.lambda_callback() for more information. Note that the Experiments conducted by OptPros will still benefit from experiment_callbacks. The presence of LambdaCallbacks will affect neither Environment keys, nor Experiment keys. In other words, for the purposes of Experiment matching/recording, all other factors being equal, an Experiment with experiment_callbacks is considered identical to an Experiment without, despite whatever custom functionality was added by the LambdaCallbacks

experiment_recorders: List, None, default=None

If not None, may be a list whose values are tuples of (<recorders.BaseRecorder descendant>, <str result_path>). The result_path str should be a path relative to results_path that specifies the directory/file in which the product of the custom recorder should be saved. The contents of experiment_recorders will be provided to recorders.RecorderList upon completion of an Experiment, and, if the subclassing documentation in recorders is followed properly, will create or update a result file for the just-executed Experiment

save_transformed_metrics: Boolean (optional)

Declares manner in which a model’s predictions should be evaluated through the provided metrics, with regard to target data transformations. This setting can be ignored if no transformation of the target variable takes place (either through FeatureEngineer, EngineerStep, or otherwise).

The default value of save_transformed_metrics depends on the dtype of the target data in train_dataset. If all target columns are numeric, save_transformed_metrics`=False, meaning metric evaluation should use the original/inverted targets and predictions. Else if any target column is non-numeric, `save_transformed_metrics`=True, meaning evaluation should use the transformed targets and predictions because most metrics require numeric inputs. This is described further in :attr:`save_transformed_metrics. A more descriptive name for this may be “calculate_metrics_using_transformed_predictions”, but that’s a bit verbose–even by my standards

Other Parameters
cross_validation_type: …
  • Alias for cv_type *

cross_validation_params: …
  • Alias for cv_params *

metrics_map: …
  • Alias for metrics *

reporting_handler_params: …
  • Alias for reporting_params *

root_results_path: …
  • Alias for results_path *

Notes

Dataset columns: In order to specify the columns to be used by the three dataset kwargs (train_dataset, holdout_dataset, test_dataset) during fitting and predicting, a few attributes can be used. On Environment initialization, the columns specified by the following kwargs will be separated from the rest of the dataset during training/predicting: 1) target_column, which names the column containing the target output labels for the input data; and 2) id_column, which (if given) represents the name of the column that contains identifying information for each data sample, and should otherwise have no relation to the actual data. Additionally, the feature_selector kwarg of the descendants of hyperparameter_hunter.experiments.BaseExperiment (like hyperparameter_hunter.experiments.CVExperiment) is used to filter out columns of the given datasets prior to fitting. See its documentation for more information, but it can effectively be used to remove any columns from the datasets

Overriding default kwargs at environment_params_path: If you have any of the above kwargs specified in the .json file at environment_params_path (except environment_params_path, which will be ignored), you can override its value by passing it as a kwarg when initializing Environment. The contents at environment_params_path are only used when the matching kwarg supplied at initialization is None. See “/examples/environment_params_path_example.py” for details

The order of precedence for determining the value of each parameter is as follows, with items at the top having the highest priority, and deferring only to the items below if their own value is None:

do_predict_proba: Because this parameter can be either a boolean or an integer, it is important to explicitly pass booleans rather than truthy or falsey values. Similarly, only pass integers if you intend for the value to be used as a column index. Do not pass 0 to mean False, or 1 to mean True

Attributes
train_input: DatasetSentinel

Sentinel replaced with current train input data during Model fitting/predicting. Commonly given in the model_extra_params kwargs of hyperparameter_hunter.experiments.BaseExperiment or hyperparameter_hunter.optimization.protocol_core.BaseOptPro.forge_experiment() for eval_set-like hyperparameters. Importantly, the actual value of this Sentinel is determined after performing cross-validation data splitting, and after executing FeatureEngineer

train_target: DatasetSentinel

Like train_input, except for current train target data

validation_input: DatasetSentinel

Like train_input, except for current validation input data

validation_target: DatasetSentinel

Like train_input, except for current validation target data

holdout_input: DatasetSentinel

Like train_input, except for current holdout input data

holdout_target: DatasetSentinel

Like train_input, except for current holdout target data

Methods

environment_workflow(self)

Execute all methods required to validate the environment and run Experiments

format_result_paths(self)

Remove paths contained in file_blacklist, and format others to prepare for saving results

generate_cross_experiment_key(self)

Generate a key to describe the current Environment’s cross-experiment parameters

initialize_reporting(self)

Initialize reporting for the Environment and Experiments conducted during its lifetime

update_custom_environment_params(self)

Try to update null parameters from environment_params_path, or DEFAULT_PARAMS

validate_parameters(self)

Ensure the provided parameters are valid and properly formatted

DEFAULT_PARAMS = {'cv_params': {}, 'cv_type': 'KFold', 'do_full_save': <function default_do_full_save>, 'do_predict_proba': False, 'environment_params_path': None, 'file_blacklist': None, 'global_random_seed': 32, 'id_column': None, 'metrics': None, 'metrics_params': {}, 'prediction_formatter': <function format_predictions>, 'random_seed_bounds': [0, 100000], 'random_seeds': None, 'reporting_params': {'console_params': None, 'float_format': '{:.5f}', 'heartbeat_params': None, 'heartbeat_path': None}, 'results_path': None, 'runs': 1, 'save_transformed_metrics': None, 'target_column': 'target', 'to_csv_params': {}, 'verbose': 3}
property results_path
property target_column
property train_dataset
property test_dataset
property holdout_dataset
property file_blacklist
property cv_type
property to_csv_params
property cross_experiment_params
property experiment_callbacks
property save_transformed_metrics

If save_transformed_metrics is True, and target transformation does occur, then experiment metrics are calculated using the transformed targets and predictions, which is the form returned directly by a fitted model’s predict method. For example, if target data is label-encoded, and an feature_engineering.EngineerStep is used to one-hot encode the target, then metrics functions will receive the following as input: (one-hot-encoded targets, one-hot-encoded predictions).

Conversely, if save_transformed_metrics is False, and target transformation does occur, then experiment metrics are calculated using the inverse of the transformed targets and predictions, which is same form as the original target data. Continuing the example of label-encoded target data, and an feature_engineering.EngineerStep to one-hot encode the target, in this case, metrics functions will receive the following as input: (label-encoded targets, label-encoded predictions)

environment_workflow(self)

Execute all methods required to validate the environment and run Experiments

validate_parameters(self)

Ensure the provided parameters are valid and properly formatted

format_result_paths(self)

Remove paths contained in file_blacklist, and format others to prepare for saving results

update_custom_environment_params(self)

Try to update null parameters from environment_params_path, or DEFAULT_PARAMS

generate_cross_experiment_key(self)

Generate a key to describe the current Environment’s cross-experiment parameters

initialize_reporting(self)

Initialize reporting for the Environment and Experiments conducted during its lifetime

property train_input

Get a DatasetSentinel representing an Experiment’s fold_train_input

Returns
DatasetSentinel:

A Sentinel that will be converted to hyperparameter_hunter.experiments.BaseExperiment.fold_train_input upon Model initialization

property train_target

Get a DatasetSentinel representing an Experiment’s fold_train_target

Returns
DatasetSentinel:

A Sentinel that will be converted to hyperparameter_hunter.experiments.BaseExperiment.fold_train_target upon Model initialization

property validation_input

Get a DatasetSentinel representing an Experiment’s fold_validation_input

Returns
DatasetSentinel:

A Sentinel that will be converted to hyperparameter_hunter.experiments.BaseExperiment.fold_validation_input upon Model initialization

property validation_target

Get a DatasetSentinel representing an Experiment’s fold_validation_target

Returns
DatasetSentinel:

A Sentinel that will be converted to hyperparameter_hunter.experiments.BaseExperiment.fold_validation_target upon Model initialization

property holdout_input

Get a DatasetSentinel representing an Experiment’s holdout_input_data

Returns
DatasetSentinel:

A Sentinel that will be converted to hyperparameter_hunter.experiments.BaseExperiment.holdout_input_data upon Model initialization

property holdout_target

Get a DatasetSentinel representing an Experiment’s holdout_target_data

Returns
DatasetSentinel:

A Sentinel that will be converted to hyperparameter_hunter.experiments.BaseExperiment.holdout_target_data upon Model initialization

class hyperparameter_hunter.CVExperiment(model_initializer, model_init_params=None, model_extra_params=None, feature_engineer=None, feature_selector=None, notes=None, do_raise_repeated=False, auto_start=True, target_metric=None, callbacks=None)

Bases: hyperparameter_hunter.experiments.BaseCVExperiment

One-off Experimentation base class

Bare-bones Description: Runs the cross-validation scheme defined by Environment, during which 1) Datasets are processed according to feature_engineer; 2) Models are built by instantiating model_initializer with model_init_params; 3) Models are trained on processed data, optionally using parameters from model_extra_params; 4) Results are logged and recorded for each fitting period; 5) Descriptions, predictions, results (both averages and individual periods), etc. are saved.

What’s the Big Deal? The most important takeaway from the above description is that descriptions/results are THOROUGH and REUSABLE. By thorough, I mean that all of a model’s hyperparameters are saved, not just the ones given in model_init_params. This may sound odd, but it’s important because it makes results reusable during optimization, when you may be using a different set of hyperparameters. It helps with other things like preventing duplicate experiments and ensembling, as well. But the big part is that this transforms hyperparameter optimization from an isolated, throwaway process we can only afford when an ML project is sufficiently “mature” to a process that covers the entire lifespan of a project. No Experiment is forgotten or wasted. Optimization is automatically given the data it needs to succeed by drawing on all your past Experiments and optimization rounds.

The Experiment has three primary missions: 1. Act as scaffold for organizing ML Experimentation and optimization 2. Record Experiment descriptions and results 3. Eliminate lots of repetitive/error-prone boilerplate code

Providing a scaffold for the entire ML process is critical because without a standardized format, everything we do looks different. Without a unified scaffold, development is slower, more confusing, and less adaptable. One of the benefits of standardizing the format of ML Experimentation is that it enables us to exhaustively record all the important characteristics of Experiment, as well as an assortment of customizable result files – all in a way that allows them to be reused in the future.

What About Data/Metrics? Experiments require an active Environment in order to function, from which the Experiment collects important cross-experiment parameters, such as datasets, metrics, cross-validation schemes, and even callbacks to inherit, among many other properties documented in Environment

Parameters
model_initializer: Class, or functools.partial, or class instance

Algorithm class used to initialize a model, such as XGBoost’s XGBRegressor, or SKLearn’s KNeighborsClassifier; although, there are hundreds of possibilities across many different ML libraries. model_initializer is expected to define at least fit and predict methods. model_initializer will be initialized with model_init_params, and its “extra” methods (fit, predict, etc.) will be invoked with parameters in model_extra_params

model_init_params: Dict, or object (optional)

Dictionary of arguments given to create an instance of model_initializer. Any kwargs that are considered valid by the __init__ method of model_initializer are valid in model_init_params.

One of the key features that makes HyperparameterHunter so magical is that ALL hyperparameters in the signature of model_initializer (and their default values) are discovered – whether or not they are explicitly given in model_init_params. Not only does this make Experiment result descriptions incredibly thorough, it also makes optimization smoother, more effective, and far less work for the user. For example, take LightGBM’s LGBMRegressor, with model_init_params`=`dict(learning_rate=0.2). HyperparameterHunter recognizes that this differs from the default of 0.1. It also recognizes that LGBMRegressor is actually initialized with more than a dozen other hyperparameters we didn’t bother mentioning, and it records their values, too. So if we want to optimize num_leaves tomorrow, the OptPro doesn’t start from scratch. It knows that we ran an Experiment that didn’t explicitly mention num_leaves, but its default value was 31, and it uses this information to fuel optimization – all without us having to manually keep track of tons of janky collections of hyperparameters. In fact, we really don’t need to go out of our way at all. HyperparameterHunter just acts as our faithful lab assistant, keeping track of all the stuff we’d rather not worry about

model_extra_params: Dict (optional)

Dictionary of extra parameters for models’ non-initialization methods (like fit, predict, predict_proba, etc.), and for neural networks. To specify parameters for an extra method, place them in a dict named for the extra method to which the parameters should be given. For example, to call fit with early_stopping_rounds`=5, use `model_extra_params`=`dict(fit=dict(early_stopping_rounds=5)).

For models whose fit methods have a kwarg like eval_set (such as XGBoost’s), one can use the DatasetSentinel attributes of the current active Environment, documented under its “Attributes” section and under train_input. An example using several DatasetSentinels can be found in HyperparameterHunter’s [XGBoost Classification Example](https://github.com/HunterMcGushion/hyperparameter_hunter/blob/master/examples/xgboost_examples/classification.py)

feature_engineer: `FeatureEngineer`, or list (optional)

Feature engineering/transformation/pre-processing steps to apply to datasets defined in Environment. If list, will be used to initialize FeatureEngineer, and can contain any of the following values:

  1. EngineerStep instance

  2. Function input to :class:~hyperparameter_hunter.feature_engineering.EngineerStep`

For important information on properly formatting EngineerStep functions, please see the documentation of EngineerStep. OptPros can perform hyperparameter optimization of feature_engineer steps. This capability adds a third allowed value to the above list and is documented in forge_experiment()

feature_selector: List of str, callable, or list of booleans (optional)

Column names to include as input data for all provided DataFrames. If None, feature_selector is set to all columns in train_dataset, less target_column, and id_column. feature_selector is provided as the second argument for calls to pandas.DataFrame.loc when constructing datasets

notes: String (optional)

Additional information about the Experiment that will be saved with the Experiment’s description result file. This serves no purpose other than to facilitate saving Experiment details in a more readable format

do_raise_repeated: Boolean, default=False

If True and this Experiment locates a previous Experiment’s results with matching Environment and Hyperparameter Keys, a RepeatedExperimentError will be raised. Else, a warning will be logged

auto_start: Boolean, default=True

If True, after the Experiment is initialized, it will automatically call BaseExperiment.preparation_workflow(), followed by BaseExperiment.experiment_workflow(), effectively completing all essential tasks without requiring additional method calls

target_metric: Tuple, str, default=(‘oof’, <:attr:`environment.Environment.metrics`[0]>)

Path denoting the metric to be used to compare completed Experiments or to use for certain early stopping procedures in some model classes. The first value should be one of [‘oof’, ‘holdout’, ‘in_fold’]. The second value should be the name of a metric being recorded according to the values supplied in hyperparameter_hunter.environment.Environment.metrics_params. See the documentation for hyperparameter_hunter.metrics.get_formatted_target_metric() for more info. Any values returned by, or used as the target_metric input to this function are acceptable values for target_metric

callbacks: `LambdaCallback`, or list of `LambdaCallback` (optional)

Callbacks injected directly into concrete Experiment (CVExperiment), adding new functionality, or customizing existing processes. Should be a LambdaCallback or a list of such classes. LambdaCallback can be created using callbacks.bases.lambda_callback(), which documents the options for creating callbacks. callbacks will be added to the MRO of the Experiment by experiment_core.ExperimentMeta at __call__ time, making callbacks new base classes of the Experiment. See callbacks.bases.lambda_callback() for more information. The presence of LambdaCallbacks will not affect Experiment keys. In other words, for the purposes of Experiment matching/recording, all other factors being equal, an Experiment with callbacks is considered identical to an Experiment without, despite whatever custom functionality was added by the LambdaCallbacks

See also

hyperparameter_hunter.optimization.protocol_core.BaseOptPro.forge_experiment()

OptPro method to define hyperparameter search scaffold for building Experiments during optimization. This method follows the same format as Experiment initialization, but it adds the ability to provide hyperparameter values as ranges to search over, via subclasses of Dimension. The other notable difference is that forge_experiment removes the auto_start and target_metric kwargs, which is described in the forge_experiment docstring Notes

Environment

Provides critical information on how Experiments should be conducted, as well as the data to be used by Experiments. An Environment must be active before executing any Experiment or OptPro

lambda_callback()

Enables customization of the Experimentation process and access to all Experiment internals through a collection of methods that are invoked at all the important periods over an Experiment’s lifespan. These can be provided via the experiment_callbacks kwarg of Environment, and the callback classes literally get thrown in to the parent classes of the Experiment, so they’re kind of a big deal

Attributes
source_script

Methods

cross_validation_workflow(self)

Execute workflow for cross-validation process, consisting of the following tasks: 1) Create train and validation split indices for all folds, 2) Iterate through folds, performing cv_fold_workflow for each, 3) Average accumulated predictions over fold splits, 4) Evaluate final predictions, 5) Format final predictions to prepare for saving

cv_fold_workflow(self)

Execute workflow for individual fold, consisting of the following tasks: Execute overridden on_fold_start() tasks, 2) Perform cv_run_workflow for each run, 3) Execute overridden on_fold_end() tasks

cv_run_workflow(self)

Execute run workflow, consisting of: 1) Execute overridden on_run_start() tasks, 2) Initialize and fit Model, 3) Execute overridden on_run_end() tasks

evaluate(self, data_type, target, prediction)

Apply metric(s) to the given data to calculate the value of the prediction

execute(self)

Execute the fitting protocol for the Experiment, comprising the following: instantiation of learners for each run, preprocessing of data as appropriate, training learners, making predictions, and evaluating and aggregating those predictions and other stats/metrics for later use

experiment_workflow(self)

Define the actual experiment process, including execution, result saving, and cleanup

on_exp_start(self)

Prepare data prior to executing fitting protocol (cross-validation), by 1) Initializing formal datasets attributes, 2) Invoking feature_engineer to perform “pre_cv”-stage preprocessing, and 3) Updating datasets to include their (transformed) counterparts in feature_engineer

on_fold_start(self)

Override on_fold_start() tasks set by experiment_core.ExperimentMeta, consisting of: 1) Split train/validation data, 2) Make copies of holdout/test data for current fold (for feature engineering), 3) Log start, 4) Execute original tasks

on_run_start(self)

Override on_run_start() tasks organized by experiment_core.ExperimentMeta, consisting of: 1) Set random seed and update model parameters according to current seed, 2) Log run start, 3) Execute original tasks

preparation_workflow(self)

Execute all tasks that must take place before the experiment is actually started.

source_script = None
class hyperparameter_hunter.BayesianOptPro(target_metric=None, iterations=1, verbose=1, read_experiments=True, reporter_parameters=None, warn_on_re_ask=False, base_estimator='GP', n_initial_points=10, acquisition_function='gp_hedge', acquisition_optimizer='auto', random_state=32, acquisition_function_kwargs=None, acquisition_optimizer_kwargs=None, n_random_starts='DEPRECATED', callbacks=None, base_estimator_kwargs=None)

Bases: hyperparameter_hunter.optimization.protocol_core.SKOptPro

Base class for SKOpt-based Optimization Protocols

There are two important methods for all SKOptPro descendants that should be invoked after initialization:

  1. forge_experiment()

  2. go()

Parameters
target_metric: Tuple, default=(“oof”, <:attr:`environment.Environment.metrics`[0]>)

Rarely necessary to explicitly provide this, as the default is usually sufficient. Path denoting the metric to be used to compare Experiment performance. The first value should be one of [“oof”, “holdout”, “in_fold”]. The second value should be the name of a metric being recorded according to environment.Environment.metrics_params. See the documentation for metrics.get_formatted_target_metric() for more info. Any values returned by, or given as the target_metric input to, get_formatted_target_metric() are acceptable values for BaseOptPro.target_metric

iterations: Int, default=1

Number of Experiments to conduct during optimization upon invoking BaseOptPro.go()

verbose: {0, 1, 2}, default=1

Verbosity mode for console logging. 0: Silent. 1: Show only logs from the Optimization Protocol. 2: In addition to logs shown when verbose=1, also show the logs from individual Experiments

read_experiments: Boolean, default=True

If True, all Experiment records that fit in the current space and guidelines, and match algorithm_name, will be read in and used to fit any optimizers

reporter_parameters: Dict, or None, default=None

Additional parameters passed to reporting.OptimizationReporter.__init__(). Note: Unless provided explicitly, the key “do_maximize” will be added by default to reporter_params, with a value inferred from the direction of target_metric in G.Env.metrics. In nearly all cases, the “do_maximize” key should be ignored, as there are very few reasons to explicitly include it

warn_on_re_ask: Boolean, default=False

If True, and the internal optimizer recommends a point that has already been evaluated on invocation of ask, a warning is logged before recommending a random point. Either way, a random point is used instead of already-evaluated recommendations. However, logging the fact that this has taken place can be useful to indicate that the optimizer may be stalling, especially if it repeatedly recommends the same point. In these cases, if the suggested point is not optimal, it can be helpful to switch a different OptPro (especially DummyOptPro), which will suggest points using different criteria

Other Parameters
base_estimator: {SKLearn Regressor, “GP”, “RF”, “ET”, “GBRT”, “DUMMY”}, default=”GP”

If not string, should inherit from sklearn.base.RegressorMixin. In addition, the predict method should have an optional return_std argument, which returns std(Y | x), along with E[Y | x].

If base_estimator is a string in {“GP”, “RF”, “ET”, “GBRT”, “DUMMY”}, a surrogate model corresponding to the relevant X_minimize function is created

n_initial_points: Int, default=10

Number of complete evaluation points necessary before allowing Experiments to be approximated with base_estimator. Any valid Experiment records found will count as initialization points. If enough Experiment records are not found, additional points will be randomly sampled

acquisition_function:{“LCB”, “EI”, “PI”, “gp_hedge”}, default=”gp_hedge”

Function to minimize over the posterior distribution. Can be any of the following:

  • “LCB”: Lower confidence bound

  • “EI”: Negative expected improvement

  • “PI”: Negative probability of improvement

  • “gp_hedge”: Probabilistically choose one of the above three acquisition functions at every iteration

    • The gains g_i are initialized to zero

    • At every iteration,

      • Each acquisition function is optimised independently to propose a candidate point X_i

      • Out of all these candidate points, the next point X_best is chosen by softmax(eta g_i)

      • After fitting the surrogate model with (X_best, y_best), the gains are updated such that g_i -= mu(X_i)

acquisition_optimizer: {“sampling”, “lbfgs”, “auto”}, default=”auto”

Method to minimize the acquisition function. The fit model is updated with the optimal value obtained by optimizing acq_func with acq_optimizer

  • “sampling”: acq_func is optimized by computing acq_func at n_initial_points randomly sampled points.

  • “lbfgs”: acq_func is optimized by

    • Randomly sampling n_restarts_optimizer (from acq_optimizer_kwargs) points

    • “lbfgs” is run for 20 iterations with these initial points to find local minima

    • The optimal of these local minima is used to update the prior

  • “auto”: acq_optimizer is configured on the basis of the base_estimator and the search space. If the space is Categorical or if the provided estimator is based on tree-models, then this is set to “sampling”

random_state: Int, `RandomState` instance, or None, default=None

Set to something other than None for reproducible results

acquisition_function_kwargs: Dict, or None, default=dict(xi=0.01, kappa=1.96)

Additional arguments passed to the acquisition function

acquisition_optimizer_kwargs: Dict, or None, default=dict(n_points=10000, n_restarts_optimizer=5, n_jobs=1)

Additional arguments passed to the acquisition optimizer

n_random_starts: …

Deprecated since version 3.0.0: Use n_initial_points, instead. Will be removed in 3.2.0

callbacks: Callable, list of callables, or None, default=[]

If callable, then callbacks(self.optimizer_result) is called after each update to optimizer. If list, then each callable is called

base_estimator_kwargs: Dict, or None, default={}

Additional arguments passed to base_estimator when it is initialized

Notes

To provide initial input points for evaluation, individual Experiments can be executed prior to instantiating an Optimization Protocol. The results of these Experiments will automatically be detected and cherished by the optimizer.

SKOptPro and its children in optimization rely heavily on the utilities provided by the Scikit-Optimize library, so thank you to the creators and contributors for their excellent work.

Attributes
search_space_size

The number of different hyperparameter permutations possible given the current

source_script

Methods

forge_experiment(self, model_initializer[, …])

Define hyperparameter search scaffold for building Experiments during optimization

go(self[, force_ready])

Execute hyperparameter optimization, building an Experiment for each iteration

source_script = None
class hyperparameter_hunter.GradientBoostedRegressionTreeOptPro(target_metric=None, iterations=1, verbose=1, read_experiments=True, reporter_parameters=None, warn_on_re_ask=False, base_estimator='GBRT', n_initial_points=10, acquisition_function='EI', acquisition_optimizer='sampling', random_state=32, acquisition_function_kwargs=None, acquisition_optimizer_kwargs=None, n_random_starts='DEPRECATED', callbacks=None, base_estimator_kwargs=None)

Bases: hyperparameter_hunter.optimization.protocol_core.SKOptPro

Base class for SKOpt-based Optimization Protocols

There are two important methods for all SKOptPro descendants that should be invoked after initialization:

  1. forge_experiment()

  2. go()

Parameters
target_metric: Tuple, default=(“oof”, <:attr:`environment.Environment.metrics`[0]>)

Rarely necessary to explicitly provide this, as the default is usually sufficient. Path denoting the metric to be used to compare Experiment performance. The first value should be one of [“oof”, “holdout”, “in_fold”]. The second value should be the name of a metric being recorded according to environment.Environment.metrics_params. See the documentation for metrics.get_formatted_target_metric() for more info. Any values returned by, or given as the target_metric input to, get_formatted_target_metric() are acceptable values for BaseOptPro.target_metric

iterations: Int, default=1

Number of Experiments to conduct during optimization upon invoking BaseOptPro.go()

verbose: {0, 1, 2}, default=1

Verbosity mode for console logging. 0: Silent. 1: Show only logs from the Optimization Protocol. 2: In addition to logs shown when verbose=1, also show the logs from individual Experiments

read_experiments: Boolean, default=True

If True, all Experiment records that fit in the current space and guidelines, and match algorithm_name, will be read in and used to fit any optimizers

reporter_parameters: Dict, or None, default=None

Additional parameters passed to reporting.OptimizationReporter.__init__(). Note: Unless provided explicitly, the key “do_maximize” will be added by default to reporter_params, with a value inferred from the direction of target_metric in G.Env.metrics. In nearly all cases, the “do_maximize” key should be ignored, as there are very few reasons to explicitly include it

warn_on_re_ask: Boolean, default=False

If True, and the internal optimizer recommends a point that has already been evaluated on invocation of ask, a warning is logged before recommending a random point. Either way, a random point is used instead of already-evaluated recommendations. However, logging the fact that this has taken place can be useful to indicate that the optimizer may be stalling, especially if it repeatedly recommends the same point. In these cases, if the suggested point is not optimal, it can be helpful to switch a different OptPro (especially DummyOptPro), which will suggest points using different criteria

Other Parameters
base_estimator: {SKLearn Regressor, “GP”, “RF”, “ET”, “GBRT”, “DUMMY”}, default=”GP”

If not string, should inherit from sklearn.base.RegressorMixin. In addition, the predict method should have an optional return_std argument, which returns std(Y | x), along with E[Y | x].

If base_estimator is a string in {“GP”, “RF”, “ET”, “GBRT”, “DUMMY”}, a surrogate model corresponding to the relevant X_minimize function is created

n_initial_points: Int, default=10

Number of complete evaluation points necessary before allowing Experiments to be approximated with base_estimator. Any valid Experiment records found will count as initialization points. If enough Experiment records are not found, additional points will be randomly sampled

acquisition_function:{“LCB”, “EI”, “PI”, “gp_hedge”}, default=”gp_hedge”

Function to minimize over the posterior distribution. Can be any of the following:

  • “LCB”: Lower confidence bound

  • “EI”: Negative expected improvement

  • “PI”: Negative probability of improvement

  • “gp_hedge”: Probabilistically choose one of the above three acquisition functions at every iteration

    • The gains g_i are initialized to zero

    • At every iteration,

      • Each acquisition function is optimised independently to propose a candidate point X_i

      • Out of all these candidate points, the next point X_best is chosen by softmax(eta g_i)

      • After fitting the surrogate model with (X_best, y_best), the gains are updated such that g_i -= mu(X_i)

acquisition_optimizer: {“sampling”, “lbfgs”, “auto”}, default=”auto”

Method to minimize the acquisition function. The fit model is updated with the optimal value obtained by optimizing acq_func with acq_optimizer

  • “sampling”: acq_func is optimized by computing acq_func at n_initial_points randomly sampled points.

  • “lbfgs”: acq_func is optimized by

    • Randomly sampling n_restarts_optimizer (from acq_optimizer_kwargs) points

    • “lbfgs” is run for 20 iterations with these initial points to find local minima

    • The optimal of these local minima is used to update the prior

  • “auto”: acq_optimizer is configured on the basis of the base_estimator and the search space. If the space is Categorical or if the provided estimator is based on tree-models, then this is set to “sampling”

random_state: Int, `RandomState` instance, or None, default=None

Set to something other than None for reproducible results

acquisition_function_kwargs: Dict, or None, default=dict(xi=0.01, kappa=1.96)

Additional arguments passed to the acquisition function

acquisition_optimizer_kwargs: Dict, or None, default=dict(n_points=10000, n_restarts_optimizer=5, n_jobs=1)

Additional arguments passed to the acquisition optimizer

n_random_starts: …

Deprecated since version 3.0.0: Use n_initial_points, instead. Will be removed in 3.2.0

callbacks: Callable, list of callables, or None, default=[]

If callable, then callbacks(self.optimizer_result) is called after each update to optimizer. If list, then each callable is called

base_estimator_kwargs: Dict, or None, default={}

Additional arguments passed to base_estimator when it is initialized

Notes

To provide initial input points for evaluation, individual Experiments can be executed prior to instantiating an Optimization Protocol. The results of these Experiments will automatically be detected and cherished by the optimizer.

SKOptPro and its children in optimization rely heavily on the utilities provided by the Scikit-Optimize library, so thank you to the creators and contributors for their excellent work.

Attributes
search_space_size

The number of different hyperparameter permutations possible given the current

source_script

Methods

forge_experiment(self, model_initializer[, …])

Define hyperparameter search scaffold for building Experiments during optimization

go(self[, force_ready])

Execute hyperparameter optimization, building an Experiment for each iteration

source_script = None
hyperparameter_hunter.GBRT

alias of hyperparameter_hunter.optimization.backends.skopt.protocols.GradientBoostedRegressionTreeOptPro

class hyperparameter_hunter.RandomForestOptPro(target_metric=None, iterations=1, verbose=1, read_experiments=True, reporter_parameters=None, warn_on_re_ask=False, base_estimator='RF', n_initial_points=10, acquisition_function='EI', acquisition_optimizer='sampling', random_state=32, acquisition_function_kwargs=None, acquisition_optimizer_kwargs=None, n_random_starts='DEPRECATED', callbacks=None, base_estimator_kwargs=None)

Bases: hyperparameter_hunter.optimization.protocol_core.SKOptPro

Base class for SKOpt-based Optimization Protocols

There are two important methods for all SKOptPro descendants that should be invoked after initialization:

  1. forge_experiment()

  2. go()

Parameters
target_metric: Tuple, default=(“oof”, <:attr:`environment.Environment.metrics`[0]>)

Rarely necessary to explicitly provide this, as the default is usually sufficient. Path denoting the metric to be used to compare Experiment performance. The first value should be one of [“oof”, “holdout”, “in_fold”]. The second value should be the name of a metric being recorded according to environment.Environment.metrics_params. See the documentation for metrics.get_formatted_target_metric() for more info. Any values returned by, or given as the target_metric input to, get_formatted_target_metric() are acceptable values for BaseOptPro.target_metric

iterations: Int, default=1

Number of Experiments to conduct during optimization upon invoking BaseOptPro.go()

verbose: {0, 1, 2}, default=1

Verbosity mode for console logging. 0: Silent. 1: Show only logs from the Optimization Protocol. 2: In addition to logs shown when verbose=1, also show the logs from individual Experiments

read_experiments: Boolean, default=True

If True, all Experiment records that fit in the current space and guidelines, and match algorithm_name, will be read in and used to fit any optimizers

reporter_parameters: Dict, or None, default=None

Additional parameters passed to reporting.OptimizationReporter.__init__(). Note: Unless provided explicitly, the key “do_maximize” will be added by default to reporter_params, with a value inferred from the direction of target_metric in G.Env.metrics. In nearly all cases, the “do_maximize” key should be ignored, as there are very few reasons to explicitly include it

warn_on_re_ask: Boolean, default=False

If True, and the internal optimizer recommends a point that has already been evaluated on invocation of ask, a warning is logged before recommending a random point. Either way, a random point is used instead of already-evaluated recommendations. However, logging the fact that this has taken place can be useful to indicate that the optimizer may be stalling, especially if it repeatedly recommends the same point. In these cases, if the suggested point is not optimal, it can be helpful to switch a different OptPro (especially DummyOptPro), which will suggest points using different criteria

Other Parameters
base_estimator: {SKLearn Regressor, “GP”, “RF”, “ET”, “GBRT”, “DUMMY”}, default=”GP”

If not string, should inherit from sklearn.base.RegressorMixin. In addition, the predict method should have an optional return_std argument, which returns std(Y | x), along with E[Y | x].

If base_estimator is a string in {“GP”, “RF”, “ET”, “GBRT”, “DUMMY”}, a surrogate model corresponding to the relevant X_minimize function is created

n_initial_points: Int, default=10

Number of complete evaluation points necessary before allowing Experiments to be approximated with base_estimator. Any valid Experiment records found will count as initialization points. If enough Experiment records are not found, additional points will be randomly sampled

acquisition_function:{“LCB”, “EI”, “PI”, “gp_hedge”}, default=”gp_hedge”

Function to minimize over the posterior distribution. Can be any of the following:

  • “LCB”: Lower confidence bound

  • “EI”: Negative expected improvement

  • “PI”: Negative probability of improvement

  • “gp_hedge”: Probabilistically choose one of the above three acquisition functions at every iteration

    • The gains g_i are initialized to zero

    • At every iteration,

      • Each acquisition function is optimised independently to propose a candidate point X_i

      • Out of all these candidate points, the next point X_best is chosen by softmax(eta g_i)

      • After fitting the surrogate model with (X_best, y_best), the gains are updated such that g_i -= mu(X_i)

acquisition_optimizer: {“sampling”, “lbfgs”, “auto”}, default=”auto”

Method to minimize the acquisition function. The fit model is updated with the optimal value obtained by optimizing acq_func with acq_optimizer

  • “sampling”: acq_func is optimized by computing acq_func at n_initial_points randomly sampled points.

  • “lbfgs”: acq_func is optimized by

    • Randomly sampling n_restarts_optimizer (from acq_optimizer_kwargs) points

    • “lbfgs” is run for 20 iterations with these initial points to find local minima

    • The optimal of these local minima is used to update the prior

  • “auto”: acq_optimizer is configured on the basis of the base_estimator and the search space. If the space is Categorical or if the provided estimator is based on tree-models, then this is set to “sampling”

random_state: Int, `RandomState` instance, or None, default=None

Set to something other than None for reproducible results

acquisition_function_kwargs: Dict, or None, default=dict(xi=0.01, kappa=1.96)

Additional arguments passed to the acquisition function

acquisition_optimizer_kwargs: Dict, or None, default=dict(n_points=10000, n_restarts_optimizer=5, n_jobs=1)

Additional arguments passed to the acquisition optimizer

n_random_starts: …

Deprecated since version 3.0.0: Use n_initial_points, instead. Will be removed in 3.2.0

callbacks: Callable, list of callables, or None, default=[]

If callable, then callbacks(self.optimizer_result) is called after each update to optimizer. If list, then each callable is called

base_estimator_kwargs: Dict, or None, default={}

Additional arguments passed to base_estimator when it is initialized

Notes

To provide initial input points for evaluation, individual Experiments can be executed prior to instantiating an Optimization Protocol. The results of these Experiments will automatically be detected and cherished by the optimizer.

SKOptPro and its children in optimization rely heavily on the utilities provided by the Scikit-Optimize library, so thank you to the creators and contributors for their excellent work.

Attributes
search_space_size

The number of different hyperparameter permutations possible given the current

source_script

Methods

forge_experiment(self, model_initializer[, …])

Define hyperparameter search scaffold for building Experiments during optimization

go(self[, force_ready])

Execute hyperparameter optimization, building an Experiment for each iteration

source_script = None
hyperparameter_hunter.RF

alias of hyperparameter_hunter.optimization.backends.skopt.protocols.RandomForestOptPro

class hyperparameter_hunter.ExtraTreesOptPro(target_metric=None, iterations=1, verbose=1, read_experiments=True, reporter_parameters=None, warn_on_re_ask=False, base_estimator='ET', n_initial_points=10, acquisition_function='EI', acquisition_optimizer='sampling', random_state=32, acquisition_function_kwargs=None, acquisition_optimizer_kwargs=None, n_random_starts='DEPRECATED', callbacks=None, base_estimator_kwargs=None)

Bases: hyperparameter_hunter.optimization.protocol_core.SKOptPro

Base class for SKOpt-based Optimization Protocols

There are two important methods for all SKOptPro descendants that should be invoked after initialization:

  1. forge_experiment()

  2. go()

Parameters
target_metric: Tuple, default=(“oof”, <:attr:`environment.Environment.metrics`[0]>)

Rarely necessary to explicitly provide this, as the default is usually sufficient. Path denoting the metric to be used to compare Experiment performance. The first value should be one of [“oof”, “holdout”, “in_fold”]. The second value should be the name of a metric being recorded according to environment.Environment.metrics_params. See the documentation for metrics.get_formatted_target_metric() for more info. Any values returned by, or given as the target_metric input to, get_formatted_target_metric() are acceptable values for BaseOptPro.target_metric

iterations: Int, default=1

Number of Experiments to conduct during optimization upon invoking BaseOptPro.go()

verbose: {0, 1, 2}, default=1

Verbosity mode for console logging. 0: Silent. 1: Show only logs from the Optimization Protocol. 2: In addition to logs shown when verbose=1, also show the logs from individual Experiments

read_experiments: Boolean, default=True

If True, all Experiment records that fit in the current space and guidelines, and match algorithm_name, will be read in and used to fit any optimizers

reporter_parameters: Dict, or None, default=None

Additional parameters passed to reporting.OptimizationReporter.__init__(). Note: Unless provided explicitly, the key “do_maximize” will be added by default to reporter_params, with a value inferred from the direction of target_metric in G.Env.metrics. In nearly all cases, the “do_maximize” key should be ignored, as there are very few reasons to explicitly include it

warn_on_re_ask: Boolean, default=False

If True, and the internal optimizer recommends a point that has already been evaluated on invocation of ask, a warning is logged before recommending a random point. Either way, a random point is used instead of already-evaluated recommendations. However, logging the fact that this has taken place can be useful to indicate that the optimizer may be stalling, especially if it repeatedly recommends the same point. In these cases, if the suggested point is not optimal, it can be helpful to switch a different OptPro (especially DummyOptPro), which will suggest points using different criteria

Other Parameters
base_estimator: {SKLearn Regressor, “GP”, “RF”, “ET”, “GBRT”, “DUMMY”}, default=”GP”

If not string, should inherit from sklearn.base.RegressorMixin. In addition, the predict method should have an optional return_std argument, which returns std(Y | x), along with E[Y | x].

If base_estimator is a string in {“GP”, “RF”, “ET”, “GBRT”, “DUMMY”}, a surrogate model corresponding to the relevant X_minimize function is created

n_initial_points: Int, default=10

Number of complete evaluation points necessary before allowing Experiments to be approximated with base_estimator. Any valid Experiment records found will count as initialization points. If enough Experiment records are not found, additional points will be randomly sampled

acquisition_function:{“LCB”, “EI”, “PI”, “gp_hedge”}, default=”gp_hedge”

Function to minimize over the posterior distribution. Can be any of the following:

  • “LCB”: Lower confidence bound

  • “EI”: Negative expected improvement

  • “PI”: Negative probability of improvement

  • “gp_hedge”: Probabilistically choose one of the above three acquisition functions at every iteration

    • The gains g_i are initialized to zero

    • At every iteration,

      • Each acquisition function is optimised independently to propose a candidate point X_i

      • Out of all these candidate points, the next point X_best is chosen by softmax(eta g_i)

      • After fitting the surrogate model with (X_best, y_best), the gains are updated such that g_i -= mu(X_i)

acquisition_optimizer: {“sampling”, “lbfgs”, “auto”}, default=”auto”

Method to minimize the acquisition function. The fit model is updated with the optimal value obtained by optimizing acq_func with acq_optimizer

  • “sampling”: acq_func is optimized by computing acq_func at n_initial_points randomly sampled points.

  • “lbfgs”: acq_func is optimized by

    • Randomly sampling n_restarts_optimizer (from acq_optimizer_kwargs) points

    • “lbfgs” is run for 20 iterations with these initial points to find local minima

    • The optimal of these local minima is used to update the prior

  • “auto”: acq_optimizer is configured on the basis of the base_estimator and the search space. If the space is Categorical or if the provided estimator is based on tree-models, then this is set to “sampling”

random_state: Int, `RandomState` instance, or None, default=None

Set to something other than None for reproducible results

acquisition_function_kwargs: Dict, or None, default=dict(xi=0.01, kappa=1.96)

Additional arguments passed to the acquisition function

acquisition_optimizer_kwargs: Dict, or None, default=dict(n_points=10000, n_restarts_optimizer=5, n_jobs=1)

Additional arguments passed to the acquisition optimizer

n_random_starts: …

Deprecated since version 3.0.0: Use n_initial_points, instead. Will be removed in 3.2.0

callbacks: Callable, list of callables, or None, default=[]

If callable, then callbacks(self.optimizer_result) is called after each update to optimizer. If list, then each callable is called

base_estimator_kwargs: Dict, or None, default={}

Additional arguments passed to base_estimator when it is initialized

Notes

To provide initial input points for evaluation, individual Experiments can be executed prior to instantiating an Optimization Protocol. The results of these Experiments will automatically be detected and cherished by the optimizer.

SKOptPro and its children in optimization rely heavily on the utilities provided by the Scikit-Optimize library, so thank you to the creators and contributors for their excellent work.

Attributes
search_space_size

The number of different hyperparameter permutations possible given the current

source_script

Methods

forge_experiment(self, model_initializer[, …])

Define hyperparameter search scaffold for building Experiments during optimization

go(self[, force_ready])

Execute hyperparameter optimization, building an Experiment for each iteration

source_script = None
hyperparameter_hunter.ET

alias of hyperparameter_hunter.optimization.backends.skopt.protocols.ExtraTreesOptPro

class hyperparameter_hunter.DummyOptPro(target_metric=None, iterations=1, verbose=1, read_experiments=True, reporter_parameters=None, warn_on_re_ask=False, base_estimator='DUMMY', n_initial_points=10, acquisition_function='EI', acquisition_optimizer='sampling', random_state=32, acquisition_function_kwargs=None, acquisition_optimizer_kwargs=None, n_random_starts='DEPRECATED', callbacks=None, base_estimator_kwargs=None)

Bases: hyperparameter_hunter.optimization.protocol_core.SKOptPro

Base class for SKOpt-based Optimization Protocols

There are two important methods for all SKOptPro descendants that should be invoked after initialization:

  1. forge_experiment()

  2. go()

Parameters
target_metric: Tuple, default=(“oof”, <:attr:`environment.Environment.metrics`[0]>)

Rarely necessary to explicitly provide this, as the default is usually sufficient. Path denoting the metric to be used to compare Experiment performance. The first value should be one of [“oof”, “holdout”, “in_fold”]. The second value should be the name of a metric being recorded according to environment.Environment.metrics_params. See the documentation for metrics.get_formatted_target_metric() for more info. Any values returned by, or given as the target_metric input to, get_formatted_target_metric() are acceptable values for BaseOptPro.target_metric

iterations: Int, default=1

Number of Experiments to conduct during optimization upon invoking BaseOptPro.go()

verbose: {0, 1, 2}, default=1

Verbosity mode for console logging. 0: Silent. 1: Show only logs from the Optimization Protocol. 2: In addition to logs shown when verbose=1, also show the logs from individual Experiments

read_experiments: Boolean, default=True

If True, all Experiment records that fit in the current space and guidelines, and match algorithm_name, will be read in and used to fit any optimizers

reporter_parameters: Dict, or None, default=None

Additional parameters passed to reporting.OptimizationReporter.__init__(). Note: Unless provided explicitly, the key “do_maximize” will be added by default to reporter_params, with a value inferred from the direction of target_metric in G.Env.metrics. In nearly all cases, the “do_maximize” key should be ignored, as there are very few reasons to explicitly include it

warn_on_re_ask: Boolean, default=False

If True, and the internal optimizer recommends a point that has already been evaluated on invocation of ask, a warning is logged before recommending a random point. Either way, a random point is used instead of already-evaluated recommendations. However, logging the fact that this has taken place can be useful to indicate that the optimizer may be stalling, especially if it repeatedly recommends the same point. In these cases, if the suggested point is not optimal, it can be helpful to switch a different OptPro (especially DummyOptPro), which will suggest points using different criteria

Other Parameters
base_estimator: {SKLearn Regressor, “GP”, “RF”, “ET”, “GBRT”, “DUMMY”}, default=”GP”

If not string, should inherit from sklearn.base.RegressorMixin. In addition, the predict method should have an optional return_std argument, which returns std(Y | x), along with E[Y | x].

If base_estimator is a string in {“GP”, “RF”, “ET”, “GBRT”, “DUMMY”}, a surrogate model corresponding to the relevant X_minimize function is created

n_initial_points: Int, default=10

Number of complete evaluation points necessary before allowing Experiments to be approximated with base_estimator. Any valid Experiment records found will count as initialization points. If enough Experiment records are not found, additional points will be randomly sampled

acquisition_function:{“LCB”, “EI”, “PI”, “gp_hedge”}, default=”gp_hedge”

Function to minimize over the posterior distribution. Can be any of the following:

  • “LCB”: Lower confidence bound

  • “EI”: Negative expected improvement

  • “PI”: Negative probability of improvement

  • “gp_hedge”: Probabilistically choose one of the above three acquisition functions at every iteration

    • The gains g_i are initialized to zero

    • At every iteration,

      • Each acquisition function is optimised independently to propose a candidate point X_i

      • Out of all these candidate points, the next point X_best is chosen by softmax(eta g_i)

      • After fitting the surrogate model with (X_best, y_best), the gains are updated such that g_i -= mu(X_i)

acquisition_optimizer: {“sampling”, “lbfgs”, “auto”}, default=”auto”

Method to minimize the acquisition function. The fit model is updated with the optimal value obtained by optimizing acq_func with acq_optimizer

  • “sampling”: acq_func is optimized by computing acq_func at n_initial_points randomly sampled points.

  • “lbfgs”: acq_func is optimized by

    • Randomly sampling n_restarts_optimizer (from acq_optimizer_kwargs) points

    • “lbfgs” is run for 20 iterations with these initial points to find local minima

    • The optimal of these local minima is used to update the prior

  • “auto”: acq_optimizer is configured on the basis of the base_estimator and the search space. If the space is Categorical or if the provided estimator is based on tree-models, then this is set to “sampling”

random_state: Int, `RandomState` instance, or None, default=None

Set to something other than None for reproducible results

acquisition_function_kwargs: Dict, or None, default=dict(xi=0.01, kappa=1.96)

Additional arguments passed to the acquisition function

acquisition_optimizer_kwargs: Dict, or None, default=dict(n_points=10000, n_restarts_optimizer=5, n_jobs=1)

Additional arguments passed to the acquisition optimizer

n_random_starts: …

Deprecated since version 3.0.0: Use n_initial_points, instead. Will be removed in 3.2.0

callbacks: Callable, list of callables, or None, default=[]

If callable, then callbacks(self.optimizer_result) is called after each update to optimizer. If list, then each callable is called

base_estimator_kwargs: Dict, or None, default={}

Additional arguments passed to base_estimator when it is initialized

Notes

To provide initial input points for evaluation, individual Experiments can be executed prior to instantiating an Optimization Protocol. The results of these Experiments will automatically be detected and cherished by the optimizer.

SKOptPro and its children in optimization rely heavily on the utilities provided by the Scikit-Optimize library, so thank you to the creators and contributors for their excellent work.

Attributes
search_space_size

The number of different hyperparameter permutations possible given the current

source_script

Methods

forge_experiment(self, model_initializer[, …])

Define hyperparameter search scaffold for building Experiments during optimization

go(self[, force_ready])

Execute hyperparameter optimization, building an Experiment for each iteration

source_script = None
class hyperparameter_hunter.Real(low, high, prior='uniform', transform='identity', name=None)

Bases: hyperparameter_hunter.space.dimensions.NumericalDimension

Search space dimension that can assume any real value in a given range

Parameters
low: Float

Lower bound (inclusive)

high: Float

Upper bound (inclusive)

prior: {“uniform”, “log-uniform”}, default=”uniform”

Distribution to use when sampling random points for this dimension. If “uniform”, points are sampled uniformly between the lower and upper bounds. If “log-uniform”, points are sampled uniformly between log10(lower) and log10(upper)

transform: {“identity”, “normalize”}, default=”identity”

Transformation to apply to the original space. If “identity”, the transformed space is the same as the original space. If “normalize”, the transformed space is scaled between 0 and 1

name: String, tuple, or None, default=None

A name associated with the dimension

Attributes
distribution: rv_generic

See documentation of _make_distribution() or distribution()

transform_: String

Original value passed through the transform kwarg - Because transform() exists

transformer: Transformer

See documentation of _make_transformer() or transformer()

Methods

distance(self, a, b)

Calculate distance between two points in the dimension’s bounds

get_params(self)

Get dict of parameters used to initialize the Real, or their defaults

inverse_transform(self, data_t)

Inverse transform samples from the warped space back to the original space

rvs(self[, n_samples, random_state])

Draw random samples.

transform(self, data)

Transform samples from the original space into a warped space

inverse_transform(self, data_t)

Inverse transform samples from the warped space back to the original space

Parameters
data_t: List

Samples to inverse transform. Should be of shape (<# samples>, transformed_size)

Returns
List

Samples transformed back to original space. Will be shape (<# samples>, size)

property transformed_bounds

Dimension bounds in the warped space

Returns
low: Float

0.0 if transform_`="normalize". If :attr:`transform_`="identity" and :attr:`prior`="uniform", then :attr:`low. Else log10(low)

high: Float

1.0 if transform_`="normalize". If :attr:`transform_`="identity" and :attr:`prior`="uniform", then :attr:`high. Else log10(high)

get_params(self) → dict

Get dict of parameters used to initialize the Real, or their defaults

class hyperparameter_hunter.Integer(low, high, transform='identity', name=None)

Bases: hyperparameter_hunter.space.dimensions.NumericalDimension

Search space dimension that can assume any integer value in a given range

Parameters
low: Int

Lower bound (inclusive)

high: Int

Upper bound (inclusive)

transform: {“identity”, “normalize”}, default=”identity”

Transformation to apply to the original space. If “identity”, the transformed space is the same as the original space. If “normalize”, the transformed space is scaled between 0 and 1

name: String, tuple, or None, default=None

A name associated with the dimension

Attributes
distribution: rv_generic

See documentation of _make_distribution() or distribution()

transform_: String

Original value passed through the transform kwarg - Because transform() exists

transformer: Transformer

See documentation of _make_transformer() or transformer()

Methods

distance(self, a, b)

Calculate distance between two points in the dimension’s bounds

get_params(self)

Get dict of parameters used to initialize the Integer, or their defaults

inverse_transform(self, data_t)

Inverse transform samples from the warped space back to the original space

rvs(self[, n_samples, random_state])

Draw random samples.

transform(self, data)

Transform samples from the original space into a warped space

inverse_transform(self, data_t)

Inverse transform samples from the warped space back to the original space

Parameters
data_t: List

Samples to inverse transform. Should be of shape (<# samples>, transformed_size)

Returns
List

Samples transformed back to original space. Will be shape (<# samples>, size)

property transformed_bounds

Dimension bounds in the warped space

Returns
low: Int

0 if transform_`="normalize", else :attr:`low

high: Int

1 if transform_`="normalize", else :attr:`high

get_params(self) → dict

Get dict of parameters used to initialize the Integer, or their defaults

class hyperparameter_hunter.Categorical(categories: list, prior: list = None, transform='onehot', optional=False, name=None)

Bases: hyperparameter_hunter.space.dimensions.Dimension

Search space dimension that can assume any categorical value in a given list

Parameters
categories: List

Sequence of possible categories of shape (n_categories,)

prior: List, or None, default=None

If list, prior probabilities for each category of shape (categories,). By default all categories are equally likely

transform: {“onehot”, “identity”}, default=”onehot”

Transformation to apply to the original space. If “identity”, the transformed space is the same as the original space. If “onehot”, the transformed space is a one-hot encoded representation of the original space

optional: Boolean, default=False

Intended for use by FeatureEngineer when optimizing an EngineerStep. Specifically, this enables searching through a space in which an EngineerStep either may or may not be used. This is contrary to Categorical’s usual function of creating a space comprising multiple categories. When optional = True, the space created will represent any of the values in categories either being included in the entire FeatureEngineer process, or being skipped entirely. Internally, a value excluded by optional is represented by a sentinel value that signals it should be removed from the containing list, so optional will not work for choosing between a single value and None, for example

name: String, tuple, or None, default=None

A name associated with the dimension

Attributes
categories: Tuple

Original value passed through the categories kwarg, cast to a tuple. If optional is True, then an instance of RejectedOptional will be appended to categories

distribution: rv_generic

See documentation of _make_distribution() or distribution()

optional: Boolean

Original value passed through the optional kwarg

prior: List, or None

Original value passed through the prior kwarg

prior_actual: List

Calculated prior value, initially equivalent to prior, but then set to a default array if None

transform_: String

Original value passed through the transform kwarg - Because transform() exists

transformer: Transformer

See documentation of _make_transformer() or transformer()

Methods

distance(self, a, b)

Calculate distance between two points in the dimension’s bounds

get_params(self)

Get dict of parameters used to initialize the Categorical, or their defaults

inverse_transform(self, data_t)

Inverse transform samples from the warped space back to the original space

rvs(self[, n_samples, random_state])

Draw random samples.

transform(self, data)

Transform samples from the original space into a warped space

rvs(self, n_samples=None, random_state=None)

Draw random samples. Samples are in the original (untransformed) space. They must be transformed before being passed to a model or minimizer via transform()

Parameters
n_samples: Int (optional)

Number of samples to be drawn. If not given, a single sample will be returned

random_state: Int, RandomState, or None, default=None

Set random state to something other than None for reproducible results

Returns
List

Randomly drawn samples from the original space

property transformed_size

Size of the transformed space for the dimension

Returns
Int
  • 1 if transform_ == “identity”

  • 1 if transform_ == “onehot” and length of categories is 1 or 2

  • Length of categories in all other cases

property bounds

Dimension bounds in the original space

Returns
Tuple

categories

property transformed_bounds

Dimension bounds in the warped space

Returns
Tuple, or list

If transformed_size == 1, then a tuple of (0.0, 1.0). Otherwise, returns a list containing transformed_size-many tuples of (0.0, 1.0)

Notes

transformed_size == 1 when the length of categories == 2, so if there are two items in categories, (0.0, 1.0) is returned. If there are three items in categories, [(0.0, 1.0), (0.0, 1.0), (0.0, 1.0)] is returned, and so on.

Because transformed_bounds uses transformed_size, it is affected by transform_. Specifically, the returns described above are for transform_ == “onehot” (default).

Examples

>>> Categorical(["a", "b"]).transformed_bounds
(0.0, 1.0)
>>> Categorical(["a", "b", "c"]).transformed_bounds
[(0.0, 1.0), (0.0, 1.0), (0.0, 1.0)]
>>> Categorical(["a", "b", "c", "d"]).transformed_bounds
[(0.0, 1.0), (0.0, 1.0), (0.0, 1.0), (0.0, 1.0)]
distance(self, a, b) → int

Calculate distance between two points in the dimension’s bounds

Parameters
a

First category

b

Second category

Returns
Int

0 if a == b. Else 1 (because categories have no order)

get_params(self) → dict

Get dict of parameters used to initialize the Categorical, or their defaults

hyperparameter_hunter.lambda_callback(on_exp_start=None, on_exp_end=None, on_rep_start=None, on_rep_end=None, on_fold_start=None, on_fold_end=None, on_run_start=None, on_run_end=None, agg_name=None, do_reshape_aggs=True, method_agg_keys=False, on_experiment_start=<object object at 0x7fe183d5ebd0>, on_experiment_end=<object object at 0x7fe183d5ebd0>, on_repetition_start=<object object at 0x7fe183d5ebd0>, on_repetition_end=<object object at 0x7fe183d5ebd0>)

Utility for creating custom callbacks to be declared by Environment and used by Experiments. The callable “on_<…>_<start/end>” parameters provided will receive as input whichever attributes of the Experiment are included in the signature of the given callable. If **kwargs is given in the callable’s signature, a dict of all of the Experiment’s attributes will be provided. This can be helpful for trying to figure out how to build a custom callback, but should not be used unless absolutely necessary. If the Experiment does not have an attribute specified in the callable’s signature, the following placeholder will be given: “INVALID KWARG”

Parameters
on_exp_start: Callable, or None, default=None

Callable that receives Experiment’s values for parameters in the signature at Experiment start

on_exp_end: Callable, or None, default=None

Callable that receives Experiment’s values for parameters in the signature at Experiment end

on_rep_start: Callable, or None, default=None

Callable that receives Experiment’s values for parameters in the signature at repetition start

on_rep_end: Callable, or None, default=None

Callable that receives Experiment’s values for parameters in the signature at repetition end

on_fold_start: Callable, or None, default=None

Callable that receives Experiment’s values for parameters in the signature at fold start

on_fold_end: Callable, or None, default=None

Callable that receives Experiment’s values for parameters in the signature at fold end

on_run_start: Callable, or None, default=None

Callable that receives Experiment’s values for parameters in the signature at run start

on_run_end: Callable, or None, default=None

Callable that receives Experiment’s values for parameters in the signature at run end

agg_name: Str, default=uuid.uuid4

This parameter is only used if the callables are behaving like AggregatorCallbacks by returning values (see the “Notes” section below for details on this). If the callables do return values, they will be stored under a key named (“_” + agg_name) in a dict in hyperparameter_hunter.experiments.BaseExperiment.stat_aggregates. The purpose of this parameter is to make it easier to understand an Experiment’s description file, as agg_name will default to a UUID if it is not given

do_reshape_aggs: Boolean, default=True

Whether to reshape the aggregated values to reflect the nested repetitions/folds/runs structure used for other aggregated values. If False, lists of aggregated values are left in their original shapes. This parameter is only used if the callables are behaving like AggregatorCallbacks (see the “Notes” section below and agg_name for details on this)

method_agg_keys: Boolean, default=False

If True, the aggregate keys for the items added to the dict at agg_name are equivalent to the names of the “on_<…>_<start/end>” pseudo-methods whose values are being aggregated. In other words, the pool of all possible aggregate keys goes from [“runs”, “folds”, “reps”, “final”] to the names of the eight “on_<…>_<start/end>” kwargs of lambda_callback(). See the “Notes” section below for further details and a rough outline

on_experiment_start: …

Deprecated since version 3.0.0: Renamed to on_exp_start. Will be removed in 3.2.0

on_experiment_end: …

Deprecated since version 3.0.0: Renamed to on_exp_end. Will be removed in 3.2.0

on_repetition_start: …

Deprecated since version 3.0.0: Renamed to on_rep_start. Will be removed in 3.2.0

on_repetition_end: …

Deprecated since version 3.0.0: Renamed to on_rep_end. Will be removed in 3.2.0

Returns
LambdaCallback: LambdaCallback

Uninitialized class, whose methods are the callables of the corresponding “on…” kwarg

Notes

For all of the “on_<…>_<start/end>” callables provided as input to lambda_callback, consider the following guidelines (for example function “f”, which can represent any of the callables):

  • All input parameters in the signature of “f” are attributes of the Experiment being executed

    • If “**kwargs” is a parameter, a dict of all the Experiment’s attributes will be provided

  • “f” will be treated as a method of a parent class of the Experiment

    • Take care when modifying attributes, as changes are reflected in the Experiment itself

  • If “f” returns something, it will automatically behave like an AggregatorCallback (see hyperparameter_hunter.callbacks.aggregators). Specifically, the following will occur:

    • A new key (named by agg_name if given, else a UUID) with a dict value is added to hyperparameter_hunter.experiments.BaseExperiment.stat_aggregates

      • This new dict can have up to four keys: “runs” (list), “folds” (list), “reps” (list), and “final” (object)

    • If “f” is an “on_run…” function, the returned value is appended to the “runs” list in the new dict

    • Similarly, if “f” is an “on_fold…” or “on_rep…” function, the returned value is appended to the “folds”, or “reps” list, respectively

    • If “f” is an “on_exp…” function, the “final” key in the new dict is set to the returned value

    • If values were aggregated in the aforementioned manner, the lists of collected values will be reshaped according to runs/folds/reps on Experiment end

    • The aggregated values will be saved in the Experiment’s description file

      • This is because hyperparameter_hunter.experiments.BaseExperiment.stat_aggregates is saved in its entirety

What follows is a rough outline of the structure produced when using an aggregator-like callback that automatically populates experiments.BaseExperiment.stat_aggregates with results of the functions used as arguments to lambda_callback():

BaseExperiment.stat_aggregates = dict(
    ...,
    <`agg_name`>=dict(
        <agg_key "runs">  = [...],
        <agg_key "folds"> = [...],
        <agg_key "reps">  = [...],
        <agg_key "final"> = object(),
        ...
    ),
    ...
)

In the above outline, the actual agg_key`s included in the dict at `agg_name depend on which “on_<…>_<start/end>” callables are behaving like aggregators. For example, if neither on_run_start nor on_run_end explicitly returns something, then the “runs” agg_key is not included in the agg_name dict. Similarly, if, for example, neither on_exp_start nor on_exp_end is provided, then the “final” agg_key is not included. If method_agg_keys=True, then the agg keys used in the dict are modified to be named after the method called. For example, if method_agg_keys=True and on_fold_start and on_fold_end are both callables returning values to be aggregated, then the agg_key`s used for each will be “on_fold_start” and “on_fold_end”, respectively. In this example, if `method_agg_keys=False (default) and do_reshape_aggs=False, then the single “folds” agg_key would contain the combined contents returned by both methods in the order in which they were returned

For examples using lambda_callback to create custom callbacks, see hyperparameter_hunter.callbacks.recipes

Examples

>>> from hyperparameter_hunter.environment import Environment
>>> def printer_helper(_rep, _fold, _run, last_evaluation_results):
...     print(f"{_rep}.{_fold}.{_run}   {last_evaluation_results}")
>>> my_lambda_callback = lambda_callback(
...     on_exp_end=printer_helper,
...     on_rep_end=printer_helper,
...     on_fold_end=printer_helper,
...     on_run_end=printer_helper,
... )
... # env = Environment(
... #     train_dataset="i am a dataset",
... #     results_path="path/to/HyperparameterHunterAssets",
... #     metrics=["roc_auc_score"],
... #     experiment_callbacks=[my_lambda_callback]
... # )
... # ... Now execute an Experiment, or an Optimization Protocol...

See hyperparameter_hunter.examples.lambda_callback_example for more information

class hyperparameter_hunter.FeatureEngineer(steps=None, do_validate=False, **datasets: Dict[str, pandas.core.frame.DataFrame])

Bases: object

Class to organize feature engineering step callables steps (EngineerStep instances) and the datasets that the steps request and return.

Parameters
steps: List, or None, default=None

List of arbitrary length, containing any of the following values:

  1. EngineerStep instance,

  2. Function to provide as input to EngineerStep, or

  3. Categorical, with categories comprising a selection of the previous two steps values (optimization only)

The third value can only be used during optimization. The feature_engineer provided to CVExperiment, for example, may only contain the first two values. To search a space optionally including an EngineerStep, use the optional kwarg of Categorical.

See EngineerStep for information on properly formatted EngineerStep functions. Additional engineering steps may be added via add_step()

do_validate: Boolean, or “strict”, default=False

… Experimental… Whether to validate the datasets resulting from feature engineering steps. If True, hashes of the new datasets will be compared to those of the originals to ensure they were actually modified. Results will be logged. If do_validate = “strict”, an exception will be raised if any anomalies are found, rather than logging a message. If do_validate = False, no validation will be performed

**datasets: DFDict

This is not expected to be provided on initialization and is offered primarily for debugging/testing. Mapping of datasets necessary to perform feature engineering steps

See also

EngineerStep

For proper formatting of non-Categorical values of steps

Notes

If steps does include any instances of hyperparameter_hunter.space.dimensions.Categorical, this FeatureEngineer instance will not be usable by Experiments. It can only be used by Optimization Protocols. Furthermore, the FeatureEngineer that the Optimization Protocol actually ends up using will not pass identity checks against the original FeatureEngineer that contained Categorical steps

Examples

>>> from sklearn.preprocessing import StandardScaler, MinMaxScaler, QuantileTransformer
>>> # Define some engineer step functions to play with
>>> def s_scale(train_inputs, non_train_inputs):
...     s = StandardScaler()
...     train_inputs[train_inputs.columns] = s.fit_transform(train_inputs.values)
...     non_train_inputs[train_inputs.columns] = s.transform(non_train_inputs.values)
...     return train_inputs, non_train_inputs
>>> def mm_scale(train_inputs, non_train_inputs):
...     s = MinMaxScaler()
...     train_inputs[train_inputs.columns] = s.fit_transform(train_inputs.values)
...     non_train_inputs[train_inputs.columns] = s.transform(non_train_inputs.values)
...     return train_inputs, non_train_inputs
>>> def q_transform(train_targets, non_train_targets):
...     t = QuantileTransformer(output_distribution="normal")
...     train_targets[train_targets.columns] = t.fit_transform(train_targets.values)
...     non_train_targets[train_targets.columns] = t.transform(non_train_targets.values)
...     return train_targets, non_train_targets, t
>>> def sqr_sum(all_inputs):
...     all_inputs["square_sum"] = all_inputs.agg(
...         lambda row: np.sqrt(np.sum([np.square(_) for _ in row])), axis="columns"
...     )
...     return all_inputs

FeatureEngineer steps wrapped by `EngineerStep` == raw function steps - as long as the `EngineerStep` is using the default parameters

>>> # FeatureEngineer steps wrapped by `EngineerStep` == raw function steps
>>> #   ... As long as the `EngineerStep` is using the default parameters
>>> fe_0 = FeatureEngineer([sqr_sum, s_scale])
>>> fe_1 = FeatureEngineer([EngineerStep(sqr_sum), EngineerStep(s_scale)])
>>> fe_0.steps == fe_1.steps
True
>>> fe_2 = FeatureEngineer([sqr_sum, EngineerStep(s_scale), q_transform])

`Categorical` can be used during optimization and placed anywhere in `steps`. `Categorical` can also handle either `EngineerStep` categories or raw functions. Use the `optional` kwarg of `Categorical` to test some questionable steps

>>> fe_3 = FeatureEngineer([sqr_sum, Categorical([s_scale, mm_scale]), q_transform])
>>> fe_4 = FeatureEngineer([Categorical([sqr_sum], optional=True), s_scale, q_transform])
>>> fe_5 = FeatureEngineer([
...     Categorical([sqr_sum], optional=True),
...     Categorical([EngineerStep(s_scale), mm_scale]),
...     q_transform
... ])
Attributes
steps

Feature engineering steps to execute in sequence on FeatureEngineer.__call__()

Methods

__call__(self, stage, \*\*datasets, …)

Execute all feature engineering steps in steps for stage, with datasets datasets as inputs

add_step(self, step, …)

Add an engineering step to steps to be executed with the other contents of steps on FeatureEngineer.__call__()

get_key_data(self)

Produce a dict of critical attributes describing the FeatureEngineer instance for use by key-making classes

inverse_transform(self, data)

Perform the inverse transformation for all engineer steps in steps in sequence on data

inverse_transform(self, data)

Perform the inverse transformation for all engineer steps in steps in sequence on data

Parameters
data: Array-like

Data to inverse transform with any inversions present in steps

Returns
Array-like

Result of sequentially calling inverse transformations in steps on data. If any step has EngineerStep.inversion = None, data is unmodified for that step, and proceeds to next engineer step inversion

property steps

Feature engineering steps to execute in sequence on FeatureEngineer.__call__()

get_key_data(self) → dict

Produce a dict of critical attributes describing the FeatureEngineer instance for use by key-making classes

Returns
Dict

Important attributes describing this FeatureEngineer instance

add_step(self, step:Union[Callable, hyperparameter_hunter.space.dimensions.Categorical], stage:str=None, name:str=None, before:str=<class 'hyperparameter_hunter.feature_engineering.EMPTY_SENTINEL'>, after:str=<class 'hyperparameter_hunter.feature_engineering.EMPTY_SENTINEL'>, number:int=<class 'hyperparameter_hunter.feature_engineering.EMPTY_SENTINEL'>)

Add an engineering step to steps to be executed with the other contents of steps on FeatureEngineer.__call__()

Parameters
step: Callable, or `EngineerStep`, or `Categorical`

If EngineerStep instance, will be added directly to steps. Otherwise, must be a feature engineering step callable that requests, modifies, and returns datasets, which will be used to instantiate a EngineerStep to add to steps. If Categorical, categories should contain EngineerStep instances or callables

stage: String in {“pre_cv”, “intra_cv”}, or None, default=None

Feature engineering stage during which the callable step will be executed

name: String, or None, default=None

Identifier for the transformation applied by this engineering step. If None and step is not an EngineerStep, will be inferred during EngineerStep instantiation

before: String, default=EMPTY_SENTINEL

… Experimental…

after: String, default=EMPTY_SENTINEL

… Experimental…

number: String, default=EMPTY_SENTINEL

… Experimental…

class hyperparameter_hunter.EngineerStep(f: Callable, stage=None, name=None, params=None, do_validate=False)

Bases: object

Container for individual FeatureEngineer step functions

Compartmentalizes functions of singular engineer steps and allows for greater customization than a raw engineer step function

Parameters
f: Callable

Feature engineering step function that requests, modifies, and returns datasets params

Step functions should follow these guidelines:

  1. Request as input a subset of the 11 data strings listed in params

  2. Do whatever you want to the DataFrames given as input

  3. Return new DataFrame values of the input parameters in same order as requested

If performing a task like target transformation, causing predictions to be transformed, it is often desirable to inverse-transform the predictions to be of the expected form. This can easily be done by returning an extra value from f (after the datasets) that is either a callable, or a transformer class that was fitted during the execution of f and implements an inverse_transform method. This is the only instance in which it is acceptable for f to return values that don’t mimic its input parameters. See the engineer function definition using SKLearn’s QuantileTransformer in the Examples section below for an actual inverse-transformation-compatible implementation

stage: String in {“pre_cv”, “intra_cv”}, or None, default=None

Feature engineering stage during which the callable f will be given the datasets params to modify and return. If None, will be inferred based on params.

  • “pre_cv” functions are applied only once in the experiment: when it starts

  • “intra_cv” functions are reapplied for each fold in the cross-validation splits

If stage is left to be inferred, “pre_cv” will usually be selected. However, if any params (or parameters in the signature of f) are prefixed with “validation…” or “non_train…”, then stage will inferred as “intra_cv”. See the Notes section below for suggestions on the stage to use for different functions

name: String, or None, default=None

Identifier for the transformation applied by this engineering step. If None, f.__name__ will be used

params: Tuple[str], or None, default=None

Dataset names requested by feature engineering step callable f. If None, will be inferred by parsing the signature of f. Must be a subset of the following 11 strings:

Input Data

  1. “train_inputs”

  2. “validation_inputs”

  3. “holdout_inputs”

  4. “test_inputs”

  5. “all_inputs”

    ("train_inputs" + ["validation_inputs"] + "holdout_inputs" + "test_inputs")

  6. “non_train_inputs”

    (["validation_inputs"] + "holdout_inputs" + "test_inputs")

Target Data

  1. “train_targets”

  2. “validation_targets”

  3. “holdout_targets”

  4. “all_targets” ("train_targets" + ["validation_targets"] + "holdout_targets")

  5. “non_train_targets” (["validation_targets"] + "holdout_targets")

As an alternative to the above list, just remember that the first half of all parameter names should be one of {“train”, “validation”, “holdout”, “test”, “all”, “non_train”}, and the second half should be either “inputs” or “targets”. The only exception to this rule is “test_targets”, which doesn’t exist.

Inference of “validation” params is affected by stage. During the “pre_cv” stage, the validation dataset has not yet been created and is still a part of the train dataset. During the “intra_cv” stage, the validation dataset is created by removing a portion of the train dataset, and their values passed to f reflect this fact. This also means that the values of the merged (“all”/”non_train”-prefixed) datasets may or may not contain “validation” data depending on the stage; however, this is all handled internally, so you probably don’t need to worry about it.

params may not include multiple references to the same dataset, either directly or indirectly. This means (“train_inputs”, “train_inputs”) is invalid due to duplicate direct references. Less obviously, (“train_inputs”, “all_inputs”) is invalid because “all_inputs” includes “train_inputs”

do_validate: Boolean, or “strict”, default=False

… Experimental… Whether to validate the datasets resulting from feature engineering steps. If True, hashes of the new datasets will be compared to those of the originals to ensure they were actually modified. Results will be logged. If do_validate = “strict”, an exception will be raised if any anomalies are found, rather than logging a message. If do_validate = False, no validation will be performed

See also

FeatureEngineer

The container for EngineerStep instances - EngineerStep`s should always be provided to HyperparameterHunter through a `FeatureEngineer

Categorical

Can be used during optimization to search through a group of EngineerStep`s given as `categories. The optional kwarg of Categorical designates a FeatureEngineer step that may be one of the EngineerStep`s in `categories, or may be omitted entirely

get_engineering_step_stage()

More information on stage inference and situations where overriding it may be prudent

Notes

stage: Generally, feature engineering conducted in the “pre_cv” stage should regard each sample/row as independent entities. For example, steps like converting a string day of the week to one-hot encoded columns, or imputing missing values by replacement with -1 might be conducted “pre_cv”, since they are unlikely to introduce an information leakage. Conversely, steps like scaling/normalization, whose results for the data in one row are affected by the data in other rows should be performed “intra_cv” in order to recalculate the final values of the datasets for each cross validation split and avoid information leakage.

params: In the list of the 11 valid params strings, “test_inputs” is notably missing the “…_targets” counterpart accompanying the other datasets. The “targets” suffix is missing because test data targets are never given. Note that although “test_inputs” is still included in both “all_inputs” and “non_train_inputs”, its lack of a target column means that “all_targets” and “non_train_targets” may have different lengths than their “inputs”-suffixed counterparts

Examples

>>> from sklearn.preprocessing import StandardScaler, QuantileTransformer
>>> def s_scale(train_inputs, non_train_inputs):
...     s = StandardScaler()
...     train_inputs[train_inputs.columns] = s.fit_transform(train_inputs.values)
...     non_train_inputs[train_inputs.columns] = s.transform(non_train_inputs.values)
...     return train_inputs, non_train_inputs
>>> # Sensible parameter defaults inferred based on `f`
>>> es_0 = EngineerStep(s_scale)
>>> es_0.stage
'intra_cv'
>>> es_0.name
's_scale'
>>> es_0.params
('train_inputs', 'non_train_inputs')
>>> # Override `stage` if you want to fit your scaler on OOF data like a crazy person
>>> es_1 = EngineerStep(s_scale, stage="pre_cv")
>>> es_1.stage
'pre_cv'

Watch out for multiple requests to the same data

>>> es_2 = EngineerStep(s_scale, params=("train_inputs", "all_inputs"))
Traceback (most recent call last):
    File "feature_engineering.py", line ? in validate_dataset_names
ValueError: Requested params include duplicate references to `train_inputs` by way of:
   - ('all_inputs', 'train_inputs')
   - ('train_inputs',)
Each dataset may only be requested by a single param for each function

Error is the same if `(train_inputs, all_inputs)` is in the actual function signature

EngineerStep functions aren’t just limited to transformations. Make your own features!

>>> def sqr_sum(all_inputs):
...     all_inputs["square_sum"] = all_inputs.agg(
...         lambda row: np.sqrt(np.sum([np.square(_) for _ in row])), axis="columns"
...     )
...     return all_inputs
>>> es_3 = EngineerStep(sqr_sum)
>>> es_3.stage
'pre_cv'
>>> es_3.name
'sqr_sum'
>>> es_3.params
('all_inputs',)

Inverse-transformation Implementation:

>>> def q_transform(train_targets, non_train_targets):
...     t = QuantileTransformer(output_distribution="normal")
...     train_targets[train_targets.columns] = t.fit_transform(train_targets.values)
...     non_train_targets[train_targets.columns] = t.transform(non_train_targets.values)
...     return train_targets, non_train_targets, t
>>> # Note that `train_targets` and `non_train_targets` must still be returned in order,
>>> #   but they are followed by `t`, an instance of `QuantileTransformer` we just fitted,
>>> #   whose `inverse_transform` method will be called on predictions
>>> es_4 = EngineerStep(q_transform)
>>> es_4.stage
'intra_cv'
>>> es_4.name
'q_transform'
>>> es_4.params
('train_targets', 'non_train_targets')
>>> # `params` does not include any returned transformers - Only data requested as input
Attributes
f

Feature engineering step callable that requests, modifies, and returns datasets

name

Identifier for the transformation applied by this engineering step

params

Dataset names requested by feature engineering step callable f.

stage

Feature engineering stage during which the EngineerStep will be executed

Methods

__call__(self, \*\*datasets, …)

Apply f to datasets to produce updated datasets.

get_comparison_attrs(step_obj, dict])

Build a dict of critical EngineerStep attributes

get_datasets_for_f(self, datasets, …)

Produce a dict of DataFrames containing only the merged datasets and standard datasets requested in params.

get_key_data(self)

Produce a dict of critical attributes describing the EngineerStep instance for use by key-making classes

honorary_step_from_dict(step_dict, dimension)

Get an EngineerStep from dimension that is equal to its dict form, step_dict

inverse_transform(self, data)

Perform the inverse transformation for this engineer step (if it exists)

stringify(self)

Make a stringified representation of self, compatible with EngineerStep.__eq__()

inverse_transform(self, data)

Perform the inverse transformation for this engineer step (if it exists)

Parameters
data: Array-like

Data to inverse transform with inversion or inversion.inverse_transform

Returns
Array-like

If inversion is None, return data unmodified. Else, return the result of inversion or inversion.inverse_transform, given data

get_datasets_for_f(self, datasets:Dict[str, pandas.core.frame.DataFrame]) → Dict[str, pandas.core.frame.DataFrame]

Produce a dict of DataFrames containing only the merged datasets and standard datasets requested in params. In other words, add the requested merged datasets and remove unnecessary standard datasets

Parameters
datasets: DFDict

Original dict of datasets, containing all datasets provided to EngineerStep.__call__(), some of which may be superfluous, or may require additional processing to resolve merged/coupled datasets

Returns
DFDict

Updated version of datasets, in which unnecessary datasets have been filtered out, and the requested merged datasets have been added

get_key_data(self) → dict

Produce a dict of critical attributes describing the EngineerStep instance for use by key-making classes

Returns
Dict

Important attributes describing this EngineerStep instance

property f

Feature engineering step callable that requests, modifies, and returns datasets

property name

Identifier for the transformation applied by this engineering step

property params

Dataset names requested by feature engineering step callable f. See documentation in EngineerStep.__init__() for more information/restrictions

property stage

Feature engineering stage during which the EngineerStep will be executed

static get_comparison_attrs(step_obj:Union[_ForwardRef('EngineerStep'), dict]) → dict

Build a dict of critical EngineerStep attributes

Parameters
step_obj: EngineerStep, dict

Object for which critical EngineerStep attributes should be collected

Returns
attr_vals: Dict

Critical EngineerStep attributes. If step_obj does not have a necessary attribute (for EngineerStep) or a necessary key (for dict), its value in attr_vals will be a placeholder object. This is to facilitate comparison, while also ensuring missing values will always be considered unequal to other values

Examples

>>> def dummy_f(train_inputs, non_train_inputs):
...     return train_inputs, non_train_inputs
>>> es_0 = EngineerStep(dummy_f)
>>> EngineerStep.get_comparison_attrs(es_0)  
{'name': 'dummy_f',
 'f': <function dummy_f at ...>,
 'params': ('train_inputs', 'non_train_inputs'),
 'stage': 'intra_cv',
 'do_validate': False}
>>> EngineerStep.get_comparison_attrs(
...     dict(foo="hello", f=dummy_f, params=["all_inputs", "all_targets"], stage="pre_cv")
... )  
{'name': <object object at ...>,
 'f': <function dummy_f at ...>,
 'params': ('all_inputs', 'all_targets'),
 'stage': 'pre_cv',
 'do_validate': <object object at ...>}
stringify(self) → str

Make a stringified representation of self, compatible with EngineerStep.__eq__()

Returns
String

String describing all critical attributes of the EngineerStep instance. This value is not particularly human-friendly due to both its length and the fact that EngineerStep.f is represented by its hash

Examples

>>> def dummy_f(train_inputs, non_train_inputs):
...     return train_inputs, non_train_inputs
>>> EngineerStep(dummy_f).stringify()  
"EngineerStep(dummy_f, ..., ('train_inputs', 'non_train_inputs'), intra_cv, False)"
>>> EngineerStep(dummy_f, stage="pre_cv").stringify()  
"EngineerStep(dummy_f, ..., ('train_inputs', 'non_train_inputs'), pre_cv, False)"
classmethod honorary_step_from_dict(step_dict:dict, dimension:hyperparameter_hunter.space.dimensions.Categorical)

Get an EngineerStep from dimension that is equal to its dict form, step_dict

Parameters
step_dict: Dict

Dict of form saved in Experiment description files for EngineerStep. Expected to have following keys, with values of the given types:

  • “name”: String

  • “f”: String (SHA256 hash)

  • “params”: List[str], or Tuple[str, …]

  • “stage”: String in {“pre_cv”, “intra_cv”}

  • “do_validate”: Boolean

dimension: Categorical

Categorical instance expected to contain the EngineerStep equivalent of step_dict in its categories

Returns
EngineerStep

From dimension.categories if it is the EngineerStep equivalent of step_dict

Raises
ValueError

If dimension.categories does not contain an EngineerStep matching step_dict

class hyperparameter_hunter.BayesianOptimization(**kwargs)

Bases: hyperparameter_hunter.optimization.backends.skopt.protocols.BayesianOptPro

Deprecated since version 3.0.0a2: Will be removed in 3.2.0. Renamed to BayesianOptPro

Attributes
search_space_size

The number of different hyperparameter permutations possible given the current

source_script

Methods

forge_experiment(self, model_initializer[, …])

Define hyperparameter search scaffold for building Experiments during optimization

get_ready(self)

Prepare for optimization by finalizing hyperparameter space and identifying similar Experiments.

go(self[, force_ready])

Execute hyperparameter optimization, building an Experiment for each iteration

set_dimensions(self)

Locate given hyperparameters that are space choice declarations and add them to dimensions

set_experiment_guidelines(self, \*args, …)

Deprecated since version 3.0.0a2.

source_script = None
class hyperparameter_hunter.GradientBoostedRegressionTreeOptimization(**kwargs)

Bases: hyperparameter_hunter.optimization.backends.skopt.protocols.GradientBoostedRegressionTreeOptPro

Deprecated since version 3.0.0a2: Will be removed in 3.2.0. Renamed to GradientBoostedRegressionTreeOptPro

Attributes
search_space_size

The number of different hyperparameter permutations possible given the current

source_script

Methods

forge_experiment(self, model_initializer[, …])

Define hyperparameter search scaffold for building Experiments during optimization

get_ready(self)

Prepare for optimization by finalizing hyperparameter space and identifying similar Experiments.

go(self[, force_ready])

Execute hyperparameter optimization, building an Experiment for each iteration

set_dimensions(self)

Locate given hyperparameters that are space choice declarations and add them to dimensions

set_experiment_guidelines(self, \*args, …)

Deprecated since version 3.0.0a2.

source_script = None
class hyperparameter_hunter.RandomForestOptimization(**kwargs)

Bases: hyperparameter_hunter.optimization.backends.skopt.protocols.RandomForestOptPro

Deprecated since version 3.0.0a2: Will be removed in 3.2.0. Renamed to RandomForestOptPro

Attributes
search_space_size

The number of different hyperparameter permutations possible given the current

source_script

Methods

forge_experiment(self, model_initializer[, …])

Define hyperparameter search scaffold for building Experiments during optimization

get_ready(self)

Prepare for optimization by finalizing hyperparameter space and identifying similar Experiments.

go(self[, force_ready])

Execute hyperparameter optimization, building an Experiment for each iteration

set_dimensions(self)

Locate given hyperparameters that are space choice declarations and add them to dimensions

set_experiment_guidelines(self, \*args, …)

Deprecated since version 3.0.0a2.

source_script = None
class hyperparameter_hunter.ExtraTreesOptimization(**kwargs)

Bases: hyperparameter_hunter.optimization.backends.skopt.protocols.ExtraTreesOptPro

Deprecated since version 3.0.0a2: Will be removed in 3.2.0. Renamed to ExtraTreesOptPro

Attributes
search_space_size

The number of different hyperparameter permutations possible given the current

source_script

Methods

forge_experiment(self, model_initializer[, …])

Define hyperparameter search scaffold for building Experiments during optimization

get_ready(self)

Prepare for optimization by finalizing hyperparameter space and identifying similar Experiments.

go(self[, force_ready])

Execute hyperparameter optimization, building an Experiment for each iteration

set_dimensions(self)

Locate given hyperparameters that are space choice declarations and add them to dimensions

set_experiment_guidelines(self, \*args, …)

Deprecated since version 3.0.0a2.

source_script = None
class hyperparameter_hunter.DummySearch(**kwargs)

Bases: hyperparameter_hunter.optimization.backends.skopt.protocols.DummyOptPro

Deprecated since version 3.0.0a2: Will be removed in 3.2.0. Renamed to DummyOptPro

Attributes
search_space_size

The number of different hyperparameter permutations possible given the current

source_script

Methods

forge_experiment(self, model_initializer[, …])

Define hyperparameter search scaffold for building Experiments during optimization

get_ready(self)

Prepare for optimization by finalizing hyperparameter space and identifying similar Experiments.

go(self[, force_ready])

Execute hyperparameter optimization, building an Experiment for each iteration

set_dimensions(self)

Locate given hyperparameters that are space choice declarations and add them to dimensions

set_experiment_guidelines(self, \*args, …)

Deprecated since version 3.0.0a2.

source_script = None