hyperparameter_hunter package¶
Subpackages¶
- hyperparameter_hunter.callbacks package
- hyperparameter_hunter.data package
- hyperparameter_hunter.keys package
- hyperparameter_hunter.library_helpers package
- hyperparameter_hunter.optimization package
- hyperparameter_hunter.space package
- hyperparameter_hunter.utils package
- Submodules
- hyperparameter_hunter.utils.boltons_utils module
- hyperparameter_hunter.utils.file_utils module
- hyperparameter_hunter.utils.general_utils module
- hyperparameter_hunter.utils.learning_utils module
- hyperparameter_hunter.utils.optimization_utils module
- hyperparameter_hunter.utils.parsing_utils module
- hyperparameter_hunter.utils.result_utils module
- hyperparameter_hunter.utils.version_utils module
- Module contents
Submodules¶
hyperparameter_hunter.algorithm_handlers module¶
-
hyperparameter_hunter.algorithm_handlers.
identify_algorithm
(model_initializer)¶ Determine the name, and module of the algorithm provided by model_initializer
- Parameters
- model_initializer: functools.partial, or class, or class instance
The algorithm class being used to initialize a model
- Returns
- algorithm_name: str
The name of the algorithm provided by model_initializer
- module_name: str
The name of the module housing the algorithm provided by model_initializer
Examples
>>> from sklearn.cluster import DBSCAN, SpectralClustering >>> from functools import partial >>> identify_algorithm(DBSCAN) ('DBSCAN', 'sklearn') >>> identify_algorithm(DBSCAN()) ('DBSCAN', 'sklearn') >>> identify_algorithm(partial(SpectralClustering)) ('SpectralClustering', 'sklearn')
-
hyperparameter_hunter.algorithm_handlers.
identify_algorithm_hyperparameters
(model_initializer)¶ Determine keyword-arguments accepted by model_initializer, along with their default values
- Parameters
- model_initializer: functools.partial, or class, or class instance
The algorithm class being used to initialize a model
- Returns
- hyperparameter_defaults: dict
The dict of kwargs accepted by model_initializer and their default values
hyperparameter_hunter.environment module¶
This module is central to the proper functioning of the entire library. It defines
Environment
, which (when activated) is used by the vast majority of the other
operation-critical modules in the library. Environment
can be viewed as a simple storage
container that defines settings that characterize the Experiments/OptimizationProtocols to be
conducted, and influence how those processes are carried out
Notes¶
Despite the fact that hyperparameter_hunter.settings
is the only module listed as being
“related”, pretty much all the other modules in the library are related to
hyperparameter_hunter.environment.Environment
by way of this relation
-
class
hyperparameter_hunter.environment.
Environment
(train_dataset, environment_params_path=None, *, results_path=None, metrics=None, holdout_dataset=None, test_dataset=None, target_column=None, id_column=None, do_predict_proba=None, prediction_formatter=None, metrics_params=None, cv_type=None, runs=None, global_random_seed=None, random_seeds=None, random_seed_bounds=None, cv_params=None, verbose=None, file_blacklist=None, reporting_params=None, to_csv_params=None, do_full_save=None, experiment_callbacks=None, experiment_recorders=None, save_transformed_metrics=None)¶ Bases:
object
Class to organize the parameters that allow Experiments/OptPros to be fairly compared
Environment is the collective starting point for all of HyperparameterHunter’s biggest and best toys: Experiments and OptimizationProtocols. Without an Environment, neither of these will work.
The Environment is where we declare all the parameters that transcend traditional “hyperparameters”. It houses the stuff without which machine learning can’t even really start. Specifically, Environment cares about 1) The data used for fitting/predicting, 2) The cross-validation scheme used to split the data and fit models; and 3) How to evaluate the predictions made on that data. There are plenty of other goodies documented below, but the absolutely mission-critical parameters concerned with the above tasks are train_dataset, cv_type, cv_params, and metrics. Additionally, it’s important to provide results_path, so Experiment/OptPro results can be saved, which is kind of what HyperparameterHunter is all about
- Parameters
- train_dataset: Pandas.DataFrame, or str path
The training data for the experiment. Will be split into train/holdout data, if applicable, and train/validation data if cross-validation is to be performed. If str, will attempt to read file at path via
pandas.read_csv()
. For more information on which columns will be used during fitting/predicting, see the “Dataset columns” note in the “Notes” section below- environment_params_path: String path, or None, default=None
If not None and is valid .json filepath containing an object (dict), the file’s contents are treated as the default values for all keys that match any of the below kwargs used to initialize
Environment
- results_path: String path, or None, default=None
If valid directory path and the results directory has not yet been created, it will be created here. If this does not end with <ASSETS_DIRNAME>, it will be appended. If <ASSETS_DIRNAME> already exists at this path, new results will also be stored here. If None or invalid, results will not be stored
- metrics: Dict, List, or None, default=None
Iterable describing the metrics to be recorded, along with a means to compute the value of each metric. Should be of one of the two following forms:
List Form:
[“<metric name>”, “<metric name>”, …]: Where each value is a string that names an attribute in
sklearn.metrics
[Metric, Metric, …]: Where each value of the list is an instance of
metrics.Metric
[(<name>, <metric_function>, [<direction>]), (<*args>), …]: Where each value of the list is a tuple of arguments that will be used to instantiate a
metrics.Metric
. Arguments given in tuples must be in order expected bymetrics.Metric
: (name, metric_function, direction)
Dict Form:
{“<metric name>”: <metric_function>, …}: Where each key is a name for the corresponding metric callable, which is used to compute the value of the metric
{“<metric name>”: (<metric_function>, <direction>), …}: Where each key is a name for the corresponding metric callable and direction, all of which are used to instantiate a
metrics.Metric
{“<metric name>”: “<sklearn metric name>”, …}: Where each key is a name for the metric, and each value is the name of the attribute in
sklearn.metrics
for which the corresponding key is an alias{“<metric name>”: None, …}: Where each key is the name of the attribute in
sklearn.metrics
{“<metric name>”: Metric, …}: Where each key names an instance of
metrics.Metric
. This is the internally-used format to which all other formats will be converted
Metric callable functions should expect inputs of form (target, prediction), and should return floats. See the documentation of
metrics.Metric
for information regarding expected parameters and types- holdout_dataset: Pandas.DataFrame, callable, str path, or None, default=None
If pd.DataFrame, this is the holdout dataset. If callable, expects a function that takes (self.train: DataFrame, self.target_column: str) as input and returns the new (self.train: DataFrame, self.holdout: DataFrame). If str, will attempt to read file at path via
pandas.read_csv()
. Else, there is no holdout set. For more information on which columns will be used during fitting/predicting, see the “Dataset columns” note in the “Notes” section below- test_dataset: Pandas.DataFrame, str path, or None, default=None
The testing data for the experiment. Structure should be identical to that of train_dataset, except its target_column column can be empty or non-existent, because test_dataset predictions will never be evaluated. If str, will attempt to read file at path via
pandas.read_csv()
. For more information on which columns will be used during fitting/predicting, see the “Dataset columns” note in the “Notes” section below- target_column: Str, or list, default=’target’
If str, denotes the column name in all provided datasets (except test) that contains the target output. If list, should be a list of strs designating multiple target columns. For example, in a multi-class classification dataset like UCI’s hand-written digits, target_column would be a list containing ten strings. In this example, the target_column data would be sparse, with a 1 to signify that a sample is a written example of a digit (0-9). For a working example, see ‘hyperparameter_hunter/examples/lib_keras_multi_classification_example.py’
- id_column: Str, or None, default=None
If not None, str denoting the column name in all provided datasets containing sample IDs
- do_predict_proba: Boolean, or int, default=False
If False,
models.Model.fit()
will callmodels.Model.model.predict()
If True, it will call
models.Model.model.predict_proba()
, and the values in all columns will be used as the actual prediction valuesIf do_predict_proba is an int,
models.Model.fit()
will callmodels.Model.model.predict_proba()
, as is the case when do_predict_proba is True, but the int supplied as do_predict_proba declares the column index to use as the actual prediction valuesFor example, for a model to call the predict method, do_predict_proba=False (default). For a model to call the predict_proba method, and use all of the class probabilities, do_predict_proba=True. To call the predict_proba method, and use the class probabilities in the first column, do_predict_proba=0. To use the second column (index 1) of the result, do_predict_proba=1 - This often corresponds to the positive class’s probabilities in binary classification problems. To use the third column do_predict_proba=2, and so on
- prediction_formatter: Callable, or None, default=None
If callable, expected to have same signature as
utils.result_utils.format_predictions()
. That is, the callable will receive (raw_predictions: np.array, dataset_df: pd.DataFrame, target_column: str, id_column: str or None) as input and should return a properly formatted prediction DataFrame. The callable uses raw_predictions as the content, dataset_df to provide any id column, and target_column to identify the column in which to place raw_predictions- metrics_params: Dict, or None, default=dict()
Dictionary of extra parameters to provide to
metrics.ScoringMixIn.__init__()
. metrics must be provided either 1) as an input kwarg toEnvironment.__init__()
(see metrics), or 2) as a key in metrics_params, but not both. An Exception will be raised if both are given, or if neither is given- cv_type: Class or str, default=’KFold’
The class to define cross-validation splits. If str, it must be an attribute of sklearn.model_selection._split, and it must be a cross-validation class that inherits one of the following sklearn classes: BaseCrossValidator, or _RepeatedSplits. Valid str values include ‘KFold’, and ‘RepeatedKFold’, although there are many more. It must implement the following methods: [__init__, split]. If using a custom class, see the following tested sklearn classes for proper implementations: [KFold, StratifiedKFold, RepeatedKFold, RepeatedStratifiedKFold]. The arguments provided to
cv_type.__init__()
will beEnvironment.cv_params
, which should include the following: [‘n_splits’ <int>, ‘n_repeats’ <int> (if applicable)].cv_type.split()
will receive the following arguments: [BaseExperiment.train_input_data
,BaseExperiment.train_target_data
]- runs: Int, default=1
The number of times to fit a model within each fold to perform multiple-run-averaging with different random seeds
- global_random_seed: Int, default=32
The initial random seed used just before generating an Experiment’s random_seeds. This ensures consistency for random_seeds between Experiments, without having to explicitly provide it here
- random_seeds: None, or List, default=None
If None, random_seeds of the appropriate shape will be created automatically. Else, must be a list of ints of shape (cv_params[‘n_repeats’], cv_params[‘n_splits’], runs). If cv_params does not have the key n_repeats (because standard cross-validation is being used), the value will default to 1. See
experiments.BaseExperiment._random_seed_initializer()
for info on expected shape- random_seed_bounds: List, default=[0, 100000]
A list containing two integers: the lower and upper bounds, respectively, for generating an Experiment’s random seeds in
experiments.BaseExperiment._random_seed_initializer()
. Generally, leave this kwarg alone- cv_params: dict, or None, default=dict()
Parameters provided upon initialization of cv_type. Keys may be any args accepted by
cv_type.__init__()
. Number of fold splits must be provided via “n_splits”, and number of repeats (if applicable for cv_type) must be provided via “n_repeats”- verbose: Int, boolean, default=3
Verbosity of printing for any experiments performed while this Environment is active
Higher values indicate more frequent logging. Logs are still recorded in the heartbeat file regardless of verbosity level. verbose only dictates which logs are visible in the console. The following table illustrates which types of logging messages will be visible with each verbosity level:
| Verbosity | Keys/IDs | Final Score | Repetitions* | Folds | Runs* | Run Starts* | Result Files | Other | |:---------:|:--------:|:-----------:|:------------:|:-----:|:-----:|:-----------:|:------------:|:-----:| | 0 | | | | | | | | | | 1 | Yes | Yes | | | | | | | | 2 | Yes | Yes | Yes | Yes | | | | | | 3 | Yes | Yes | Yes | Yes | Yes | | | | | 4 | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
*: If such logging is deemed appropriate with the given cross-validation parameters. In other words, repetition/run logging will only be verbose if Environment was given more than one repetition/run, respectively
- file_blacklist: List of str, or None, or ‘ALL’, default=None
If list of str, the result files named within are not saved to their respective directory in “<ASSETS_DIRNAME>/Experiments”. If None, all result files are saved. If ‘ALL’, nothing at all will be saved for the Experiments. If the path of the file that initializes an Experiment does not end with a “.py” extension, the Experiment proceeds as if “script_backup” had been added to file_blacklist. This means that backup files will not be created for Jupyter notebooks (or any other non-“.py” files). For info on acceptable values, see
validate_file_blacklist()
- reporting_params: Dict, default=dict()
Parameters passed to initialize
reporting.ReportingHandler
- to_csv_params: Dict, default=dict()
Parameters passed to the calls to
pandas.frame.DataFrame.to_csv()
inrecorders
. In particular, this is where an Experiment’s final prediction files are saved, so the values here will affect the format of the .csv prediction files. Warning: If to_csv_params contains the key “path_or_buf”, it will be removed. Otherwise, all items are supplied directly toto_csv()
, including kwargs it might not be expecting if they are given- do_full_save: None, or callable, default=:func:`utils.result_utils.default_do_full_save`
If callable, expected to take an Experiment’s result description dict as input and return a boolean. If None, treated as a callable that returns True. This parameter is used by
recorders.DescriptionRecorder
to determine whether the Experiment result files following the description should also be created. If do_full_save returns False, result file-saving is stopped early, and only the description is saved. If do_full_save returns True, all files not in file_blacklist are saved normally. This allows you to skip creation of an Experiment’s predictions, logs, and heartbeats if its score does not meet some threshold you set, for example. do_full_save receives the Experiment description dict as input, so for help setting do_full_save, just look into one of your Experiment descriptions- experiment_callbacks: `LambdaCallback`, or list of `LambdaCallback` (optional)
Callbacks injected directly into Experiments, adding new functionality, or customizing existing processes. Should be a
LambdaCallback
or a list of such classes. LambdaCallback can be created usingcallbacks.bases.lambda_callback()
, which documents the options for creating callbacks. experiment_callbacks will be added to the MRO of the executed Experiment class byexperiment_core.ExperimentMeta
at __call__ time, making experiment_callbacks new base classes of the Experiment. Seecallbacks.bases.lambda_callback()
for more information. Note that the Experiments conducted by OptPros will still benefit from experiment_callbacks. The presence of LambdaCallbacks will affect neither Environment keys, nor Experiment keys. In other words, for the purposes of Experiment matching/recording, all other factors being equal, an Experiment with experiment_callbacks is considered identical to an Experiment without, despite whatever custom functionality was added by the LambdaCallbacks- experiment_recorders: List, None, default=None
If not None, may be a list whose values are tuples of (<
recorders.BaseRecorder
descendant>, <str result_path>). The result_path str should be a path relative to results_path that specifies the directory/file in which the product of the custom recorder should be saved. The contents of experiment_recorders will be provided to recorders.RecorderList upon completion of an Experiment, and, if the subclassing documentation in recorders is followed properly, will create or update a result file for the just-executed Experiment- save_transformed_metrics: Boolean (optional)
Declares manner in which a model’s predictions should be evaluated through the provided metrics, with regard to target data transformations. This setting can be ignored if no transformation of the target variable takes place (either through
FeatureEngineer
,EngineerStep
, or otherwise).The default value of save_transformed_metrics depends on the dtype of the target data in train_dataset. If all target columns are numeric, save_transformed_metrics`=False, meaning metric evaluation should use the original/inverted targets and predictions. Else if any target column is non-numeric, `save_transformed_metrics`=True, meaning evaluation should use the transformed targets and predictions because most metrics require numeric inputs. This is described further in :attr:`save_transformed_metrics. A more descriptive name for this may be “calculate_metrics_using_transformed_predictions”, but that’s a bit verbose–even by my standards
- Other Parameters
- cross_validation_type: …
Alias for cv_type *
- cross_validation_params: …
Alias for cv_params *
- metrics_map: …
Alias for metrics *
- reporting_handler_params: …
Alias for reporting_params *
- root_results_path: …
Alias for results_path *
Notes
Dataset columns: In order to specify the columns to be used by the three dataset kwargs (train_dataset, holdout_dataset, test_dataset) during fitting and predicting, a few attributes can be used. On Environment initialization, the columns specified by the following kwargs will be separated from the rest of the dataset during training/predicting: 1) target_column, which names the column containing the target output labels for the input data; and 2) id_column, which (if given) represents the name of the column that contains identifying information for each data sample, and should otherwise have no relation to the actual data. Additionally, the feature_selector kwarg of the descendants of
hyperparameter_hunter.experiments.BaseExperiment
(likehyperparameter_hunter.experiments.CVExperiment
) is used to filter out columns of the given datasets prior to fitting. See its documentation for more information, but it can effectively be used to remove any columns from the datasetsOverriding default kwargs at environment_params_path: If you have any of the above kwargs specified in the .json file at environment_params_path (except environment_params_path, which will be ignored), you can override its value by passing it as a kwarg when initializing
Environment
. The contents at environment_params_path are only used when the matching kwarg supplied at initialization is None. See “/examples/environment_params_path_example.py” for detailsThe order of precedence for determining the value of each parameter is as follows, with items at the top having the highest priority, and deferring only to the items below if their own value is None:
1)kwargs passed directly to
Environment.__init__()
on initialization,2)keys of the file at environment_params_path (if valid .json object),
3)keys of
hyperparameter_hunter.environment.Environment.DEFAULT_PARAMS
do_predict_proba: Because this parameter can be either a boolean or an integer, it is important to explicitly pass booleans rather than truthy or falsey values. Similarly, only pass integers if you intend for the value to be used as a column index. Do not pass 0 to mean False, or 1 to mean True
- Attributes
- train_input: DatasetSentinel
Sentinel replaced with current train input data during Model fitting/predicting. Commonly given in the model_extra_params kwargs of
hyperparameter_hunter.experiments.BaseExperiment
orhyperparameter_hunter.optimization.protocol_core.BaseOptPro.forge_experiment()
for eval_set-like hyperparameters. Importantly, the actual value of this Sentinel is determined after performing cross-validation data splitting, and after executingFeatureEngineer
- train_target: DatasetSentinel
Like
train_input
, except for current train target data- validation_input: DatasetSentinel
Like
train_input
, except for current validation input data- validation_target: DatasetSentinel
Like
train_input
, except for current validation target data- holdout_input: DatasetSentinel
Like
train_input
, except for current holdout input data- holdout_target: DatasetSentinel
Like
train_input
, except for current holdout target data
Methods
environment_workflow
(self)Execute all methods required to validate the environment and run Experiments
format_result_paths
(self)Remove paths contained in file_blacklist, and format others to prepare for saving results
Generate a key to describe the current Environment’s cross-experiment parameters
initialize_reporting
(self)Initialize reporting for the Environment and Experiments conducted during its lifetime
Try to update null parameters from environment_params_path, or DEFAULT_PARAMS
validate_parameters
(self)Ensure the provided parameters are valid and properly formatted
-
DEFAULT_PARAMS
= {'cv_params': {}, 'cv_type': 'KFold', 'do_full_save': <function default_do_full_save>, 'do_predict_proba': False, 'environment_params_path': None, 'file_blacklist': None, 'global_random_seed': 32, 'id_column': None, 'metrics': None, 'metrics_params': {}, 'prediction_formatter': <function format_predictions>, 'random_seed_bounds': [0, 100000], 'random_seeds': None, 'reporting_params': {'console_params': None, 'float_format': '{:.5f}', 'heartbeat_params': None, 'heartbeat_path': None}, 'results_path': None, 'runs': 1, 'save_transformed_metrics': None, 'target_column': 'target', 'to_csv_params': {}, 'verbose': 3}¶
-
property
results_path
¶
-
property
target_column
¶
-
property
train_dataset
¶
-
property
test_dataset
¶
-
property
holdout_dataset
¶
-
property
file_blacklist
¶
-
property
cv_type
¶
-
property
to_csv_params
¶
-
property
cross_experiment_params
¶
-
property
experiment_callbacks
¶
-
property
save_transformed_metrics
¶ If save_transformed_metrics is True, and target transformation does occur, then experiment metrics are calculated using the transformed targets and predictions, which is the form returned directly by a fitted model’s predict method. For example, if target data is label-encoded, and an
feature_engineering.EngineerStep
is used to one-hot encode the target, then metrics functions will receive the following as input: (one-hot-encoded targets, one-hot-encoded predictions).Conversely, if save_transformed_metrics is False, and target transformation does occur, then experiment metrics are calculated using the inverse of the transformed targets and predictions, which is same form as the original target data. Continuing the example of label-encoded target data, and an
feature_engineering.EngineerStep
to one-hot encode the target, in this case, metrics functions will receive the following as input: (label-encoded targets, label-encoded predictions)
-
environment_workflow
(self)¶ Execute all methods required to validate the environment and run Experiments
-
validate_parameters
(self)¶ Ensure the provided parameters are valid and properly formatted
-
format_result_paths
(self)¶ Remove paths contained in file_blacklist, and format others to prepare for saving results
-
update_custom_environment_params
(self)¶ Try to update null parameters from environment_params_path, or DEFAULT_PARAMS
-
generate_cross_experiment_key
(self)¶ Generate a key to describe the current Environment’s cross-experiment parameters
-
initialize_reporting
(self)¶ Initialize reporting for the Environment and Experiments conducted during its lifetime
-
property
train_input
¶ Get a DatasetSentinel representing an Experiment’s fold_train_input
- Returns
- DatasetSentinel:
A Sentinel that will be converted to
hyperparameter_hunter.experiments.BaseExperiment.fold_train_input
upon Model initialization
-
property
train_target
¶ Get a DatasetSentinel representing an Experiment’s fold_train_target
- Returns
- DatasetSentinel:
A Sentinel that will be converted to
hyperparameter_hunter.experiments.BaseExperiment.fold_train_target
upon Model initialization
-
property
validation_input
¶ Get a DatasetSentinel representing an Experiment’s fold_validation_input
- Returns
- DatasetSentinel:
A Sentinel that will be converted to
hyperparameter_hunter.experiments.BaseExperiment.fold_validation_input
upon Model initialization
-
property
validation_target
¶ Get a DatasetSentinel representing an Experiment’s fold_validation_target
- Returns
- DatasetSentinel:
A Sentinel that will be converted to
hyperparameter_hunter.experiments.BaseExperiment.fold_validation_target
upon Model initialization
-
property
holdout_input
¶ Get a DatasetSentinel representing an Experiment’s holdout_input_data
- Returns
- DatasetSentinel:
A Sentinel that will be converted to
hyperparameter_hunter.experiments.BaseExperiment.holdout_input_data
upon Model initialization
-
property
holdout_target
¶ Get a DatasetSentinel representing an Experiment’s holdout_target_data
- Returns
- DatasetSentinel:
A Sentinel that will be converted to
hyperparameter_hunter.experiments.BaseExperiment.holdout_target_data
upon Model initialization
-
hyperparameter_hunter.environment.
define_holdout_set
(train_set:pandas.core.frame.DataFrame, holdout_set:Union[pandas.core.frame.DataFrame, <built-in function callable>, str, NoneType], target_column:Union[str, List[str]]) → Tuple[pandas.core.frame.DataFrame, Union[pandas.core.frame.DataFrame, NoneType]]¶ Create holdout_set (if necessary) by loading a DataFrame from a .csv file, or by separating train_set, and return the updated (train_set, holdout_set) pair
- Parameters
- train_set: Pandas.DataFrame
Training DataFrame. Will be split into train/holdout data, if holdout_set is callable
- holdout_set: Pandas.DataFrame, callable, str, or None
If pd.DataFrame, this is the holdout dataset. If callable, expects a function that takes (train_set, target_column) as input and returns the new (train_set, holdout_set). If str, will attempt to read file at path via
pandas.read_csv()
. Else, no holdout set- target_column: Str, or list
If str, denotes the column name in provided datasets that contains the target output. If list, should be a list of strs designating multiple target columns
- Returns
- train_set: Pandas.DataFrame
train_set if holdout_set is not callable. Else train_set modified by holdout_set
- holdout_set: Pandas.DataFrame, or None
Original DataFrame, or DataFrame read from str filepath, or a portion of train_set if holdout_set is callable, or None
-
hyperparameter_hunter.environment.
validate_file_blacklist
(blacklist)¶ Validate contents of blacklist. For most values, the corresponding file is saved upon completion of the experiment. See the “Notes” section below for details on some special cases
- Parameters
- blacklist: List of strings, or None
The result files that should not be saved
- Returns
- blacklist: List
If not empty, acceptable list of result file types to blacklist
Notes
‘heartbeat’: If the heartbeat file is saved, a new file is not generated and saved to the “Experiments/Heartbeats” directory as is the case with most other files. Instead, the general “Heartbeat.log” file is copied and renamed to the current experiment id, then saved to the appropriate dir. This is because the general “Heartbeat.log” file represents the heartbeat for whatever experiment is currently in progress.
‘script_backup’: This file is saved as quickly as possible after starting a new experiment, rather than waiting for the experiment to end. There are two reasons for this behavior: 1) to avoid saving any changes that may have been made to a file after it has been executed, and 2) to have the offending file in the event of a catastrophic failure that results in no other files being saved. As stated in the documentation of the file_blacklist parameter of Environment, if the path of the file that initializes an Experiment does not end with a “.py” extension, the Experiment proceeds as if “script_backup” had been added to blacklist. This means that backup files will not be created for Jupyter notebooks (or any other non-“.py” files)
‘description’ and ‘tested_keys’: These two results types constitute a bare minimum of sorts for experiment recording. If either of these two are blacklisted, then as far as the library is concerned, the experiment never took place.
‘tested_keys’ (continued): If this string is included in the blacklist, then the contents of the “KeyAttributeLookup” directory will also be excluded from the list of files to update
‘current_heartbeat’: The general heartbeat file that should be stored at ‘HyperparameterHunterAssets/Heartbeat.log’. If this value is blacklisted, then ‘heartbeat’ is also added to blacklist automatically out of necessity. This is done because the heartbeat file for the current experiment cannot be created as a copy of the general heartbeat file if the general heartbeat file is never created in the first place
hyperparameter_hunter.exceptions module¶
This module defines a few custom Exception classes, and it provides the means for Exceptions to be added to the Heartbeat result files of Experiments
Related¶
hyperparameter_hunter.reporting
This module executes
hyperparameter_hunter.exception_handler.hook_exception_handler()
to ensure that any raised Exceptions are also recorded in the Heartbeat files of the Experiment for which the Exception was raised in order to assist in debugging
-
hyperparameter_hunter.exceptions.
handle_exception
(exception_type, exception_value, exception_traceback)¶ Intercept raised exceptions to ensure they are included in an Experiment’s log files
- Parameters
- exception_type: Exception
The class type of the exception that was raised
- exception_value: Str
The message produced by the exception
- exception_traceback: Exception.traceback
The traceback provided by the raised exception
- Raises
- SystemExit
If exception_type is a subclass of KeyboardInterrupt
-
hyperparameter_hunter.exceptions.
hook_exception_handler
()¶ Set sys.excepthook to
hyperparameter_hunter.exception_handler.handle_exception()
-
exception
hyperparameter_hunter.exceptions.
EnvironmentInactiveError
(message=None, extra='')¶ Bases:
Exception
Exception raised when an active instance of
hyperparameter_hunter.environments.Environment
is not detected- Parameters
- message: String, or None, default=None
A message to provide upon raising EnvironmentExceptionError
- extra: String, default=’’
Extra content to append onto the end of message before raising the Exception
-
exception
hyperparameter_hunter.exceptions.
EnvironmentInvalidError
(message=None, extra='')¶ Bases:
Exception
Exception raised when there is an active instance of
hyperparameter_hunter.environments.Environment
, but it is invalid for some reason- Parameters
- message: String, or None, default=None
A message to provide upon raising EnvironmentInvalidError
- extra: String, default=’’
Extra content to append onto the end of message before raising the Exception
-
exception
hyperparameter_hunter.exceptions.
RepeatedExperimentError
(message=None, extra='')¶ Bases:
Exception
Exception raised when a saved Experiment is found with the same hyperparameters as the Experiment being executed
- Parameters
- message: String, or None, default=None
A message to provide upon raising RepeatedExperimentError
- extra: String, default=’’
Extra content to append onto the end of message before raising the Exception
-
exception
hyperparameter_hunter.exceptions.
IncompatibleCandidateError
(candidate, template)¶ Bases:
Exception
Exception raised when a candidate hyperparameter set is incompatible with a template
- Parameters
- candidate: Any
Hyperparameter set that is incompatible with the choices/concrete values of template
- template: Any
Hyperparameter set defined by
forge_experiment()
. May include any combination of space choices and concrete values
-
exception
hyperparameter_hunter.exceptions.
ContinueRemap
¶ Bases:
Exception
-
exception
hyperparameter_hunter.exceptions.
DeprecatedWarning
(obj_name, v_deprecate, v_remove, details='')¶ Bases:
DeprecationWarning
Warning class for deprecated callables. This is a specialization of the built-in
DeprecationWarning
, adding parameters that allow us to get information into the __str__ that ends up being sent through thewarnings
system. The attributes aren’t able to be retrieved after the warning gets raised and passed through the system as only the class–not the instance–and message are what gets preserved- Parameters
- obj_name: String
The name of the callable being deprecated
- v_deprecate: String
The version that obj is deprecated in
- v_remove: String
The version that obj gets removed in
- details: String, default=””
Deprecation details, such as directions on what to use instead of the deprecated code
-
exception
hyperparameter_hunter.exceptions.
UnsupportedWarning
(obj_name, v_deprecate, v_remove, details='')¶ Bases:
hyperparameter_hunter.exceptions.DeprecatedWarning
Warning class for callable to warn that it is being unsupported
hyperparameter_hunter.experiment_core module¶
This module is the core of all of the experimentation in hyperparameter_hunter, hence its name.
It is impossible to understand hyperparameter_hunter.experiments
without first having a grasp
on what hyperparameter_hunter.experiment_core.ExperimentMeta
is doing. This module serves
to bridge the gap between Experiments, and hyperparameter_hunter.callbacks
by dynamically
making Experiments inherit various callbacks depending on the inputs given in order to make
Experiments completely functional
Related¶
hyperparameter_hunter.experiments
Defines the structure of the experimentation process. While certainly very important,
hyperparameter_hunter.experiments
wouldn’t do much at all withouthyperparameter_hunter.callbacks
, orhyperparameter_hunter.experiment_core
hyperparameter_hunter.callbacks
Defines parent classes to the classes defined in
hyperparameter_hunter.experiments
. This not only makes it very easy to find the entire workflow for a given task, but also ensures that each instance of an Experiment inherits exactly the functionality that it needs. For example, if no holdout data was given, thenexperiment_core.ExperimentMeta
will not addcallbacks.evaluators.EvaluatorHoldout
orcallbacks.predictors.PredictorHoldout
to the list of callbacks inherited by the Experiment. This means that the Experiment never needs to check for the existence of holdout data in order to determine how it should proceed because it literally doesn’t have the code that deals with holdout data
Notes¶
Was a metaclass really necessary here? Probably not, but it’s being used for two reasons:
1) metaclasses are fun, and programming (especially artificial intelligence) should be fun; and
2) it allowed for a very clean separation between the various functions demanded by Experiments that
are provided by hyperparameter_hunter.callbacks
. Having each of the callbacks separated in
their own classes makes it very easy to debug existing functionality, and to add new callbacks in
the future
-
class
hyperparameter_hunter.experiment_core.
ExperimentMeta
¶ Bases:
type
Create a new class object that stores necessary class-wide callbacks to
__class_wide_bases
Methods
__call__
(cls, \*args, \*\*kwargs)Store necessary instance-wide callbacks to
__instance_bases
, sort all dynamically added callback base classes, then add them to the instancemro
()return a type’s method resolution order
-
hyperparameter_hunter.experiment_core.
base_callback_class_sorter
(auxiliary_bases, parent_class_order=None)¶ Sort callback classes in order to preserve the intended MRO of their descendant, and to enable callbacks that may depend on one another to function properly
- Parameters
- auxiliary_bases: List
The callback classes to be sorted according to the order in which their parent is found in parent_class_order. For example, if a class (x) in auxiliary_bases is the only descendant of the last class in parent_class_order, then class x will be moved to the last position in sorted_auxiliary_bases. If multiple classes in auxiliary_bases are descendants of the same parent in parent_class_order, they will be sorted alphabetically (from A-Z)
- parent_class_order: List, or None, default=<See description>
List of base callback classes that define the sort order for auxiliary_bases. Note that these are not the normal callback classes that add to the functionality of an Experiment, but the base classes from which the callback classes are descendants. All the classes in parent_class_order should be defined in
hyperparameter_hunter.callbacks.bases
. The last class in parent_class_order should behyperparameter_hunter.callbacks.bases.BaseCallback
, which is the parent class for all other base classes. This ensures that custom callbacks defined byhyperparameter_hunter.callbacks.bases.lambda_callback()
will be recognized as valid and executed last
- Returns
- sorted_auxiliary_bases: List
The contents of auxiliary_bases sorted according to their parents’ location in parent_class_order, then alphabetically
- Raises
- ValueError
If auxiliary_bases contains a class that is not a descendant of any of the classes in parent_class_order
Examples
>>> in_0 = [AggregatorEvaluations, AggregatorTimes, EvaluatorOOF, EvaluatorHoldout, LoggerFitStatus, PredictorOOF, PredictorHoldout, PredictorTest] >>> out_0 = [PredictorHoldout, PredictorOOF, PredictorTest, EvaluatorHoldout, EvaluatorOOF, AggregatorEvaluations, AggregatorTimes, LoggerFitStatus] >>> assert base_callback_class_sorter(in_0) == out_0 >>> in_1 = [AggregatorEvaluations, AggregatorTimes, EvaluatorOOF, EvaluatorHoldout, LoggerFitStatus, PredictorOOF, PredictorHoldout, PredictorTest] >>> out_1 = [PredictorHoldout, PredictorOOF, PredictorTest, EvaluatorHoldout, EvaluatorOOF, AggregatorEvaluations, AggregatorTimes, LoggerFitStatus] >>> assert base_callback_class_sorter(in_1) == out_1 >>> in_2 = [PredictorOOF, PredictorHoldout, AggregatorTimes, PredictorTest, AggregatorEvaluations, EvaluatorOOF, EvaluatorHoldout, LoggerFitStatus] >>> out_2 = [PredictorHoldout, PredictorOOF, PredictorTest, EvaluatorHoldout, EvaluatorOOF, AggregatorEvaluations, AggregatorTimes, LoggerFitStatus] >>> assert base_callback_class_sorter(in_2) == out_2 >>> in_3 = [PredictorTest, EvaluatorHoldout, LoggerFitStatus, AggregatorTimes, PredictorHoldout, PredictorOOF, AggregatorEvaluations, EvaluatorOOF] >>> out_3 = [PredictorHoldout, PredictorOOF, PredictorTest, EvaluatorHoldout, EvaluatorOOF, AggregatorEvaluations, AggregatorTimes, LoggerFitStatus] >>> assert base_callback_class_sorter(in_3) == out_3 >>> in_4 = [LoggerFitStatus, EvaluatorOOF, PredictorTest, EvaluatorHoldout, AggregatorTimes, AggregatorEvaluations, PredictorHoldout, PredictorOOF] >>> out_4 = [PredictorHoldout, PredictorOOF, PredictorTest, EvaluatorHoldout, EvaluatorOOF, AggregatorEvaluations, AggregatorTimes, LoggerFitStatus] >>> assert base_callback_class_sorter(in_4) == out_4 >>> in_5 = [AggregatorEvaluations, PredictorTest, PredictorOOF, EvaluatorOOF, EvaluatorHoldout] >>> out_5 = [PredictorOOF, PredictorTest, EvaluatorHoldout, EvaluatorOOF, AggregatorEvaluations] >>> assert base_callback_class_sorter(in_5) == out_5 >>> in_6 = [EvaluatorOOF, PredictorOOF, EvaluatorHoldout, AggregatorEvaluations, PredictorTest] >>> out_6 = [PredictorOOF, PredictorTest, EvaluatorHoldout, EvaluatorOOF, AggregatorEvaluations] >>> assert base_callback_class_sorter(in_6) == out_6 >>> in_7 = [PredictorTest, EvaluatorHoldout, PredictorOOF] >>> out_7 = [PredictorOOF, PredictorTest, EvaluatorHoldout] >>> assert base_callback_class_sorter(in_7) == out_7 >>> in_8 = [PredictorTest, PredictorOOF, EvaluatorHoldout] >>> out_8 = [PredictorOOF, PredictorTest, EvaluatorHoldout] >>> assert base_callback_class_sorter(in_8) == out_8
>>> base_callback_class_sorter([type("Foo", (object,), {}), PredictorTest, EvaluatorHoldout, PredictorOOF]) Traceback (most recent call last): File "experiment_core.py", line ?, in base_callback_class_sorter ValueError: Base class not descendant of acceptable parent class: [<class 'hyperparameter_hunter.experiment_core.Foo'>]
hyperparameter_hunter.experiments module¶
This module contains the classes used for constructing and conducting an Experiment (most
notably, CVExperiment
). Any class contained herein whose name starts with “Base” should not
be used directly. CVExperiment
is the preferred means of conducting one-off experimentation
Related¶
hyperparameter_hunter.experiment_core
Defines
ExperimentMeta
, an understanding of which is critical to being able to understandexperiments
hyperparameter_hunter.metrics
Defines
ScoringMixIn
, a parent ofexperiments.BaseExperiment
that enables scoring and evaluating modelshyperparameter_hunter.models
Used to instantiate the actual learning models, which are a single part of the entire experimentation workflow, albeit the most significant part
Notes¶
As mentioned above, the inner workings of experiments
will be very confusing without a grasp
on what’s going on in experiment_core
, and its related modules
-
class
hyperparameter_hunter.experiments.
BaseExperiment
(model_initializer, model_init_params=None, model_extra_params=None, feature_engineer=None, feature_selector=None, notes=None, do_raise_repeated=False, auto_start=True, target_metric=None)¶ Bases:
hyperparameter_hunter.metrics.ScoringMixIn
One-off Experimentation base class
Bare-bones Description: Runs the cross-validation scheme defined by Environment, during which 1) Datasets are processed according to feature_engineer; 2) Models are built by instantiating model_initializer with model_init_params; 3) Models are trained on processed data, optionally using parameters from model_extra_params; 4) Results are logged and recorded for each fitting period; 5) Descriptions, predictions, results (both averages and individual periods), etc. are saved.
What’s the Big Deal? The most important takeaway from the above description is that descriptions/results are THOROUGH and REUSABLE. By thorough, I mean that all of a model’s hyperparameters are saved, not just the ones given in model_init_params. This may sound odd, but it’s important because it makes results reusable during optimization, when you may be using a different set of hyperparameters. It helps with other things like preventing duplicate experiments and ensembling, as well. But the big part is that this transforms hyperparameter optimization from an isolated, throwaway process we can only afford when an ML project is sufficiently “mature” to a process that covers the entire lifespan of a project. No Experiment is forgotten or wasted. Optimization is automatically given the data it needs to succeed by drawing on all your past Experiments and optimization rounds.
The Experiment has three primary missions: 1. Act as scaffold for organizing ML Experimentation and optimization 2. Record Experiment descriptions and results 3. Eliminate lots of repetitive/error-prone boilerplate code
Providing a scaffold for the entire ML process is critical because without a standardized format, everything we do looks different. Without a unified scaffold, development is slower, more confusing, and less adaptable. One of the benefits of standardizing the format of ML Experimentation is that it enables us to exhaustively record all the important characteristics of Experiment, as well as an assortment of customizable result files – all in a way that allows them to be reused in the future.
What About Data/Metrics? Experiments require an active
Environment
in order to function, from which the Experiment collects important cross-experiment parameters, such as datasets, metrics, cross-validation schemes, and even callbacks to inherit, among many other properties documented inEnvironment
- Parameters
- model_initializer: Class, or functools.partial, or class instance
Algorithm class used to initialize a model, such as XGBoost’s XGBRegressor, or SKLearn’s KNeighborsClassifier; although, there are hundreds of possibilities across many different ML libraries. model_initializer is expected to define at least fit and predict methods. model_initializer will be initialized with model_init_params, and its “extra” methods (fit, predict, etc.) will be invoked with parameters in model_extra_params
- model_init_params: Dict, or object (optional)
Dictionary of arguments given to create an instance of model_initializer. Any kwargs that are considered valid by the __init__ method of model_initializer are valid in model_init_params.
One of the key features that makes HyperparameterHunter so magical is that ALL hyperparameters in the signature of model_initializer (and their default values) are discovered – whether or not they are explicitly given in model_init_params. Not only does this make Experiment result descriptions incredibly thorough, it also makes optimization smoother, more effective, and far less work for the user. For example, take LightGBM’s LGBMRegressor, with model_init_params`=`dict(learning_rate=0.2). HyperparameterHunter recognizes that this differs from the default of 0.1. It also recognizes that LGBMRegressor is actually initialized with more than a dozen other hyperparameters we didn’t bother mentioning, and it records their values, too. So if we want to optimize num_leaves tomorrow, the OptPro doesn’t start from scratch. It knows that we ran an Experiment that didn’t explicitly mention num_leaves, but its default value was 31, and it uses this information to fuel optimization – all without us having to manually keep track of tons of janky collections of hyperparameters. In fact, we really don’t need to go out of our way at all. HyperparameterHunter just acts as our faithful lab assistant, keeping track of all the stuff we’d rather not worry about
- model_extra_params: Dict (optional)
Dictionary of extra parameters for models’ non-initialization methods (like fit, predict, predict_proba, etc.), and for neural networks. To specify parameters for an extra method, place them in a dict named for the extra method to which the parameters should be given. For example, to call fit with early_stopping_rounds`=5, use `model_extra_params`=`dict(fit=dict(early_stopping_rounds=5)).
For models whose fit methods have a kwarg like eval_set (such as XGBoost’s), one can use the DatasetSentinel attributes of the current active
Environment
, documented under its “Attributes” section and undertrain_input
. An example using several DatasetSentinels can be found in HyperparameterHunter’s [XGBoost Classification Example](https://github.com/HunterMcGushion/hyperparameter_hunter/blob/master/examples/xgboost_examples/classification.py)- feature_engineer: `FeatureEngineer`, or list (optional)
Feature engineering/transformation/pre-processing steps to apply to datasets defined in
Environment
. If list, will be used to initializeFeatureEngineer
, and can contain any of the following values:EngineerStep
instanceFunction input to :class:~hyperparameter_hunter.feature_engineering.EngineerStep`
For important information on properly formatting EngineerStep functions, please see the documentation of
EngineerStep
. OptPros can perform hyperparameter optimization of feature_engineer steps. This capability adds a third allowed value to the above list and is documented inforge_experiment()
- feature_selector: List of str, callable, or list of booleans (optional)
Column names to include as input data for all provided DataFrames. If None, feature_selector is set to all columns in
train_dataset
, lesstarget_column
, andid_column
. feature_selector is provided as the second argument for calls to pandas.DataFrame.loc when constructing datasets- notes: String (optional)
Additional information about the Experiment that will be saved with the Experiment’s description result file. This serves no purpose other than to facilitate saving Experiment details in a more readable format
- do_raise_repeated: Boolean, default=False
If True and this Experiment locates a previous Experiment’s results with matching Environment and Hyperparameter Keys, a RepeatedExperimentError will be raised. Else, a warning will be logged
- auto_start: Boolean, default=True
If True, after the Experiment is initialized, it will automatically call
BaseExperiment.preparation_workflow()
, followed byBaseExperiment.experiment_workflow()
, effectively completing all essential tasks without requiring additional method calls- target_metric: Tuple, str, default=(‘oof’, <:attr:`environment.Environment.metrics`[0]>)
Path denoting the metric to be used to compare completed Experiments or to use for certain early stopping procedures in some model classes. The first value should be one of [‘oof’, ‘holdout’, ‘in_fold’]. The second value should be the name of a metric being recorded according to the values supplied in
hyperparameter_hunter.environment.Environment.metrics_params
. See the documentation forhyperparameter_hunter.metrics.get_formatted_target_metric()
for more info. Any values returned by, or used as the target_metric input to this function are acceptable values for target_metric- callbacks: `LambdaCallback`, or list of `LambdaCallback` (optional)
Callbacks injected directly into concrete Experiment (CVExperiment), adding new functionality, or customizing existing processes. Should be a
LambdaCallback
or a list of such classes. LambdaCallback can be created usingcallbacks.bases.lambda_callback()
, which documents the options for creating callbacks. callbacks will be added to the MRO of the Experiment byexperiment_core.ExperimentMeta
at __call__ time, making callbacks new base classes of the Experiment. Seecallbacks.bases.lambda_callback()
for more information. The presence of LambdaCallbacks will not affect Experiment keys. In other words, for the purposes of Experiment matching/recording, all other factors being equal, an Experiment with callbacks is considered identical to an Experiment without, despite whatever custom functionality was added by the LambdaCallbacks
See also
hyperparameter_hunter.optimization.protocol_core.BaseOptPro.forge_experiment()
OptPro method to define hyperparameter search scaffold for building Experiments during optimization. This method follows the same format as Experiment initialization, but it adds the ability to provide hyperparameter values as ranges to search over, via subclasses of
Dimension
. The other notable difference is that forge_experiment removes the auto_start and target_metric kwargs, which is described in the forge_experiment docstring NotesEnvironment
Provides critical information on how Experiments should be conducted, as well as the data to be used by Experiments. An Environment must be active before executing any Experiment or OptPro
lambda_callback()
Enables customization of the Experimentation process and access to all Experiment internals through a collection of methods that are invoked at all the important periods over an Experiment’s lifespan. These can be provided via the experiment_callbacks kwarg of
Environment
, and the callback classes literally get thrown in to the parent classes of the Experiment, so they’re kind of a big deal
Methods
evaluate
(self, data_type, target, prediction)Apply metric(s) to the given data to calculate the value of the prediction
execute
(self)Execute the fitting protocol for the Experiment, comprising the following: instantiation of learners for each run, preprocessing of data as appropriate, training learners, making predictions, and evaluating and aggregating those predictions and other stats/metrics for later use
experiment_workflow
(self)Define the actual experiment process, including execution, result saving, and cleanup
on_exp_start
(self)Prepare data prior to executing fitting protocol (cross-validation), by 1) Initializing formal
datasets
attributes, 2) Invoking feature_engineer to perform “pre_cv”-stage preprocessing, and 3) Updating datasets to include their (transformed) counterparts in feature_engineerpreparation_workflow
(self)Execute all tasks that must take place before the experiment is actually started.
-
experiment_workflow
(self)¶ Define the actual experiment process, including execution, result saving, and cleanup
-
preparation_workflow
(self)¶ Execute all tasks that must take place before the experiment is actually started. Such tasks include (but are not limited to): Creating experiment IDs and hyperparameter keys, creating script backups, and validating parameters
-
abstract
execute
(self)¶ Execute the fitting protocol for the Experiment, comprising the following: instantiation of learners for each run, preprocessing of data as appropriate, training learners, making predictions, and evaluating and aggregating those predictions and other stats/metrics for later use
-
class
hyperparameter_hunter.experiments.
BaseCVExperiment
(model_initializer, model_init_params=None, model_extra_params=None, feature_engineer=None, feature_selector=None, notes=None, do_raise_repeated=False, auto_start=True, target_metric=None)¶ Bases:
hyperparameter_hunter.experiments.BaseExperiment
Methods
Execute workflow for cross-validation process, consisting of the following tasks: 1) Create train and validation split indices for all folds, 2) Iterate through folds, performing cv_fold_workflow for each, 3) Average accumulated predictions over fold splits, 4) Evaluate final predictions, 5) Format final predictions to prepare for saving
cv_fold_workflow
(self)Execute workflow for individual fold, consisting of the following tasks: Execute overridden
on_fold_start()
tasks, 2) Perform cv_run_workflow for each run, 3) Execute overriddenon_fold_end()
taskscv_run_workflow
(self)Execute run workflow, consisting of: 1) Execute overridden
on_run_start()
tasks, 2) Initialize and fit Model, 3) Execute overriddenon_run_end()
tasksevaluate
(self, data_type, target, prediction)Apply metric(s) to the given data to calculate the value of the prediction
execute
(self)Execute the fitting protocol for the Experiment, comprising the following: instantiation of learners for each run, preprocessing of data as appropriate, training learners, making predictions, and evaluating and aggregating those predictions and other stats/metrics for later use
experiment_workflow
(self)Define the actual experiment process, including execution, result saving, and cleanup
on_exp_start
(self)Prepare data prior to executing fitting protocol (cross-validation), by 1) Initializing formal
datasets
attributes, 2) Invoking feature_engineer to perform “pre_cv”-stage preprocessing, and 3) Updating datasets to include their (transformed) counterparts in feature_engineeron_fold_start
(self)Override
on_fold_start()
tasks set byexperiment_core.ExperimentMeta
, consisting of: 1) Split train/validation data, 2) Make copies of holdout/test data for current fold (for feature engineering), 3) Log start, 4) Execute original taskson_run_start
(self)Override
on_run_start()
tasks organized byexperiment_core.ExperimentMeta
, consisting of: 1) Set random seed and update model parameters according to current seed, 2) Log run start, 3) Execute original taskspreparation_workflow
(self)Execute all tasks that must take place before the experiment is actually started.
-
execute
(self)¶ Execute the fitting protocol for the Experiment, comprising the following: instantiation of learners for each run, preprocessing of data as appropriate, training learners, making predictions, and evaluating and aggregating those predictions and other stats/metrics for later use
-
cross_validation_workflow
(self)¶ Execute workflow for cross-validation process, consisting of the following tasks: 1) Create train and validation split indices for all folds, 2) Iterate through folds, performing cv_fold_workflow for each, 3) Average accumulated predictions over fold splits, 4) Evaluate final predictions, 5) Format final predictions to prepare for saving
-
on_fold_start
(self)¶ Override
on_fold_start()
tasks set byexperiment_core.ExperimentMeta
, consisting of: 1) Split train/validation data, 2) Make copies of holdout/test data for current fold (for feature engineering), 3) Log start, 4) Execute original tasks
-
cv_fold_workflow
(self)¶ Execute workflow for individual fold, consisting of the following tasks: Execute overridden
on_fold_start()
tasks, 2) Perform cv_run_workflow for each run, 3) Execute overriddenon_fold_end()
tasks
-
on_run_start
(self)¶ Override
on_run_start()
tasks organized byexperiment_core.ExperimentMeta
, consisting of: 1) Set random seed and update model parameters according to current seed, 2) Log run start, 3) Execute original tasks
-
cv_run_workflow
(self)¶ Execute run workflow, consisting of: 1) Execute overridden
on_run_start()
tasks, 2) Initialize and fit Model, 3) Execute overriddenon_run_end()
tasks
-
-
class
hyperparameter_hunter.experiments.
CVExperiment
(model_initializer, model_init_params=None, model_extra_params=None, feature_engineer=None, feature_selector=None, notes=None, do_raise_repeated=False, auto_start=True, target_metric=None, callbacks=None)¶ Bases:
hyperparameter_hunter.experiments.BaseCVExperiment
- Attributes
- source_script
Methods
cross_validation_workflow
(self)Execute workflow for cross-validation process, consisting of the following tasks: 1) Create train and validation split indices for all folds, 2) Iterate through folds, performing cv_fold_workflow for each, 3) Average accumulated predictions over fold splits, 4) Evaluate final predictions, 5) Format final predictions to prepare for saving
cv_fold_workflow
(self)Execute workflow for individual fold, consisting of the following tasks: Execute overridden
on_fold_start()
tasks, 2) Perform cv_run_workflow for each run, 3) Execute overriddenon_fold_end()
taskscv_run_workflow
(self)Execute run workflow, consisting of: 1) Execute overridden
on_run_start()
tasks, 2) Initialize and fit Model, 3) Execute overriddenon_run_end()
tasksevaluate
(self, data_type, target, prediction)Apply metric(s) to the given data to calculate the value of the prediction
execute
(self)Execute the fitting protocol for the Experiment, comprising the following: instantiation of learners for each run, preprocessing of data as appropriate, training learners, making predictions, and evaluating and aggregating those predictions and other stats/metrics for later use
experiment_workflow
(self)Define the actual experiment process, including execution, result saving, and cleanup
on_exp_start
(self)Prepare data prior to executing fitting protocol (cross-validation), by 1) Initializing formal
datasets
attributes, 2) Invoking feature_engineer to perform “pre_cv”-stage preprocessing, and 3) Updating datasets to include their (transformed) counterparts in feature_engineeron_fold_start
(self)Override
on_fold_start()
tasks set byexperiment_core.ExperimentMeta
, consisting of: 1) Split train/validation data, 2) Make copies of holdout/test data for current fold (for feature engineering), 3) Log start, 4) Execute original taskson_run_start
(self)Override
on_run_start()
tasks organized byexperiment_core.ExperimentMeta
, consisting of: 1) Set random seed and update model parameters according to current seed, 2) Log run start, 3) Execute original taskspreparation_workflow
(self)Execute all tasks that must take place before the experiment is actually started.
-
source_script
= None¶
-
hyperparameter_hunter.experiments.
get_cv_indices
(folds, cv_params, input_data, target_data)¶ Produce iterables of cross validation indices in the shape of (n_repeats, n_folds)
- Parameters
- folds: Instance of `cv_type`
Cross validation folds object, whose
split()
receives input_data and target_data- cv_params: Dict
Parameters given to instantiate folds. Must contain n_splits. May contain n_repeats
- input_data: pandas.DataFrame
Input data to be split by folds, to which yielded indices will correspond
- target_data: pandas.DataFrame
Target data to be split by folds, to which yielded indices will correspond
- Yields
- Generator
Cross validation indices in shape of (<n_repeats or 1>, <n_splits>)
hyperparameter_hunter.feature_engineering module¶
This module organizes and executes feature engineering/preprocessing step functions. The central
components of the module are FeatureEngineer
and EngineerStep
- everything else
is built to support those two classes. This module works with a very broad definition of
“feature engineering”. The following is a non-exhaustive list of transformations that are
considered valid for FeatureEngineer step functions:
Manual feature creation
Input data scaling/normalization/standardization
Target data transformation
Re-sampling
Data imputation
Feature selection/elimination
Encoding (one-hot, label, etc.)
Binarization/binning/discretization
Feature extraction (as for NLP/image recognition tasks)
Feature shuffling
Related¶
hyperparameter_hunter.space
Only related when optimizing FeatureEngineer steps within an Optimization Protocol, but defines
Categorical
, which is the mechanism for defining a feature engineer step search space, andRejectedOptional
, which is used to represent the absence of a feature engineer step, when labeled as optional
-
class
hyperparameter_hunter.feature_engineering.
EMPTY_SENTINEL
¶ Bases:
object
-
class
hyperparameter_hunter.feature_engineering.
DatasetNameReport
(params: Tuple[str], stage: str)¶ Bases:
object
Characterize the relationships between the dataset names params
- Parameters
- params: Tuple[str]
Dataset names requested by a feature engineering step callable. Must be a subset of {“train_data”, “train_inputs”, “train_targets”, “validation_data”, “validation_inputs”, “validation_targets”, “holdout_data”, “holdout_inputs”, “holdout_targets”, “test_inputs”, “all_data”, “all_inputs”, “all_targets”, “non_train_data”, “non_train_inputs”, “non_train_targets”}
- stage: String in {“pre_cv”, “intra_cv”}
Feature engineering stage during which the datasets params are requested
- Attributes
- merged_datasets: List[tuple]
Tuples of strings denoting paths to datasets that represent a merge between multiple datasets. Merged datasets are those prefixed with either “all” or “non_train”. These paths are locations in descendants
- coupled_datasets: List[tuple]
Tuples of strings denoting paths to datasets that represent a coupling of “inputs” and “targets” datasets. Coupled datasets are those suffixed with “data”. These paths are locations in descendants, and the values at each path should be a dict containing keys with “inputs” and “targets” suffixes
- leaves: Dict[tuple, str]
Mapping of full path tuples in descendants to their leaf values. Tuple paths represent the steps necessary to reach the standard dataset leaf value in descendants by traversing merged and coupled datasets. Values in leaves should be identical to the last element of the corresponding tuple key
- descendants: DescendantsType
Nested dict in which all keys are dataset name strings, and all leaf values are None. Represents the structure of the requested dataset names, traversing over merged and coupled datasets (if necessary) in order to reach the standard dataset leaves
-
hyperparameter_hunter.feature_engineering.
names_for_merge
(merge_to:str, stage:str) → List[str]¶ Retrieve the names of the standard datasets that are allowed to be included in a merged DataFrame of type merge_to at stage stage
- Parameters
- merge_to: String
Type of merged dataframe to produce. Should be one of the following: {“all_data”, “all_inputs”, “all_targets”, “non_train_data”, “non_train_inputs”, “non_train_targets”}
- stage: String in {“pre_cv”, “intra_cv}
Feature engineering stage for which the merged dataframe is requested. The results produced with each option differ only in that a merged_df created with stage=”pre_cv” will never contain “validation” data because it doesn’t exist before cross-validation has begun. Conversely, a merged_df created with stage=”intra_cv” will contain the appropriate “validation” data if it exists
- Returns
- names: List
Subset of {“train_data”, “train_inputs”, “train_targets”, “validation_data”, “validation_inputs”, “validation_targets”, “holdout_data”, “holdout_inputs”, “holdout_targets”, “test_inputs”}
Examples
>>> names_for_merge("all_data", "intra_cv") ['train_data', 'validation_data', 'holdout_data'] >>> names_for_merge("all_inputs", "intra_cv") ['train_inputs', 'validation_inputs', 'holdout_inputs', 'test_inputs'] >>> names_for_merge("all_targets", "intra_cv") ['train_targets', 'validation_targets', 'holdout_targets'] >>> names_for_merge("all_data", "pre_cv") ['train_data', 'holdout_data'] >>> names_for_merge("all_inputs", "pre_cv") ['train_inputs', 'holdout_inputs', 'test_inputs'] >>> names_for_merge("all_targets", "pre_cv") ['train_targets', 'holdout_targets'] >>> names_for_merge("non_train_data", "intra_cv") ['validation_data', 'holdout_data'] >>> names_for_merge("non_train_inputs", "intra_cv") ['validation_inputs', 'holdout_inputs', 'test_inputs'] >>> names_for_merge("non_train_targets", "intra_cv") ['validation_targets', 'holdout_targets'] >>> names_for_merge("non_train_data", "pre_cv") ['holdout_data'] >>> names_for_merge("non_train_inputs", "pre_cv") ['holdout_inputs', 'test_inputs'] >>> names_for_merge("non_train_targets", "pre_cv") ['holdout_targets']
-
hyperparameter_hunter.feature_engineering.
merge_dfs
(merge_to:str, stage:str, dfs:Dict[str, pandas.core.frame.DataFrame]) → pandas.core.frame.DataFrame¶ Construct a multi-indexed DataFrame containing the values of dfs deemed necessary by merge_to and stage. This is the opposite of split_merged_df
- Parameters
- merge_to: String
Type of merged_df to produce. Should be one of the following: {“all_data”, “all_inputs”, “all_targets”, “non_train_data”, “non_train_inputs”, “non_train_targets”}
- stage: String in {“pre_cv”, “intra_cv}
Feature engineering stage for which merged_df is requested
- dfs: Dict
Mapping of dataset names to their DataFrame values. Keys in dfs should be a subset of {“train_data”, “train_inputs”, “train_targets”, “validation_data”, “validation_inputs”, “validation_targets”, “holdout_data”, “holdout_inputs”, “holdout_targets”, “test_inputs”}
- Returns
- merged_df: pd.DataFrame
Multi-indexed DataFrame, in which the first index is a string naming the dataset in dfs from which the corresponding data originates. The following index(es) are the original index(es) from the dataset in dfs. All primary indexes in merged_df will be one of the strings considered to be valid keys for dfs
- Raises
- ValueError
If all the DataFrames that would have been used in merged_df are None. This can happen if requesting merge_to=”non_train_targets” during stage=”pre_cv” when there is no holdout dataset available. Under these circumstances, the holdout dataset targets would be the sole contents of merged_df, rendering merged_df invalid since the data is unavailable
See also
names_for_merge
Describes how stage values differ
-
hyperparameter_hunter.feature_engineering.
split_merged_df
(merged_df:pandas.core.frame.DataFrame) → Dict[str, pandas.core.frame.DataFrame]¶ Separate a multi-indexed DataFrame into a dict mapping primary indexes in merged_df to DataFrames containing one fewer dimension than merged_df. This is the opposite of merge_dfs
- Parameters
- merged_df: pd.DataFrame
Multi-indexed DataFrame of the form returned by
merge_dfs()
to split into the separate DataFrames named by the primary indexes of merged_df
- Returns
- dfs: Dict
Mapping of dataset names to their DataFrame values. Keys in dfs will be a subset of {“train_data”, “train_inputs”, “train_targets”, “validation_data”, “validation_inputs”, “validation_targets”, “holdout_data”, “holdout_inputs”, “holdout_targets”, “test_inputs”} containing only those values that are also primary indexes in merged_df
-
hyperparameter_hunter.feature_engineering.
validate_dataset_names
(params:Tuple[str], stage:str) → List[str]¶ Produce the names of merged datasets in params and verify there are no duplicate references to any datasets in params
- Parameters
- params: Tuple[str]
Dataset names requested by a feature engineering step callable. Must be a subset of {“train_data”, “train_inputs”, “train_targets”, “validation_data”, “validation_inputs”, “validation_targets”, “holdout_data”, “holdout_inputs”, “holdout_targets”, “test_inputs”, “all_data”, “all_inputs”, “all_targets”, “non_train_data”, “non_train_inputs”, “non_train_targets”}
- stage: String in {“pre_cv”, “intra_cv}
Feature engineering stage for which merged_df is requested
- Returns
- List[str]
Names of merged datasets in params
- Raises
- ValueError
If requested params contain a duplicate reference to any dataset, either by way of merging/coupling or not
-
class
hyperparameter_hunter.feature_engineering.
EngineerStep
(f: Callable, stage=None, name=None, params=None, do_validate=False)¶ Bases:
object
Container for individual
FeatureEngineer
step functionsCompartmentalizes functions of singular engineer steps and allows for greater customization than a raw engineer step function
- Parameters
- f: Callable
Feature engineering step function that requests, modifies, and returns datasets params
Step functions should follow these guidelines:
Request as input a subset of the 11 data strings listed in params
Do whatever you want to the DataFrames given as input
Return new DataFrame values of the input parameters in same order as requested
If performing a task like target transformation, causing predictions to be transformed, it is often desirable to inverse-transform the predictions to be of the expected form. This can easily be done by returning an extra value from f (after the datasets) that is either a callable, or a transformer class that was fitted during the execution of f and implements an inverse_transform method. This is the only instance in which it is acceptable for f to return values that don’t mimic its input parameters. See the engineer function definition using SKLearn’s QuantileTransformer in the Examples section below for an actual inverse-transformation-compatible implementation
- stage: String in {“pre_cv”, “intra_cv”}, or None, default=None
Feature engineering stage during which the callable f will be given the datasets params to modify and return. If None, will be inferred based on params.
“pre_cv” functions are applied only once in the experiment: when it starts
“intra_cv” functions are reapplied for each fold in the cross-validation splits
If stage is left to be inferred, “pre_cv” will usually be selected. However, if any params (or parameters in the signature of f) are prefixed with “validation…” or “non_train…”, then stage will inferred as “intra_cv”. See the Notes section below for suggestions on the stage to use for different functions
- name: String, or None, default=None
Identifier for the transformation applied by this engineering step. If None, f.__name__ will be used
- params: Tuple[str], or None, default=None
Dataset names requested by feature engineering step callable f. If None, will be inferred by parsing the signature of f. Must be a subset of the following 11 strings:
Input Data
“train_inputs”
“validation_inputs”
“holdout_inputs”
“test_inputs”
- “all_inputs”
("train_inputs" + ["validation_inputs"] + "holdout_inputs" + "test_inputs")
- “non_train_inputs”
(["validation_inputs"] + "holdout_inputs" + "test_inputs")
Target Data
“train_targets”
“validation_targets”
“holdout_targets”
“all_targets”
("train_targets" + ["validation_targets"] + "holdout_targets")
“non_train_targets”
(["validation_targets"] + "holdout_targets")
As an alternative to the above list, just remember that the first half of all parameter names should be one of {“train”, “validation”, “holdout”, “test”, “all”, “non_train”}, and the second half should be either “inputs” or “targets”. The only exception to this rule is “test_targets”, which doesn’t exist.
Inference of “validation” params is affected by stage. During the “pre_cv” stage, the validation dataset has not yet been created and is still a part of the train dataset. During the “intra_cv” stage, the validation dataset is created by removing a portion of the train dataset, and their values passed to f reflect this fact. This also means that the values of the merged (“all”/”non_train”-prefixed) datasets may or may not contain “validation” data depending on the stage; however, this is all handled internally, so you probably don’t need to worry about it.
params may not include multiple references to the same dataset, either directly or indirectly. This means (“train_inputs”, “train_inputs”) is invalid due to duplicate direct references. Less obviously, (“train_inputs”, “all_inputs”) is invalid because “all_inputs” includes “train_inputs”
- do_validate: Boolean, or “strict”, default=False
… Experimental… Whether to validate the datasets resulting from feature engineering steps. If True, hashes of the new datasets will be compared to those of the originals to ensure they were actually modified. Results will be logged. If do_validate = “strict”, an exception will be raised if any anomalies are found, rather than logging a message. If do_validate = False, no validation will be performed
See also
FeatureEngineer
The container for EngineerStep instances - EngineerStep`s should always be provided to HyperparameterHunter through a `FeatureEngineer
Categorical
Can be used during optimization to search through a group of EngineerStep`s given as `categories. The optional kwarg of Categorical designates a FeatureEngineer step that may be one of the EngineerStep`s in `categories, or may be omitted entirely
get_engineering_step_stage()
More information on stage inference and situations where overriding it may be prudent
Notes
stage: Generally, feature engineering conducted in the “pre_cv” stage should regard each sample/row as independent entities. For example, steps like converting a string day of the week to one-hot encoded columns, or imputing missing values by replacement with -1 might be conducted “pre_cv”, since they are unlikely to introduce an information leakage. Conversely, steps like scaling/normalization, whose results for the data in one row are affected by the data in other rows should be performed “intra_cv” in order to recalculate the final values of the datasets for each cross validation split and avoid information leakage.
params: In the list of the 11 valid params strings, “test_inputs” is notably missing the “…_targets” counterpart accompanying the other datasets. The “targets” suffix is missing because test data targets are never given. Note that although “test_inputs” is still included in both “all_inputs” and “non_train_inputs”, its lack of a target column means that “all_targets” and “non_train_targets” may have different lengths than their “inputs”-suffixed counterparts
Examples
>>> from sklearn.preprocessing import StandardScaler, QuantileTransformer >>> def s_scale(train_inputs, non_train_inputs): ... s = StandardScaler() ... train_inputs[train_inputs.columns] = s.fit_transform(train_inputs.values) ... non_train_inputs[train_inputs.columns] = s.transform(non_train_inputs.values) ... return train_inputs, non_train_inputs >>> # Sensible parameter defaults inferred based on `f` >>> es_0 = EngineerStep(s_scale) >>> es_0.stage 'intra_cv' >>> es_0.name 's_scale' >>> es_0.params ('train_inputs', 'non_train_inputs') >>> # Override `stage` if you want to fit your scaler on OOF data like a crazy person >>> es_1 = EngineerStep(s_scale, stage="pre_cv") >>> es_1.stage 'pre_cv'
Watch out for multiple requests to the same data
>>> es_2 = EngineerStep(s_scale, params=("train_inputs", "all_inputs")) Traceback (most recent call last): File "feature_engineering.py", line ? in validate_dataset_names ValueError: Requested params include duplicate references to `train_inputs` by way of: - ('all_inputs', 'train_inputs') - ('train_inputs',) Each dataset may only be requested by a single param for each function
Error is the same if `(train_inputs, all_inputs)` is in the actual function signature
EngineerStep functions aren’t just limited to transformations. Make your own features!
>>> def sqr_sum(all_inputs): ... all_inputs["square_sum"] = all_inputs.agg( ... lambda row: np.sqrt(np.sum([np.square(_) for _ in row])), axis="columns" ... ) ... return all_inputs >>> es_3 = EngineerStep(sqr_sum) >>> es_3.stage 'pre_cv' >>> es_3.name 'sqr_sum' >>> es_3.params ('all_inputs',)
Inverse-transformation Implementation:
>>> def q_transform(train_targets, non_train_targets): ... t = QuantileTransformer(output_distribution="normal") ... train_targets[train_targets.columns] = t.fit_transform(train_targets.values) ... non_train_targets[train_targets.columns] = t.transform(non_train_targets.values) ... return train_targets, non_train_targets, t >>> # Note that `train_targets` and `non_train_targets` must still be returned in order, >>> # but they are followed by `t`, an instance of `QuantileTransformer` we just fitted, >>> # whose `inverse_transform` method will be called on predictions >>> es_4 = EngineerStep(q_transform) >>> es_4.stage 'intra_cv' >>> es_4.name 'q_transform' >>> es_4.params ('train_targets', 'non_train_targets') >>> # `params` does not include any returned transformers - Only data requested as input
- Attributes
Methods
__call__
(self, \*\*datasets, …)Apply
f
to datasets to produce updated datasets.get_comparison_attrs
(step_obj, dict])Build a dict of critical
EngineerStep
attributesget_datasets_for_f
(self, datasets, …)Produce a dict of DataFrames containing only the merged datasets and standard datasets requested in
params
.get_key_data
(self)Produce a dict of critical attributes describing the
EngineerStep
instance for use by key-making classeshonorary_step_from_dict
(step_dict, dimension)Get an EngineerStep from dimension that is equal to its dict form, step_dict
inverse_transform
(self, data)Perform the inverse transformation for this engineer step (if it exists)
stringify
(self)Make a stringified representation of self, compatible with
EngineerStep.__eq__()
-
inverse_transform
(self, data)¶ Perform the inverse transformation for this engineer step (if it exists)
- Parameters
- data: Array-like
Data to inverse transform with
inversion
orinversion.inverse_transform
- Returns
- Array-like
If
inversion
is None, return data unmodified. Else, return the result ofinversion
orinversion.inverse_transform
, given data
-
get_datasets_for_f
(self, datasets:Dict[str, pandas.core.frame.DataFrame]) → Dict[str, pandas.core.frame.DataFrame]¶ Produce a dict of DataFrames containing only the merged datasets and standard datasets requested in
params
. In other words, add the requested merged datasets and remove unnecessary standard datasets- Parameters
- datasets: DFDict
Original dict of datasets, containing all datasets provided to
EngineerStep.__call__()
, some of which may be superfluous, or may require additional processing to resolve merged/coupled datasets
- Returns
- DFDict
Updated version of datasets, in which unnecessary datasets have been filtered out, and the requested merged datasets have been added
-
get_key_data
(self) → dict¶ Produce a dict of critical attributes describing the
EngineerStep
instance for use by key-making classes- Returns
- Dict
Important attributes describing this
EngineerStep
instance
-
property
f
¶ Feature engineering step callable that requests, modifies, and returns datasets
-
property
name
¶ Identifier for the transformation applied by this engineering step
-
property
params
¶ Dataset names requested by feature engineering step callable
f
. See documentation inEngineerStep.__init__()
for more information/restrictions
-
property
stage
¶ Feature engineering stage during which the EngineerStep will be executed
-
static
get_comparison_attrs
(step_obj:Union[_ForwardRef('EngineerStep'), dict]) → dict¶ Build a dict of critical
EngineerStep
attributes- Parameters
- step_obj: EngineerStep, dict
Object for which critical
EngineerStep
attributes should be collected
- Returns
- attr_vals: Dict
Critical
EngineerStep
attributes. If step_obj does not have a necessary attribute (for EngineerStep) or a necessary key (for dict), its value in attr_vals will be a placeholder object. This is to facilitate comparison, while also ensuring missing values will always be considered unequal to other values
Examples
>>> def dummy_f(train_inputs, non_train_inputs): ... return train_inputs, non_train_inputs >>> es_0 = EngineerStep(dummy_f) >>> EngineerStep.get_comparison_attrs(es_0) # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE {'name': 'dummy_f', 'f': <function dummy_f at ...>, 'params': ('train_inputs', 'non_train_inputs'), 'stage': 'intra_cv', 'do_validate': False} >>> EngineerStep.get_comparison_attrs( ... dict(foo="hello", f=dummy_f, params=["all_inputs", "all_targets"], stage="pre_cv") ... ) # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE {'name': <object object at ...>, 'f': <function dummy_f at ...>, 'params': ('all_inputs', 'all_targets'), 'stage': 'pre_cv', 'do_validate': <object object at ...>}
-
stringify
(self) → str¶ Make a stringified representation of self, compatible with
EngineerStep.__eq__()
- Returns
- String
String describing all critical attributes of the
EngineerStep
instance. This value is not particularly human-friendly due to both its length and the fact thatEngineerStep.f
is represented by its hash
Examples
>>> def dummy_f(train_inputs, non_train_inputs): ... return train_inputs, non_train_inputs >>> EngineerStep(dummy_f).stringify() # doctest: +ELLIPSIS "EngineerStep(dummy_f, ..., ('train_inputs', 'non_train_inputs'), intra_cv, False)" >>> EngineerStep(dummy_f, stage="pre_cv").stringify() # doctest: +ELLIPSIS "EngineerStep(dummy_f, ..., ('train_inputs', 'non_train_inputs'), pre_cv, False)"
-
classmethod
honorary_step_from_dict
(step_dict:dict, dimension:hyperparameter_hunter.space.dimensions.Categorical)¶ Get an EngineerStep from dimension that is equal to its dict form, step_dict
- Parameters
- step_dict: Dict
Dict of form saved in Experiment description files for EngineerStep. Expected to have following keys, with values of the given types:
“name”: String
“f”: String (SHA256 hash)
“params”: List[str], or Tuple[str, …]
“stage”: String in {“pre_cv”, “intra_cv”}
“do_validate”: Boolean
- dimension: Categorical
Categorical instance expected to contain the EngineerStep equivalent of step_dict in its categories
- Returns
- EngineerStep
From dimension.categories if it is the EngineerStep equivalent of step_dict
- Raises
- ValueError
If dimension.categories does not contain an EngineerStep matching step_dict
-
class
hyperparameter_hunter.feature_engineering.
FeatureEngineer
(steps=None, do_validate=False, **datasets)¶ Bases:
object
Class to organize feature engineering step callables steps (
EngineerStep
instances) and the datasets that the steps request and return.- Parameters
- steps: List, or None, default=None
List of arbitrary length, containing any of the following values:
EngineerStep
instance,Function to provide as input to
EngineerStep
, orCategorical
, with categories comprising a selection of the previous two steps values (optimization only)
The third value can only be used during optimization. The feature_engineer provided to
CVExperiment
, for example, may only contain the first two values. To search a space optionally including an EngineerStep, use the optional kwarg ofCategorical
.See
EngineerStep
for information on properly formatted EngineerStep functions. Additional engineering steps may be added viaadd_step()
- do_validate: Boolean, or “strict”, default=False
… Experimental… Whether to validate the datasets resulting from feature engineering steps. If True, hashes of the new datasets will be compared to those of the originals to ensure they were actually modified. Results will be logged. If do_validate = “strict”, an exception will be raised if any anomalies are found, rather than logging a message. If do_validate = False, no validation will be performed
- **datasets: DFDict
This is not expected to be provided on initialization and is offered primarily for debugging/testing. Mapping of datasets necessary to perform feature engineering steps
See also
EngineerStep
For proper formatting of non-Categorical values of steps
Notes
If steps does include any instances of
hyperparameter_hunter.space.dimensions.Categorical
, this FeatureEngineer instance will not be usable by Experiments. It can only be used by Optimization Protocols. Furthermore, the FeatureEngineer that the Optimization Protocol actually ends up using will not pass identity checks against the original FeatureEngineer that contained Categorical stepsExamples
>>> from sklearn.preprocessing import StandardScaler, MinMaxScaler, QuantileTransformer >>> # Define some engineer step functions to play with >>> def s_scale(train_inputs, non_train_inputs): ... s = StandardScaler() ... train_inputs[train_inputs.columns] = s.fit_transform(train_inputs.values) ... non_train_inputs[train_inputs.columns] = s.transform(non_train_inputs.values) ... return train_inputs, non_train_inputs >>> def mm_scale(train_inputs, non_train_inputs): ... s = MinMaxScaler() ... train_inputs[train_inputs.columns] = s.fit_transform(train_inputs.values) ... non_train_inputs[train_inputs.columns] = s.transform(non_train_inputs.values) ... return train_inputs, non_train_inputs >>> def q_transform(train_targets, non_train_targets): ... t = QuantileTransformer(output_distribution="normal") ... train_targets[train_targets.columns] = t.fit_transform(train_targets.values) ... non_train_targets[train_targets.columns] = t.transform(non_train_targets.values) ... return train_targets, non_train_targets, t >>> def sqr_sum(all_inputs): ... all_inputs["square_sum"] = all_inputs.agg( ... lambda row: np.sqrt(np.sum([np.square(_) for _ in row])), axis="columns" ... ) ... return all_inputs
FeatureEngineer steps wrapped by `EngineerStep` == raw function steps - as long as the `EngineerStep` is using the default parameters
>>> # FeatureEngineer steps wrapped by `EngineerStep` == raw function steps >>> # ... As long as the `EngineerStep` is using the default parameters >>> fe_0 = FeatureEngineer([sqr_sum, s_scale]) >>> fe_1 = FeatureEngineer([EngineerStep(sqr_sum), EngineerStep(s_scale)]) >>> fe_0.steps == fe_1.steps True >>> fe_2 = FeatureEngineer([sqr_sum, EngineerStep(s_scale), q_transform])
`Categorical` can be used during optimization and placed anywhere in `steps`. `Categorical` can also handle either `EngineerStep` categories or raw functions. Use the `optional` kwarg of `Categorical` to test some questionable steps
>>> fe_3 = FeatureEngineer([sqr_sum, Categorical([s_scale, mm_scale]), q_transform]) >>> fe_4 = FeatureEngineer([Categorical([sqr_sum], optional=True), s_scale, q_transform]) >>> fe_5 = FeatureEngineer([ ... Categorical([sqr_sum], optional=True), ... Categorical([EngineerStep(s_scale), mm_scale]), ... q_transform ... ])
- Attributes
steps
Feature engineering steps to execute in sequence on
FeatureEngineer.__call__()
Methods
__call__
(self, stage, \*\*datasets, …)Execute all feature engineering steps in
steps
for stage, with datasets datasets as inputsadd_step
(self, step, …)Add an engineering step to
steps
to be executed with the other contents ofsteps
onFeatureEngineer.__call__()
get_key_data
(self)Produce a dict of critical attributes describing the
FeatureEngineer
instance for use by key-making classesinverse_transform
(self, data)Perform the inverse transformation for all engineer steps in
steps
in sequence on data-
inverse_transform
(self, data)¶ Perform the inverse transformation for all engineer steps in
steps
in sequence on data- Parameters
- data: Array-like
Data to inverse transform with any inversions present in
steps
- Returns
- Array-like
Result of sequentially calling inverse transformations in
steps
on data. If any step hasEngineerStep.inversion
= None, data is unmodified for that step, and proceeds to next engineer step inversion
-
property
steps
¶ Feature engineering steps to execute in sequence on
FeatureEngineer.__call__()
-
get_key_data
(self) → dict¶ Produce a dict of critical attributes describing the
FeatureEngineer
instance for use by key-making classes- Returns
- Dict
Important attributes describing this
FeatureEngineer
instance
-
add_step
(self, step:Union[Callable, hyperparameter_hunter.space.dimensions.Categorical], stage:str=None, name:str=None, before:str=<class 'hyperparameter_hunter.feature_engineering.EMPTY_SENTINEL'>, after:str=<class 'hyperparameter_hunter.feature_engineering.EMPTY_SENTINEL'>, number:int=<class 'hyperparameter_hunter.feature_engineering.EMPTY_SENTINEL'>)¶ Add an engineering step to
steps
to be executed with the other contents ofsteps
onFeatureEngineer.__call__()
- Parameters
- step: Callable, or `EngineerStep`, or `Categorical`
If EngineerStep instance, will be added directly to
steps
. Otherwise, must be a feature engineering step callable that requests, modifies, and returns datasets, which will be used to instantiate aEngineerStep
to add tosteps
. If Categorical, categories should contain EngineerStep instances or callables- stage: String in {“pre_cv”, “intra_cv”}, or None, default=None
Feature engineering stage during which the callable step will be executed
- name: String, or None, default=None
Identifier for the transformation applied by this engineering step. If None and step is not an EngineerStep, will be inferred during
EngineerStep
instantiation- before: String, default=EMPTY_SENTINEL
… Experimental…
- after: String, default=EMPTY_SENTINEL
… Experimental…
- number: String, default=EMPTY_SENTINEL
… Experimental…
-
hyperparameter_hunter.feature_engineering.
get_engineering_step_stage
(datasets:Tuple[str, ...]) → str¶ Determine the stage in which a feature engineering step that requests datasets as input should be executed
- Parameters
- datasets: Tuple[str]
Dataset names requested by a feature engineering step callable
- Returns
- stage: {“pre_cv”, “intra_cv”}
“pre_cv” if a step processing the given datasets should be executed in the pre-cross-validation stage. “intra_cv” if the step should be executed for each cross-validation split. If any of the elements in datasets is prefixed with “validation” or “non_train”, stage will be “intra_cv”. Otherwise, it will be “pre_cv”
Notes
Generally, feature engineering conducted in the “pre_cv” stage should regard each sample/row as independent entities. For example, steps like converting a string day of the week to one-hot encoded columns, or imputing missing values by replacement with -1 might be conducted “pre_cv”, since they are unlikely to introduce an information leakage. Conversely, steps like scaling/normalization, whose results for the data in one row are affected by the data in other rows should be performed “intra_cv” in order to recalculate the final values of the datasets for each cross validation split and avoid information leakage
Technically, the inference of stage=”intra_cv” due to the existence of a “non_train”-prefixed value in datasets could unnecessarily force steps to be executed “intra_cv” if, for example, there is no validation data. However, this is safer than the alternative of executing these steps “pre_cv”, in which validation data would be a subset of train data, probably introducing information leakage. A simple workaround for this is to explicitly provide
EngineerStep
with the desired stage parameter to bypass this inferenceExamples
>>> get_engineering_step_stage(("train_inputs", "validation_inputs", "holdout_inputs")) 'intra_cv' >>> get_engineering_step_stage(("all_data")) 'pre_cv' >>> get_engineering_step_stage(("all_inputs", "all_targets")) 'pre_cv' >>> get_engineering_step_stage(("train_data", "non_train_data")) 'intra_cv'
-
class
hyperparameter_hunter.feature_engineering.
ParameterParser
¶ Bases:
ast.NodeVisitor
ast.NodeVisitor subclass that collects the arguments specified in the signature of a callable node, as well as the values returned by the callable, in the attributes args and returns, respectively
Methods
generic_visit
(self, node)Called if no explicit visitor function exists for a node.
visit
(self, node)Visit a node.
visit_Return
visit_arg
-
visit_arg
(self, node)¶
-
visit_Return
(self, node)¶
-
-
hyperparameter_hunter.feature_engineering.
get_engineering_step_params
(f:<built-in function callable>) → Tuple[str]¶ Verify that callable f requests valid input parameters, and returns a tuple of the same parameters, with the assumption that the parameters are modified by f
- Parameters
- f: Callable
Feature engineering step function that requests, modifies, and returns datasets
- Returns
- Tuple
Argument/return value names declared by f
Examples
>>> def impute_negative_one(all_inputs): ... all_inputs.fillna(-1, inplace=True) ... return all_inputs >>> get_engineering_step_params(impute_negative_one) ('all_inputs',) >>> def standard_scale(train_inputs, non_train_inputs): ... scaler = StandardScaler() ... train_inputs[train_inputs.columns] = scaler.fit_transform(train_inputs.values) ... non_train_inputs[train_inputs.columns] = scaler.transform(non_train_inputs.values) ... return train_inputs, non_train_inputs >>> get_engineering_step_params(standard_scale) ('train_inputs', 'non_train_inputs') >>> def error_invalid_dataset(train_inputs, foo): ... return train_inputs, foo >>> get_engineering_step_params(error_invalid_dataset) Traceback (most recent call last): File "feature_engineering.py", line ?, in get_engineering_step_params ValueError: Invalid dataset name: 'foo'
-
hyperparameter_hunter.feature_engineering.
hash_datasets
(datasets:dict) → dict¶ Describe datasets with dicts of hashes for their values, column names, and column values
- Parameters
- datasets: Dict
Mapping of dataset names to pandas.DataFrame instances
- Returns
- hashes: Dict
Mapping with same keys as datasets, whose values are dicts returned from
_hash_dataset()
that provide hashes for each DataFrame and its column names/values
Examples
>>> df_x = pd.DataFrame(dict(a=[0, 1], b=[2, 3], c=[4, 5])) >>> df_y = pd.DataFrame(dict(a=[0, 1], b=[6, 7], d=[8, 9])) >>> hash_datasets(dict(x=df_x, y=df_y)) == dict(x=_hash_dataset(df_x), y=_hash_dataset(df_y)) True
hyperparameter_hunter.importer module¶
This module provides utilities to intercept external imports and load them using custom logic
Related¶
hyperparameter_hunter.__init__
Executes the import hooks to ensure assets are properly imported prior to starting any real work
hyperparameter_hunter.tracers
Defines tracing metaclasses applied by
hyperparameter_hunter.importer
to imports
-
class
hyperparameter_hunter.importer.
Interceptor
(module_name, custom_loader, asset_name=None)¶ Bases:
_frozen_importlib_external.PathFinder
Class to intercept loading of an external module in order to provide custom loading logic
- Parameters
- module_name: String
The path of the module, for which loading should be handled by custom_loader
- custom_loader: Descendant of `importlib.machinery.SourceFileLoader`
Should implement
exec_module()
, which should call its superclass’sexec_module()
, then perform the custom loading logic, and return module
Methods
find_module
(fullname[, path])find the module on sys.path or ‘path’ based on sys.path_hooks and sys.path_importer_cache.
find_spec
(self, full_name[, path, target])Perform custom loading logic if full_name ==
module_name
invalidate_caches
()Call the invalidate_caches() method on all path entry finders stored in sys.path_importer_caches (where implemented).
-
find_spec
(self, full_name, path=None, target=None)¶ Perform custom loading logic if full_name ==
module_name
-
class
hyperparameter_hunter.importer.
KerasLayerLoader
(fullname, path)¶ Bases:
_frozen_importlib_external.SourceFileLoader
Cache the module name and the path to the file found by the finder.
Methods
create_module
(self, spec)Use default semantics for module creation.
exec_module
(self, module)Set module.Layer to a traced version of itself via
tracers.ArgumentTracer
get_code
(self, fullname)Concrete implementation of InspectLoader.get_code.
get_data
(self, path)Return the data from path as raw bytes.
get_filename
(self[, name])Return the path to the source file as found by the finder.
get_source
(self, fullname)Concrete implementation of InspectLoader.get_source.
is_package
(self, fullname)Concrete implementation of InspectLoader.is_package by checking if the path returned by get_filename has a filename of ‘__init__.py’.
load_module
(self[, name])Load a module from a file.
path_mtime
(self, path)Optional method that returns the modification time (an int) for the specified path, where path is a str.
path_stats
(self, path)Return the metadata for the path.
set_data
(self, path, data, \*[, _mode])Write bytes data to a file.
source_to_code
(self, data, path, \*[, _optimize])Return the code object compiled from source.
-
exec_module
(self, module)¶ Set module.Layer to a traced version of itself via
tracers.ArgumentTracer
-
-
hyperparameter_hunter.importer.
hook_keras_layer
()¶ If Keras has yet to be imported, modify the inheritance structure of its base Layer class to inject attributes that keep track of the parameters provided to each layer
-
class
hyperparameter_hunter.importer.
KerasMultiInitializerLoader
(fullname, path)¶ Bases:
_frozen_importlib_external.SourceFileLoader
Cache the module name and the path to the file found by the finder.
Methods
create_module
(self, spec)Use default semantics for module creation.
exec_module
(self, module)Execute the module.
get_code
(self, fullname)Concrete implementation of InspectLoader.get_code.
get_data
(self, path)Return the data from path as raw bytes.
get_filename
(self[, name])Return the path to the source file as found by the finder.
get_source
(self, fullname)Concrete implementation of InspectLoader.get_source.
is_package
(self, fullname)Concrete implementation of InspectLoader.is_package by checking if the path returned by get_filename has a filename of ‘__init__.py’.
load_module
(self[, name])Load a module from a file.
path_mtime
(self, path)Optional method that returns the modification time (an int) for the specified path, where path is a str.
path_stats
(self, path)Return the metadata for the path.
set_data
(self, path, data, \*[, _mode])Write bytes data to a file.
source_to_code
(self, data, path, \*[, _optimize])Return the code object compiled from source.
-
exec_module
(self, module)¶ Execute the module.
-
-
hyperparameter_hunter.importer.
hook_keras_initializers
()¶
hyperparameter_hunter.leaderboards module¶
This module defines the Leaderboard classes that are saved to the ‘HyperparameterHunterAssets/Leaderboards’ subdirectory. It provides the ability to compare all Experiment results at a glance
Related¶
hyperparameter_hunter.recorders
This module initiates the saving of Experiment entries to Leaderboards
-
class
hyperparameter_hunter.leaderboards.
Leaderboard
(data=None)¶ Bases:
object
The Leaderboard class is used for reading, updating, and saving leaderboard files within the ‘HyperparameterHunterAssets/Leaderboards’ subdirectory
- Parameters
- data: pd.DataFrame, or None, default=None
The starting state of the Leaderboard. If None, an empty DataFrame is used
Methods
add_entry
(self, experiment, \*\*kwargs)Add an entry row for experiment to
data
from_path
(path[, assert_existence])Initialize a Leaderboard from a .csv path
save
(self, path, \*\*kwargs)Save the Leaderboard instance
sort
(self, by[, ascending])Sort the rows in
data
according to the values of a column-
classmethod
from_path
(path, assert_existence=False)¶ Initialize a Leaderboard from a .csv path
- Parameters
- path: str
The path of the file to read in as a DataFrame
- assert_existence: boolean, default=False
If False, and
pandas.read_csv()
raises FileNotFoundError, the Leaderboard will be initialized with None. Else the exception is raised normally
-
abstract
add_entry
(self, experiment, **kwargs)¶ Add an entry row for experiment to
data
- Parameters
- experiment: :class:`experiments.BaseExperiment`
An instance of a completed Experiment from which to construct a Leaderboard entry
-
save
(self, path, **kwargs)¶ Save the Leaderboard instance
- Parameters
- path: str
The file to which the Leaderboard instance should be saved
- **kwargs: Dict
Additional arguments to supply to
pandas.DataFrame.to_csv()
-
sort
(self, by, ascending=False)¶ Sort the rows in
data
according to the values of a column- Parameters
- by: str, or list of str
The column name(s) by which to sort the rows of
data
- ascending: boolean, default=False
The direction in which to sort the rows of
data
-
class
hyperparameter_hunter.leaderboards.
GlobalLeaderboard
(data=None)¶ Bases:
hyperparameter_hunter.leaderboards.Leaderboard
The Leaderboard class is used for reading, updating, and saving leaderboard files within the ‘HyperparameterHunterAssets/Leaderboards’ subdirectory
- Parameters
- data: pd.DataFrame, or None, default=None
The starting state of the Leaderboard. If None, an empty DataFrame is used
Methods
add_entry
(self, experiment, \*\*kwargs)Add an entry row to
Leaderboard.data
(pandas.DataFrame).from_path
(path[, assert_existence])Initialize a Leaderboard from a .csv path
save
(self, path, \*\*kwargs)Save the Leaderboard instance
sort
(self, by[, ascending])Sort the rows in
data
according to the values of a column-
add_entry
(self, experiment, **kwargs)¶ Add an entry row to
Leaderboard.data
(pandas.DataFrame). This method also handles column conflicts to an extent- Parameters
- experiment: Instance of :class:`experiments.BaseExperiment` descendant
An Experiment instance for which a leaderboard entry row should be added
- **kwargs: Dict
Extra keyword arguments
-
hyperparameter_hunter.leaderboards.
evaluations_to_columns
(evaluation:Dict[str, Union[collections.OrderedDict, NoneType]], decimals=10) → List[Tuple[str, numbers.Number]]¶ Convert the results of
metrics.ScoringMixIn.evaluate()
to a pd.DataFrame-ready format- Parameters
- evaluation: Dict[str, OrderedDict]
The result of consecutive calls to
metrics.ScoringMixIn.evaluate()
for all given dataset types- decimals: Int, default=10
Number of decimal places to which to round. If decimals is negative, it specifies the number of positions to the left of the decimal point
- Returns
- column_metrics: list of pairs
A pair for each data_type-metric combination, where the first item is the key, and the second is the metric value
Examples
>>> evaluations_to_columns({ ... 'in_fold': None, ... 'holdout': OrderedDict([('roc_auc_score', 0.9856), ('f1_score', 0.9768)]), ... 'oof': OrderedDict([('roc_auc_score', 0.9634)]) ... }) [('oof_roc_auc_score', 0.9634), ('holdout_roc_auc_score', 0.9856), ('holdout_f1_score', 0.9768)]
-
hyperparameter_hunter.leaderboards.
combine_column_order
(df_1, df_2, both_cols=None)¶ Determine the sort order for the combined columns of two DataFrames
- Parameters
- df_1: pd.DataFrame
The first DataFrame, whose columns will be sorted. Columns unique to df_1 will be sorted before those of df_2
- df_2: pd.DataFrame
The second DataFrame, whose columns will be sorted. Columns unique to df_2 will be sorted after those of df_1
- both_cols: list, or None, default=None
If list, the column names that should be common to both DataFrames and placed last in the sort order
- Returns
- combined_cols: list of strings
The result of combining and sorting column names from df_1, and df_2
Examples
>>> df_1 = pd.DataFrame(columns=['A', 'B', 'C', 'Common_1', 'Common_2']) >>> df_2 = pd.DataFrame(columns=['A', 'D', 'E', 'Common_1', 'Common_2']) >>> combine_column_order(df_1, df_2, both_cols=['Common_1', 'Common_2']) ['A', 'B', 'C', 'D', 'E', 'Common_1', 'Common_2'] >>> combine_column_order(df_1, df_2, both_cols=None) ['A', 'Common_1', 'Common_2', 'B', 'C', 'D', 'E']
hyperparameter_hunter.metrics module¶
This module defines hyperparameter_hunter.metrics.ScoringMixIn
which enables
hyperparameter_hunter.experiments.BaseExperiment
to score predictions and collect the
results of those evaluations
Related¶
hyperparameter_hunter.experiments
This module uses
hyperparameter_hunter.metrics.ScoringMixIn
as the only explicit parent class tohyperparameter_hunter.experiments.BaseExperiment
(that is, the only parent class that isn’t bestowed upon it byhyperparameter_hunter.experiment_core.ExperimentMeta
)
-
class
hyperparameter_hunter.metrics.
Metric
(name: str, metric_function: Union[callable, str, None] = None, direction: str = 'infer')¶ Bases:
object
Class to encapsulate all necessary information for identifying, calculating, and evaluating metrics results
- Parameters
- name: String
Identifying name of the metric. Should be unique relative to any other metric names that might be provided by the user
- metric_function: Callable, string, None, default=None
If callable, should expect inputs of form (target, prediction), and return a float. If string, will be treated as an attribute in
sklearn.metrics
. If None, name will be treated as an attribute insklearn.metrics
, the value of which will be retrieved and used as metric_function- direction: {“infer”, “max”, “min”}, default=”infer”
How to compare the result of metric_function relative to previous evaluations
“max”: Metric values should be maximized, and higher metric values are better than lower values; it should be used for measures of accuracy
“min”: Metric values should be minimized, and lower metric values are better than higher values; it should be used for measures of error or loss
“infer”: direction will be set to:
“min” if name (or metric_function’s name) contains “error” or “loss”
“max” if name contains neither of the aforementioned strings
Notes
direction = “infer” looks for “error”/”loss” in name first, then in the name of metric_function. This means that name can be an abbreviation/anything for error measures and direction will still be correctly inferred as long as the actual callable for metric_function has “error”/”loss” in its name. For example, direction = “min” is safely inferred when using “mae” for “mean_absolute_error” or “rmsle” for “root_mean_squared_logarithmic_error”. This functions as described whether metric_function is a string naming an SKLearn metric, or a callable whose name includes “error”/”loss”
Examples
>>> Metric("roc_auc_score") # doctest: +ELLIPSIS Metric(roc_auc_score, <function roc_auc_score at 0x...>, max) >>> Metric("roc_auc_score", sk_metrics.roc_auc_score) # doctest: +ELLIPSIS Metric(roc_auc_score, <function roc_auc_score at 0x...>, max) >>> Metric("my_f1_score", "f1_score") # doctest: +ELLIPSIS Metric(my_f1_score, <function f1_score at 0x...>, max) >>> Metric("hamming_loss", sk_metrics.hamming_loss) # doctest: +ELLIPSIS Metric(hamming_loss, <function hamming_loss at 0x...>, min)
Respect explicit `direction` even if it doesn’t make sense for the `metric_function`
>>> Metric("r2_score", sk_metrics.r2_score, direction="min") # doctest: +ELLIPSIS Metric(r2_score, <function r2_score at 0x...>, min)
Direction inference based on `metric_function` name, rather than `name` itself
>>> Metric("mae", "median_absolute_error") # doctest: +ELLIPSIS Metric(mae, <function median_absolute_error at 0x...>, min) >>> Metric("hl", sk_metrics.hamming_loss) # doctest: +ELLIPSIS Metric(hl, <function hamming_loss at 0x...>, min)
Methods
__call__
(self, target, prediction)Call self as a function.
-
hyperparameter_hunter.metrics.
format_metrics
(metrics:Union[Dict, List]) → Dict[str, hyperparameter_hunter.metrics.Metric]¶ Properly format iterable metrics to contain instances of
Metric
- Parameters
- metrics: Dict, List
Iterable describing the metrics to be recorded, along with a means to compute the value of each metric. Should be of one of the two following forms:
List Form:
[“<metric name>”, “<metric name>”, …]: Where each value of the list is a string that names an attribute in
sklearn.metrics
[Metric, Metric, …]: Where each value of the list is an instance of
Metric
[(<*args>), (<*args>), …]: Where each value of the list is a tuple of arguments that will be used to instantiate a
Metric
. Arguments given in tuples must be in order expected byMetric
Dict Form:
{“<metric name>”: <metric_function>, …}: Where each key is a name for the corresponding metric callable, which is used to compute the value of the metric
{“<metric name>”: (<metric_function>, <direction>), …}: Where each key is a name for the corresponding metric callable and direction, all of which are used to instantiate a
Metric
{“<metric name>”: “<sklearn metric name>”, …}: Where each key is a name for the metric, and each value is the name of the attribute in
sklearn.metrics
for which the corresponding key is an alias{“<metric name>”: None, …}: Where each key is the name of the attribute in
sklearn.metrics
{“<metric name>”: Metric, …}: Where each key names an instance of
Metric
. This is the internally-used format to which all other formats will be converted
Metric callable functions should expect inputs of form (target, prediction), and should return floats. See the documentation of
Metric
for information regarding expected parameters and types
- Returns
- metrics_dict: Dict
Cast of metrics to a dict, in which values are instances of
Metric
Examples
>>> format_metrics(["roc_auc_score", "f1_score"]) # doctest: +ELLIPSIS {'roc_auc_score': Metric(roc_auc_score, <function roc_auc_score at 0x...>, max), 'f1_score': Metric(f1_score, <function f1_score at 0x...>, max)} >>> format_metrics([Metric("log_loss"), Metric("r2_score", direction="min")]) # doctest: +ELLIPSIS {'log_loss': Metric(log_loss, <function log_loss at 0x...>, min), 'r2_score': Metric(r2_score, <function r2_score at 0x...>, min)} >>> format_metrics({"log_loss": Metric("log_loss"), "r2_score": Metric("r2_score", direction="min")}) # doctest: +ELLIPSIS {'log_loss': Metric(log_loss, <function log_loss at 0x...>, min), 'r2_score': Metric(r2_score, <function r2_score at 0x...>, min)} >>> format_metrics([("log_loss", None), ("my_r2_score", "r2_score", "min")]) # doctest: +ELLIPSIS {'log_loss': Metric(log_loss, <function log_loss at 0x...>, min), 'my_r2_score': Metric(my_r2_score, <function r2_score at 0x...>, min)} >>> format_metrics({"roc_auc": sk_metrics.roc_auc_score, "f1": sk_metrics.f1_score}) # doctest: +ELLIPSIS {'roc_auc': Metric(roc_auc, <function roc_auc_score at 0x...>, max), 'f1': Metric(f1, <function f1_score at 0x...>, max)} >>> format_metrics({"log_loss": (None, ), "my_r2_score": ("r2_score", "min")}) # doctest: +ELLIPSIS {'log_loss': Metric(log_loss, <function log_loss at 0x...>, min), 'my_r2_score': Metric(my_r2_score, <function r2_score at 0x...>, min)} >>> format_metrics({"roc_auc": "roc_auc_score", "f1": "f1_score"}) # doctest: +ELLIPSIS {'roc_auc': Metric(roc_auc, <function roc_auc_score at 0x...>, max), 'f1': Metric(f1, <function f1_score at 0x...>, max)} >>> format_metrics({"roc_auc_score": None, "f1_score": None}) # doctest: +ELLIPSIS {'roc_auc_score': Metric(roc_auc_score, <function roc_auc_score at 0x...>, max), 'f1_score': Metric(f1_score, <function f1_score at 0x...>, max)}
-
hyperparameter_hunter.metrics.
get_formatted_target_metric
(target_metric:Union[tuple, str, NoneType], metrics:dict, default_dataset:str='oof') → Tuple[str, str]¶ Return a properly formatted target_metric tuple for use with navigating evaluation results
- Parameters
- target_metric: Tuple, String, or None
Path denoting metric to be used. If tuple, the first value should be in [‘oof’, ‘holdout’, ‘in_fold’], and the second value should be the name of a metric supplied in metrics. If str, should be one of the two values from the tuple form. Else, a value will be chosen
- metrics: Dict
Properly formatted metrics as produced by
metrics.format_metrics()
, in which keys are strings identifying metrics, and values are instances ofmetrics.Metric
. See the documentation ofmetrics.format_metrics()
for more information on different metrics formats- default_dataset: {“oof”, “holdout”, “in_fold”}, default=”oof”
The default dataset type value to use if one is not provided
- Returns
- target_metric: Tuple
A formatted target_metric containing two strings: a dataset_type, followed by a metric name
Examples
>>> get_formatted_target_metric(('holdout', 'roc_auc_score'), format_metrics(['roc_auc_score', 'f1_score'])) ('holdout', 'roc_auc_score') >>> get_formatted_target_metric(('holdout',), format_metrics(['roc_auc_score', 'f1_score'])) ('holdout', 'roc_auc_score') >>> get_formatted_target_metric('holdout', format_metrics(['roc_auc_score', 'f1_score'])) ('holdout', 'roc_auc_score') >>> get_formatted_target_metric('holdout', format_metrics({'roc': 'roc_auc_score', 'f1': 'f1_score'})) ('holdout', 'roc') >>> get_formatted_target_metric('roc_auc_score', format_metrics(['roc_auc_score', 'f1_score'])) ('oof', 'roc_auc_score') >>> get_formatted_target_metric(None, format_metrics(['f1_score', 'roc_auc_score'])) ('oof', 'f1_score')
-
class
hyperparameter_hunter.metrics.
ScoringMixIn
(metrics, in_fold='all', oof='all', holdout='all', do_score=True)¶ Bases:
object
MixIn class to manage metrics to record for each dataset type, and perform evaluations
- Parameters
- metrics: Dict, List
Specifies all metrics to be used by their id keys, along with a means to compute the metric. If list, all values must be strings that are attributes in
sklearn.metrics
. If dict, key/value pairs must be of the form: (<id>, <callable/None/str sklearn.metrics attribute>), where “id” is a str name for the metric. Its corresponding value must be one of: 1) a callable to calculate the metric, 2) None if the “id” key is an attribute in sklearn.metrics and should be used to fetch a callable, 3) a string that is an attribute in sklearn.metrics and should be used to fetch a callable. Metric callable functions should expect inputs of form (target, prediction), and should return floats- in_fold: List of strings, None, default=<all ids in `metrics`>
Which metrics (from ids in metrics) should be recorded for in-fold data
- oof: List of strings, None, default=<all ids in `metrics`>
Which metrics (from ids in metrics) should be recorded for out-of-fold data
- holdout: List of strings, None, default=<all ids in `metrics`>
Which metrics (from ids in metrics) should be recorded for holdout data
- do_score: Boolean, default=True
This is experimental. If False, scores will be neither calculated nor recorded for the duration of the experiment
Notes
For each kwarg in [in_fold, oof, holdout], the following must be true: if the value of the kwarg is a list, its contents must be a subset of metrics
Methods
evaluate
(self, data_type, target, prediction)Apply metric(s) to the given data to calculate the value of the prediction
-
evaluate
(self, data_type, target, prediction, return_list=False, dry_run=False)¶ Apply metric(s) to the given data to calculate the value of the prediction
- Parameters
- data_type: {“in_fold”, “oof”, “holdout”}
The type of dataset for which target and prediction arguments are being provided
- target: Array-like
True labels for the data. Should be same shape as prediction
- prediction: Array-like
Predicted labels for the data. Should be same shape as target
- return_list: Boolean, default=False
If True, return list of tuples instead of dict. See “Returns” section below for details
- dry_run: Boolean, default=False
If True, the value of
last_evaluation_results
will not be updated to include the returned _result. The core library callbacks operate under the assumption that last_evaluation_results will be updated as usual, so restrict usage to debugging orlambda_callback()
implementations
- Returns
- _result: OrderedDict, or list
A dict whose keys are all metric keys supplied for data_type, and whose values are the results of each metric. If return_list is True, returns a list of tuples of: (<data_type metric str>, <metric result>)
Notes
The required types of target and prediction are entirely dependent on the metric callable’s expectations
-
hyperparameter_hunter.metrics.
get_clean_prediction
(target:Iterable, prediction:Iterable)¶ Create prediction that is of a form comparable to target
- Parameters
- target: Array-like
True labels for the data. Should be same shape as prediction
- prediction: Array-like
Predicted labels for the data. Should be same shape as target
- Returns
- prediction: Array-like
If target types are ints, and prediction types are not, given predicted labels clipped between the min, and max of target, then rounded to the nearest integer. Else, original predicted labels
-
hyperparameter_hunter.metrics.
classify_output
(target, prediction)¶ Force continuous prediction into the discrete, classified space of target. This is not an output/feature transformer akin to SKLearn’s discretization transformers. This function is intended for use in the very specific case of having a target that is classification-like (“binary”, “multiclass”, etc.), with prediction that resembles a “continuous” target, despite being made for target. The most common reason for this occurrence is that prediction is actually the division-averaged predictions collected along the course of a
CVExperiment
. In this case, the original model predictions should have been classification-like; however, due to disagreement in the division predictions, the resulting average predictions appear to be continuous- Parameters
- target: Array-like
# TODO: …
- prediction: Array-like
# TODO: …
- Returns
- numpy.array
# TODO: …
Notes
Target types used by this function are defined by sklearn.utils.multiclass.type_of_target.
If a prediction value is exactly between two target values, it will assume the lower of the two values. For example, given a single prediction of 1.5 and unique labels of [0, 1, 2, 3], the value of that prediction will be 1, rather than 2
Examples
>>> import numpy as np >>> classify_output(np.array([0, 3, 1, 2]), [0.5, 1.51, 0.66, 4.9]) array([0, 2, 1, 3]) >>> classify_output(np.array([0, 1, 2, 3]), [0.5, 1.51, 0.66, 4.9]) array([0, 2, 1, 3]) >>> # TODO: ... Add more examples, including binary classification
-
hyperparameter_hunter.metrics.
wrap_xgboost_metric
(metric, metric_name)¶ Create a function to use as the eval_metric kwarg for
xgboost.sklearn.XGBModel.fit()
- Parameters
- metric: Function
The function to calculate the value of metric, with signature: (target, prediction)
- metric_name: String
The name of the metric being evaluated
- Returns
- eval_metric: Function
The function to pass to XGBoost’s
fit()
, with signature: (prediction, target). It will return a tuple of (metric_name: str, metric_value: float)
hyperparameter_hunter.models module¶
This module provides wrapper classes around the raw algorithms being executed to facilitate use
by hyperparameter_hunter.experiments.BaseExperiment
. The algorithms created by most
libraries can be handled by hyperparameter_hunter.models.Model
, but some need special
attention, hence KerasModel
, and XGBoostModel
. The model classes defined herein
handle algorithm instantiation, as well as fitting and predicting
Related¶
hyperparameter_hunter.experiments
This module is the primary user of the classes defined in
hyperparameter_hunter.models
hyperparameter_hunter.sentinels
This module defines the Sentinel classes that will be converted to the actual values they represent in
hyperparameter_hunter.models.Model.__init__()
-
hyperparameter_hunter.models.
load_model
(_)¶
-
hyperparameter_hunter.models.
model_selector
(model_initializer)¶ Selects the appropriate Model class to use for model_initializer
- Parameters
- model_initializer: callable
The callable used to create an instance of some algorithm
- Returns
Model
, or one of its children
Examples
>>> from keras.wrappers.scikit_learn import KerasClassifier, KerasRegressor >>> model_selector(KerasClassifier) == KerasModel True >>> model_selector(KerasRegressor) == KerasModel True >>> from sklearn.svm import SVC >>> model_selector(SVC) == Model True >>> model_selector(None) == Model True
-
class
hyperparameter_hunter.models.
Model
(model_initializer, initialization_params, extra_params, train_input=None, train_target=None, validation_input=None, validation_target=None, do_predict_proba=False, target_metric=None, metrics=None)¶ Bases:
object
Handles initialization, fitting, and prediction for provided algorithms. Consider documentation for children of
Model
to be identical to that ofModel
, except where noted- Parameters
- model_initializer: Class
Expected to implement at least the following methods: 1) __init__, to which
initialization_params
will usually be provided unless stated otherwise in a child class’s documentation - likeKerasModel
. 2) fit, to whichtrain_input
, andtrain_target
will be provided, in addition to the contents ofextra_params['fit']
in some child classes - likeXGBoostModel
. 3) predict, or predict_proba if applicable, which should accept any array-like input of shape: (<num_samples>, train_input.shape[1])- initialization_params: Dict
A dict containing all arguments accepted by
__init__()
of the classmodel_initializer
, unless stated otherwise in a child class’s documentation - likeKerasModel
. Arguments pertaining to random seeds will be ignored- extra_params: Dict, default={}
A dict of special parameters that are passed to a model’s non-initialization methods in special cases (such as fit, predict, predict_proba, and score). extra_params are not used for all models. See the documentation for the appropriate descendant of
models.Model
for information about how it handles extra_params- train_input: `pandas.DataFrame`
The model’s training input data
- train_target: `pandas.DataFrame`
The true labels corresponding to the rows of
train_input
- validation_input: `pandas.DataFrame`, or None
The model’s validation input data to evaluate performance during fitting
- validation_target: `pandas.DataFrame`, or None
The true labels corresponding to the rows of
validation_input
- do_predict_proba: Boolean, or int, default=False
If False,
models.Model.fit()
will callmodels.Model.model.predict()
If True, it will call
models.Model.model.predict_proba()
, and the values in all columns will be used as the actual prediction valuesIf do_predict_proba is an int,
models.Model.fit()
will callmodels.Model.model.predict_proba()
, as is the case when do_predict_proba is True, but the int supplied as do_predict_proba declares the column index to use as the actual prediction valuesFor example, for a model to call the predict method, do_predict_proba=False (default). For a model to call the predict_proba method, and use all of the class probabilities, do_predict_proba=True. To call the predict_proba method, and use the class probabilities in the first column, do_predict_proba=0. To use the second column (index 1) of the result, do_predict_proba=1 - This often corresponds to the positive class’s probabilities in binary classification problems. To use the third column do_predict_proba=2, and so on
See the notes for the do_predict_proba parameter in the documentation of
environment.Environment
for additional usage notes
- target_metric: Tuple
Used by some child classes (like
XGBoostModel
) to provide validation data tomodel.fit()
- metrics: Dict
Used by some child classes (like
XGBoostModel
) to provide validation data tomodel.fit()
Methods
fit
(self)Train model according to
extra_params['fit']
(if appropriate) on training datainitialize_model
(self)Create an instance of a model using
model_initializer
, withinitialization_params
as inputpredict
(self, input_data)Generate model predictions for input_data
-
initialize_model
(self)¶ Create an instance of a model using
model_initializer
, withinitialization_params
as input
-
fit
(self)¶ Train model according to
extra_params['fit']
(if appropriate) on training data
-
predict
(self, input_data)¶ Generate model predictions for input_data
- Parameters
- input_data: Array-like
Data containing the same number of features as were trained on, for which the model will predict output values
- Returns
- prediction: Array-like
Output predictions made by the model, using input_data
-
class
hyperparameter_hunter.models.
XGBoostModel
(model_initializer, initialization_params, extra_params, train_input=None, train_target=None, validation_input=None, validation_target=None, do_predict_proba=False, target_metric=None, metrics=None)¶ Bases:
hyperparameter_hunter.models.Model
A special Model class for handling XGBoost algorithms. Consider documentation to be identical to that of
Model
, except where noted- Parameters
- model_initializer: :class:`xgboost.sklearn.XGBClassifier`, or :class:`xgboost.sklearn.XGBRegressor`
See
Model
- initialization_params: See :class:`Model`
- extra_params: Dict, default={}
Useful keys: [‘fit’, ‘predict’]. If ‘fit’ is a key with a dict value, its contents will be provided to
xgboost.sklearn.XGBModel.fit()
, with the exception of the following: [‘X’, ‘y’]. If any of the aforementioned keys are inextra_params['fit']
or ifextra_params['fit']
is provided, but is not a dict, an Exception will be raised- train_input: See :class:`Model`
- train_target: See :class:`Model`
- validation_input: See :class:`Model`
- validation_target: See :class:`Model`
- do_predict_proba: See :class:`Model`
- target_metric: Tuple
Used to determine the ‘eval_metric’ argument to
xgboost.sklearn.XGBModel.fit()
. See the documentation forXGBoostModel.extra_params
for more information- metrics: See :class:`Model`
Methods
fit
(self)Train model according to
extra_params['fit']
(if appropriate) on training datainitialize_model
(self)Create an instance of a model using
model_initializer
, withinitialization_params
as inputpredict
(self, input_data)Generate model predictions for input_data
-
class
hyperparameter_hunter.models.
KerasModel
(model_initializer, initialization_params, extra_params, train_input=None, train_target=None, validation_input=None, validation_target=None, do_predict_proba=False, target_metric=None, metrics=None)¶ Bases:
hyperparameter_hunter.models.Model
A special Model class for handling Keras neural networks. Consider documentation to be identical to that of
Model
, except where noted- Parameters
- model_initializer: :class:`keras.wrappers.scikit_learn.KerasClassifier`, or `keras.wrappers.scikit_learn.KerasRegressor`
Expected to implement at least the following methods: 1) __init__, to which
initialization_params
will usually be provided unless stated otherwise in a child class’s documentation - likeKerasModel
. 2) fit, to whichtrain_input
, andtrain_target
will be provided, in addition to the contents ofextra_params['fit']
in some child classes - likeXGBoostModel
. 3) predict, or predict_proba if applicable, which should accept any array-like input of shape: (<num_samples>, train_input.shape[1])- initialization_params: Dict containing `build_fn`
A dictionary containing the single key: build_fn, which is a callable function that returns a compiled Keras model
- extra_params: Dict, default={}
The parameters expected to be passed to the extra methods of the compiled Keras model. Such methods include (but are not limited to) fit, predict, and predict_proba. Some of the common parameters given here include epochs, batch_size, and callbacks
- train_input: `pandas.DataFrame`
The model’s training input data
- train_target: `pandas.DataFrame`
The true labels corresponding to the rows of
train_input
- validation_input: `pandas.DataFrame`, or None
The model’s validation input data to evaluate performance during fitting
- validation_target: `pandas.DataFrame`, or None
The true labels corresponding to the rows of
validation_input
- do_predict_proba: Boolean, or int, default=False
If False,
models.Model.fit()
will callmodels.Model.model.predict()
If True, it will call
models.Model.model.predict_proba()
, and the values in all columns will be used as the actual prediction valuesIf int,
models.Model.fit()
will callmodels.Model.model.predict_proba()
, as is the case when do_predict_proba is True, but the int supplied as do_predict_proba declares the column index to use as the actual prediction values
For example, for a model to call the predict method, do_predict_proba=False (default). For a model to call the predict_proba method, and use all of the class probabilities, do_predict_proba=True. To call the predict_proba method, and use the class probabilities in the first column, do_predict_proba=0. To use the second column (index 1) of the result, do_predict_proba=1 - This often corresponds to the positive class’s probabilities in binary classification problems. To use the third column do_predict_proba=2, and so on.
See the notes for the do_predict_proba parameter of
Environment
for additional usage notes- target_metric: Tuple
Used by some child classes (like
XGBoostModel
) to provide validation data tomodel.fit()
- metrics: Dict
Used by some child classes (like
XGBoostModel
) to provide validation data tomodel.fit()
Methods
fit
(self)Train model according to
extra_params['fit']
(if appropriate) on training dataget_input_shape
(self[, get_dim])Calculate the shape of the input that should be expected by the model
Initialize Keras model wrapper (
model_initializer
) withinitialization_params
,extra_params
, and validation_data if it can be found, as well as the input dimensions for the modelinitialize_model
(self)Create an instance of a model using
model_initializer
, withinitialization_params
as inputpredict
(self, input_data)Generate model predictions for input_data
validate_keras_params
(self)Ensure provided input parameters are properly formatted
-
initialize_model
(self)¶ Create an instance of a model using
model_initializer
, withinitialization_params
as input
-
fit
(self)¶ Train model according to
extra_params['fit']
(if appropriate) on training data
-
get_input_shape
(self, get_dim=False)¶ Calculate the shape of the input that should be expected by the model
- Parameters
- get_dim: Boolean, default=False
If True, instead of returning an input_shape tuple, an input_dim scalar will be returned
- Returns
- Tuple, or scalar
If get_dim=False, an input_shape tuple. Else, an input_dim scalar
-
validate_keras_params
(self)¶ Ensure provided input parameters are properly formatted
-
initialize_keras_neural_network
(self)¶ Initialize Keras model wrapper (
model_initializer
) withinitialization_params
,extra_params
, and validation_data if it can be found, as well as the input dimensions for the model
hyperparameter_hunter.recorders module¶
This module handles recording and properly formatting the various result files requested for a completed Experiment. Coincidentally, if a particular result file was blacklisted by the active Environment, that is also handled here
Related¶
hyperparameter_hunter.experiments
This is the intended user of the contents of
hyperparameter_hunter.recorders
-
class
hyperparameter_hunter.recorders.
BaseRecorder
¶ Bases:
object
Base class for other classes that record various Experiment result files. Critical attributes of the descendants of :class`recorders.BaseRecorder` are set here, enabling them to function properly
- Returns
- None
If
result_path
is None, which means the present result file was blacklisted by the active Environment
- Raises
- EnvironmentInactiveError
If
settings.G.Env
is None- EnvironmentInvalidError
If any of the following occur: 1)
settings.G.Env
does not have an attribute named ‘result_paths’, 2)settings.G.Env.result_paths
does not contain the current result_path_key, 3)settings.G.Env.current_task
is None
- Attributes
required_attributes
Return attributes of the current Experiment that are necessary to properly record result.
result_path_key
Return key from
environment.Environment.result_paths
, corresponding to the
Methods
format_result
(self)Set
BaseRecorder.result
to the final result object to be saved byBaseRecorder.save_result()
save_result
(self)Save
BaseRecorder.result
toBaseRecorder.result_path
, or elsewhere if special case-
abstract property
result_path_key
¶ Return key from
environment.Environment.result_paths
, corresponding to the target record
-
abstract property
required_attributes
¶ Return attributes of the current Experiment that are necessary to properly record result. Specifically, BaseRecorder fetches the attrs via
settings.G.Env.current_task
, which can also be regarded asenvironment.Environment.current_task
, but this is an implementation detail. It is simpler to useexperiments.BaseExperiment
, and its appropriate descendants as a reference for acceptable values of required_attributes
-
abstract
format_result
(self)¶ Set
BaseRecorder.result
to the final result object to be saved byBaseRecorder.save_result()
-
abstract
save_result
(self)¶ Save
BaseRecorder.result
toBaseRecorder.result_path
, or elsewhere if special case
-
class
hyperparameter_hunter.recorders.
RecorderList
(file_blacklist=None, extra_recorders=None)¶ Bases:
object
Collection of
BaseRecorder
subclasses to facilitate executing group methods- Parameters
- file_blacklist: List, or None, default=None
If list, used to reject any elements of
RecorderList.recorders
whoseBaseRecorder.result_path_key
is in file_blacklist- extra_recorders: List, None, default=None
If not None, may be a list whose values are tuples of (<
recorders.BaseRecorder
descendant>, <str result_path>). The result_path str should be a path relative to results_path, specifying the directory/file in which the product of the custom recorder will be saved. The contents of extra_recorders are appended to the list of default recorders and used to create/update result files for an Experiment. The contents of extra_recorders are blacklisted in the same way as normal recorders. That is, if file_blacklist contains the result_path_key of a recorder in extra_recorders, that recorder is blacklisted
Methods
format_result
(self)Execute
format_result()
for all classes inrecorders
save_result
(self)Execute
save_result()
for all classes inrecorders
-
format_result
(self)¶ Execute
format_result()
for all classes inrecorders
-
save_result
(self)¶ Execute
save_result()
for all classes inrecorders
Notes
When iterating through
recorders
and callingsave_result()
, a check is performed for exit_code. Children classes ofBaseRecorder
are NOT expected to explicitly return a value in theirsave_result()
. However, if a value is returned and exit_code == ‘break’, the result-saving loop will be broken, and no further results will be saved. In practice, this is only performed for the sake ofDescriptionRecorder.save_result()
, which has the additional quality of being able to prevent any other result files from being saved if the result ofDescriptionRecorder.do_full_save()
returns False when given the formattedDescriptionRecorder.result
. This can be useful when there are storage constraints, because it ensures that essential data - including keys and the results of the experiment - are saved (to ensure the experiment is not duplicated, and to enable optimization protocol learning), while extra results like Predictions are not saved
-
class
hyperparameter_hunter.recorders.
DescriptionRecorder
¶ Bases:
hyperparameter_hunter.recorders.BaseRecorder
Base class for other classes that record various Experiment result files. Critical attributes of the descendants of :class`recorders.BaseRecorder` are set here, enabling them to function properly
- Returns
- None
If
result_path
is None, which means the present result file was blacklisted by the active Environment
- Raises
- EnvironmentInactiveError
If
settings.G.Env
is None- EnvironmentInvalidError
If any of the following occur: 1)
settings.G.Env
does not have an attribute named ‘result_paths’, 2)settings.G.Env.result_paths
does not contain the current result_path_key, 3)settings.G.Env.current_task
is None
Methods
format_result
(self)Format an OrderedDict containing the Experiment’s identifying attributes, results, hyperparameters used, and other stats or information that may be useful
save_result
(self)Save the Experiment description as a .json file, named after
experiment_id
.-
result_path_key
= 'description'¶
-
required_attributes
= ['experiment_id', 'hyperparameter_key', 'cross_experiment_key', 'last_evaluation_results', 'stat_aggregates', 'source_script', 'notes', 'model_initializer', 'do_full_save', 'model', 'algorithm_name', 'module_name']¶
-
format_result
(self)¶ Format an OrderedDict containing the Experiment’s identifying attributes, results, hyperparameters used, and other stats or information that may be useful
-
save_result
(self)¶ Save the Experiment description as a .json file, named after
experiment_id
. Ifdo_full_save
is a callable and returns False when given the description object, the result recording loop will be broken, and the remaining result files will not be saved- Returns
- ‘break’
This string will be returned if
do_full_save
is a callable and returns False when given the description object. This is the signal forrecorders.RecorderList
to stop recording result files
-
class
hyperparameter_hunter.recorders.
HeartbeatRecorder
¶ Bases:
hyperparameter_hunter.recorders.BaseRecorder
Base class for other classes that record various Experiment result files. Critical attributes of the descendants of :class`recorders.BaseRecorder` are set here, enabling them to function properly
- Returns
- None
If
result_path
is None, which means the present result file was blacklisted by the active Environment
- Raises
- EnvironmentInactiveError
If
settings.G.Env
is None- EnvironmentInvalidError
If any of the following occur: 1)
settings.G.Env
does not have an attribute named ‘result_paths’, 2)settings.G.Env.result_paths
does not contain the current result_path_key, 3)settings.G.Env.current_task
is None
Methods
format_result
(self)Do nothing
save_result
(self)Copy global Heartbeat log to results dir as .log file named for
experiment_id
-
result_path_key
= 'heartbeat'¶
-
required_attributes
= ['experiment_id']¶
-
format_result
(self)¶ Do nothing
-
save_result
(self)¶ Copy global Heartbeat log to results dir as .log file named for
experiment_id
-
class
hyperparameter_hunter.recorders.
PredictionsHoldoutRecorder
¶ Bases:
hyperparameter_hunter.recorders.BaseRecorder
Base class for other classes that record various Experiment result files. Critical attributes of the descendants of :class`recorders.BaseRecorder` are set here, enabling them to function properly
- Returns
- None
If
result_path
is None, which means the present result file was blacklisted by the active Environment
- Raises
- EnvironmentInactiveError
If
settings.G.Env
is None- EnvironmentInvalidError
If any of the following occur: 1)
settings.G.Env
does not have an attribute named ‘result_paths’, 2)settings.G.Env.result_paths
does not contain the current result_path_key, 3)settings.G.Env.current_task
is None
Methods
format_result
(self)Format predictions according to the callable
prediction_formatter
save_result
(self)Save holdout predictions to a .csv file, named after
experiment_id
-
result_path_key
= 'predictions_holdout'¶
-
required_attributes
= ['data_holdout', 'holdout_dataset', 'experiment_id', 'prediction_formatter', 'target_column', 'id_column', 'to_csv_params']¶
-
format_result
(self)¶ Format predictions according to the callable
prediction_formatter
-
save_result
(self)¶ Save holdout predictions to a .csv file, named after
experiment_id
-
class
hyperparameter_hunter.recorders.
PredictionsOOFRecorder
¶ Bases:
hyperparameter_hunter.recorders.BaseRecorder
Base class for other classes that record various Experiment result files. Critical attributes of the descendants of :class`recorders.BaseRecorder` are set here, enabling them to function properly
- Returns
- None
If
result_path
is None, which means the present result file was blacklisted by the active Environment
- Raises
- EnvironmentInactiveError
If
settings.G.Env
is None- EnvironmentInvalidError
If any of the following occur: 1)
settings.G.Env
does not have an attribute named ‘result_paths’, 2)settings.G.Env.result_paths
does not contain the current result_path_key, 3)settings.G.Env.current_task
is None
Methods
format_result
(self)Format predictions according to the callable
prediction_formatter
save_result
(self)Save out-of-fold predictions to a .csv file, named after
experiment_id
-
result_path_key
= 'predictions_oof'¶
-
required_attributes
= ['data_oof', 'train_dataset', 'experiment_id', 'prediction_formatter', 'target_column', 'id_column', 'to_csv_params']¶
-
format_result
(self)¶ Format predictions according to the callable
prediction_formatter
-
save_result
(self)¶ Save out-of-fold predictions to a .csv file, named after
experiment_id
-
class
hyperparameter_hunter.recorders.
PredictionsTestRecorder
¶ Bases:
hyperparameter_hunter.recorders.BaseRecorder
Base class for other classes that record various Experiment result files. Critical attributes of the descendants of :class`recorders.BaseRecorder` are set here, enabling them to function properly
- Returns
- None
If
result_path
is None, which means the present result file was blacklisted by the active Environment
- Raises
- EnvironmentInactiveError
If
settings.G.Env
is None- EnvironmentInvalidError
If any of the following occur: 1)
settings.G.Env
does not have an attribute named ‘result_paths’, 2)settings.G.Env.result_paths
does not contain the current result_path_key, 3)settings.G.Env.current_task
is None
Methods
format_result
(self)Format predictions according to the callable
prediction_formatter
save_result
(self)Save test predictions to a .csv file, named after
experiment_id
-
result_path_key
= 'predictions_test'¶
-
required_attributes
= ['data_test', 'test_dataset', 'experiment_id', 'prediction_formatter', 'target_column', 'id_column', 'to_csv_params']¶
-
format_result
(self)¶ Format predictions according to the callable
prediction_formatter
-
save_result
(self)¶ Save test predictions to a .csv file, named after
experiment_id
-
class
hyperparameter_hunter.recorders.
TestedKeyRecorder
¶ Bases:
hyperparameter_hunter.recorders.BaseRecorder
Base class for other classes that record various Experiment result files. Critical attributes of the descendants of :class`recorders.BaseRecorder` are set here, enabling them to function properly
- Returns
- None
If
result_path
is None, which means the present result file was blacklisted by the active Environment
- Raises
- EnvironmentInactiveError
If
settings.G.Env
is None- EnvironmentInvalidError
If any of the following occur: 1)
settings.G.Env
does not have an attribute named ‘result_paths’, 2)settings.G.Env.result_paths
does not contain the current result_path_key, 3)settings.G.Env.current_task
is None
Methods
format_result
(self)Do nothing
save_result
(self)Save cross-experiment, and hyperparameter keys, and update their tested keys entries
-
result_path_key
= 'tested_keys'¶
-
required_attributes
= ['experiment_id', 'hyperparameter_key', 'cross_experiment_key']¶
-
format_result
(self)¶ Do nothing
-
save_result
(self)¶ Save cross-experiment, and hyperparameter keys, and update their tested keys entries
-
class
hyperparameter_hunter.recorders.
LeaderboardEntryRecorder
¶ Bases:
hyperparameter_hunter.recorders.BaseRecorder
Base class for other classes that record various Experiment result files. Critical attributes of the descendants of :class`recorders.BaseRecorder` are set here, enabling them to function properly
- Returns
- None
If
result_path
is None, which means the present result file was blacklisted by the active Environment
- Raises
- EnvironmentInactiveError
If
settings.G.Env
is None- EnvironmentInvalidError
If any of the following occur: 1)
settings.G.Env
does not have an attribute named ‘result_paths’, 2)settings.G.Env.result_paths
does not contain the current result_path_key, 3)settings.G.Env.current_task
is None
Methods
format_result
(self)Read existing global leaderboard, add current entry, then sort the updated leaderboard
save_result
(self)Save the updated leaderboard file
-
result_path_key
= 'tested_keys'¶
-
required_attributes
= ['result_paths', 'current_task', 'target_metric', 'metrics']¶
-
format_result
(self)¶ Read existing global leaderboard, add current entry, then sort the updated leaderboard
-
save_result
(self)¶ Save the updated leaderboard file
-
class
hyperparameter_hunter.recorders.
UnsortedIDLeaderboardRecorder
¶ Bases:
hyperparameter_hunter.recorders.BaseRecorder
Base class for other classes that record various Experiment result files. Critical attributes of the descendants of :class`recorders.BaseRecorder` are set here, enabling them to function properly
- Returns
- None
If
result_path
is None, which means the present result file was blacklisted by the active Environment
- Raises
- EnvironmentInactiveError
If
settings.G.Env
is None- EnvironmentInvalidError
If any of the following occur: 1)
settings.G.Env
does not have an attribute named ‘result_paths’, 2)settings.G.Env.result_paths
does not contain the current result_path_key, 3)settings.G.Env.current_task
is None
Methods
format_result
(self)Read existing global leaderboard, add current entry, then sort the updated leaderboard
save_result
(self)Save the updated leaderboard file
-
result_path_key
= 'unsorted_id_leaderboard'¶
-
required_attributes
= ['result_paths', 'current_task', 'target_metric', 'metrics']¶
-
format_result
(self)¶ Read existing global leaderboard, add current entry, then sort the updated leaderboard
-
save_result
(self)¶ Save the updated leaderboard file
-
class
hyperparameter_hunter.recorders.
YAMLDescriptionRecorder
¶ Bases:
hyperparameter_hunter.recorders.BaseRecorder
Base class for other classes that record various Experiment result files. Critical attributes of the descendants of :class`recorders.BaseRecorder` are set here, enabling them to function properly
- Returns
- None
If
result_path
is None, which means the present result file was blacklisted by the active Environment
- Raises
- EnvironmentInactiveError
If
settings.G.Env
is None- EnvironmentInvalidError
If any of the following occur: 1)
settings.G.Env
does not have an attribute named ‘result_paths’, 2)settings.G.Env.result_paths
does not contain the current result_path_key, 3)settings.G.Env.current_task
is None
Methods
format_result
(self)Set
BaseRecorder.result
to the final result object to be saved byBaseRecorder.save_result()
save_result
(self)Save
BaseRecorder.result
toBaseRecorder.result_path
, or elsewhere if special case-
result_path_key
= 'yaml_description'¶
-
required_attributes
= ['result_paths', 'experiment_id']¶
-
format_result
(self)¶ Set
BaseRecorder.result
to the final result object to be saved byBaseRecorder.save_result()
-
save_result
(self)¶ Save
BaseRecorder.result
toBaseRecorder.result_path
, or elsewhere if special case
hyperparameter_hunter.reporting module¶
-
class
hyperparameter_hunter.reporting.
ReportingHandler
(heartbeat_path=None, float_format='{:.5f}', console_params=None, heartbeat_params=None, add_frame=False)¶ Bases:
object
Class in control of logging methods, log formatting, and initializing Experiment logging
- Parameters
- heartbeat_path: Str path, or None, default=None
If string and valid heartbeat path, logging messages will also be saved in this file
- float_format: String, default=’{:.5f}’
If not default, must be a valid formatting string for floating point values. If invalid, default will be used
- console_params: Dict, or None, default=None
Parameters passed to
_configure_console_handler()
- heartbeat_params: Dict, or None, default=None
Parameters passed to
_configure_heartbeat_handler()
- add_frame: Boolean, default=False
If True, whenever
log()
is called, the source of the call will be prepended to the content being logged
Methods
debug
(self, content, \*\*kwargs)Placeholder method before proper initialization
log
(self, content, \*\*kwargs)Placeholder method before proper initialization
warn
(self, content, \*\*kwargs)Placeholder method before proper initialization
-
log
(self, content, **kwargs)¶ Placeholder method before proper initialization
-
debug
(self, content, **kwargs)¶ Placeholder method before proper initialization
-
warn
(self, content, **kwargs)¶ Placeholder method before proper initialization
-
hyperparameter_hunter.reporting.
clean_parameter_names
(parameter_names:list) → List[str]¶ Remove unnecessary prefixes or characters from the names of search space dimensions
- Parameters
- parameter_names: List
Names of the dimensions in a hyperparameter search Space object. Values are usually tuples
- Returns
- names: List[str]
Cleaned parameter_names, containing stringified values to facilitate logging
-
hyperparameter_hunter.reporting.
get_param_column_sizes
(space:list, names:List[str]) → List[int]¶ Determine maximum column sizes for displaying values of each hyperparameter in space
- Parameters
- space: List
Hyperparameter search space dimensions for the current Optimization Protocol
- names: List[str]
Cleaned hyperparameter dimension names
- Returns
- sizes: List[int]
Column sizes for each of the hyperparameters in names
-
class
hyperparameter_hunter.reporting.
OptimizationReporter
(space: list, verbose=1, show_experiment_id=8, do_maximize=True)¶ Bases:
object
A MixIn class for reporting the results of hyperparameter optimization rounds
- Parameters
- space: List
Hyperparameter search space dimensions for the current Optimization Protocol
- verbose: Int in [0, 1, 2], default=1
If 0, all but critical logging is silenced. If 1, normal logging is performed. If 2, detailed logging is performed
- show_experiment_id: Int, or Boolean, default=8
If True, the experiment_id will be printed in each result row. If False, it will not. If int, the first show_experiment_id-many characters of each experiment_id will be printed in each row
- do_maximize: Boolean, default=True
If False, smaller metric values will be considered preferred and will be highlighted to stand out. Else larger metric values will be treated as preferred
Methods
print_header
(self, header, line)Utility to perform actual printing of headers given formatted inputs
Print a header signifying that Optimization rounds are starting
Print a header signifying that random point evaluation rounds are starting
print_result
(self, hyperparameters, evaluation)Print a row containing the results of an Experiment just executed
Print a header signifying that saved Experiment results are being read
print_summary
(self)Print a summary of the results of hyperparameter optimization upon completion
reset_timer
(self)Set
start_time
, andlast_round
to the current time-
print_saved_results_header
(self)¶ Print a header signifying that saved Experiment results are being read
-
print_random_points_header
(self)¶ Print a header signifying that random point evaluation rounds are starting
-
print_optimization_header
(self)¶ Print a header signifying that Optimization rounds are starting
-
print_header
(self, header, line)¶ Utility to perform actual printing of headers given formatted inputs
- Parameters
- header: String
Specifies the stage of optimization being entered, and the type of results to follow
- line: String
The underlining to follow header
-
print_result
(self, hyperparameters, evaluation, experiment_id=None)¶ Print a row containing the results of an Experiment just executed
- Parameters
- hyperparameters: List
List of hyperparameter values in the same order as
parameter_names
- evaluation: Float
An evaluation of the performance of hyperparameters
- experiment_id: Str, or None, default=None
If not None, should be a string that is the UUID of the Experiment
-
reset_timer
(self)¶ Set
start_time
, andlast_round
to the current time
-
print_summary
(self)¶ Print a summary of the results of hyperparameter optimization upon completion
-
hyperparameter_hunter.reporting.
format_frame_source
(previous_frame, **kwargs)¶ Construct a string describing the location at which a call was made
- Parameters
- previous_frame: Frame
A frame depicting the location at which a call was made
- **kwargs: Dict
Any additional kwargs to supply to
reporting.stringify_frame_source()
- Returns
- The stringified frame source information of previous_frame
-
hyperparameter_hunter.reporting.
stringify_frame_source
(src_file, src_line_no, src_func, src_class, add_line_no=True, max_line_no_size=4, total_max_size=80)¶ Construct a string that neatly displays the location in the code at which a call was made
- Parameters
- src_file: Str
A filepath
- src_line_no: Int
The line number in src_file at which the call was made
- src_func: Str
The name of the function in src_file in which the call was made
- src_class: Str, or None
If not None, the class in src_file in which the call was made
- add_line_no: Boolean, default=False
If True, the line number will be included in the source_content result
- max_line_no_size: Int, default=4
Total number (including padding) of characters to be occupied by src_line_no. For example, if src_line_no`=32, and `max_line_no_size`=4, `src_line_no will be padded to become ‘32 ‘ in order to occupy four characters
- total_max_size: Int, default=80
Total number (including padding) of characters to be occupied by the source_content result
- Returns
- source_content: Str
A formatted string containing the location in the code at which a call was made
Examples
>>> stringify_frame_source("reporting.py", 570, "stringify_frame_source", None) '570 - reporting.stringify_frame_source() ' >>> stringify_frame_source("reporting.py", 12, "bar", "Foo") '12 - reporting.Foo.bar() ' >>> stringify_frame_source("reporting.py", 12, "bar", "Foo", add_line_no=False) 'reporting.Foo.bar() ' >>> stringify_frame_source("reporting.py", 12, "bar", "Foo", total_max_size=60) '12 - reporting.Foo.bar() '
-
hyperparameter_hunter.reporting.
add_time_to_content
(content, add_time=False)¶ Construct a string containing the original content, in addition to the current time
- Parameters
- content: Str
The original string, to which the current time will be concatenated
- add_time: Boolean, default=False
If True, the current time will be concatenated onto the end of content
- Returns
- content: Str
Str containing original content, along with current time, and additional formatting
-
hyperparameter_hunter.reporting.
format_fold_run
(rep=None, fold=None, run=None, mode='concise')¶ Construct a string to display the repetition, fold, and run currently being executed
- Parameters
- rep: Int, or None, default=None
The repetition number currently being executed
- fold: Int, or None, default=None
The fold number currently being executed
- run: Int, or None, default=None
The run number currently being executed
- mode: {“concise”, “verbose”}, default=”concise”
If “concise”, the result will contain abbreviations for rep/fold/run
- Returns
- content: Str
A clean display of the current repetition/fold/run
Examples
>>> format_fold_run(rep=0, fold=3, run=2, mode="concise") 'R0-f3-r2' >>> format_fold_run(rep=0, fold=3, run=2, mode="verbose") 'Rep-Fold-Run: 0-3-2' >>> format_fold_run(rep=0, fold=3, run="*", mode="concise") 'R0-f3-r*' >>> format_fold_run(rep=0, fold=3, run=2, mode="foo") Traceback (most recent call last): File "reporting.py", line ?, in format_fold_run ValueError: Received invalid mode value: 'foo'
-
hyperparameter_hunter.reporting.
format_evaluation
(results, separator=' | ', float_format='{:.5f}')¶ Construct a string to neatly display the results of a model evaluation
- Parameters
- results: Dict
The results of a model evaluation, in which keys represent the dataset type evaluated, and values are dicts containing metrics as keys, and metric values as values
- separator: Str, default=’ | ‘
The string used to join all the metric values into a single string
- float_format: Str, default=’{:.5f}’
A python string float formatter, applied to floating metric values
- Returns
- content: Str
The model’s evaluation results
hyperparameter_hunter.result_reader module¶
-
hyperparameter_hunter.result_reader.
finder_selector
(module_name)¶ Selects the appropriate
ResultFinder
to use for module_name- Parameters
- module_name: String
Module from whence the algorithm being used came
- Returns
- Uninitialized
ResultFinder
, or one of its descendants
- Uninitialized
Examples
>>> assert finder_selector("Keras") == KerasResultFinder >>> assert finder_selector("xgboost") == ResultFinder >>> assert finder_selector("lightgbm") == ResultFinder
-
hyperparameter_hunter.result_reader.
update_match_status
(target_attr='match_status') → <built-in function callable>¶ Build a decorator to apply to class instance methods to record inputs/outputs
- Parameters
- target_attr: String, default=”match_status”
Name of dict attribute in the class instance of the decorated method, in which the decorated method’s inputs and outputs should be recorded. This attribute should be predefined and documented by the class whose method is being decorated
- Returns
- Callable
Decorator that will save the decorated method’s inputs and outputs to the attribute dict named by target_attr. Decorator assumes that the method will receive at least three positional arguments: “exp_id”, “params”, and “score”. “exp_id” is used as the key added to target_attr, with a dict value, which includes the latter two positional arguments. Each time the decorator is invoked with an “exp_id”, an additional key is added to its dict that is the name of the decorated method, and whose value is the decorated method’s output
See also
ResultFinder
Decorates “does_match…” methods using update_match_status in order to keep a detailed record of the full pool of candidate Experiments in
ResultFinder.match_status
Examples
>>> class X: ... def __init__(self): ... self.match_status = dict() ... @update_match_status() ... def method_a(self, exp_id, params, score): ... return True ... @update_match_status() ... def method_b(self, exp_id, params, score): ... return False >>> x = X() >>> x.match_status {} >>> assert x.method_a("foo", None, 0.8) is True >>> x.match_status # doctest: +NORMALIZE_WHITESPACE {'foo': {'params': None, 'score': 0.8, 'method_a': True}} >>> assert x.method_b("foo", None, 0.8) is False >>> x.match_status # doctest: +NORMALIZE_WHITESPACE {'foo': {'params': None, 'score': 0.8, 'method_a': True, 'method_b': False}} >>> assert x.method_b("bar", "some stuff", 0.5) is False >>> x.match_status # doctest: +NORMALIZE_WHITESPACE {'foo': {'params': None, 'score': 0.8, 'method_a': True, 'method_b': False}, 'bar': {'params': 'some stuff', 'score': 0.5, 'method_b': False}}
-
hyperparameter_hunter.result_reader.
does_match_guidelines
(candidate_params:dict, space:hyperparameter_hunter.space.space_core.Space, template_params:dict, visitors=(), dims_to_ignore:List[tuple]=None) → bool¶ Check candidate compatibility with template guideline hyperparameters
- Parameters
- candidate_params: Dict
Candidate Experiment hyperparameters to be compared to template_params after processing
- space: Space
Hyperparameter search space constraints for the current template
- template_params: Dict
Template hyperparameters to which candidate_params will be compared after processing. Although the name of the function implies that these will all be guideline hyperparameters, this is not a requirement, as any non-guideline hyperparameters will be removed during processing by checking space.names
- visitors: Callable, or Tuple[callable] (optional)
Extra visit function(s) invoked when
remap()
-ing both template_params and candidate_params. Can be used to filter out unwanted values, or to pre-process selected values (and more) prior to performing the final compatibility check between the processed candidate_params and guidelines in template_params- dims_to_ignore: List[tuple] (optional)
Paths to hyperparameter(s) that should be ignored when comparing candidate_params and template_params. By default, hyperparameters pertaining to verbosity and random states are ignored. Paths should be tuples of the form expected by
get_path()
. Additionally a path may start with None, which acts as a wildcard, matching any hyperparameters whose terminal path nodes match the value following None. For example,(None, "verbose")
would match paths such as("model_init_params", "a", "verbose")
and("model_extra_params", "b", 2, "verbose")
- Returns
- Boolean
True if the processed version of candidate_params is equal to the extracted and processed guidelines from template_params. Else, False
-
hyperparameter_hunter.result_reader.
validate_feature_engineer
(candidate:Union[dict, hyperparameter_hunter.feature_engineering.FeatureEngineer], template:hyperparameter_hunter.feature_engineering.FeatureEngineer) → Union[bool, dict, hyperparameter_hunter.feature_engineering.FeatureEngineer]¶ Check candidate “feature_engineer” compatibility with template and sanitize candidate. This is mostly a wrapper around
validate_fe_steps()
to ensure different inputs are handled properly and to return False, rather than raising IncompatibleCandidateError- Parameters
- candidate: Dict, or FeatureEngineer
Candidate “feature_engineer” to compare to template. If compatible with template, a sanitized version of candidate will be returned (described below)
- template: FeatureEngineer
Template “feature_engineer” to which candidate will be compared after processing
- Returns
- Boolean, dict, or FeatureEngineer
False if candidate is deemed incompatible with template. Else, a sanitized candidate with reinitialized
EngineerStep
steps and withRejectedOptional
filling in missingCategorical
steps that were declared asoptional
by the template
-
hyperparameter_hunter.result_reader.
validate_fe_steps
(candidate:Union[list, hyperparameter_hunter.feature_engineering.FeatureEngineer], template:Union[list, hyperparameter_hunter.feature_engineering.FeatureEngineer]) → list¶ Check candidate “feature_engineer” steps compatibility with template and sanitize candidate
- Parameters
- candidate: List, or FeatureEngineer
Candidate “feature_engineer” steps to compare to template. If compatible with template, a sanitized version of candidate will be returned (described below)
- template: List, or FeatureEngineer
Template “feature_engineer” steps to which candidate will be compared. template is also used to sanitize candidate (described below)
- Returns
- List
If candidate is compatible with template, returns a list resembling candidate, with the following changes: 1) all step dicts in candidate are reinitialized to proper EngineerStep instances; and 2) wherever candidate was missing a step that was tagged as optional in template, RejectedOptional is added. In the end, if a list is returned, it is built from candidate, guaranteed to be the same length as template and guaranteed to contain only EngineerStep and RejectedOptional instances
- Raises
- IncompatibleCandidateError
If candidate is incompatible with template. candidate may be incompatible with template for any of the following reasons:
candidate has more steps than template
2. candidate has a step that differs from a concrete (non-Categorical) template step 2. candidate has a step that differs from a concrete (non-Categorical) template step 3. candidate has a step that does not fit in a Categorical template step 4. candidate is missing a concrete step in template 5. candidate is missing a non-optional Categorical step in template
-
class
hyperparameter_hunter.result_reader.
ResultFinder
(algorithm_name, module_name, cross_experiment_key, target_metric, space, leaderboard_path, descriptions_dir, model_params, sort=None)¶ Bases:
object
Locate saved Experiments that are compatible with the given constraints
- Parameters
- algorithm_name: String
Name of the algorithm whose hyperparameters are being optimized
- module_name: String
Name of the module from whence the algorithm being used came
- cross_experiment_key: String
hyperparameter_hunter.environment.Environment.cross_experiment_key
produced by the current Environment- target_metric: Tuple
Path denoting the metric to be used. The first value should be one of {“oof”, “holdout”, “in_fold”}, and the second value should be the name of a metric supplied in
hyperparameter_hunter.environment.Environment.metrics_params
- space: Space
Instance of
Space
, defining hyperparameter search space constraints- leaderboard_path: String
Path to a leaderboard file, whose listed Experiments will be tested for compatibility
- descriptions_dir: String
Path to a directory containing the description files of saved Experiments
- model_params: Dict
All hyperparameters for the model, both concrete and choice. Common keys include “model_init_params” and “model_extra_params”, both of which can be pointers to dicts of hyperparameters. Additionally, “feature_engineer” may be included with an instance of
FeatureEngineer
- sort: {“target_asc”, “target_desc”, “chronological”, “reverse_chronological”}, or int
… Experimental… How to sort the experiment results that fit within the given constraints
“target_asc”: Sort from experiments with the lowest value for target_metric to those with the greatest
“target_desc”: Sort from experiments with the highest value for target_metric to those with the lowest
“chronological”: Sort from oldest experiments to newest
“reverse_chronological”: Sort from newest experiments to oldest
int: Random seed with which to shuffle experiments
See also
update_match_status()
Used to decorate “does_match…” methods in order to keep a detailed record of the full pool of candidate Experiments in
match_status
. Aside from being used to compile the list of finalistsimilar_experiments
,match_status
is helpful for debugging purposes, specifically figuring out which aspects of a candidate are incompatible with the template
- Attributes
- similar_experiments: List[Tuple[dict, Number, str]]
Candidate saved Experiment results that are fully compatible with the template hyperparameters. Each value is a tuple triple of (<hyperparameters>, <target_metric value>, <candidate experiment_id>). similar_experiments is composed of the “finalists” from
match_status
- match_status: Dict[str, dict]
Record of the hyperparameters and target_metric values for all discovered Experiments, keyed by values of
experiment_ids
. Each value is a dict containing two keys: “params” (hyperparameter dict), and “score” (target_metric value number). In addition to these two keys, a key may be added for every ResultFinder method decorated byupdate_match_status()
. The exact key will be the name of the method invoked (which will always start with “does_match”, followed by the name of the hyperparameter group being checked). The value for each “does_match…” key is the value returned by that method, whose truthiness dictates whether the candidate Experiment was a successful match to the template hyperparameters for that group. For example, a match_status entry for one Experiment could look like this:{ "params": <dict of hyperparameters for candidate>, "score": 0.42, # `target_metric` value for candidate Experiment "does_match_init_params_space": True, "does_match_init_params_guidelines": False, "does_match_extra_params_space": False, "does_match_extra_params_guidelines": True, "does_match_feature_engineer": <`FeatureEngineer`>, # Still truthy }
Note that “model_init_params” and “model_extra_params” both check the compatibility of “space” choices and concrete “guidelines” separately. Conversely, “feature_engineer” is checked in its entirety by the single
does_match_feature_engineer()
. Also note that “does_match…” values are not restricted to booleans. For instance, “does_match_feature_engineer” may be set to a reinitialized FeatureEngineer, which is still truthy, even though it’s not True. If all of the “does_match…” keys have truthy values, the candidate is a full match and is added tosimilar_experiments
Methods
find
(self)Execute full result-finding workflow, populating
similar_experiments
-
property
experiment_ids
¶ Experiment IDs in the target Leaderboard that match
algorithm_name
andcross_experiment_key
- Returns
- List[str]
All saved Experiment IDs listed in the Leaderboard at
leaderboard_path
that match thealgorithm_name
andcross_experiment_key
of the template
-
property
mini_spaces
¶ Separate
space
into subspaces based onmodel_params
keys- Returns
- Dict[str, Space]
Dict of subspaces, wherein keys are all keys of
model_params
. Each key’s corresponding value is a filtered subspace, containing all the dimensions inspace
whose name tuples start with that key. Keys will usually be one of the core hyperparameter group names (“model_init_params”, “model_extra_params”, “feature_engineer”, “feature_selector”)
Examples
>>> from hyperparameter_hunter import Integer >>> def es_0(all_inputs): ... return all_inputs >>> def es_1(all_inputs): ... return all_inputs >>> def es_2(all_inputs): ... return all_inputs >>> s = Space([ ... Integer(900, 1500, name=("model_init_params", "max_iter")), ... Categorical(["svd", "cholesky", "lsgr"], name=("model_init_params", "solver")), ... Categorical([es_1, es_2], name=("feature_engineer", "steps", 1)), ... ]) >>> rf = ResultFinder( ... "a", "b", "c", ("oof", "d"), space=s, leaderboard_path="e", descriptions_dir="f", ... model_params=dict( ... model_init_params=dict( ... max_iter=s.dimensions[0], normalize=True, solver=s.dimensions[1], ... ), ... feature_engineer=FeatureEngineer([es_0, s.dimensions[2]]), ... ), ... ) >>> rf.mini_spaces # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE {'model_init_params': Space([Integer(low=900, high=1500), Categorical(categories=('svd', 'cholesky', 'lsgr'))]), 'feature_engineer': Space([Categorical(categories=(<function es_1 at ...>, <function es_2 at ...>))])}
-
find
(self)¶ Execute full result-finding workflow, populating
similar_experiments
See also
update_match_status()
Used to decorate “does_match…” methods in order to keep a detailed record of the full pool of candidate Experiments in
match_status
. Aside from being used to compile the list of finalistsimilar_experiments
,match_status
is helpful for debugging purposes, specifically figuring out which aspects of a candidate are incompatible with the templatedoes_match_feature_engineer()
Performs special functionality beyond that of the other “does_match…” methods, namely providing an updated “feature_engineer” value for compatible candidates to use. Specifics are documented in
does_match_feature_engineer()
-
does_match_feature_engineer
(self, exp_id, params, score) → Union[bool, dict, hyperparameter_hunter.feature_engineering.FeatureEngineer]¶ Check candidate compatibility with feature_engineer template guidelines and space choices. This method is different from the other “does_match…” methods in two important aspects:
It checks both guidelines and choices in a single method
It returns an updated feature_engineer for compatible candidates, rather than True
- Parameters
- exp_id: String
Candidate Experiment ID
- params: Dict
Candidate “feature_engineer” to compare to the template in
model_params
. This should always be a dict, not an instance of FeatureEngineer, which is not the case for the template “feature_engineer” inmodel_params
- score: Number
Value of the candidate Experiment’s target metric
- Returns
- Boolean, dict, or FeatureEngineer
Expanding on the second difference noted in the description, False will still be returned if the candidate is deemed incompatible with the template (as is the case with the other “does_match…” methods). The return value differs with compatible candidates in order to provide a feature_engineer with reinitialized
EngineerStep
steps and to fill in missingCategorical
steps that were declared asoptional
by the template. This updated feature_engineer is the value that then gets included in the candidate’ssimilar_experiments
entry (assuming candidate is a full match)
-
does_match_init_params_space
(self, exp_id, params, score) → bool¶ Check candidate compatibility with model_init_params template space choices
- Parameters
- exp_id: String
Candidate Experiment ID
- params: Dict
Candidate “model_init_params” to compare to the template in
model_params
- score: Number
Value of the candidate Experiment’s target metric
- Returns
- Boolean
True if candidate params fit in model_init_params space choices. Else, False
-
does_match_init_params_guidelines
(self, exp_id, params, score, template_params=None) → bool¶ Check candidate compatibility with model_init_params template guidelines
- Parameters
- exp_id: String
Candidate Experiment ID
- params: Dict
Candidate “model_init_params” to compare to the template in
model_params
- score: Number
Value of the candidate Experiment’s target metric
- template_params: Dict (optional)
If given, used as the template hyperparameters against which to compare candidate params, rather than the standard guideline template of the “model_init_params” in
model_params
. This is used bydoes_match_init_params_guidelines_multi()
- Returns
- Boolean
True if candidate params match model_init_params guidelines. Else, False
Notes
Template hyperparameters are generally considered “guidelines” if they are declared as concrete values, rather than space choices present in
space
-
does_match_init_params_guidelines_multi
(self, exp_id, params, score, location) → bool¶ Check candidate compatibility with model_init_params template guidelines when a guideline hyperparameter is directly affected by another hyperparameter that is given as a space choice
- Parameters
- exp_id: String
Candidate Experiment ID
- params: Dict
Candidate “model_init_params” to compare to the template in
model_params
- score: Number
Value of the candidate Experiment’s target metric
- location: Tuple
Location of the hyperparameter space choice that affects the acceptable guideline values of a particular hyperparameter. In other words, this is the path of a hyperparameter, which, if changed, would change the expected default value of another hyperparameter
- Returns
- Boolean
True if candidate params match model_init_params guidelines. Else, False
Notes
This is used for Keras Experiments when the optimizer value in a model’s compile_params is given as a hyperparameter space choice. Each possible value of optimizer prescribes different default values for the optimizer_params argument, so special measures need to be taken to ensure the correct Experiments are declared to fit within the constraints
-
does_match_extra_params_space
(self, exp_id, params, score) → bool¶ Check candidate compatibility with model_extra_params template space choices
- Parameters
- exp_id: String
Candidate Experiment ID
- params: Dict
Candidate “model_extra_params” to compare to the template in
model_params
- score: Number
Value of the candidate Experiment’s target metric
- Returns
- Boolean
True if candidate params fit in model_extra_params space choices. Else, False
-
does_match_extra_params_guidelines
(self, exp_id, params, score) → bool¶ Check candidate guideline compatibility with model_extra_params template
- Parameters
- exp_id: String
Candidate Experiment ID
- params: Dict
Candidate “model_extra_params” to compare to the template in
model_params
- score: Number
Value of the candidate Experiment’s target metric
- Returns
- Boolean
True if candidate params match model_extra_params guidelines. Else, False
-
class
hyperparameter_hunter.result_reader.
KerasResultFinder
(algorithm_name, module_name, cross_experiment_key, target_metric, space, leaderboard_path, descriptions_dir, model_params, sort=None)¶ Bases:
hyperparameter_hunter.result_reader.ResultFinder
ResultFinder for locating saved Keras Experiments compatible with the given constraints
- Parameters
- algorithm_name: String
Name of the algorithm whose hyperparameters are being optimized
- module_name: String
Name of the module from whence the algorithm being used came
- cross_experiment_key: String
hyperparameter_hunter.environment.Environment.cross_experiment_key
produced by the current Environment- target_metric: Tuple
Path denoting the metric to be used. The first value should be one of {“oof”, “holdout”, “in_fold”}, and the second value should be the name of a metric supplied in
hyperparameter_hunter.environment.Environment.metrics_params
- space: Space
Instance of
Space
, defining hyperparameter search space constraints- leaderboard_path: String
Path to a leaderboard file, whose listed Experiments will be tested for compatibility
- descriptions_dir: String
Path to a directory containing the description files of saved Experiments
- model_params: Dict
Concrete hyperparameters for the model. Common keys include “model_init_params” and “model_extra_params”, both of which can be pointers to dicts of hyperparameters. Additionally, “feature_engineer” may be included with an instance of
FeatureEngineer
- sort: {“target_asc”, “target_desc”, “chronological”, “reverse_chronological”}, or int
… Experimental… How to sort the experiment results that fit within the given constraints
“target_asc”: Sort from experiments with the lowest value for target_metric to those with the greatest
“target_desc”: Sort from experiments with the highest value for target_metric to those with the lowest
“chronological”: Sort from oldest experiments to newest
“reverse_chronological”: Sort from newest experiments to oldest
int: Random seed with which to shuffle experiments
- Attributes
experiment_ids
Experiment IDs in the target Leaderboard that match
algorithm_name
andmini_spaces
Separate
space
into subspaces based onmodel_params
keys
Methods
does_match_extra_params_guidelines
(self, …)Check candidate guideline compatibility with model_extra_params template
does_match_extra_params_space
(self, exp_id, …)Check candidate compatibility with model_extra_params template space choices
does_match_feature_engineer
(self, exp_id, …)Check candidate compatibility with feature_engineer template guidelines and space choices.
does_match_init_params_guidelines
(self, …)Check candidate compatibility with model_init_params template guidelines
does_match_init_params_guidelines_multi
(…)Check candidate compatibility with model_init_params template guidelines when a guideline hyperparameter is directly affected by another hyperparameter that is given as a space choice
does_match_init_params_space
(self, exp_id, …)Check candidate compatibility with model_init_params template space choices
find
(self)Execute full result-finding workflow, populating
similar_experiments
-
hyperparameter_hunter.result_reader.
has_experiment_result_file
(results_dir, experiment_id, result_type=None)¶ Check if the specified result files exist in results_dir for Experiment experiment_id
- Parameters
- results_dir: String
HyperparameterHunterAssets directory in which to search for Experiment result files
- experiment_id: String, or BaseExperiment
ID of the Experiment whose result files should be searched for in results_dir. If not string, should be an instance of a descendant of
BaseExperiment
with an “experiment_id” attribute- result_type: List, or string (optional)
Result file types for which to check. Valid values include any subdirectory name that can be included in “HyperparameterHunterAssets/Experiments” by default: [“Descriptions”, “Heartbeats”, “PredictionsOOF”, “PredictionsHoldout”, “PredictionsTest”, “ScriptBackups”]. If string, should be one of the aforementioned strings, or “ALL” to use all of the results. If list, should be a subset of the aforementioned list of valid values. Else, default is [“Descriptions”, “Heartbeats”, “PredictionsOOF”, “ScriptBackups”]. The returned boolean signifies whether ALL of the result_type files were found, not whether ANY were found
- Returns
- Boolean
True if all result files specified by result_type exist in results_dir for the Experiment specified by experiment_id. Else, False
hyperparameter_hunter.sentinels module¶
This module defines Sentinel objects that are used to represent data that is not yet available.
For example, hyperparameter_hunter.sentinels.DatasetSentinel
is used in
hyperparameter_hunter.environment.Environment
to enable a user to pass the fold validation
dataset as an argument on Experiment initialization. At the point that the sentinel is provided, the
training dataset has not yet been split into folds, which is why the Sentinel is necessary
Related¶
hyperparameter_hunter.environment
hyperparameter_hunter.environment.Environment
has the following properties that utilizehyperparameter_hunter.sentinels.DatasetSentinel
: [train_input, train_target, validation_input, validation_target, holdout_input, holdout_target]. These properties can be passed as arguments to Experiment or OptimizationProtocol initialization in order to provide the dataset to a Model’s fit call, for examplehyperparameter_hunter.experiments
This is one of the points at which one might want to use the Sentinels exposed by
hyperparameter_hunter.environment.Environment
, specifically as values in the model_init_params and model_extra_params arguments to a descendant ofhyperparameter_hunter.experiments.BaseExperiment
hyperparameter_hunter.optimization.protocol_core
This is a second point at which one might use the Sentinels exposed by
hyperparameter_hunter.environment.Environment
. In this case, they could be provided as values in the model_init_params and model_extra_params arguments in a call tohyperparameter_hunter.optimization.protocol_core.BaseOptPro.forge_experiment()
, the structure of which intentionally mirrors that ofhyperparameter_hunter.experiments.BaseExperiment.__init__()
hyperparameter_hunter.models
This is ultimately where Sentinel instances will be converted to the actual values that they represent via calls to
hyperparameter_hunter.sentinels.locate_sentinels()
-
class
hyperparameter_hunter.sentinels.
Sentinel
(*args, **kwargs)¶ Bases:
object
Base class for Sentinels representing data that is not yet available. Subclasses should call super().__init__() at the end of their __init__ methods
- Parameters
- Attributes
sentinel
Retrieve
Sentinel._sentinel
Methods
retrieve_by_sentinel
(self)Retrieve the actual object represented by the sentinel
-
property
sentinel
¶ Retrieve
Sentinel._sentinel
- Returns
- Str
The value of
Sentinel._sentinel
-
abstract
retrieve_by_sentinel
(self) → object¶ Retrieve the actual object represented by the sentinel
- Returns
- object
The object for which the sentinel was being used as a placeholder
-
hyperparameter_hunter.sentinels.
locate_sentinels
(parameters)¶ Produce a mirrored parameters dict, wherein Sentinel values are converted to the objects they represent
- Parameters
- parameters: Dict
Dict of parameters, which may contain nested Sentinel values
- Returns
- Dict
Mirror of parameters, except where a Sentinel was found, the value it represents is returned instead
-
class
hyperparameter_hunter.sentinels.
DatasetSentinel
(dataset_type, dataset_hash, cv_type=None, global_random_seed=None, random_seeds=None)¶ Bases:
hyperparameter_hunter.sentinels.Sentinel
Class to create sentinels representing dataset input/target values
- Parameters
- dataset_type: Str
Dataset type, suffixed with ‘_input’, or ‘_target’, for which a sentinel should be created. Acceptable values are as follows: [‘train_input’, ‘train_target’, ‘validation_input’, ‘validation_target’, ‘holdout_input’, ‘holdout_target’]
- dataset_hash: Str
The hash of the dataset for which a sentinel should be created that was generated while creating
hyperparameter_hunter.environment.Environment.cross_experiment_key
- cv_type: Str, or None, default=None
If None, dataset_type should be one of [‘holdout_input’, ‘holdout_target’]. Else, should be a string that is one of the following: 1) a string attribute of sklearn.model_selection._split, or 2) a hash produced while creating
hyperparameter_hunter.environment.Environment.cross_experiment_key
- global_random_seed: Int, or None, default=None
If None, dataset_type should be one of [‘holdout_input’, ‘holdout_target’]. If int, should be
hyperparameter_hunter.environment.Environment.global_random_seed
- random_seeds: List, or None, default=None
If None, dataset_type should be one of [‘holdout_input’, ‘holdout_target’]. If list, should be
hyperparameter_hunter.environment.Environment.random_seeds
- Attributes
sentinel
Retrieve
Sentinel._sentinel
Methods
retrieve_by_sentinel
(self)Retrieve the actual dataset represented by the sentinel
-
retrieve_by_sentinel
(self)¶ Retrieve the actual dataset represented by the sentinel
- Returns
- object
The dataset for which the sentinel was being used as a placeholder
hyperparameter_hunter.settings module¶
This module is the doorway for other modules to access the information set by the active
hyperparameter_hunter.environment.Environment
, and to access the appropriate logging
methods. Specifically, other modules will most often use hyperparameter_hunter.settings.G
to access the aforementioned information. Additionally, this module defines several variables to
assist in navigating the ‘HyperparameterHunterAssets’ directory structure
Related¶
hyperparameter_hunter.environment
This module sets
hyperparameter_hunter.settings.G.Env
to itself, creating the primary gateway used by other modules to access the active Environment’s information
-
class
hyperparameter_hunter.settings.
G
¶ Bases:
object
This class defines global attributes that are set upon instantiation of
environment.Environment
. All attributes contained herein are class variables (not instance variables) because the expectation is for the attributes of this class to be set only once, then referenced by operations that may be executed after instantiating aenvironment.Environment
. This allows functions to be called or classes to be initiated without passing a reference to the currently active Environment, because they check the attributes of this class, instead- Attributes
- Env: None
This is set to “self” in
environment.Environment.__init__()
. This fact allows other modules to check ifsettings.G.Env
is None. If None, aenvironment.Environment
has not yet been instantiated. If not None, any attributes or methods of the instantiated Env may be called- save_transformed_predictions: False
Declares format in which a model’s predictions should be saved, with regard to
feature_engineering.FeatureEngineer
transformations. If no transformation of the target variable takes place (either throughfeature_engineering.FeatureEngineer
,feature_engineering.EngineerStep
, or otherwise), then this setting can be ignored.If save_transformed_predictions is True, and target transformation does occur, then experiment predictions are saved in the same form as the transformed target, which is the form returned directly by a fitted model’s predict method. For example, if target data is label-encoded, and an
feature_engineering.EngineerStep
is used to one-hot encode the target, then one-hot-encoded predictions will be saved.Conversely, if save_transformed_predictions is False (default), and target transformation does occur, then experiment predictions are saved in the inverted form of the transformed target, which is the same form as the original target data. Continuing the example of label-encoded target data, and an
feature_engineering.EngineerStep
to one-hot encode the target, in this case, label-encoded predictions will be saved.- priority_callbacks: Tuple
Intended for internal use only. The contents of this tuple are inserted at the front of an Experiment’s list of callback bases via
experiment_core.ExperimentMeta
, ahead of even the Experiment’s original base classes. This is used primarily for testing callbacks, but it can also be used if you absolutely need a callback to be placed before the Experiment’s other ancestors in its MRO- log_: print
…
- debug_: print
…
- warn_: print
…
- import_hooks: List
…
- sentinel_registry: List
…
Methods
debug
(content, \*args, \*\*kwargs)Set in
environment.Environment.initialize_reporting()
to the updated version ofreporting.ReportingHandler.debug()
debug_
(value, …[, sep, end, file, flush])Prints the values to a stream, or to sys.stdout by default.
log
(content, \*args, \*\*kwargs)Set in
environment.Environment.initialize_reporting()
to the updated version ofreporting.ReportingHandler.log()
log_
(value, …[, sep, end, file, flush])Prints the values to a stream, or to sys.stdout by default.
Return the attributes of
settings.G
to their original valueswarn
(content, \*args, \*\*kwargs)Set in
environment.Environment.initialize_reporting()
to the updated version ofreporting.ReportingHandler.warn()
warn_
()Issue a warning, or maybe ignore it or raise an exception.
-
Env
= None¶
-
save_transformed_predictions
= False¶
-
priority_callbacks
= ()¶
-
static
log
(content, *args, **kwargs)¶ Set in
environment.Environment.initialize_reporting()
to the updated version ofreporting.ReportingHandler.log()
-
static
debug
(content, *args, **kwargs)¶ Set in
environment.Environment.initialize_reporting()
to the updated version ofreporting.ReportingHandler.debug()
-
static
warn
(content, *args, **kwargs)¶ Set in
environment.Environment.initialize_reporting()
to the updated version ofreporting.ReportingHandler.warn()
-
log_
(value, ..., sep=' ', end='n', file=sys.stdout, flush=False)¶ Prints the values to a stream, or to sys.stdout by default. Optional keyword arguments: file: a file-like object (stream); defaults to the current sys.stdout. sep: string inserted between values, default a space. end: string appended after the last value, default a newline. flush: whether to forcibly flush the stream.
-
debug_
(value, ..., sep=' ', end='n', file=sys.stdout, flush=False)¶ Prints the values to a stream, or to sys.stdout by default. Optional keyword arguments: file: a file-like object (stream); defaults to the current sys.stdout. sep: string inserted between values, default a space. end: string appended after the last value, default a newline. flush: whether to forcibly flush the stream.
-
warn_
()¶ Issue a warning, or maybe ignore it or raise an exception.
-
import_hooks
= ['keras_layer', 'keras_initializer', 'keras_variance_scaling']¶
-
sentinel_registry
= []¶
-
classmethod
reset_attributes
()¶ Return the attributes of
settings.G
to their original values
hyperparameter_hunter.tracers module¶
This module defines metaclasses used to trace the parameters passed through operation-critical classes that are members of other libraries. These are only used in cases where it is impractical or impossible to effectively retrieve the arguments explicitly provided by a user, as well as the default arguments for the classes being traced. Generally, tracer metaclasses will aim to add some attributes to the class, that will collect default values, and provided arguments on the class’s creation, and an instance’s call
Related¶
hyperparameter_hunter.importer
This module handles the interception of certain imports in order to inject the tracer metaclasses defined in
hyperparameter_hunter.tracers
into the inheritance structure of objects that need to be traced
-
class
hyperparameter_hunter.tracers.
ArgumentTracer
¶ Bases:
type
Metaclass to trace the default arguments and explicitly provided arguments of its descendants. It also has special provisions for instantiating dummy models if directed to
Methods
__call__
(cls, \*args, \*\*kwargs)Call self as a function.
mro
()return a type’s method resolution order
-
class
hyperparameter_hunter.tracers.
LocationTracer
¶ Bases:
hyperparameter_hunter.tracers.ArgumentTracer
Metaclass to trace the origin of the call to initialize the descending class
Methods
__call__
(cls, \*args, \*\*kwargs)Call self as a function.
mro
()return a type’s method resolution order
Module contents¶
-
class
hyperparameter_hunter.
Environment
(train_dataset, environment_params_path=None, *, results_path=None, metrics=None, holdout_dataset=None, test_dataset=None, target_column=None, id_column=None, do_predict_proba=None, prediction_formatter=None, metrics_params=None, cv_type=None, runs=None, global_random_seed=None, random_seeds=None, random_seed_bounds=None, cv_params=None, verbose=None, file_blacklist=None, reporting_params=None, to_csv_params=None, do_full_save=None, experiment_callbacks=None, experiment_recorders=None, save_transformed_metrics=None)¶ Bases:
object
Class to organize the parameters that allow Experiments/OptPros to be fairly compared
Environment is the collective starting point for all of HyperparameterHunter’s biggest and best toys: Experiments and OptimizationProtocols. Without an Environment, neither of these will work.
The Environment is where we declare all the parameters that transcend traditional “hyperparameters”. It houses the stuff without which machine learning can’t even really start. Specifically, Environment cares about 1) The data used for fitting/predicting, 2) The cross-validation scheme used to split the data and fit models; and 3) How to evaluate the predictions made on that data. There are plenty of other goodies documented below, but the absolutely mission-critical parameters concerned with the above tasks are train_dataset, cv_type, cv_params, and metrics. Additionally, it’s important to provide results_path, so Experiment/OptPro results can be saved, which is kind of what HyperparameterHunter is all about
- Parameters
- train_dataset: Pandas.DataFrame, or str path
The training data for the experiment. Will be split into train/holdout data, if applicable, and train/validation data if cross-validation is to be performed. If str, will attempt to read file at path via
pandas.read_csv()
. For more information on which columns will be used during fitting/predicting, see the “Dataset columns” note in the “Notes” section below- environment_params_path: String path, or None, default=None
If not None and is valid .json filepath containing an object (dict), the file’s contents are treated as the default values for all keys that match any of the below kwargs used to initialize
Environment
- results_path: String path, or None, default=None
If valid directory path and the results directory has not yet been created, it will be created here. If this does not end with <ASSETS_DIRNAME>, it will be appended. If <ASSETS_DIRNAME> already exists at this path, new results will also be stored here. If None or invalid, results will not be stored
- metrics: Dict, List, or None, default=None
Iterable describing the metrics to be recorded, along with a means to compute the value of each metric. Should be of one of the two following forms:
List Form:
[“<metric name>”, “<metric name>”, …]: Where each value is a string that names an attribute in
sklearn.metrics
[Metric, Metric, …]: Where each value of the list is an instance of
metrics.Metric
[(<name>, <metric_function>, [<direction>]), (<*args>), …]: Where each value of the list is a tuple of arguments that will be used to instantiate a
metrics.Metric
. Arguments given in tuples must be in order expected bymetrics.Metric
: (name, metric_function, direction)
Dict Form:
{“<metric name>”: <metric_function>, …}: Where each key is a name for the corresponding metric callable, which is used to compute the value of the metric
{“<metric name>”: (<metric_function>, <direction>), …}: Where each key is a name for the corresponding metric callable and direction, all of which are used to instantiate a
metrics.Metric
{“<metric name>”: “<sklearn metric name>”, …}: Where each key is a name for the metric, and each value is the name of the attribute in
sklearn.metrics
for which the corresponding key is an alias{“<metric name>”: None, …}: Where each key is the name of the attribute in
sklearn.metrics
{“<metric name>”: Metric, …}: Where each key names an instance of
metrics.Metric
. This is the internally-used format to which all other formats will be converted
Metric callable functions should expect inputs of form (target, prediction), and should return floats. See the documentation of
metrics.Metric
for information regarding expected parameters and types- holdout_dataset: Pandas.DataFrame, callable, str path, or None, default=None
If pd.DataFrame, this is the holdout dataset. If callable, expects a function that takes (self.train: DataFrame, self.target_column: str) as input and returns the new (self.train: DataFrame, self.holdout: DataFrame). If str, will attempt to read file at path via
pandas.read_csv()
. Else, there is no holdout set. For more information on which columns will be used during fitting/predicting, see the “Dataset columns” note in the “Notes” section below- test_dataset: Pandas.DataFrame, str path, or None, default=None
The testing data for the experiment. Structure should be identical to that of train_dataset, except its target_column column can be empty or non-existent, because test_dataset predictions will never be evaluated. If str, will attempt to read file at path via
pandas.read_csv()
. For more information on which columns will be used during fitting/predicting, see the “Dataset columns” note in the “Notes” section below- target_column: Str, or list, default=’target’
If str, denotes the column name in all provided datasets (except test) that contains the target output. If list, should be a list of strs designating multiple target columns. For example, in a multi-class classification dataset like UCI’s hand-written digits, target_column would be a list containing ten strings. In this example, the target_column data would be sparse, with a 1 to signify that a sample is a written example of a digit (0-9). For a working example, see ‘hyperparameter_hunter/examples/lib_keras_multi_classification_example.py’
- id_column: Str, or None, default=None
If not None, str denoting the column name in all provided datasets containing sample IDs
- do_predict_proba: Boolean, or int, default=False
If False,
models.Model.fit()
will callmodels.Model.model.predict()
If True, it will call
models.Model.model.predict_proba()
, and the values in all columns will be used as the actual prediction valuesIf do_predict_proba is an int,
models.Model.fit()
will callmodels.Model.model.predict_proba()
, as is the case when do_predict_proba is True, but the int supplied as do_predict_proba declares the column index to use as the actual prediction valuesFor example, for a model to call the predict method, do_predict_proba=False (default). For a model to call the predict_proba method, and use all of the class probabilities, do_predict_proba=True. To call the predict_proba method, and use the class probabilities in the first column, do_predict_proba=0. To use the second column (index 1) of the result, do_predict_proba=1 - This often corresponds to the positive class’s probabilities in binary classification problems. To use the third column do_predict_proba=2, and so on
- prediction_formatter: Callable, or None, default=None
If callable, expected to have same signature as
utils.result_utils.format_predictions()
. That is, the callable will receive (raw_predictions: np.array, dataset_df: pd.DataFrame, target_column: str, id_column: str or None) as input and should return a properly formatted prediction DataFrame. The callable uses raw_predictions as the content, dataset_df to provide any id column, and target_column to identify the column in which to place raw_predictions- metrics_params: Dict, or None, default=dict()
Dictionary of extra parameters to provide to
metrics.ScoringMixIn.__init__()
. metrics must be provided either 1) as an input kwarg toEnvironment.__init__()
(see metrics), or 2) as a key in metrics_params, but not both. An Exception will be raised if both are given, or if neither is given- cv_type: Class or str, default=’KFold’
The class to define cross-validation splits. If str, it must be an attribute of sklearn.model_selection._split, and it must be a cross-validation class that inherits one of the following sklearn classes: BaseCrossValidator, or _RepeatedSplits. Valid str values include ‘KFold’, and ‘RepeatedKFold’, although there are many more. It must implement the following methods: [__init__, split]. If using a custom class, see the following tested sklearn classes for proper implementations: [KFold, StratifiedKFold, RepeatedKFold, RepeatedStratifiedKFold]. The arguments provided to
cv_type.__init__()
will beEnvironment.cv_params
, which should include the following: [‘n_splits’ <int>, ‘n_repeats’ <int> (if applicable)].cv_type.split()
will receive the following arguments: [BaseExperiment.train_input_data
,BaseExperiment.train_target_data
]- runs: Int, default=1
The number of times to fit a model within each fold to perform multiple-run-averaging with different random seeds
- global_random_seed: Int, default=32
The initial random seed used just before generating an Experiment’s random_seeds. This ensures consistency for random_seeds between Experiments, without having to explicitly provide it here
- random_seeds: None, or List, default=None
If None, random_seeds of the appropriate shape will be created automatically. Else, must be a list of ints of shape (cv_params[‘n_repeats’], cv_params[‘n_splits’], runs). If cv_params does not have the key n_repeats (because standard cross-validation is being used), the value will default to 1. See
experiments.BaseExperiment._random_seed_initializer()
for info on expected shape- random_seed_bounds: List, default=[0, 100000]
A list containing two integers: the lower and upper bounds, respectively, for generating an Experiment’s random seeds in
experiments.BaseExperiment._random_seed_initializer()
. Generally, leave this kwarg alone- cv_params: dict, or None, default=dict()
Parameters provided upon initialization of cv_type. Keys may be any args accepted by
cv_type.__init__()
. Number of fold splits must be provided via “n_splits”, and number of repeats (if applicable for cv_type) must be provided via “n_repeats”- verbose: Int, boolean, default=3
Verbosity of printing for any experiments performed while this Environment is active
Higher values indicate more frequent logging. Logs are still recorded in the heartbeat file regardless of verbosity level. verbose only dictates which logs are visible in the console. The following table illustrates which types of logging messages will be visible with each verbosity level:
| Verbosity | Keys/IDs | Final Score | Repetitions* | Folds | Runs* | Run Starts* | Result Files | Other | |:---------:|:--------:|:-----------:|:------------:|:-----:|:-----:|:-----------:|:------------:|:-----:| | 0 | | | | | | | | | | 1 | Yes | Yes | | | | | | | | 2 | Yes | Yes | Yes | Yes | | | | | | 3 | Yes | Yes | Yes | Yes | Yes | | | | | 4 | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
*: If such logging is deemed appropriate with the given cross-validation parameters. In other words, repetition/run logging will only be verbose if Environment was given more than one repetition/run, respectively
- file_blacklist: List of str, or None, or ‘ALL’, default=None
If list of str, the result files named within are not saved to their respective directory in “<ASSETS_DIRNAME>/Experiments”. If None, all result files are saved. If ‘ALL’, nothing at all will be saved for the Experiments. If the path of the file that initializes an Experiment does not end with a “.py” extension, the Experiment proceeds as if “script_backup” had been added to file_blacklist. This means that backup files will not be created for Jupyter notebooks (or any other non-“.py” files). For info on acceptable values, see
validate_file_blacklist()
- reporting_params: Dict, default=dict()
Parameters passed to initialize
reporting.ReportingHandler
- to_csv_params: Dict, default=dict()
Parameters passed to the calls to
pandas.frame.DataFrame.to_csv()
inrecorders
. In particular, this is where an Experiment’s final prediction files are saved, so the values here will affect the format of the .csv prediction files. Warning: If to_csv_params contains the key “path_or_buf”, it will be removed. Otherwise, all items are supplied directly toto_csv()
, including kwargs it might not be expecting if they are given- do_full_save: None, or callable, default=:func:`utils.result_utils.default_do_full_save`
If callable, expected to take an Experiment’s result description dict as input and return a boolean. If None, treated as a callable that returns True. This parameter is used by
recorders.DescriptionRecorder
to determine whether the Experiment result files following the description should also be created. If do_full_save returns False, result file-saving is stopped early, and only the description is saved. If do_full_save returns True, all files not in file_blacklist are saved normally. This allows you to skip creation of an Experiment’s predictions, logs, and heartbeats if its score does not meet some threshold you set, for example. do_full_save receives the Experiment description dict as input, so for help setting do_full_save, just look into one of your Experiment descriptions- experiment_callbacks: `LambdaCallback`, or list of `LambdaCallback` (optional)
Callbacks injected directly into Experiments, adding new functionality, or customizing existing processes. Should be a
LambdaCallback
or a list of such classes. LambdaCallback can be created usingcallbacks.bases.lambda_callback()
, which documents the options for creating callbacks. experiment_callbacks will be added to the MRO of the executed Experiment class byexperiment_core.ExperimentMeta
at __call__ time, making experiment_callbacks new base classes of the Experiment. Seecallbacks.bases.lambda_callback()
for more information. Note that the Experiments conducted by OptPros will still benefit from experiment_callbacks. The presence of LambdaCallbacks will affect neither Environment keys, nor Experiment keys. In other words, for the purposes of Experiment matching/recording, all other factors being equal, an Experiment with experiment_callbacks is considered identical to an Experiment without, despite whatever custom functionality was added by the LambdaCallbacks- experiment_recorders: List, None, default=None
If not None, may be a list whose values are tuples of (<
recorders.BaseRecorder
descendant>, <str result_path>). The result_path str should be a path relative to results_path that specifies the directory/file in which the product of the custom recorder should be saved. The contents of experiment_recorders will be provided to recorders.RecorderList upon completion of an Experiment, and, if the subclassing documentation in recorders is followed properly, will create or update a result file for the just-executed Experiment- save_transformed_metrics: Boolean (optional)
Declares manner in which a model’s predictions should be evaluated through the provided metrics, with regard to target data transformations. This setting can be ignored if no transformation of the target variable takes place (either through
FeatureEngineer
,EngineerStep
, or otherwise).The default value of save_transformed_metrics depends on the dtype of the target data in train_dataset. If all target columns are numeric, save_transformed_metrics`=False, meaning metric evaluation should use the original/inverted targets and predictions. Else if any target column is non-numeric, `save_transformed_metrics`=True, meaning evaluation should use the transformed targets and predictions because most metrics require numeric inputs. This is described further in :attr:`save_transformed_metrics. A more descriptive name for this may be “calculate_metrics_using_transformed_predictions”, but that’s a bit verbose–even by my standards
- Other Parameters
- cross_validation_type: …
Alias for cv_type *
- cross_validation_params: …
Alias for cv_params *
- metrics_map: …
Alias for metrics *
- reporting_handler_params: …
Alias for reporting_params *
- root_results_path: …
Alias for results_path *
Notes
Dataset columns: In order to specify the columns to be used by the three dataset kwargs (train_dataset, holdout_dataset, test_dataset) during fitting and predicting, a few attributes can be used. On Environment initialization, the columns specified by the following kwargs will be separated from the rest of the dataset during training/predicting: 1) target_column, which names the column containing the target output labels for the input data; and 2) id_column, which (if given) represents the name of the column that contains identifying information for each data sample, and should otherwise have no relation to the actual data. Additionally, the feature_selector kwarg of the descendants of
hyperparameter_hunter.experiments.BaseExperiment
(likehyperparameter_hunter.experiments.CVExperiment
) is used to filter out columns of the given datasets prior to fitting. See its documentation for more information, but it can effectively be used to remove any columns from the datasetsOverriding default kwargs at environment_params_path: If you have any of the above kwargs specified in the .json file at environment_params_path (except environment_params_path, which will be ignored), you can override its value by passing it as a kwarg when initializing
Environment
. The contents at environment_params_path are only used when the matching kwarg supplied at initialization is None. See “/examples/environment_params_path_example.py” for detailsThe order of precedence for determining the value of each parameter is as follows, with items at the top having the highest priority, and deferring only to the items below if their own value is None:
1)kwargs passed directly to
Environment.__init__()
on initialization,2)keys of the file at environment_params_path (if valid .json object),
3)keys of
hyperparameter_hunter.environment.Environment.DEFAULT_PARAMS
do_predict_proba: Because this parameter can be either a boolean or an integer, it is important to explicitly pass booleans rather than truthy or falsey values. Similarly, only pass integers if you intend for the value to be used as a column index. Do not pass 0 to mean False, or 1 to mean True
- Attributes
- train_input: DatasetSentinel
Sentinel replaced with current train input data during Model fitting/predicting. Commonly given in the model_extra_params kwargs of
hyperparameter_hunter.experiments.BaseExperiment
orhyperparameter_hunter.optimization.protocol_core.BaseOptPro.forge_experiment()
for eval_set-like hyperparameters. Importantly, the actual value of this Sentinel is determined after performing cross-validation data splitting, and after executingFeatureEngineer
- train_target: DatasetSentinel
Like
train_input
, except for current train target data- validation_input: DatasetSentinel
Like
train_input
, except for current validation input data- validation_target: DatasetSentinel
Like
train_input
, except for current validation target data- holdout_input: DatasetSentinel
Like
train_input
, except for current holdout input data- holdout_target: DatasetSentinel
Like
train_input
, except for current holdout target data
Methods
environment_workflow
(self)Execute all methods required to validate the environment and run Experiments
format_result_paths
(self)Remove paths contained in file_blacklist, and format others to prepare for saving results
Generate a key to describe the current Environment’s cross-experiment parameters
initialize_reporting
(self)Initialize reporting for the Environment and Experiments conducted during its lifetime
Try to update null parameters from environment_params_path, or DEFAULT_PARAMS
validate_parameters
(self)Ensure the provided parameters are valid and properly formatted
-
DEFAULT_PARAMS
= {'cv_params': {}, 'cv_type': 'KFold', 'do_full_save': <function default_do_full_save>, 'do_predict_proba': False, 'environment_params_path': None, 'file_blacklist': None, 'global_random_seed': 32, 'id_column': None, 'metrics': None, 'metrics_params': {}, 'prediction_formatter': <function format_predictions>, 'random_seed_bounds': [0, 100000], 'random_seeds': None, 'reporting_params': {'console_params': None, 'float_format': '{:.5f}', 'heartbeat_params': None, 'heartbeat_path': None}, 'results_path': None, 'runs': 1, 'save_transformed_metrics': None, 'target_column': 'target', 'to_csv_params': {}, 'verbose': 3}¶
-
property
results_path
¶
-
property
target_column
¶
-
property
train_dataset
¶
-
property
test_dataset
¶
-
property
holdout_dataset
¶
-
property
file_blacklist
¶
-
property
cv_type
¶
-
property
to_csv_params
¶
-
property
cross_experiment_params
¶
-
property
experiment_callbacks
¶
-
property
save_transformed_metrics
¶ If save_transformed_metrics is True, and target transformation does occur, then experiment metrics are calculated using the transformed targets and predictions, which is the form returned directly by a fitted model’s predict method. For example, if target data is label-encoded, and an
feature_engineering.EngineerStep
is used to one-hot encode the target, then metrics functions will receive the following as input: (one-hot-encoded targets, one-hot-encoded predictions).Conversely, if save_transformed_metrics is False, and target transformation does occur, then experiment metrics are calculated using the inverse of the transformed targets and predictions, which is same form as the original target data. Continuing the example of label-encoded target data, and an
feature_engineering.EngineerStep
to one-hot encode the target, in this case, metrics functions will receive the following as input: (label-encoded targets, label-encoded predictions)
-
environment_workflow
(self)¶ Execute all methods required to validate the environment and run Experiments
-
validate_parameters
(self)¶ Ensure the provided parameters are valid and properly formatted
-
format_result_paths
(self)¶ Remove paths contained in file_blacklist, and format others to prepare for saving results
-
update_custom_environment_params
(self)¶ Try to update null parameters from environment_params_path, or DEFAULT_PARAMS
-
generate_cross_experiment_key
(self)¶ Generate a key to describe the current Environment’s cross-experiment parameters
-
initialize_reporting
(self)¶ Initialize reporting for the Environment and Experiments conducted during its lifetime
-
property
train_input
¶ Get a DatasetSentinel representing an Experiment’s fold_train_input
- Returns
- DatasetSentinel:
A Sentinel that will be converted to
hyperparameter_hunter.experiments.BaseExperiment.fold_train_input
upon Model initialization
-
property
train_target
¶ Get a DatasetSentinel representing an Experiment’s fold_train_target
- Returns
- DatasetSentinel:
A Sentinel that will be converted to
hyperparameter_hunter.experiments.BaseExperiment.fold_train_target
upon Model initialization
-
property
validation_input
¶ Get a DatasetSentinel representing an Experiment’s fold_validation_input
- Returns
- DatasetSentinel:
A Sentinel that will be converted to
hyperparameter_hunter.experiments.BaseExperiment.fold_validation_input
upon Model initialization
-
property
validation_target
¶ Get a DatasetSentinel representing an Experiment’s fold_validation_target
- Returns
- DatasetSentinel:
A Sentinel that will be converted to
hyperparameter_hunter.experiments.BaseExperiment.fold_validation_target
upon Model initialization
-
property
holdout_input
¶ Get a DatasetSentinel representing an Experiment’s holdout_input_data
- Returns
- DatasetSentinel:
A Sentinel that will be converted to
hyperparameter_hunter.experiments.BaseExperiment.holdout_input_data
upon Model initialization
-
property
holdout_target
¶ Get a DatasetSentinel representing an Experiment’s holdout_target_data
- Returns
- DatasetSentinel:
A Sentinel that will be converted to
hyperparameter_hunter.experiments.BaseExperiment.holdout_target_data
upon Model initialization
-
class
hyperparameter_hunter.
CVExperiment
(model_initializer, model_init_params=None, model_extra_params=None, feature_engineer=None, feature_selector=None, notes=None, do_raise_repeated=False, auto_start=True, target_metric=None, callbacks=None)¶ Bases:
hyperparameter_hunter.experiments.BaseCVExperiment
- Attributes
- source_script
Methods
cross_validation_workflow
(self)Execute workflow for cross-validation process, consisting of the following tasks: 1) Create train and validation split indices for all folds, 2) Iterate through folds, performing cv_fold_workflow for each, 3) Average accumulated predictions over fold splits, 4) Evaluate final predictions, 5) Format final predictions to prepare for saving
cv_fold_workflow
(self)Execute workflow for individual fold, consisting of the following tasks: Execute overridden
on_fold_start()
tasks, 2) Perform cv_run_workflow for each run, 3) Execute overriddenon_fold_end()
taskscv_run_workflow
(self)Execute run workflow, consisting of: 1) Execute overridden
on_run_start()
tasks, 2) Initialize and fit Model, 3) Execute overriddenon_run_end()
tasksevaluate
(self, data_type, target, prediction)Apply metric(s) to the given data to calculate the value of the prediction
execute
(self)Execute the fitting protocol for the Experiment, comprising the following: instantiation of learners for each run, preprocessing of data as appropriate, training learners, making predictions, and evaluating and aggregating those predictions and other stats/metrics for later use
experiment_workflow
(self)Define the actual experiment process, including execution, result saving, and cleanup
on_exp_start
(self)Prepare data prior to executing fitting protocol (cross-validation), by 1) Initializing formal
datasets
attributes, 2) Invoking feature_engineer to perform “pre_cv”-stage preprocessing, and 3) Updating datasets to include their (transformed) counterparts in feature_engineeron_fold_start
(self)Override
on_fold_start()
tasks set byexperiment_core.ExperimentMeta
, consisting of: 1) Split train/validation data, 2) Make copies of holdout/test data for current fold (for feature engineering), 3) Log start, 4) Execute original taskson_run_start
(self)Override
on_run_start()
tasks organized byexperiment_core.ExperimentMeta
, consisting of: 1) Set random seed and update model parameters according to current seed, 2) Log run start, 3) Execute original taskspreparation_workflow
(self)Execute all tasks that must take place before the experiment is actually started.
-
source_script
= None¶
-
class
hyperparameter_hunter.
BayesianOptPro
(target_metric=None, iterations=1, verbose=1, read_experiments=True, reporter_parameters=None, warn_on_re_ask=False, base_estimator='GP', n_initial_points=10, acquisition_function='gp_hedge', acquisition_optimizer='auto', random_state=32, acquisition_function_kwargs=None, acquisition_optimizer_kwargs=None, n_random_starts='DEPRECATED', callbacks=None, base_estimator_kwargs=None)¶ Bases:
hyperparameter_hunter.optimization.protocol_core.SKOptPro
Bayesian optimization with Gaussian Processes
- Attributes
search_space_size
The number of different hyperparameter permutations possible given the current
- source_script
Methods
forge_experiment
(self, model_initializer[, …])Define hyperparameter search scaffold for building Experiments during optimization
get_ready
(self)Prepare for optimization by finalizing hyperparameter space and identifying similar Experiments.
go
(self[, force_ready])Execute hyperparameter optimization, building an Experiment for each iteration
set_dimensions
(self)Locate given hyperparameters that are space choice declarations and add them to
dimensions
set_experiment_guidelines
(self, \*args, …)Deprecated since version 3.0.0a2.
-
source_script
= None¶
-
class
hyperparameter_hunter.
GradientBoostedRegressionTreeOptPro
(target_metric=None, iterations=1, verbose=1, read_experiments=True, reporter_parameters=None, warn_on_re_ask=False, base_estimator='GBRT', n_initial_points=10, acquisition_function='EI', acquisition_optimizer='sampling', random_state=32, acquisition_function_kwargs=None, acquisition_optimizer_kwargs=None, n_random_starts='DEPRECATED', callbacks=None, base_estimator_kwargs=None)¶ Bases:
hyperparameter_hunter.optimization.protocol_core.SKOptPro
Sequential optimization with gradient boosted regression trees
- Attributes
search_space_size
The number of different hyperparameter permutations possible given the current
- source_script
Methods
forge_experiment
(self, model_initializer[, …])Define hyperparameter search scaffold for building Experiments during optimization
get_ready
(self)Prepare for optimization by finalizing hyperparameter space and identifying similar Experiments.
go
(self[, force_ready])Execute hyperparameter optimization, building an Experiment for each iteration
set_dimensions
(self)Locate given hyperparameters that are space choice declarations and add them to
dimensions
set_experiment_guidelines
(self, \*args, …)Deprecated since version 3.0.0a2.
-
source_script
= None¶
-
hyperparameter_hunter.
GBRT
¶ alias of
hyperparameter_hunter.optimization.backends.skopt.protocols.GradientBoostedRegressionTreeOptPro
-
class
hyperparameter_hunter.
RandomForestOptPro
(target_metric=None, iterations=1, verbose=1, read_experiments=True, reporter_parameters=None, warn_on_re_ask=False, base_estimator='RF', n_initial_points=10, acquisition_function='EI', acquisition_optimizer='sampling', random_state=32, acquisition_function_kwargs=None, acquisition_optimizer_kwargs=None, n_random_starts='DEPRECATED', callbacks=None, base_estimator_kwargs=None)¶ Bases:
hyperparameter_hunter.optimization.protocol_core.SKOptPro
Sequential optimization with random forest regressor decision trees
- Attributes
search_space_size
The number of different hyperparameter permutations possible given the current
- source_script
Methods
forge_experiment
(self, model_initializer[, …])Define hyperparameter search scaffold for building Experiments during optimization
get_ready
(self)Prepare for optimization by finalizing hyperparameter space and identifying similar Experiments.
go
(self[, force_ready])Execute hyperparameter optimization, building an Experiment for each iteration
set_dimensions
(self)Locate given hyperparameters that are space choice declarations and add them to
dimensions
set_experiment_guidelines
(self, \*args, …)Deprecated since version 3.0.0a2.
-
source_script
= None¶
-
hyperparameter_hunter.
RF
¶ alias of
hyperparameter_hunter.optimization.backends.skopt.protocols.RandomForestOptPro
-
class
hyperparameter_hunter.
ExtraTreesOptPro
(target_metric=None, iterations=1, verbose=1, read_experiments=True, reporter_parameters=None, warn_on_re_ask=False, base_estimator='ET', n_initial_points=10, acquisition_function='EI', acquisition_optimizer='sampling', random_state=32, acquisition_function_kwargs=None, acquisition_optimizer_kwargs=None, n_random_starts='DEPRECATED', callbacks=None, base_estimator_kwargs=None)¶ Bases:
hyperparameter_hunter.optimization.protocol_core.SKOptPro
Sequential optimization with extra trees regressor decision trees
- Attributes
search_space_size
The number of different hyperparameter permutations possible given the current
- source_script
Methods
forge_experiment
(self, model_initializer[, …])Define hyperparameter search scaffold for building Experiments during optimization
get_ready
(self)Prepare for optimization by finalizing hyperparameter space and identifying similar Experiments.
go
(self[, force_ready])Execute hyperparameter optimization, building an Experiment for each iteration
set_dimensions
(self)Locate given hyperparameters that are space choice declarations and add them to
dimensions
set_experiment_guidelines
(self, \*args, …)Deprecated since version 3.0.0a2.
-
source_script
= None¶
-
hyperparameter_hunter.
ET
¶ alias of
hyperparameter_hunter.optimization.backends.skopt.protocols.ExtraTreesOptPro
-
class
hyperparameter_hunter.
DummyOptPro
(target_metric=None, iterations=1, verbose=1, read_experiments=True, reporter_parameters=None, warn_on_re_ask=False, base_estimator='DUMMY', n_initial_points=10, acquisition_function='EI', acquisition_optimizer='sampling', random_state=32, acquisition_function_kwargs=None, acquisition_optimizer_kwargs=None, n_random_starts='DEPRECATED', callbacks=None, base_estimator_kwargs=None)¶ Bases:
hyperparameter_hunter.optimization.protocol_core.SKOptPro
Random search by uniform sampling
- Attributes
search_space_size
The number of different hyperparameter permutations possible given the current
- source_script
Methods
forge_experiment
(self, model_initializer[, …])Define hyperparameter search scaffold for building Experiments during optimization
get_ready
(self)Prepare for optimization by finalizing hyperparameter space and identifying similar Experiments.
go
(self[, force_ready])Execute hyperparameter optimization, building an Experiment for each iteration
set_dimensions
(self)Locate given hyperparameters that are space choice declarations and add them to
dimensions
set_experiment_guidelines
(self, \*args, …)Deprecated since version 3.0.0a2.
-
source_script
= None¶
-
class
hyperparameter_hunter.
Real
(low, high, prior='uniform', transform='identity', name=None)¶ Bases:
hyperparameter_hunter.space.dimensions.NumericalDimension
Search space dimension that can assume any real value in a given range
- Parameters
- low: Float
Lower bound (inclusive)
- high: Float
Upper bound (inclusive)
- prior: {“uniform”, “log-uniform”}, default=”uniform”
Distribution to use when sampling random points for this dimension. If “uniform”, points are sampled uniformly between the lower and upper bounds. If “log-uniform”, points are sampled uniformly between log10(lower) and log10(upper)
- transform: {“identity”, “normalize”}, default=”identity”
Transformation to apply to the original space. If “identity”, the transformed space is the same as the original space. If “normalize”, the transformed space is scaled between 0 and 1
- name: String, tuple, or None, default=None
A name associated with the dimension
- Attributes
- distribution: rv_generic
See documentation of
_make_distribution()
ordistribution()
- transform_: String
Original value passed through the transform kwarg - Because
transform()
exists- transformer: Transformer
See documentation of
_make_transformer()
ortransformer()
Methods
distance
(self, a, b)Calculate distance between two points in the dimension’s bounds
get_params
(self)Get dict of parameters used to initialize the Real, or their defaults
inverse_transform
(self, data_t)Inverse transform samples from the warped space back to the original space
rvs
(self[, n_samples, random_state])Draw random samples.
transform
(self, data)Transform samples from the original space into a warped space
-
inverse_transform
(self, data_t)¶ Inverse transform samples from the warped space back to the original space
- Parameters
- data_t: List
Samples to inverse transform. Should be of shape (<# samples>,
transformed_size
)
- Returns
- List
Samples transformed back to original space. Will be shape (<# samples>,
size
)
-
property
transformed_bounds
¶ Dimension bounds in the warped space
- Returns
- low: Float
0.0 if
transform_`="normalize". If :attr:`transform_`="identity" and :attr:`prior`="uniform", then :attr:`low
. Else log10(low)- high: Float
1.0 if
transform_`="normalize". If :attr:`transform_`="identity" and :attr:`prior`="uniform", then :attr:`high
. Else log10(high)
-
get_params
(self) → dict¶ Get dict of parameters used to initialize the Real, or their defaults
-
class
hyperparameter_hunter.
Integer
(low, high, transform='identity', name=None)¶ Bases:
hyperparameter_hunter.space.dimensions.NumericalDimension
Search space dimension that can assume any integer value in a given range
- Parameters
- low: Int
Lower bound (inclusive)
- high: Int
Upper bound (inclusive)
- transform: {“identity”, “normalize”}, default=”identity”
Transformation to apply to the original space. If “identity”, the transformed space is the same as the original space. If “normalize”, the transformed space is scaled between 0 and 1
- name: String, tuple, or None, default=None
A name associated with the dimension
- Attributes
- distribution: rv_generic
See documentation of
_make_distribution()
ordistribution()
- transform_: String
Original value passed through the transform kwarg - Because
transform()
exists- transformer: Transformer
See documentation of
_make_transformer()
ortransformer()
Methods
distance
(self, a, b)Calculate distance between two points in the dimension’s bounds
get_params
(self)Get dict of parameters used to initialize the Integer, or their defaults
inverse_transform
(self, data_t)Inverse transform samples from the warped space back to the original space
rvs
(self[, n_samples, random_state])Draw random samples.
transform
(self, data)Transform samples from the original space into a warped space
-
inverse_transform
(self, data_t)¶ Inverse transform samples from the warped space back to the original space
- Parameters
- data_t: List
Samples to inverse transform. Should be of shape (<# samples>,
transformed_size
)
- Returns
- List
Samples transformed back to original space. Will be shape (<# samples>,
size
)
-
property
transformed_bounds
¶ Dimension bounds in the warped space
- Returns
- low: Int
0 if
transform_`="normalize", else :attr:`low
- high: Int
1 if
transform_`="normalize", else :attr:`high
-
get_params
(self) → dict¶ Get dict of parameters used to initialize the Integer, or their defaults
-
class
hyperparameter_hunter.
Categorical
(categories: list, prior: list = None, transform='onehot', optional=False, name=None)¶ Bases:
hyperparameter_hunter.space.dimensions.Dimension
Search space dimension that can assume any categorical value in a given list
- Parameters
- categories: List
Sequence of possible categories of shape (n_categories,)
- prior: List, or None, default=None
If list, prior probabilities for each category of shape (categories,). By default all categories are equally likely
- transform: {“onehot”, “identity”}, default=”onehot”
Transformation to apply to the original space. If “identity”, the transformed space is the same as the original space. If “onehot”, the transformed space is a one-hot encoded representation of the original space
- optional: Boolean, default=False
Intended for use by
FeatureEngineer
when optimizing anEngineerStep
. Specifically, this enables searching through a space in which an EngineerStep either may or may not be used. This is contrary to Categorical’s usual function of creating a space comprising multiple categories. When optional = True, the space created will represent any of the values in categories either being included in the entire FeatureEngineer process, or being skipped entirely. Internally, a value excluded by optional is represented by a sentinel value that signals it should be removed from the containing list, so optional will not work for choosing between a single value and None, for example- name: String, tuple, or None, default=None
A name associated with the dimension
- Attributes
- categories: Tuple
Original value passed through the categories kwarg, cast to a tuple. If optional is True, then an instance of
RejectedOptional
will be appended to categories- distribution: rv_generic
See documentation of
_make_distribution()
ordistribution()
- optional: Boolean
Original value passed through the optional kwarg
- prior: List, or None
Original value passed through the prior kwarg
- prior_actual: List
Calculated prior value, initially equivalent to
prior
, but then set to a default array if None- transform_: String
Original value passed through the transform kwarg - Because
transform()
exists- transformer: Transformer
See documentation of
_make_transformer()
ortransformer()
Methods
distance
(self, a, b)Calculate distance between two points in the dimension’s bounds
get_params
(self)Get dict of parameters used to initialize the Categorical, or their defaults
inverse_transform
(self, data_t)Inverse transform samples from the warped space back to the original space
rvs
(self[, n_samples, random_state])Draw random samples.
transform
(self, data)Transform samples from the original space into a warped space
-
rvs
(self, n_samples=None, random_state=None)¶ Draw random samples. Samples are in the original (untransformed) space. They must be transformed before being passed to a model or minimizer via
transform()
- Parameters
- n_samples: Int (optional)
Number of samples to be drawn. If not given, a single sample will be returned
- random_state: Int, RandomState, or None, default=None
Set random state to something other than None for reproducible results
- Returns
- List
Randomly drawn samples from the original space
-
property
transformed_size
¶ Size of the transformed space for the dimension
- Returns
- Int
1 if
transform_
== “identity”1 if
transform_
== “onehot” and length ofcategories
is 1 or 2Length of
categories
in all other cases
-
property
bounds
¶ Dimension bounds in the original space
- Returns
- Tuple
categories
-
property
transformed_bounds
¶ Dimension bounds in the warped space
- Returns
- Tuple, or list
If
transformed_size
== 1, then a tuple of (0.0, 1.0). Otherwise, returns a list containingtransformed_size
-many tuples of (0.0, 1.0)
Notes
transformed_size
== 1 when the length ofcategories
== 2, so if there are two items in categories, (0.0, 1.0) is returned. If there are three items in categories, [(0.0, 1.0), (0.0, 1.0), (0.0, 1.0)] is returned, and so on.Because transformed_bounds uses
transformed_size
, it is affected bytransform_
. Specifically, the returns described above are fortransform_
== “onehot” (default).Examples
>>> Categorical(["a", "b"]).transformed_bounds (0.0, 1.0) >>> Categorical(["a", "b", "c"]).transformed_bounds [(0.0, 1.0), (0.0, 1.0), (0.0, 1.0)] >>> Categorical(["a", "b", "c", "d"]).transformed_bounds [(0.0, 1.0), (0.0, 1.0), (0.0, 1.0), (0.0, 1.0)]
-
distance
(self, a, b) → int¶ Calculate distance between two points in the dimension’s bounds
- Parameters
- a
First category
- b
Second category
- Returns
- Int
0 if a == b. Else 1 (because categories have no order)
-
get_params
(self) → dict¶ Get dict of parameters used to initialize the Categorical, or their defaults
-
hyperparameter_hunter.
lambda_callback
(on_exp_start=None, on_exp_end=None, on_rep_start=None, on_rep_end=None, on_fold_start=None, on_fold_end=None, on_run_start=None, on_run_end=None, agg_name=None, do_reshape_aggs=True, method_agg_keys=False, on_experiment_start=<object object at 0x7fe8726edbf0>, on_experiment_end=<object object at 0x7fe8726edbf0>, on_repetition_start=<object object at 0x7fe8726edbf0>, on_repetition_end=<object object at 0x7fe8726edbf0>)¶ Utility for creating custom callbacks to be declared by
Environment
and used by Experiments. The callable “on_<…>_<start/end>” parameters provided will receive as input whichever attributes of the Experiment are included in the signature of the given callable. If **kwargs is given in the callable’s signature, a dict of all of the Experiment’s attributes will be provided. This can be helpful for trying to figure out how to build a custom callback, but should not be used unless absolutely necessary. If the Experiment does not have an attribute specified in the callable’s signature, the following placeholder will be given: “INVALID KWARG”- Parameters
- on_exp_start: Callable, or None, default=None
Callable that receives Experiment’s values for parameters in the signature at Experiment start
- on_exp_end: Callable, or None, default=None
Callable that receives Experiment’s values for parameters in the signature at Experiment end
- on_rep_start: Callable, or None, default=None
Callable that receives Experiment’s values for parameters in the signature at repetition start
- on_rep_end: Callable, or None, default=None
Callable that receives Experiment’s values for parameters in the signature at repetition end
- on_fold_start: Callable, or None, default=None
Callable that receives Experiment’s values for parameters in the signature at fold start
- on_fold_end: Callable, or None, default=None
Callable that receives Experiment’s values for parameters in the signature at fold end
- on_run_start: Callable, or None, default=None
Callable that receives Experiment’s values for parameters in the signature at run start
- on_run_end: Callable, or None, default=None
Callable that receives Experiment’s values for parameters in the signature at run end
- agg_name: Str, default=uuid.uuid4
This parameter is only used if the callables are behaving like AggregatorCallbacks by returning values (see the “Notes” section below for details on this). If the callables do return values, they will be stored under a key named (“_” + agg_name) in a dict in
hyperparameter_hunter.experiments.BaseExperiment.stat_aggregates
. The purpose of this parameter is to make it easier to understand an Experiment’s description file, as agg_name will default to a UUID if it is not given- do_reshape_aggs: Boolean, default=True
Whether to reshape the aggregated values to reflect the nested repetitions/folds/runs structure used for other aggregated values. If False, lists of aggregated values are left in their original shapes. This parameter is only used if the callables are behaving like AggregatorCallbacks (see the “Notes” section below and agg_name for details on this)
- method_agg_keys: Boolean, default=False
If True, the aggregate keys for the items added to the dict at agg_name are equivalent to the names of the “on_<…>_<start/end>” pseudo-methods whose values are being aggregated. In other words, the pool of all possible aggregate keys goes from [“runs”, “folds”, “reps”, “final”] to the names of the eight “on_<…>_<start/end>” kwargs of
lambda_callback()
. See the “Notes” section below for further details and a rough outline- on_experiment_start: …
Deprecated since version 3.0.0: Renamed to on_exp_start. Will be removed in 3.2.0
- on_experiment_end: …
Deprecated since version 3.0.0: Renamed to on_exp_end. Will be removed in 3.2.0
- on_repetition_start: …
Deprecated since version 3.0.0: Renamed to on_rep_start. Will be removed in 3.2.0
- on_repetition_end: …
Deprecated since version 3.0.0: Renamed to on_rep_end. Will be removed in 3.2.0
- Returns
- LambdaCallback:
LambdaCallback
Uninitialized class, whose methods are the callables of the corresponding “on…” kwarg
- LambdaCallback:
Notes
For all of the “on_<…>_<start/end>” callables provided as input to lambda_callback, consider the following guidelines (for example function “f”, which can represent any of the callables):
All input parameters in the signature of “f” are attributes of the Experiment being executed
If “**kwargs” is a parameter, a dict of all the Experiment’s attributes will be provided
“f” will be treated as a method of a parent class of the Experiment
Take care when modifying attributes, as changes are reflected in the Experiment itself
If “f” returns something, it will automatically behave like an AggregatorCallback (see
hyperparameter_hunter.callbacks.aggregators
). Specifically, the following will occur:A new key (named by agg_name if given, else a UUID) with a dict value is added to
hyperparameter_hunter.experiments.BaseExperiment.stat_aggregates
This new dict can have up to four keys: “runs” (list), “folds” (list), “reps” (list), and “final” (object)
If “f” is an “on_run…” function, the returned value is appended to the “runs” list in the new dict
Similarly, if “f” is an “on_fold…” or “on_rep…” function, the returned value is appended to the “folds”, or “reps” list, respectively
If “f” is an “on_exp…” function, the “final” key in the new dict is set to the returned value
If values were aggregated in the aforementioned manner, the lists of collected values will be reshaped according to runs/folds/reps on Experiment end
The aggregated values will be saved in the Experiment’s description file
This is because
hyperparameter_hunter.experiments.BaseExperiment.stat_aggregates
is saved in its entirety
What follows is a rough outline of the structure produced when using an aggregator-like callback that automatically populates
experiments.BaseExperiment.stat_aggregates
with results of the functions used as arguments tolambda_callback()
:BaseExperiment.stat_aggregates = dict( ..., <`agg_name`>=dict( <agg_key "runs"> = [...], <agg_key "folds"> = [...], <agg_key "reps"> = [...], <agg_key "final"> = object(), ... ), ... )
In the above outline, the actual agg_key`s included in the dict at `agg_name depend on which “on_<…>_<start/end>” callables are behaving like aggregators. For example, if neither on_run_start nor on_run_end explicitly returns something, then the “runs” agg_key is not included in the agg_name dict. Similarly, if, for example, neither on_exp_start nor on_exp_end is provided, then the “final” agg_key is not included. If method_agg_keys=True, then the agg keys used in the dict are modified to be named after the method called. For example, if method_agg_keys=True and on_fold_start and on_fold_end are both callables returning values to be aggregated, then the agg_key`s used for each will be “on_fold_start” and “on_fold_end”, respectively. In this example, if `method_agg_keys=False (default) and do_reshape_aggs=False, then the single “folds” agg_key would contain the combined contents returned by both methods in the order in which they were returned
For examples using lambda_callback to create custom callbacks, see
hyperparameter_hunter.callbacks.recipes
Examples
>>> from hyperparameter_hunter.environment import Environment >>> def printer_helper(_rep, _fold, _run, last_evaluation_results): ... print(f"{_rep}.{_fold}.{_run} {last_evaluation_results}") >>> my_lambda_callback = lambda_callback( ... on_exp_end=printer_helper, ... on_rep_end=printer_helper, ... on_fold_end=printer_helper, ... on_run_end=printer_helper, ... ) ... # env = Environment( ... # train_dataset="i am a dataset", ... # results_path="path/to/HyperparameterHunterAssets", ... # metrics=["roc_auc_score"], ... # experiment_callbacks=[my_lambda_callback] ... # ) ... # ... Now execute an Experiment, or an Optimization Protocol...
See
hyperparameter_hunter.examples.lambda_callback_example
for more information
-
class
hyperparameter_hunter.
FeatureEngineer
(steps=None, do_validate=False, **datasets)¶ Bases:
object
Class to organize feature engineering step callables steps (
EngineerStep
instances) and the datasets that the steps request and return.- Parameters
- steps: List, or None, default=None
List of arbitrary length, containing any of the following values:
EngineerStep
instance,Function to provide as input to
EngineerStep
, orCategorical
, with categories comprising a selection of the previous two steps values (optimization only)
The third value can only be used during optimization. The feature_engineer provided to
CVExperiment
, for example, may only contain the first two values. To search a space optionally including an EngineerStep, use the optional kwarg ofCategorical
.See
EngineerStep
for information on properly formatted EngineerStep functions. Additional engineering steps may be added viaadd_step()
- do_validate: Boolean, or “strict”, default=False
… Experimental… Whether to validate the datasets resulting from feature engineering steps. If True, hashes of the new datasets will be compared to those of the originals to ensure they were actually modified. Results will be logged. If do_validate = “strict”, an exception will be raised if any anomalies are found, rather than logging a message. If do_validate = False, no validation will be performed
- **datasets: DFDict
This is not expected to be provided on initialization and is offered primarily for debugging/testing. Mapping of datasets necessary to perform feature engineering steps
See also
EngineerStep
For proper formatting of non-Categorical values of steps
Notes
If steps does include any instances of
hyperparameter_hunter.space.dimensions.Categorical
, this FeatureEngineer instance will not be usable by Experiments. It can only be used by Optimization Protocols. Furthermore, the FeatureEngineer that the Optimization Protocol actually ends up using will not pass identity checks against the original FeatureEngineer that contained Categorical stepsExamples
>>> from sklearn.preprocessing import StandardScaler, MinMaxScaler, QuantileTransformer >>> # Define some engineer step functions to play with >>> def s_scale(train_inputs, non_train_inputs): ... s = StandardScaler() ... train_inputs[train_inputs.columns] = s.fit_transform(train_inputs.values) ... non_train_inputs[train_inputs.columns] = s.transform(non_train_inputs.values) ... return train_inputs, non_train_inputs >>> def mm_scale(train_inputs, non_train_inputs): ... s = MinMaxScaler() ... train_inputs[train_inputs.columns] = s.fit_transform(train_inputs.values) ... non_train_inputs[train_inputs.columns] = s.transform(non_train_inputs.values) ... return train_inputs, non_train_inputs >>> def q_transform(train_targets, non_train_targets): ... t = QuantileTransformer(output_distribution="normal") ... train_targets[train_targets.columns] = t.fit_transform(train_targets.values) ... non_train_targets[train_targets.columns] = t.transform(non_train_targets.values) ... return train_targets, non_train_targets, t >>> def sqr_sum(all_inputs): ... all_inputs["square_sum"] = all_inputs.agg( ... lambda row: np.sqrt(np.sum([np.square(_) for _ in row])), axis="columns" ... ) ... return all_inputs
FeatureEngineer steps wrapped by `EngineerStep` == raw function steps - as long as the `EngineerStep` is using the default parameters
>>> # FeatureEngineer steps wrapped by `EngineerStep` == raw function steps >>> # ... As long as the `EngineerStep` is using the default parameters >>> fe_0 = FeatureEngineer([sqr_sum, s_scale]) >>> fe_1 = FeatureEngineer([EngineerStep(sqr_sum), EngineerStep(s_scale)]) >>> fe_0.steps == fe_1.steps True >>> fe_2 = FeatureEngineer([sqr_sum, EngineerStep(s_scale), q_transform])
`Categorical` can be used during optimization and placed anywhere in `steps`. `Categorical` can also handle either `EngineerStep` categories or raw functions. Use the `optional` kwarg of `Categorical` to test some questionable steps
>>> fe_3 = FeatureEngineer([sqr_sum, Categorical([s_scale, mm_scale]), q_transform]) >>> fe_4 = FeatureEngineer([Categorical([sqr_sum], optional=True), s_scale, q_transform]) >>> fe_5 = FeatureEngineer([ ... Categorical([sqr_sum], optional=True), ... Categorical([EngineerStep(s_scale), mm_scale]), ... q_transform ... ])
- Attributes
steps
Feature engineering steps to execute in sequence on
FeatureEngineer.__call__()
Methods
__call__
(self, stage, \*\*datasets, …)Execute all feature engineering steps in
steps
for stage, with datasets datasets as inputsadd_step
(self, step, …)Add an engineering step to
steps
to be executed with the other contents ofsteps
onFeatureEngineer.__call__()
get_key_data
(self)Produce a dict of critical attributes describing the
FeatureEngineer
instance for use by key-making classesinverse_transform
(self, data)Perform the inverse transformation for all engineer steps in
steps
in sequence on data-
inverse_transform
(self, data)¶ Perform the inverse transformation for all engineer steps in
steps
in sequence on data- Parameters
- data: Array-like
Data to inverse transform with any inversions present in
steps
- Returns
- Array-like
Result of sequentially calling inverse transformations in
steps
on data. If any step hasEngineerStep.inversion
= None, data is unmodified for that step, and proceeds to next engineer step inversion
-
property
steps
¶ Feature engineering steps to execute in sequence on
FeatureEngineer.__call__()
-
get_key_data
(self) → dict¶ Produce a dict of critical attributes describing the
FeatureEngineer
instance for use by key-making classes- Returns
- Dict
Important attributes describing this
FeatureEngineer
instance
-
add_step
(self, step:Union[Callable, hyperparameter_hunter.space.dimensions.Categorical], stage:str=None, name:str=None, before:str=<class 'hyperparameter_hunter.feature_engineering.EMPTY_SENTINEL'>, after:str=<class 'hyperparameter_hunter.feature_engineering.EMPTY_SENTINEL'>, number:int=<class 'hyperparameter_hunter.feature_engineering.EMPTY_SENTINEL'>)¶ Add an engineering step to
steps
to be executed with the other contents ofsteps
onFeatureEngineer.__call__()
- Parameters
- step: Callable, or `EngineerStep`, or `Categorical`
If EngineerStep instance, will be added directly to
steps
. Otherwise, must be a feature engineering step callable that requests, modifies, and returns datasets, which will be used to instantiate aEngineerStep
to add tosteps
. If Categorical, categories should contain EngineerStep instances or callables- stage: String in {“pre_cv”, “intra_cv”}, or None, default=None
Feature engineering stage during which the callable step will be executed
- name: String, or None, default=None
Identifier for the transformation applied by this engineering step. If None and step is not an EngineerStep, will be inferred during
EngineerStep
instantiation- before: String, default=EMPTY_SENTINEL
… Experimental…
- after: String, default=EMPTY_SENTINEL
… Experimental…
- number: String, default=EMPTY_SENTINEL
… Experimental…
-
class
hyperparameter_hunter.
EngineerStep
(f: Callable, stage=None, name=None, params=None, do_validate=False)¶ Bases:
object
Container for individual
FeatureEngineer
step functionsCompartmentalizes functions of singular engineer steps and allows for greater customization than a raw engineer step function
- Parameters
- f: Callable
Feature engineering step function that requests, modifies, and returns datasets params
Step functions should follow these guidelines:
Request as input a subset of the 11 data strings listed in params
Do whatever you want to the DataFrames given as input
Return new DataFrame values of the input parameters in same order as requested
If performing a task like target transformation, causing predictions to be transformed, it is often desirable to inverse-transform the predictions to be of the expected form. This can easily be done by returning an extra value from f (after the datasets) that is either a callable, or a transformer class that was fitted during the execution of f and implements an inverse_transform method. This is the only instance in which it is acceptable for f to return values that don’t mimic its input parameters. See the engineer function definition using SKLearn’s QuantileTransformer in the Examples section below for an actual inverse-transformation-compatible implementation
- stage: String in {“pre_cv”, “intra_cv”}, or None, default=None
Feature engineering stage during which the callable f will be given the datasets params to modify and return. If None, will be inferred based on params.
“pre_cv” functions are applied only once in the experiment: when it starts
“intra_cv” functions are reapplied for each fold in the cross-validation splits
If stage is left to be inferred, “pre_cv” will usually be selected. However, if any params (or parameters in the signature of f) are prefixed with “validation…” or “non_train…”, then stage will inferred as “intra_cv”. See the Notes section below for suggestions on the stage to use for different functions
- name: String, or None, default=None
Identifier for the transformation applied by this engineering step. If None, f.__name__ will be used
- params: Tuple[str], or None, default=None
Dataset names requested by feature engineering step callable f. If None, will be inferred by parsing the signature of f. Must be a subset of the following 11 strings:
Input Data
“train_inputs”
“validation_inputs”
“holdout_inputs”
“test_inputs”
- “all_inputs”
("train_inputs" + ["validation_inputs"] + "holdout_inputs" + "test_inputs")
- “non_train_inputs”
(["validation_inputs"] + "holdout_inputs" + "test_inputs")
Target Data
“train_targets”
“validation_targets”
“holdout_targets”
“all_targets”
("train_targets" + ["validation_targets"] + "holdout_targets")
“non_train_targets”
(["validation_targets"] + "holdout_targets")
As an alternative to the above list, just remember that the first half of all parameter names should be one of {“train”, “validation”, “holdout”, “test”, “all”, “non_train”}, and the second half should be either “inputs” or “targets”. The only exception to this rule is “test_targets”, which doesn’t exist.
Inference of “validation” params is affected by stage. During the “pre_cv” stage, the validation dataset has not yet been created and is still a part of the train dataset. During the “intra_cv” stage, the validation dataset is created by removing a portion of the train dataset, and their values passed to f reflect this fact. This also means that the values of the merged (“all”/”non_train”-prefixed) datasets may or may not contain “validation” data depending on the stage; however, this is all handled internally, so you probably don’t need to worry about it.
params may not include multiple references to the same dataset, either directly or indirectly. This means (“train_inputs”, “train_inputs”) is invalid due to duplicate direct references. Less obviously, (“train_inputs”, “all_inputs”) is invalid because “all_inputs” includes “train_inputs”
- do_validate: Boolean, or “strict”, default=False
… Experimental… Whether to validate the datasets resulting from feature engineering steps. If True, hashes of the new datasets will be compared to those of the originals to ensure they were actually modified. Results will be logged. If do_validate = “strict”, an exception will be raised if any anomalies are found, rather than logging a message. If do_validate = False, no validation will be performed
See also
FeatureEngineer
The container for EngineerStep instances - EngineerStep`s should always be provided to HyperparameterHunter through a `FeatureEngineer
Categorical
Can be used during optimization to search through a group of EngineerStep`s given as `categories. The optional kwarg of Categorical designates a FeatureEngineer step that may be one of the EngineerStep`s in `categories, or may be omitted entirely
get_engineering_step_stage()
More information on stage inference and situations where overriding it may be prudent
Notes
stage: Generally, feature engineering conducted in the “pre_cv” stage should regard each sample/row as independent entities. For example, steps like converting a string day of the week to one-hot encoded columns, or imputing missing values by replacement with -1 might be conducted “pre_cv”, since they are unlikely to introduce an information leakage. Conversely, steps like scaling/normalization, whose results for the data in one row are affected by the data in other rows should be performed “intra_cv” in order to recalculate the final values of the datasets for each cross validation split and avoid information leakage.
params: In the list of the 11 valid params strings, “test_inputs” is notably missing the “…_targets” counterpart accompanying the other datasets. The “targets” suffix is missing because test data targets are never given. Note that although “test_inputs” is still included in both “all_inputs” and “non_train_inputs”, its lack of a target column means that “all_targets” and “non_train_targets” may have different lengths than their “inputs”-suffixed counterparts
Examples
>>> from sklearn.preprocessing import StandardScaler, QuantileTransformer >>> def s_scale(train_inputs, non_train_inputs): ... s = StandardScaler() ... train_inputs[train_inputs.columns] = s.fit_transform(train_inputs.values) ... non_train_inputs[train_inputs.columns] = s.transform(non_train_inputs.values) ... return train_inputs, non_train_inputs >>> # Sensible parameter defaults inferred based on `f` >>> es_0 = EngineerStep(s_scale) >>> es_0.stage 'intra_cv' >>> es_0.name 's_scale' >>> es_0.params ('train_inputs', 'non_train_inputs') >>> # Override `stage` if you want to fit your scaler on OOF data like a crazy person >>> es_1 = EngineerStep(s_scale, stage="pre_cv") >>> es_1.stage 'pre_cv'
Watch out for multiple requests to the same data
>>> es_2 = EngineerStep(s_scale, params=("train_inputs", "all_inputs")) Traceback (most recent call last): File "feature_engineering.py", line ? in validate_dataset_names ValueError: Requested params include duplicate references to `train_inputs` by way of: - ('all_inputs', 'train_inputs') - ('train_inputs',) Each dataset may only be requested by a single param for each function
Error is the same if `(train_inputs, all_inputs)` is in the actual function signature
EngineerStep functions aren’t just limited to transformations. Make your own features!
>>> def sqr_sum(all_inputs): ... all_inputs["square_sum"] = all_inputs.agg( ... lambda row: np.sqrt(np.sum([np.square(_) for _ in row])), axis="columns" ... ) ... return all_inputs >>> es_3 = EngineerStep(sqr_sum) >>> es_3.stage 'pre_cv' >>> es_3.name 'sqr_sum' >>> es_3.params ('all_inputs',)
Inverse-transformation Implementation:
>>> def q_transform(train_targets, non_train_targets): ... t = QuantileTransformer(output_distribution="normal") ... train_targets[train_targets.columns] = t.fit_transform(train_targets.values) ... non_train_targets[train_targets.columns] = t.transform(non_train_targets.values) ... return train_targets, non_train_targets, t >>> # Note that `train_targets` and `non_train_targets` must still be returned in order, >>> # but they are followed by `t`, an instance of `QuantileTransformer` we just fitted, >>> # whose `inverse_transform` method will be called on predictions >>> es_4 = EngineerStep(q_transform) >>> es_4.stage 'intra_cv' >>> es_4.name 'q_transform' >>> es_4.params ('train_targets', 'non_train_targets') >>> # `params` does not include any returned transformers - Only data requested as input
- Attributes
Methods
__call__
(self, \*\*datasets, …)Apply
f
to datasets to produce updated datasets.get_comparison_attrs
(step_obj, dict])Build a dict of critical
EngineerStep
attributesget_datasets_for_f
(self, datasets, …)Produce a dict of DataFrames containing only the merged datasets and standard datasets requested in
params
.get_key_data
(self)Produce a dict of critical attributes describing the
EngineerStep
instance for use by key-making classeshonorary_step_from_dict
(step_dict, dimension)Get an EngineerStep from dimension that is equal to its dict form, step_dict
inverse_transform
(self, data)Perform the inverse transformation for this engineer step (if it exists)
stringify
(self)Make a stringified representation of self, compatible with
EngineerStep.__eq__()
-
inverse_transform
(self, data)¶ Perform the inverse transformation for this engineer step (if it exists)
- Parameters
- data: Array-like
Data to inverse transform with
inversion
orinversion.inverse_transform
- Returns
- Array-like
If
inversion
is None, return data unmodified. Else, return the result ofinversion
orinversion.inverse_transform
, given data
-
get_datasets_for_f
(self, datasets:Dict[str, pandas.core.frame.DataFrame]) → Dict[str, pandas.core.frame.DataFrame]¶ Produce a dict of DataFrames containing only the merged datasets and standard datasets requested in
params
. In other words, add the requested merged datasets and remove unnecessary standard datasets- Parameters
- datasets: DFDict
Original dict of datasets, containing all datasets provided to
EngineerStep.__call__()
, some of which may be superfluous, or may require additional processing to resolve merged/coupled datasets
- Returns
- DFDict
Updated version of datasets, in which unnecessary datasets have been filtered out, and the requested merged datasets have been added
-
get_key_data
(self) → dict¶ Produce a dict of critical attributes describing the
EngineerStep
instance for use by key-making classes- Returns
- Dict
Important attributes describing this
EngineerStep
instance
-
property
f
¶ Feature engineering step callable that requests, modifies, and returns datasets
-
property
name
¶ Identifier for the transformation applied by this engineering step
-
property
params
¶ Dataset names requested by feature engineering step callable
f
. See documentation inEngineerStep.__init__()
for more information/restrictions
-
property
stage
¶ Feature engineering stage during which the EngineerStep will be executed
-
static
get_comparison_attrs
(step_obj:Union[_ForwardRef('EngineerStep'), dict]) → dict¶ Build a dict of critical
EngineerStep
attributes- Parameters
- step_obj: EngineerStep, dict
Object for which critical
EngineerStep
attributes should be collected
- Returns
- attr_vals: Dict
Critical
EngineerStep
attributes. If step_obj does not have a necessary attribute (for EngineerStep) or a necessary key (for dict), its value in attr_vals will be a placeholder object. This is to facilitate comparison, while also ensuring missing values will always be considered unequal to other values
Examples
>>> def dummy_f(train_inputs, non_train_inputs): ... return train_inputs, non_train_inputs >>> es_0 = EngineerStep(dummy_f) >>> EngineerStep.get_comparison_attrs(es_0) # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE {'name': 'dummy_f', 'f': <function dummy_f at ...>, 'params': ('train_inputs', 'non_train_inputs'), 'stage': 'intra_cv', 'do_validate': False} >>> EngineerStep.get_comparison_attrs( ... dict(foo="hello", f=dummy_f, params=["all_inputs", "all_targets"], stage="pre_cv") ... ) # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE {'name': <object object at ...>, 'f': <function dummy_f at ...>, 'params': ('all_inputs', 'all_targets'), 'stage': 'pre_cv', 'do_validate': <object object at ...>}
-
stringify
(self) → str¶ Make a stringified representation of self, compatible with
EngineerStep.__eq__()
- Returns
- String
String describing all critical attributes of the
EngineerStep
instance. This value is not particularly human-friendly due to both its length and the fact thatEngineerStep.f
is represented by its hash
Examples
>>> def dummy_f(train_inputs, non_train_inputs): ... return train_inputs, non_train_inputs >>> EngineerStep(dummy_f).stringify() # doctest: +ELLIPSIS "EngineerStep(dummy_f, ..., ('train_inputs', 'non_train_inputs'), intra_cv, False)" >>> EngineerStep(dummy_f, stage="pre_cv").stringify() # doctest: +ELLIPSIS "EngineerStep(dummy_f, ..., ('train_inputs', 'non_train_inputs'), pre_cv, False)"
-
classmethod
honorary_step_from_dict
(step_dict:dict, dimension:hyperparameter_hunter.space.dimensions.Categorical)¶ Get an EngineerStep from dimension that is equal to its dict form, step_dict
- Parameters
- step_dict: Dict
Dict of form saved in Experiment description files for EngineerStep. Expected to have following keys, with values of the given types:
“name”: String
“f”: String (SHA256 hash)
“params”: List[str], or Tuple[str, …]
“stage”: String in {“pre_cv”, “intra_cv”}
“do_validate”: Boolean
- dimension: Categorical
Categorical instance expected to contain the EngineerStep equivalent of step_dict in its categories
- Returns
- EngineerStep
From dimension.categories if it is the EngineerStep equivalent of step_dict
- Raises
- ValueError
If dimension.categories does not contain an EngineerStep matching step_dict
-
class
hyperparameter_hunter.
BayesianOptimization
(**kwargs)¶ Bases:
hyperparameter_hunter.optimization.backends.skopt.protocols.BayesianOptPro
Deprecated since version 3.0.0a2: Will be removed in 3.2.0. Renamed to BayesianOptPro
- Attributes
search_space_size
The number of different hyperparameter permutations possible given the current
- source_script
Methods
forge_experiment
(self, model_initializer[, …])Define hyperparameter search scaffold for building Experiments during optimization
get_ready
(self)Prepare for optimization by finalizing hyperparameter space and identifying similar Experiments.
go
(self[, force_ready])Execute hyperparameter optimization, building an Experiment for each iteration
set_dimensions
(self)Locate given hyperparameters that are space choice declarations and add them to
dimensions
set_experiment_guidelines
(self, \*args, …)Deprecated since version 3.0.0a2.
-
source_script
= None¶
-
class
hyperparameter_hunter.
GradientBoostedRegressionTreeOptimization
(**kwargs)¶ Bases:
hyperparameter_hunter.optimization.backends.skopt.protocols.GradientBoostedRegressionTreeOptPro
Deprecated since version 3.0.0a2: Will be removed in 3.2.0. Renamed to GradientBoostedRegressionTreeOptPro
- Attributes
search_space_size
The number of different hyperparameter permutations possible given the current
- source_script
Methods
forge_experiment
(self, model_initializer[, …])Define hyperparameter search scaffold for building Experiments during optimization
get_ready
(self)Prepare for optimization by finalizing hyperparameter space and identifying similar Experiments.
go
(self[, force_ready])Execute hyperparameter optimization, building an Experiment for each iteration
set_dimensions
(self)Locate given hyperparameters that are space choice declarations and add them to
dimensions
set_experiment_guidelines
(self, \*args, …)Deprecated since version 3.0.0a2.
-
source_script
= None¶
-
class
hyperparameter_hunter.
RandomForestOptimization
(**kwargs)¶ Bases:
hyperparameter_hunter.optimization.backends.skopt.protocols.RandomForestOptPro
Deprecated since version 3.0.0a2: Will be removed in 3.2.0. Renamed to RandomForestOptPro
- Attributes
search_space_size
The number of different hyperparameter permutations possible given the current
- source_script
Methods
forge_experiment
(self, model_initializer[, …])Define hyperparameter search scaffold for building Experiments during optimization
get_ready
(self)Prepare for optimization by finalizing hyperparameter space and identifying similar Experiments.
go
(self[, force_ready])Execute hyperparameter optimization, building an Experiment for each iteration
set_dimensions
(self)Locate given hyperparameters that are space choice declarations and add them to
dimensions
set_experiment_guidelines
(self, \*args, …)Deprecated since version 3.0.0a2.
-
source_script
= None¶
-
class
hyperparameter_hunter.
ExtraTreesOptimization
(**kwargs)¶ Bases:
hyperparameter_hunter.optimization.backends.skopt.protocols.ExtraTreesOptPro
Deprecated since version 3.0.0a2: Will be removed in 3.2.0. Renamed to ExtraTreesOptPro
- Attributes
search_space_size
The number of different hyperparameter permutations possible given the current
- source_script
Methods
forge_experiment
(self, model_initializer[, …])Define hyperparameter search scaffold for building Experiments during optimization
get_ready
(self)Prepare for optimization by finalizing hyperparameter space and identifying similar Experiments.
go
(self[, force_ready])Execute hyperparameter optimization, building an Experiment for each iteration
set_dimensions
(self)Locate given hyperparameters that are space choice declarations and add them to
dimensions
set_experiment_guidelines
(self, \*args, …)Deprecated since version 3.0.0a2.
-
source_script
= None¶
-
class
hyperparameter_hunter.
DummySearch
(**kwargs)¶ Bases:
hyperparameter_hunter.optimization.backends.skopt.protocols.DummyOptPro
Deprecated since version 3.0.0a2: Will be removed in 3.2.0. Renamed to DummyOptPro
- Attributes
search_space_size
The number of different hyperparameter permutations possible given the current
- source_script
Methods
forge_experiment
(self, model_initializer[, …])Define hyperparameter search scaffold for building Experiments during optimization
get_ready
(self)Prepare for optimization by finalizing hyperparameter space and identifying similar Experiments.
go
(self[, force_ready])Execute hyperparameter optimization, building an Experiment for each iteration
set_dimensions
(self)Locate given hyperparameters that are space choice declarations and add them to
dimensions
set_experiment_guidelines
(self, \*args, …)Deprecated since version 3.0.0a2.
-
source_script
= None¶