HyperparameterHunter API Essentials¶
This section exposes the API for all the HyperparameterHunter functionality that will be necessary for most users.
Environment¶
Experimentation¶
Hyperparameter Space¶
-
class
hyperparameter_hunter.space.dimensions.
Real
(low, high, prior='uniform', transform='identity', name=None) Search space dimension that can assume any real value in a given range
- Parameters
- low: Float
Lower bound (inclusive)
- high: Float
Upper bound (inclusive)
- prior: {“uniform”, “log-uniform”}, default=”uniform”
Distribution to use when sampling random points for this dimension. If “uniform”, points are sampled uniformly between the lower and upper bounds. If “log-uniform”, points are sampled uniformly between log10(lower) and log10(upper)
- transform: {“identity”, “normalize”}, default=”identity”
Transformation to apply to the original space. If “identity”, the transformed space is the same as the original space. If “normalize”, the transformed space is scaled between 0 and 1
- name: String, tuple, or None, default=None
A name associated with the dimension
- Attributes
- distribution: rv_generic
See documentation of
_make_distribution()
ordistribution()
- transform_: String
Original value passed through the transform kwarg - Because
transform()
exists- transformer: Transformer
See documentation of
_make_transformer()
ortransformer()
Methods
distance
(a, b)Calculate distance between two points in the dimension’s bounds
Get dict of parameters used to initialize the Real, or their defaults
inverse_transform
(data_t)Inverse transform samples from the warped space back to the original space
rvs
([n_samples, random_state])Draw random samples.
transform
(data)Transform samples from the original space into a warped space
-
__init__
(low, high, prior='uniform', transform='identity', name=None) Search space dimension that can assume any real value in a given range
- Parameters
- low: Float
Lower bound (inclusive)
- high: Float
Upper bound (inclusive)
- prior: {“uniform”, “log-uniform”}, default=”uniform”
Distribution to use when sampling random points for this dimension. If “uniform”, points are sampled uniformly between the lower and upper bounds. If “log-uniform”, points are sampled uniformly between log10(lower) and log10(upper)
- transform: {“identity”, “normalize”}, default=”identity”
Transformation to apply to the original space. If “identity”, the transformed space is the same as the original space. If “normalize”, the transformed space is scaled between 0 and 1
- name: String, tuple, or None, default=None
A name associated with the dimension
- Attributes
- distribution: rv_generic
See documentation of
_make_distribution()
ordistribution()
- transform_: String
Original value passed through the transform kwarg - Because
transform()
exists- transformer: Transformer
See documentation of
_make_transformer()
ortransformer()
-
class
hyperparameter_hunter.space.dimensions.
Integer
(low, high, transform='identity', name=None) Search space dimension that can assume any integer value in a given range
- Parameters
- low: Int
Lower bound (inclusive)
- high: Int
Upper bound (inclusive)
- transform: {“identity”, “normalize”}, default=”identity”
Transformation to apply to the original space. If “identity”, the transformed space is the same as the original space. If “normalize”, the transformed space is scaled between 0 and 1
- name: String, tuple, or None, default=None
A name associated with the dimension
- Attributes
- distribution: rv_generic
See documentation of
_make_distribution()
ordistribution()
- transform_: String
Original value passed through the transform kwarg - Because
transform()
exists- transformer: Transformer
See documentation of
_make_transformer()
ortransformer()
Methods
distance
(a, b)Calculate distance between two points in the dimension’s bounds
Get dict of parameters used to initialize the Integer, or their defaults
inverse_transform
(data_t)Inverse transform samples from the warped space back to the original space
rvs
([n_samples, random_state])Draw random samples.
transform
(data)Transform samples from the original space into a warped space
-
__init__
(low, high, transform='identity', name=None) Search space dimension that can assume any integer value in a given range
- Parameters
- low: Int
Lower bound (inclusive)
- high: Int
Upper bound (inclusive)
- transform: {“identity”, “normalize”}, default=”identity”
Transformation to apply to the original space. If “identity”, the transformed space is the same as the original space. If “normalize”, the transformed space is scaled between 0 and 1
- name: String, tuple, or None, default=None
A name associated with the dimension
- Attributes
- distribution: rv_generic
See documentation of
_make_distribution()
ordistribution()
- transform_: String
Original value passed through the transform kwarg - Because
transform()
exists- transformer: Transformer
See documentation of
_make_transformer()
ortransformer()
-
class
hyperparameter_hunter.space.dimensions.
Categorical
(categories: list, prior: Optional[list] = None, transform='onehot', optional=False, name=None) Search space dimension that can assume any categorical value in a given list
- Parameters
- categories: List
Sequence of possible categories of shape (n_categories,)
- prior: List, or None, default=None
If list, prior probabilities for each category of shape (categories,). By default all categories are equally likely
- transform: {“onehot”, “identity”}, default=”onehot”
Transformation to apply to the original space. If “identity”, the transformed space is the same as the original space. If “onehot”, the transformed space is a one-hot encoded representation of the original space
- optional: Boolean, default=False
Intended for use by
FeatureEngineer
when optimizing anEngineerStep
. Specifically, this enables searching through a space in which an EngineerStep either may or may not be used. This is contrary to Categorical’s usual function of creating a space comprising multiple categories. When optional = True, the space created will represent any of the values in categories either being included in the entire FeatureEngineer process, or being skipped entirely. Internally, a value excluded by optional is represented by a sentinel value that signals it should be removed from the containing list, so optional will not work for choosing between a single value and None, for example- name: String, tuple, or None, default=None
A name associated with the dimension
- Attributes
- categories: Tuple
Original value passed through the categories kwarg, cast to a tuple. If optional is True, then an instance of
RejectedOptional
will be appended to categories- distribution: rv_generic
See documentation of
_make_distribution()
ordistribution()
- optional: Boolean
Original value passed through the optional kwarg
- prior: List, or None
Original value passed through the prior kwarg
- prior_actual: List
Calculated prior value, initially equivalent to
prior
, but then set to a default array if None- transform_: String
Original value passed through the transform kwarg - Because
transform()
exists- transformer: Transformer
See documentation of
_make_transformer()
ortransformer()
Methods
distance
(a, b)Calculate distance between two points in the dimension’s bounds
Get dict of parameters used to initialize the Categorical, or their defaults
inverse_transform
(data_t)Inverse transform samples from the warped space back to the original space
rvs
([n_samples, random_state])Draw random samples.
transform
(data)Transform samples from the original space into a warped space
-
__init__
(categories: list, prior: Optional[list] = None, transform='onehot', optional=False, name=None) Search space dimension that can assume any categorical value in a given list
- Parameters
- categories: List
Sequence of possible categories of shape (n_categories,)
- prior: List, or None, default=None
If list, prior probabilities for each category of shape (categories,). By default all categories are equally likely
- transform: {“onehot”, “identity”}, default=”onehot”
Transformation to apply to the original space. If “identity”, the transformed space is the same as the original space. If “onehot”, the transformed space is a one-hot encoded representation of the original space
- optional: Boolean, default=False
Intended for use by
FeatureEngineer
when optimizing anEngineerStep
. Specifically, this enables searching through a space in which an EngineerStep either may or may not be used. This is contrary to Categorical’s usual function of creating a space comprising multiple categories. When optional = True, the space created will represent any of the values in categories either being included in the entire FeatureEngineer process, or being skipped entirely. Internally, a value excluded by optional is represented by a sentinel value that signals it should be removed from the containing list, so optional will not work for choosing between a single value and None, for example- name: String, tuple, or None, default=None
A name associated with the dimension
- Attributes
- categories: Tuple
Original value passed through the categories kwarg, cast to a tuple. If optional is True, then an instance of
RejectedOptional
will be appended to categories- distribution: rv_generic
See documentation of
_make_distribution()
ordistribution()
- optional: Boolean
Original value passed through the optional kwarg
- prior: List, or None
Original value passed through the prior kwarg
- prior_actual: List
Calculated prior value, initially equivalent to
prior
, but then set to a default array if None- transform_: String
Original value passed through the transform kwarg - Because
transform()
exists- transformer: Transformer
See documentation of
_make_transformer()
ortransformer()
Feature Engineering¶
-
class
hyperparameter_hunter.feature_engineering.
FeatureEngineer
(steps=None, do_validate=False, **datasets: Dict[str, pandas.core.frame.DataFrame]) Class to organize feature engineering step callables steps (
EngineerStep
instances) and the datasets that the steps request and return.- Parameters
- steps: List, or None, default=None
List of arbitrary length, containing any of the following values:
EngineerStep
instance,Function to provide as input to
EngineerStep
, orCategorical
, with categories comprising a selection of the previous two steps values (optimization only)
The third value can only be used during optimization. The feature_engineer provided to
CVExperiment
, for example, may only contain the first two values. To search a space optionally including an EngineerStep, use the optional kwarg ofCategorical
.See
EngineerStep
for information on properly formatted EngineerStep functions. Additional engineering steps may be added viaadd_step()
- do_validate: Boolean, or “strict”, default=False
… Experimental… Whether to validate the datasets resulting from feature engineering steps. If True, hashes of the new datasets will be compared to those of the originals to ensure they were actually modified. Results will be logged. If do_validate = “strict”, an exception will be raised if any anomalies are found, rather than logging a message. If do_validate = False, no validation will be performed
- **datasets: DFDict
This is not expected to be provided on initialization and is offered primarily for debugging/testing. Mapping of datasets necessary to perform feature engineering steps
See also
EngineerStep
For proper formatting of non-Categorical values of steps
Notes
If steps does include any instances of
hyperparameter_hunter.space.dimensions.Categorical
, this FeatureEngineer instance will not be usable by Experiments. It can only be used by Optimization Protocols. Furthermore, the FeatureEngineer that the Optimization Protocol actually ends up using will not pass identity checks against the original FeatureEngineer that contained Categorical stepsExamples
>>> from sklearn.preprocessing import StandardScaler, MinMaxScaler, QuantileTransformer >>> # Define some engineer step functions to play with >>> def s_scale(train_inputs, non_train_inputs): ... s = StandardScaler() ... train_inputs[train_inputs.columns] = s.fit_transform(train_inputs.values) ... non_train_inputs[train_inputs.columns] = s.transform(non_train_inputs.values) ... return train_inputs, non_train_inputs >>> def mm_scale(train_inputs, non_train_inputs): ... s = MinMaxScaler() ... train_inputs[train_inputs.columns] = s.fit_transform(train_inputs.values) ... non_train_inputs[train_inputs.columns] = s.transform(non_train_inputs.values) ... return train_inputs, non_train_inputs >>> def q_transform(train_targets, non_train_targets): ... t = QuantileTransformer(output_distribution="normal") ... train_targets[train_targets.columns] = t.fit_transform(train_targets.values) ... non_train_targets[train_targets.columns] = t.transform(non_train_targets.values) ... return train_targets, non_train_targets, t >>> def sqr_sum(all_inputs): ... all_inputs["square_sum"] = all_inputs.agg( ... lambda row: np.sqrt(np.sum([np.square(_) for _ in row])), axis="columns" ... ) ... return all_inputs
FeatureEngineer steps wrapped by `EngineerStep` == raw function steps - as long as the `EngineerStep` is using the default parameters
>>> # FeatureEngineer steps wrapped by `EngineerStep` == raw function steps >>> # ... As long as the `EngineerStep` is using the default parameters >>> fe_0 = FeatureEngineer([sqr_sum, s_scale]) >>> fe_1 = FeatureEngineer([EngineerStep(sqr_sum), EngineerStep(s_scale)]) >>> fe_0.steps == fe_1.steps True >>> fe_2 = FeatureEngineer([sqr_sum, EngineerStep(s_scale), q_transform])
`Categorical` can be used during optimization and placed anywhere in `steps`. `Categorical` can also handle either `EngineerStep` categories or raw functions. Use the `optional` kwarg of `Categorical` to test some questionable steps
>>> fe_3 = FeatureEngineer([sqr_sum, Categorical([s_scale, mm_scale]), q_transform]) >>> fe_4 = FeatureEngineer([Categorical([sqr_sum], optional=True), s_scale, q_transform]) >>> fe_5 = FeatureEngineer([ ... Categorical([sqr_sum], optional=True), ... Categorical([EngineerStep(s_scale), mm_scale]), ... q_transform ... ])
-
__init__
(steps=None, do_validate=False, **datasets: Dict[str, pandas.core.frame.DataFrame]) Class to organize feature engineering step callables steps (
EngineerStep
instances) and the datasets that the steps request and return.- Parameters
- steps: List, or None, default=None
List of arbitrary length, containing any of the following values:
EngineerStep
instance,Function to provide as input to
EngineerStep
, orCategorical
, with categories comprising a selection of the previous two steps values (optimization only)
The third value can only be used during optimization. The feature_engineer provided to
CVExperiment
, for example, may only contain the first two values. To search a space optionally including an EngineerStep, use the optional kwarg ofCategorical
.See
EngineerStep
for information on properly formatted EngineerStep functions. Additional engineering steps may be added viaadd_step()
- do_validate: Boolean, or “strict”, default=False
… Experimental… Whether to validate the datasets resulting from feature engineering steps. If True, hashes of the new datasets will be compared to those of the originals to ensure they were actually modified. Results will be logged. If do_validate = “strict”, an exception will be raised if any anomalies are found, rather than logging a message. If do_validate = False, no validation will be performed
- **datasets: DFDict
This is not expected to be provided on initialization and is offered primarily for debugging/testing. Mapping of datasets necessary to perform feature engineering steps
See also
EngineerStep
For proper formatting of non-Categorical values of steps
Notes
If steps does include any instances of
hyperparameter_hunter.space.dimensions.Categorical
, this FeatureEngineer instance will not be usable by Experiments. It can only be used by Optimization Protocols. Furthermore, the FeatureEngineer that the Optimization Protocol actually ends up using will not pass identity checks against the original FeatureEngineer that contained Categorical stepsExamples
>>> from sklearn.preprocessing import StandardScaler, MinMaxScaler, QuantileTransformer >>> # Define some engineer step functions to play with >>> def s_scale(train_inputs, non_train_inputs): ... s = StandardScaler() ... train_inputs[train_inputs.columns] = s.fit_transform(train_inputs.values) ... non_train_inputs[train_inputs.columns] = s.transform(non_train_inputs.values) ... return train_inputs, non_train_inputs >>> def mm_scale(train_inputs, non_train_inputs): ... s = MinMaxScaler() ... train_inputs[train_inputs.columns] = s.fit_transform(train_inputs.values) ... non_train_inputs[train_inputs.columns] = s.transform(non_train_inputs.values) ... return train_inputs, non_train_inputs >>> def q_transform(train_targets, non_train_targets): ... t = QuantileTransformer(output_distribution="normal") ... train_targets[train_targets.columns] = t.fit_transform(train_targets.values) ... non_train_targets[train_targets.columns] = t.transform(non_train_targets.values) ... return train_targets, non_train_targets, t >>> def sqr_sum(all_inputs): ... all_inputs["square_sum"] = all_inputs.agg( ... lambda row: np.sqrt(np.sum([np.square(_) for _ in row])), axis="columns" ... ) ... return all_inputs
FeatureEngineer steps wrapped by `EngineerStep` == raw function steps - as long as the `EngineerStep` is using the default parameters
>>> # FeatureEngineer steps wrapped by `EngineerStep` == raw function steps >>> # ... As long as the `EngineerStep` is using the default parameters >>> fe_0 = FeatureEngineer([sqr_sum, s_scale]) >>> fe_1 = FeatureEngineer([EngineerStep(sqr_sum), EngineerStep(s_scale)]) >>> fe_0.steps == fe_1.steps True >>> fe_2 = FeatureEngineer([sqr_sum, EngineerStep(s_scale), q_transform])
`Categorical` can be used during optimization and placed anywhere in `steps`. `Categorical` can also handle either `EngineerStep` categories or raw functions. Use the `optional` kwarg of `Categorical` to test some questionable steps
>>> fe_3 = FeatureEngineer([sqr_sum, Categorical([s_scale, mm_scale]), q_transform]) >>> fe_4 = FeatureEngineer([Categorical([sqr_sum], optional=True), s_scale, q_transform]) >>> fe_5 = FeatureEngineer([ ... Categorical([sqr_sum], optional=True), ... Categorical([EngineerStep(s_scale), mm_scale]), ... q_transform ... ])
-
class
hyperparameter_hunter.feature_engineering.
EngineerStep
(f: Callable, stage=None, name=None, params=None, do_validate=False) Container for individual
FeatureEngineer
step functionsCompartmentalizes functions of singular engineer steps and allows for greater customization than a raw engineer step function
- Parameters
- f: Callable
Feature engineering step function that requests, modifies, and returns datasets params
Step functions should follow these guidelines:
Request as input a subset of the 11 data strings listed in params
Do whatever you want to the DataFrames given as input
Return new DataFrame values of the input parameters in same order as requested
If performing a task like target transformation, causing predictions to be transformed, it is often desirable to inverse-transform the predictions to be of the expected form. This can easily be done by returning an extra value from f (after the datasets) that is either a callable, or a transformer class that was fitted during the execution of f and implements an inverse_transform method. This is the only instance in which it is acceptable for f to return values that don’t mimic its input parameters. See the engineer function definition using SKLearn’s QuantileTransformer in the Examples section below for an actual inverse-transformation-compatible implementation
- stage: String in {“pre_cv”, “intra_cv”}, or None, default=None
Feature engineering stage during which the callable f will be given the datasets params to modify and return. If None, will be inferred based on params.
“pre_cv” functions are applied only once in the experiment: when it starts
“intra_cv” functions are reapplied for each fold in the cross-validation splits
If stage is left to be inferred, “pre_cv” will usually be selected. However, if any params (or parameters in the signature of f) are prefixed with “validation…” or “non_train…”, then stage will inferred as “intra_cv”. See the Notes section below for suggestions on the stage to use for different functions
- name: String, or None, default=None
Identifier for the transformation applied by this engineering step. If None, f.__name__ will be used
- params: Tuple[str], or None, default=None
Dataset names requested by feature engineering step callable f. If None, will be inferred by parsing the signature of f. Must be a subset of the following 11 strings:
Input Data
“train_inputs”
“validation_inputs”
“holdout_inputs”
“test_inputs”
- “all_inputs”
("train_inputs" + ["validation_inputs"] + "holdout_inputs" + "test_inputs")
- “non_train_inputs”
(["validation_inputs"] + "holdout_inputs" + "test_inputs")
Target Data
“train_targets”
“validation_targets”
“holdout_targets”
“all_targets”
("train_targets" + ["validation_targets"] + "holdout_targets")
“non_train_targets”
(["validation_targets"] + "holdout_targets")
As an alternative to the above list, just remember that the first half of all parameter names should be one of {“train”, “validation”, “holdout”, “test”, “all”, “non_train”}, and the second half should be either “inputs” or “targets”. The only exception to this rule is “test_targets”, which doesn’t exist.
Inference of “validation” params is affected by stage. During the “pre_cv” stage, the validation dataset has not yet been created and is still a part of the train dataset. During the “intra_cv” stage, the validation dataset is created by removing a portion of the train dataset, and their values passed to f reflect this fact. This also means that the values of the merged (“all”/”non_train”-prefixed) datasets may or may not contain “validation” data depending on the stage; however, this is all handled internally, so you probably don’t need to worry about it.
params may not include multiple references to the same dataset, either directly or indirectly. This means (“train_inputs”, “train_inputs”) is invalid due to duplicate direct references. Less obviously, (“train_inputs”, “all_inputs”) is invalid because “all_inputs” includes “train_inputs”
- do_validate: Boolean, or “strict”, default=False
… Experimental… Whether to validate the datasets resulting from feature engineering steps. If True, hashes of the new datasets will be compared to those of the originals to ensure they were actually modified. Results will be logged. If do_validate = “strict”, an exception will be raised if any anomalies are found, rather than logging a message. If do_validate = False, no validation will be performed
See also
FeatureEngineer
The container for EngineerStep instances - EngineerStep`s should always be provided to HyperparameterHunter through a `FeatureEngineer
Categorical
Can be used during optimization to search through a group of EngineerStep`s given as `categories. The optional kwarg of Categorical designates a FeatureEngineer step that may be one of the EngineerStep`s in `categories, or may be omitted entirely
get_engineering_step_stage()
More information on stage inference and situations where overriding it may be prudent
Notes
stage: Generally, feature engineering conducted in the “pre_cv” stage should regard each sample/row as independent entities. For example, steps like converting a string day of the week to one-hot encoded columns, or imputing missing values by replacement with -1 might be conducted “pre_cv”, since they are unlikely to introduce an information leakage. Conversely, steps like scaling/normalization, whose results for the data in one row are affected by the data in other rows should be performed “intra_cv” in order to recalculate the final values of the datasets for each cross validation split and avoid information leakage.
params: In the list of the 11 valid params strings, “test_inputs” is notably missing the “…_targets” counterpart accompanying the other datasets. The “targets” suffix is missing because test data targets are never given. Note that although “test_inputs” is still included in both “all_inputs” and “non_train_inputs”, its lack of a target column means that “all_targets” and “non_train_targets” may have different lengths than their “inputs”-suffixed counterparts
Examples
>>> from sklearn.preprocessing import StandardScaler, QuantileTransformer >>> def s_scale(train_inputs, non_train_inputs): ... s = StandardScaler() ... train_inputs[train_inputs.columns] = s.fit_transform(train_inputs.values) ... non_train_inputs[train_inputs.columns] = s.transform(non_train_inputs.values) ... return train_inputs, non_train_inputs >>> # Sensible parameter defaults inferred based on `f` >>> es_0 = EngineerStep(s_scale) >>> es_0.stage 'intra_cv' >>> es_0.name 's_scale' >>> es_0.params ('train_inputs', 'non_train_inputs') >>> # Override `stage` if you want to fit your scaler on OOF data like a crazy person >>> es_1 = EngineerStep(s_scale, stage="pre_cv") >>> es_1.stage 'pre_cv'
Watch out for multiple requests to the same data
>>> es_2 = EngineerStep(s_scale, params=("train_inputs", "all_inputs")) Traceback (most recent call last): File "feature_engineering.py", line ? in validate_dataset_names ValueError: Requested params include duplicate references to `train_inputs` by way of: - ('all_inputs', 'train_inputs') - ('train_inputs',) Each dataset may only be requested by a single param for each function
Error is the same if `(train_inputs, all_inputs)` is in the actual function signature
EngineerStep functions aren’t just limited to transformations. Make your own features!
>>> def sqr_sum(all_inputs): ... all_inputs["square_sum"] = all_inputs.agg( ... lambda row: np.sqrt(np.sum([np.square(_) for _ in row])), axis="columns" ... ) ... return all_inputs >>> es_3 = EngineerStep(sqr_sum) >>> es_3.stage 'pre_cv' >>> es_3.name 'sqr_sum' >>> es_3.params ('all_inputs',)
Inverse-transformation Implementation:
>>> def q_transform(train_targets, non_train_targets): ... t = QuantileTransformer(output_distribution="normal") ... train_targets[train_targets.columns] = t.fit_transform(train_targets.values) ... non_train_targets[train_targets.columns] = t.transform(non_train_targets.values) ... return train_targets, non_train_targets, t >>> # Note that `train_targets` and `non_train_targets` must still be returned in order, >>> # but they are followed by `t`, an instance of `QuantileTransformer` we just fitted, >>> # whose `inverse_transform` method will be called on predictions >>> es_4 = EngineerStep(q_transform) >>> es_4.stage 'intra_cv' >>> es_4.name 'q_transform' >>> es_4.params ('train_targets', 'non_train_targets') >>> # `params` does not include any returned transformers - Only data requested as input
-
__init__
(f: Callable, stage=None, name=None, params=None, do_validate=False) Container for individual
FeatureEngineer
step functionsCompartmentalizes functions of singular engineer steps and allows for greater customization than a raw engineer step function
- Parameters
- f: Callable
Feature engineering step function that requests, modifies, and returns datasets params
Step functions should follow these guidelines:
Request as input a subset of the 11 data strings listed in params
Do whatever you want to the DataFrames given as input
Return new DataFrame values of the input parameters in same order as requested
If performing a task like target transformation, causing predictions to be transformed, it is often desirable to inverse-transform the predictions to be of the expected form. This can easily be done by returning an extra value from f (after the datasets) that is either a callable, or a transformer class that was fitted during the execution of f and implements an inverse_transform method. This is the only instance in which it is acceptable for f to return values that don’t mimic its input parameters. See the engineer function definition using SKLearn’s QuantileTransformer in the Examples section below for an actual inverse-transformation-compatible implementation
- stage: String in {“pre_cv”, “intra_cv”}, or None, default=None
Feature engineering stage during which the callable f will be given the datasets params to modify and return. If None, will be inferred based on params.
“pre_cv” functions are applied only once in the experiment: when it starts
“intra_cv” functions are reapplied for each fold in the cross-validation splits
If stage is left to be inferred, “pre_cv” will usually be selected. However, if any params (or parameters in the signature of f) are prefixed with “validation…” or “non_train…”, then stage will inferred as “intra_cv”. See the Notes section below for suggestions on the stage to use for different functions
- name: String, or None, default=None
Identifier for the transformation applied by this engineering step. If None, f.__name__ will be used
- params: Tuple[str], or None, default=None
Dataset names requested by feature engineering step callable f. If None, will be inferred by parsing the signature of f. Must be a subset of the following 11 strings:
Input Data
“train_inputs”
“validation_inputs”
“holdout_inputs”
“test_inputs”
- “all_inputs”
("train_inputs" + ["validation_inputs"] + "holdout_inputs" + "test_inputs")
- “non_train_inputs”
(["validation_inputs"] + "holdout_inputs" + "test_inputs")
Target Data
“train_targets”
“validation_targets”
“holdout_targets”
“all_targets”
("train_targets" + ["validation_targets"] + "holdout_targets")
“non_train_targets”
(["validation_targets"] + "holdout_targets")
As an alternative to the above list, just remember that the first half of all parameter names should be one of {“train”, “validation”, “holdout”, “test”, “all”, “non_train”}, and the second half should be either “inputs” or “targets”. The only exception to this rule is “test_targets”, which doesn’t exist.
Inference of “validation” params is affected by stage. During the “pre_cv” stage, the validation dataset has not yet been created and is still a part of the train dataset. During the “intra_cv” stage, the validation dataset is created by removing a portion of the train dataset, and their values passed to f reflect this fact. This also means that the values of the merged (“all”/”non_train”-prefixed) datasets may or may not contain “validation” data depending on the stage; however, this is all handled internally, so you probably don’t need to worry about it.
params may not include multiple references to the same dataset, either directly or indirectly. This means (“train_inputs”, “train_inputs”) is invalid due to duplicate direct references. Less obviously, (“train_inputs”, “all_inputs”) is invalid because “all_inputs” includes “train_inputs”
- do_validate: Boolean, or “strict”, default=False
… Experimental… Whether to validate the datasets resulting from feature engineering steps. If True, hashes of the new datasets will be compared to those of the originals to ensure they were actually modified. Results will be logged. If do_validate = “strict”, an exception will be raised if any anomalies are found, rather than logging a message. If do_validate = False, no validation will be performed
See also
FeatureEngineer
The container for EngineerStep instances - EngineerStep`s should always be provided to HyperparameterHunter through a `FeatureEngineer
Categorical
Can be used during optimization to search through a group of EngineerStep`s given as `categories. The optional kwarg of Categorical designates a FeatureEngineer step that may be one of the EngineerStep`s in `categories, or may be omitted entirely
get_engineering_step_stage()
More information on stage inference and situations where overriding it may be prudent
Notes
stage: Generally, feature engineering conducted in the “pre_cv” stage should regard each sample/row as independent entities. For example, steps like converting a string day of the week to one-hot encoded columns, or imputing missing values by replacement with -1 might be conducted “pre_cv”, since they are unlikely to introduce an information leakage. Conversely, steps like scaling/normalization, whose results for the data in one row are affected by the data in other rows should be performed “intra_cv” in order to recalculate the final values of the datasets for each cross validation split and avoid information leakage.
params: In the list of the 11 valid params strings, “test_inputs” is notably missing the “…_targets” counterpart accompanying the other datasets. The “targets” suffix is missing because test data targets are never given. Note that although “test_inputs” is still included in both “all_inputs” and “non_train_inputs”, its lack of a target column means that “all_targets” and “non_train_targets” may have different lengths than their “inputs”-suffixed counterparts
Examples
>>> from sklearn.preprocessing import StandardScaler, QuantileTransformer >>> def s_scale(train_inputs, non_train_inputs): ... s = StandardScaler() ... train_inputs[train_inputs.columns] = s.fit_transform(train_inputs.values) ... non_train_inputs[train_inputs.columns] = s.transform(non_train_inputs.values) ... return train_inputs, non_train_inputs >>> # Sensible parameter defaults inferred based on `f` >>> es_0 = EngineerStep(s_scale) >>> es_0.stage 'intra_cv' >>> es_0.name 's_scale' >>> es_0.params ('train_inputs', 'non_train_inputs') >>> # Override `stage` if you want to fit your scaler on OOF data like a crazy person >>> es_1 = EngineerStep(s_scale, stage="pre_cv") >>> es_1.stage 'pre_cv'
Watch out for multiple requests to the same data
>>> es_2 = EngineerStep(s_scale, params=("train_inputs", "all_inputs")) Traceback (most recent call last): File "feature_engineering.py", line ? in validate_dataset_names ValueError: Requested params include duplicate references to `train_inputs` by way of: - ('all_inputs', 'train_inputs') - ('train_inputs',) Each dataset may only be requested by a single param for each function
Error is the same if `(train_inputs, all_inputs)` is in the actual function signature
EngineerStep functions aren’t just limited to transformations. Make your own features!
>>> def sqr_sum(all_inputs): ... all_inputs["square_sum"] = all_inputs.agg( ... lambda row: np.sqrt(np.sum([np.square(_) for _ in row])), axis="columns" ... ) ... return all_inputs >>> es_3 = EngineerStep(sqr_sum) >>> es_3.stage 'pre_cv' >>> es_3.name 'sqr_sum' >>> es_3.params ('all_inputs',)
Inverse-transformation Implementation:
>>> def q_transform(train_targets, non_train_targets): ... t = QuantileTransformer(output_distribution="normal") ... train_targets[train_targets.columns] = t.fit_transform(train_targets.values) ... non_train_targets[train_targets.columns] = t.transform(non_train_targets.values) ... return train_targets, non_train_targets, t >>> # Note that `train_targets` and `non_train_targets` must still be returned in order, >>> # but they are followed by `t`, an instance of `QuantileTransformer` we just fitted, >>> # whose `inverse_transform` method will be called on predictions >>> es_4 = EngineerStep(q_transform) >>> es_4.stage 'intra_cv' >>> es_4.name 'q_transform' >>> es_4.params ('train_targets', 'non_train_targets') >>> # `params` does not include any returned transformers - Only data requested as input
Extras¶
-
hyperparameter_hunter.callbacks.bases.
lambda_callback
(on_exp_start=None, on_exp_end=None, on_rep_start=None, on_rep_end=None, on_fold_start=None, on_fold_end=None, on_run_start=None, on_run_end=None, agg_name=None, do_reshape_aggs=True, method_agg_keys=False, on_experiment_start=<object object>, on_experiment_end=<object object>, on_repetition_start=<object object>, on_repetition_end=<object object>) Utility for creating custom callbacks to be declared by
Environment
and used by Experiments. The callable “on_<…>_<start/end>” parameters provided will receive as input whichever attributes of the Experiment are included in the signature of the given callable. If **kwargs is given in the callable’s signature, a dict of all of the Experiment’s attributes will be provided. This can be helpful for trying to figure out how to build a custom callback, but should not be used unless absolutely necessary. If the Experiment does not have an attribute specified in the callable’s signature, the following placeholder will be given: “INVALID KWARG”- Parameters
- on_exp_start: Callable, or None, default=None
Callable that receives Experiment’s values for parameters in the signature at Experiment start
- on_exp_end: Callable, or None, default=None
Callable that receives Experiment’s values for parameters in the signature at Experiment end
- on_rep_start: Callable, or None, default=None
Callable that receives Experiment’s values for parameters in the signature at repetition start
- on_rep_end: Callable, or None, default=None
Callable that receives Experiment’s values for parameters in the signature at repetition end
- on_fold_start: Callable, or None, default=None
Callable that receives Experiment’s values for parameters in the signature at fold start
- on_fold_end: Callable, or None, default=None
Callable that receives Experiment’s values for parameters in the signature at fold end
- on_run_start: Callable, or None, default=None
Callable that receives Experiment’s values for parameters in the signature at run start
- on_run_end: Callable, or None, default=None
Callable that receives Experiment’s values for parameters in the signature at run end
- agg_name: Str, default=uuid.uuid4
This parameter is only used if the callables are behaving like AggregatorCallbacks by returning values (see the “Notes” section below for details on this). If the callables do return values, they will be stored under a key named (“_” + agg_name) in a dict in
hyperparameter_hunter.experiments.BaseExperiment.stat_aggregates
. The purpose of this parameter is to make it easier to understand an Experiment’s description file, as agg_name will default to a UUID if it is not given- do_reshape_aggs: Boolean, default=True
Whether to reshape the aggregated values to reflect the nested repetitions/folds/runs structure used for other aggregated values. If False, lists of aggregated values are left in their original shapes. This parameter is only used if the callables are behaving like AggregatorCallbacks (see the “Notes” section below and agg_name for details on this)
- method_agg_keys: Boolean, default=False
If True, the aggregate keys for the items added to the dict at agg_name are equivalent to the names of the “on_<…>_<start/end>” pseudo-methods whose values are being aggregated. In other words, the pool of all possible aggregate keys goes from [“runs”, “folds”, “reps”, “final”] to the names of the eight “on_<…>_<start/end>” kwargs of
lambda_callback()
. See the “Notes” section below for further details and a rough outline- on_experiment_start: …
Deprecated since version 3.0.0: Renamed to on_exp_start. Will be removed in 3.2.0
- on_experiment_end: …
Deprecated since version 3.0.0: Renamed to on_exp_end. Will be removed in 3.2.0
- on_repetition_start: …
Deprecated since version 3.0.0: Renamed to on_rep_start. Will be removed in 3.2.0
- on_repetition_end: …
Deprecated since version 3.0.0: Renamed to on_rep_end. Will be removed in 3.2.0
- Returns
- LambdaCallback:
LambdaCallback
Uninitialized class, whose methods are the callables of the corresponding “on…” kwarg
- LambdaCallback:
Notes
For all of the “on_<…>_<start/end>” callables provided as input to lambda_callback, consider the following guidelines (for example function “f”, which can represent any of the callables):
All input parameters in the signature of “f” are attributes of the Experiment being executed
If “**kwargs” is a parameter, a dict of all the Experiment’s attributes will be provided
“f” will be treated as a method of a parent class of the Experiment
Take care when modifying attributes, as changes are reflected in the Experiment itself
If “f” returns something, it will automatically behave like an AggregatorCallback (see
hyperparameter_hunter.callbacks.aggregators
). Specifically, the following will occur:A new key (named by agg_name if given, else a UUID) with a dict value is added to
hyperparameter_hunter.experiments.BaseExperiment.stat_aggregates
This new dict can have up to four keys: “runs” (list), “folds” (list), “reps” (list), and “final” (object)
If “f” is an “on_run…” function, the returned value is appended to the “runs” list in the new dict
Similarly, if “f” is an “on_fold…” or “on_rep…” function, the returned value is appended to the “folds”, or “reps” list, respectively
If “f” is an “on_exp…” function, the “final” key in the new dict is set to the returned value
If values were aggregated in the aforementioned manner, the lists of collected values will be reshaped according to runs/folds/reps on Experiment end
The aggregated values will be saved in the Experiment’s description file
This is because
hyperparameter_hunter.experiments.BaseExperiment.stat_aggregates
is saved in its entirety
What follows is a rough outline of the structure produced when using an aggregator-like callback that automatically populates
experiments.BaseExperiment.stat_aggregates
with results of the functions used as arguments tolambda_callback()
:BaseExperiment.stat_aggregates = dict( ..., <`agg_name`>=dict( <agg_key "runs"> = [...], <agg_key "folds"> = [...], <agg_key "reps"> = [...], <agg_key "final"> = object(), ... ), ... )
In the above outline, the actual agg_key`s included in the dict at `agg_name depend on which “on_<…>_<start/end>” callables are behaving like aggregators. For example, if neither on_run_start nor on_run_end explicitly returns something, then the “runs” agg_key is not included in the agg_name dict. Similarly, if, for example, neither on_exp_start nor on_exp_end is provided, then the “final” agg_key is not included. If method_agg_keys=True, then the agg keys used in the dict are modified to be named after the method called. For example, if method_agg_keys=True and on_fold_start and on_fold_end are both callables returning values to be aggregated, then the agg_key`s used for each will be “on_fold_start” and “on_fold_end”, respectively. In this example, if `method_agg_keys=False (default) and do_reshape_aggs=False, then the single “folds” agg_key would contain the combined contents returned by both methods in the order in which they were returned
For examples using lambda_callback to create custom callbacks, see
hyperparameter_hunter.callbacks.recipes
Examples
>>> from hyperparameter_hunter.environment import Environment >>> def printer_helper(_rep, _fold, _run, last_evaluation_results): ... print(f"{_rep}.{_fold}.{_run} {last_evaluation_results}") >>> my_lambda_callback = lambda_callback( ... on_exp_end=printer_helper, ... on_rep_end=printer_helper, ... on_fold_end=printer_helper, ... on_run_end=printer_helper, ... ) ... # env = Environment( ... # train_dataset="i am a dataset", ... # results_path="path/to/HyperparameterHunterAssets", ... # metrics=["roc_auc_score"], ... # experiment_callbacks=[my_lambda_callback] ... # ) ... # ... Now execute an Experiment, or an Optimization Protocol...
See
hyperparameter_hunter.examples.lambda_callback_example
for more information