Reproducibility#
Reproducibility is a key aspect of
scientific research, and more precisely,
in machine learning. MLOptimizer
provides an
input parameter seed
that allows to set
the random seed for:
The random number generator of the optimizer generating the initial population and the mutations
The random number generator of the model on training
The random number generator of the data on split
An example of usage is:
from sklearn.datasets import load_breast_cancer as dataset
from sklearn.tree import DecisionTreeClassifier
from mloptimizer.core import Optimizer
from mloptimizer.hyperparams import HyperparameterSpace
X, y = load_iris(return_X_y=True)
default_hyperparam_space = HyperparameterSpace.get_default_hyperparameter_space(DecisionTreeClassifier)
population = 2
generations = 2
seed = 25
distinct_seed = 2
# It is important to run the optimization
# right after the creation of the optimizer
optimizer1 = Optimizer(estimator_class=DecisionTreeClassifier, features=X, labels=y,
hyperparam_space=default_hyperparam_space, seed=seed)
result1 = optimizer1.optimize_clf(population_size=population,
generations=generations)
# WARNING: In case the optimizer2 would be created after the optimizer1,
# the results would be different
optimizer2 = Optimizer(estimator_class=DecisionTreeClassifier, features=X, labels=y,
hyperparam_space=default_hyperparam_space, seed=seed)
result2 = optimizer2.optimize_clf(population_size=population,
generations=generations)
optimizer3 = Optimizer(estimator_class=DecisionTreeClassifier, features=X, labels=y,
hyperparam_space=default_hyperparam_space, seed=distinct_seed)
result3 = optimizer3.optimize_clf(population_size=population,
generations=generations)
str(result1) == str(result2)
str(result1) != str(result3)
Warning
To ensure reproducibility, it is important to run the optimization right after the creation of the optimizer with the seed to ensure no other random number generator has been used in the meantime.