CatBoost optimization example#

A simple example showing hyperparameter optimization for CatBoost Classifier with genetic algorithms.

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from catboost import CatBoostClassifier
import plotly
import os
from mloptimizer.interfaces import HyperparameterSpaceBuilder, GeneticSearch
from mloptimizer.application.reporting.plots import (plotly_search_space, plotly_logbook,
                                                     plot_logbook)
import matplotlib.pyplot as plt

Load and prepare the dataset#

print("Loading Breast Cancer dataset...")
data = load_breast_cancer()
X, y = data.data, data.target

print(f"Dataset shape: {X.shape}")
print(f"Number of classes: {len(set(y))}")

Loading Breast Cancer dataset...
Dataset shape: (569, 30)
Number of classes: 2

Split the data

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

Define the CatBoost hyperparameter space#

CatBoost has specific hyperparameters that differ from other gradient boosting libraries. We can use the default hyperparameter space or build a custom one.

Option 1: Load default space (recommended for quick start)

hyperparam_space = HyperparameterSpaceBuilder.get_default_space(
    estimator_class=CatBoostClassifier
)

# Option 2: Build custom space (uncomment to use)
# hyperparam_space = (HyperparameterSpaceBuilder()
#                     .add_int_param('depth', min_value=4, max_value=10)
#                     .add_float_param('learning_rate', min_value=1, max_value=10, scale=100)
#                     .add_int_param('iterations', min_value=50, max_value=200)
#                     .add_float_param('subsample', min_value=700, max_value=1000, scale=1000)
#                     .set_fixed_param('verbose', False)
#                     .build())

print("\nHyperparameter space configuration:")
print(f"Evolvable parameters: {list(hyperparam_space.evolvable_hyperparams.keys())}")
print(f"Fixed parameters: {list(hyperparam_space.fixed_hyperparams.keys())}")

Hyperparameter space configuration:
Evolvable parameters: ['learning_rate', 'depth', 'n_estimators', 'subsample', 'l2_leaf_reg', 'colsample_bylevel', 'random_strength']
Fixed parameters: ['auto_class_weights', 'bootstrap_type', 'allow_writing_files', 'verbose', 'thread_count']

Configure and run the genetic optimization#

Genetic Algorithm Configuration: - generations: Number of evolutionary iterations - population_size: Number of configurations per generation - n_elites: Number of best individuals preserved each generation - seed: Random seed for reproducibility Note: Small values for faster documentation builds. For production, increase to 20+ generations.

genetic_params = {
    'generations': 5,
    'population_size': 8,
    'n_elites': 2,
    'seed': 42,
    'use_parallel': False
}

opt = GeneticSearch(
    estimator_class=CatBoostClassifier,
    hyperparam_space=hyperparam_space,
    cv=3,
    scoring='accuracy',
    disable_file_output=False,
    **genetic_params
)

print("\nStarting CatBoost optimization...")
print(f"Generations: {opt.generations}")
print(f"Population size: {opt.population_size}")

# Run the optimization
opt.fit(X_train, y_train)

/home/docs/checkouts/readthedocs.org/user_builds/mloptimizer/checkouts/master/examples/plot_catboost_example.py:76: UserWarning: Some hyperparameters have very small integer ranges (< 10 distinct values): 'depth' (7 values: 4 to 10). Small ranges limit search granularity. Consider increasing the range or scale for float types.
  opt = GeneticSearch(

Starting CatBoost optimization...
Generations: 5
Population size: 8

Genetic execution:   0%|          | 0/6 [00:00<?, ?it/s, best fitness=?]
Genetic execution:  17%|█▋        | 1/6 [00:05<00:25,  5.20s/it, best fitness=0.974]
Genetic execution:  17%|█▋        | 1/6 [00:06<00:34,  6.86s/it, best fitness=0.976]
Genetic execution:  17%|█▋        | 1/6 [00:40<03:20, 40.19s/it, best fitness=0.978]
Genetic execution:  17%|█▋        | 1/6 [00:53<04:25, 53.03s/it, best fitness=0.978]
Genetic execution:  33%|███▎      | 2/6 [00:53<01:46, 26.51s/it, best fitness=0.978]
Genetic execution:  33%|███▎      | 2/6 [01:18<01:46, 26.51s/it, best fitness=0.98]
Genetic execution:  50%|█████     | 3/6 [01:22<01:23, 27.84s/it, best fitness=0.98]
Genetic execution:  67%|██████▋   | 4/6 [01:29<00:39, 19.86s/it, best fitness=0.98]
Genetic execution:  67%|██████▋   | 4/6 [01:32<00:39, 19.86s/it, best fitness=0.982]
Genetic execution:  83%|████████▎ | 5/6 [01:35<00:15, 15.01s/it, best fitness=0.982]
Genetic execution: 100%|██████████| 6/6 [01:42<00:00, 12.38s/it, best fitness=0.982]
Genetic execution: 100%|██████████| 6/6 [01:48<00:00, 18.15s/it, best fitness=0.982]
/home/docs/checkouts/readthedocs.org/user_builds/mloptimizer/envs/master/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/home/docs/checkouts/readthedocs.org/user_builds/mloptimizer/envs/master/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):

GeneticSearch(cv=StratifiedKFold(n_splits=3, random_state=42, shuffle=True),
              estimator_class=<class 'catboost.core.CatBoostClassifier'>,
              generations=5,
              hyperparam_space=HyperparameterSpace(fixed_hyperparams={'auto_class_weights': 'Balanced', 'bootstrap_type': 'Bernoulli', 'allow_writing_files': False, 'verbose': False, 'thread_count': 1}, evolvable_hyperparams={'l...ram('n_estimators', 100, 500, 'int'), 'subsample': Hyperparam('subsample', 700, 1000, 'float', 1000), 'l2_leaf_reg': Hyperparam('l2_leaf_reg', 1, 10, 'int'), 'colsample_bylevel': Hyperparam('colsample_bylevel', 50, 100, 'float', 100), 'random_strength': Hyperparam('random_strength', 1, 10, 'int')}),
              n_elites=2, population_size=8, scoring='accuracy', seed=42,
              use_parallel=False)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Get results and evaluate

best_clf = opt.best_estimator_
y_pred = best_clf.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='binary')
recall = recall_score(y_test, y_pred, average='binary')
f1 = f1_score(y_test, y_pred, average='binary')

print(f"\nOptimization completed!")
print(f"Best hyperparameters: {opt.best_params_}")
print(f"\nTest performance:")
print(f"  Accuracy:  {test_accuracy:.4f}")
print(f"  Precision: {precision:.4f}")
print(f"  Recall:    {recall:.4f}")
print(f"  F1 Score:  {f1:.4f}")

Optimization completed!
Best hyperparameters: {'learning_rate': 0.02, 'depth': 4, 'l2_leaf_reg': 5, 'thread_count': 1, 'verbose': False, 'auto_class_weights': 'Balanced', 'random_strength': 8, 'allow_writing_files': False, 'bootstrap_type': 'Bernoulli', 'subsample': 0.835, 'n_estimators': 479, 'colsample_bylevel': 0.86, 'random_state': 42}

Test performance:
  Accuracy:  0.9649
  Precision: 0.9722
  Recall:    0.9722
  F1 Score:  0.9722

Generate visualizations

population_df = opt.populations_

# Search space visualization
top_params = ['depth', 'learning_rate', 'n_estimators', 'l2_leaf_reg', 'fitness']
df_filtered = population_df[top_params]
g_search_space = plotly_search_space(df_filtered, top_params)
g_search_space.update_layout(
    title="CatBoost Hyperparameter Search Space - Breast Cancer Dataset",
    autosize=True,
    width=None,
    height=650
)
plotly.io.show(g_search_space, config={'responsive': True})

Simple logbook visualization

g_logbook_s = plot_logbook(opt.logbook_)
# plt.show()  # Commented out for non-interactive environments

/home/docs/checkouts/readthedocs.org/user_builds/mloptimizer/envs/master/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/home/docs/checkouts/readthedocs.org/user_builds/mloptimizer/envs/master/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):

Evolution logbook visualization

g_logbook = plotly_logbook(opt.logbook_, population_df)
g_logbook.update_layout(
    title="CatBoost Optimization Evolution - Breast Cancer Dataset",
    autosize=True,
    width=None,
    height=500
)
plotly.io.show(g_logbook, config={'responsive': True})

Analyze optimization results

print("\n=== Optimization Analysis ===")
print(f"Unique evaluations performed: {opt.n_trials_}")
print(f"Total individuals in population history: {len(population_df)}")
print(f"Optimization time: {opt.optimization_time_:.4f} seconds")
print(f"Time per evaluation: {opt.optimization_time_ / opt.n_trials_:.4f} seconds")
print(f"Generations completed: {opt.generations}")

final_gen = population_df[population_df['population'] == opt.generations]
initial_gen = population_df[population_df['population'] == 1]

final_avg_fitness = final_gen['fitness'].mean()
initial_avg_fitness = initial_gen['fitness'].mean()
improvement = final_avg_fitness - initial_avg_fitness

print(f"\nFitness progression:")
print(f"  Initial average fitness: {initial_avg_fitness:.4f}")
print(f"  Final average fitness:   {final_avg_fitness:.4f}")
print(f"  Average improvement:     {improvement:.4f}")

=== Optimization Analysis ===
Unique evaluations performed: 28
Total individuals in population history: 48
Optimization time: 109.3850 seconds
Time per evaluation: 3.9066 seconds
Generations completed: 5

Fitness progression:
  Initial average fitness: 0.9772
  Final average fitness:   0.9783
  Average improvement:     0.0011

Access generated files

print("\n=== Generated Files ===")
graphics_path = opt._optimizer_service.optimizer.tracker.graphics_path
results_path = opt._optimizer_service.optimizer.tracker.results_path

print(f"Graphics path: {graphics_path}")
if os.path.exists(graphics_path):
    print("  Graphics files:", [f for f in os.listdir(graphics_path) if f.endswith('.html')])

print(f"Results path: {results_path}")
if os.path.exists(results_path):
    print("  Results files:", [f for f in os.listdir(results_path) if f.endswith('.csv')])

=== Generated Files ===
Graphics path: ./20260406_230846_CatBoostClassifier/graphics
  Graphics files: ['logbook.html', 'search_space.html']
Results path: ./20260406_230846_CatBoostClassifier/results
  Results files: ['logbook.csv', 'populations.csv']

CatBoost-specific features#

CatBoost offers several unique features:

Automatic handling of categorical features: No need for manual encoding
Balanced class weights: Set via auto_class_weights=’Balanced’ (included in default space)
GPU acceleration: Add ‘task_type’: ‘GPU’ to fixed parameters
Feature importance: Access via best_clf.feature_importances_

Example of adding categorical feature support:

# If you have categorical features
cat_features = [0, 2, 4]  # Indices of categorical columns

hyperparam_space = (HyperparameterSpaceBuilder()
                    .add_int_param('depth', 4, 10)
                    .add_float_param('learning_rate', 1, 10, scale=100)
                    .set_fixed_param('cat_features', cat_features)
                    .set_fixed_param('verbose', False)
                    .build())

Total running time of the script: (1 minutes 51.535 seconds)

Gallery generated by Sphinx-Gallery