Mastering Monte Carlo: How to Simulate Your Way to Better Machine Learning Models | by Sydney Nye | Aug, 2023


In the Monte Carlo method, the pi estimate is based on the proportion of “darts” that land inside the circle to the total number of darts thrown. The resulting estimated pi value is used to generate a circle. If the Monte Carlo estimate is inaccurate, the circle will again be the wrong size. The width of the gap between this estimated circle and the unit circle gives an indication of the accuracy of the Monte Carlo estimate.

However, because the Monte Carlo method generates more accurate estimates as the number of “darts” increases, the estimated circle should converge towards the unit circle as more “darts” are thrown. Therefore, while both methods show a gap when the estimate is inaccurate, this gap should decrease more consistently with the Monte Carlo method as the number of “darts” increases.

What makes Monte Carlo simulations so powerful is their ability to harness randomness to solve deterministic problems. By generating a large number of random scenarios and analyzing the results, we can estimate the probability of different outcomes, even for complex problems that would be difficult to solve analytically.

In the case of estimating pi, the Monte Carlo method allows us to make a very accurate estimate, even though we’re just throwing darts randomly. As discussed, the more darts we throw, the more accurate our estimate becomes. This is a demonstration of the law of large numbers, a fundamental concept in probability theory that states that the average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer and closer as more trials are performed. Let’s see if this tends to be true for our six examples shown in Figures 2a-2f by plotting the number of darts thrown against the difference between Monte Carlo-estimated pi and real pi. In general, our graph (Figure 2g) should trend negative. Here’s the code to accomplish this:

# Calculate the differences between the real pi and the estimated pi
diff_pi = [abs(estimate - math.pi) for estimate in pi_estimates]

# Create the figure for the number of darts vs difference in pi plot (Figure 2g)
fig2g = go.Figure(data=go.Scatter(x=num_darts_thrown, y=diff_pi, mode='lines'))

# Add title and labels to the plot
fig2g.update_layout(
title="Fig2g: Darts Thrown vs Difference in Estimated Pi",
xaxis_title="Number of Darts Thrown",
yaxis_title="Difference in Pi",
)

# Display the plot
fig2g.show()

# Save the plot as a png
pio.write_image(fig2g, "fig2g.png")

Note that, even with only 6 examples, the general pattern is as expected: more darts thrown (more scenarios), a smaller difference between the estimated and real value, and thus a better prediction.

Let’s say we throw 1,000,000 total darts, and allow ourselves 500 predictions. In other words, we will record the difference between the estimated and actual values of pi at 500 evenly spaced intervals throughout the simulation of 1,000,000 thrown darts. Rather than generate 500 extra figures, let’s just skip to what we’re trying to confirm: whether it’s indeed true that as more darts are thrown, the difference in our predicted value of pi and real pi gets lower. We’ll use a scatter plot (Figure 2h):

#500 Monte Carlo Scenarios; 1,000,000 darts thrown
import random
import math
import plotly.graph_objects as go
import numpy as np

# Total number of darts to throw (1M)
num_darts = 1000000
darts_in_circle = 0

# Number of scenarios to record (500)
num_scenarios = 500
darts_per_scenario = num_darts // num_scenarios

# Lists to store the data for each scenario
darts_thrown_list = []
pi_diff_list = []

# We'll throw a number of darts
for i in range(num_darts):
# Generate random x, y coordinates between -1 and 1
x, y = random.uniform(-1, 1), random.uniform(-1, 1)

# Check if the dart is inside the circle
# A dart is inside the circle if the distance from the origin (0,0) is less than or equal to 1
if math.sqrt(x**2 + y**2) <= 1:
darts_in_circle += 1

# If it's time to record a scenario
if (i + 1) % darts_per_scenario == 0:
# Estimate pi with Monte Carlo method
# The estimate is 4 times the number of darts in the circle divided by the total number of darts
pi_estimate = 4 * darts_in_circle / (i + 1)

# Record the number of darts thrown and the difference between the estimated and actual values of pi
darts_thrown_list.append((i + 1) / 1000) # Dividing by 1000 to display in thousands
pi_diff_list.append(abs(pi_estimate - math.pi))

# Create a scatter plot of the data
fig = go.Figure(data=go.Scattergl(x=darts_thrown_list, y=pi_diff_list, mode='markers'))

# Update the layout of the plot
fig.update_layout(
title="Fig2h: Difference between Estimated and Actual Pi vs. Number of Darts Thrown (in thousands)",
xaxis_title="Number of Darts Thrown (in thousands)",
yaxis_title="Difference between Estimated and Actual Pi",
)

# Display the plot
fig.show()
# Save the plot as a png
pio.write_image(fig2h, "fig2h.png")

You might be thinking to yourself at this point, “Monte Carlo is an interesting statistical tool, but how does it apply to machine learning?” The short answer is: in many ways. One of the many applications of Monte Carlo simulations in machine learning is in the realm of hyperparameter tuning.

Hyperparameters are the knobs and dials that we (the humans) adjust when setting up machine learning algorithms. They control aspects of the algorithm’s behavior that, crucially, aren’t learned from the data. For example, in a decision tree, the maximum depth of the tree is a hyperparameter. In a neural network, the learning rate and the number of hidden layers are hyperparameters.

Choosing the right hyperparameters can make the difference between a model that performs poorly and one that performs excellently. But how do we know which hyperparameters to choose? This is where Monte Carlo simulations come in.

Traditionally, machine learning practitioners have used methods like grid search or random search to tune hyperparameters. These methods involve specifying a set of possible values for each hyperparameter, and then training and evaluating a model for every possible combination of hyperparameters. This can be computationally expensive and time-consuming, especially when there are many hyperparameters to tune or a large range of possible values each can take.

Monte Carlo simulations offer a more efficient alternative. Instead of exhaustively searching through all possible combinations of hyperparameters, we can randomly sample from the space of hyperparameters according to some probability distribution. This allows us to explore the hyperparameter space more efficiently and find good combinations of hyperparameters faster.

In the next section, we’ll use a real dataset to demonstrate how to use Monte Carlo simulations for hyperparameter tuning in practice. Let’s get started!

The Heartbeat of Our Experiment: The Heart Disease Dataset

In the world of machine learning, data is the lifeblood that powers our models. For our exploration of Monte Carlo simulations in hyperparameter tuning, let’s look at a dataset that’s close to the heart — quite literally. The Heart Disease dataset (CC BY 4.0) from the UCI Machine Learning Repository is a collection of medical records from patients, some of whom have heart disease.

The dataset contains 14 attributes, including age, sex, chest pain type, resting blood pressure, cholesterol levels, fasting blood sugar, and others. The target variable is the presence of heart disease, making this a binary classification task. With a mix of categorical and numerical features, it’s an interesting dataset for demonstrating hyperparameter tuning.

First, let’s take a look at our dataset to get a sense of what we’ll be working with — always a good place to start.


#Load and view first few rows of dataset

# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score
import numpy as np
import plotly.graph_objects as go

# Load the dataset
# The dataset is available at the UCI Machine Learning Repository
# It's a dataset about heart disease and includes various patient measurements
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"

# Define the column names for the dataframe
column_names = ["age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "target"]

# Load the dataset into a pandas dataframe
# We specify the column names and also tell pandas to treat '?' as NaN
df = pd.read_csv(url, names=column_names, na_values="?")

# Print the first few rows of the dataframe
# This gives us a quick overview of the data
print(df.head())

This shows us the first four values in our dataset across all columns. If you’ve loaded the right csv and named your columns as I have, your output will look like Figure 3.

Figure 3: First 4 rows of data from our dataset

Before we can use the Heart Disease dataset for hyperparameter tuning, we need to preprocess the data. This involves several steps:

  1. Handling missing values: Some records in the dataset have missing values. We’ll need to decide how to handle these, whether by deleting the records, filling in the missing values, or some other method.
  2. Encoding categorical variables: Many machine learning algorithms require input data to be numerical. We’ll need to convert categorical variables into a numerical format.
  3. Normalizing numerical features: Machine learning algorithms often perform better when numerical features are on a similar scale. We’ll apply normalization to adjust the scale of these features.

Let’s start by handling missing values. In our Heart Disease dataset, we have a few missing values in the ‘ca’ and ‘thal’ columns. We’ll fill these missing values with the median of the respective column. This is a common strategy for dealing with missing data, as it doesn’t drastically affect the distribution of the data.

Next, we’ll encode the categorical variables. In our dataset, the ‘cp’, ‘restecg’, ‘slope’, ‘ca’, and ‘thal’ columns are categorical. We’ll use label encoding to convert these categorical variables into numerical ones. Label encoding assigns each unique category in a column to a different integer.

Finally, we’ll normalize the numerical features. Normalization adjusts the scale of numerical features so that they all fall within a similar range. This can help improve the performance of many machine learning algorithms. We’ll use standard scaling for normalization, which transforms the data to have a mean of 0 and a standard deviation of 1.

Here’s the Python code that performs all of these preprocessing steps:

# Preprocess

# Import necessary libraries
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder

# Identify missing values in the dataset
# This will print the number of missing values in each column
print(df.isnull().sum())

# Fill missing values with the median of the column
# The SimpleImputer class from sklearn provides basic strategies for imputing missing values
# We're using the 'median' strategy, which replaces missing values with the median of each column
imputer = SimpleImputer(strategy='median')

# Apply the imputer to the dataframe
# The result is a new dataframe where missing values have been filled in
df_filled = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

# Print the first few rows of the filled dataframe
# This gives us a quick check to make sure the imputation worked correctly
print(df_filled.head())

# Identify categorical variables in the dataset
# These are variables that contain non-numerical data
categorical_vars = df_filled.select_dtypes(include='object').columns

# Encode categorical variables
# The LabelEncoder class from sklearn converts each unique string into a unique integer
encoder = LabelEncoder()
for var in categorical_vars:
df_filled[var] = encoder.fit_transform(df_filled[var])

# Normalize numerical features
# The StandardScaler class from sklearn standardizes features by removing the mean and scaling to unit variance
scaler = StandardScaler()

# Apply the scaler to the dataframe
# The result is a new dataframe where numerical features have been normalized
df_normalized = pd.DataFrame(scaler.fit_transform(df_filled), columns=df_filled.columns)

# Print the first few rows of the normalized dataframe
# This gives us a quick check to make sure the normalization worked correctly
print(df_normalized.head())

The first print statement shows us the number of missing values in each column of the original dataset. In our case, the ‘ca’ and ‘thal’ columns had a few missing values.

The second print statement shows us the first few rows of the dataset after filling in the missing values. As discussed, we used the median of each column to fill in the missing values.

The third print statement shows us the first few rows of the dataset after encoding the categorical variables. After this step, all the variables in our dataset are numerical.

The final print statement shows us the first few rows of the dataset after normalizing the numerical features, in which the data will have a mean of 0 and a standard deviation of 1. After this step, all the numerical features in our dataset are on a similar scale. Check that your output resembles Figure 4:

Figure 4: Preprocessing Print Statements Output

After running this code, we have a preprocessed dataset that’s ready for modeling.

Now that we’ve preprocessed our data, we’re ready to implement a basic machine learning model. This will serve as our baseline model, which we’ll later try to improve through hyperparameter tuning.

We’ll use a simple logistic regression model for this task. Note that while it’s called “regression,” this is actually one of the most popular algorithms for binary classification problems, like the one we’re dealing with in the Heart Disease dataset. It’s a linear model that predicts the probability of the positive class.

After training our model, we’ll evaluate its performance using two common metrics: accuracy and ROC-AUC. Accuracy is the proportion of correct predictions out of all predictions, while ROC-AUC (Receiver Operating Characteristic — Area Under Curve) measures the trade-off between the true positive rate and the false positive rate.

But what does this have to do with Monte Carlo simulations? Well, machine learning models like logistic regression have several hyperparameters that can be tuned to improve performance. However, finding the best set of hyperparameters can be like searching for a needle in a haystack. This is where Monte Carlo simulations come in. By randomly sampling different sets of hyperparameters and evaluating their performance, we can estimate the probability distribution of good hyperparameters and make an educated guess about the best ones to use, similarly to how we picked better values of pi in our dart-throwing exercise.

Here’s the Python code that implements and evaluates a basic logistic regression model for our newly pre-processed data:

# Logistic Regression Model - Baseline

# Import necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score

# Replace the 'target' column in the normalized DataFrame with the original 'target' column
# This is done because the 'target' column was also normalized, which is not what we want
df_normalized['target'] = df['target']

# Binarize the 'target' column
# This is done because the original 'target' column contains values from 0 to 4
# We want to simplify the problem to a binary classification problem: heart disease or no heart disease
df_normalized['target'] = df_normalized['target'].apply(lambda x: 1 if x > 0 else 0)

# Split the data into training and test sets
# The 'target' column is our label, so we drop it from our features (X)
# We use a test size of 20%, meaning 80% of the data will be used for training and 20% for testing
X = df_normalized.drop('target', axis=1)
y = df_normalized['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Implement a basic logistic regression model
# Logistic Regression is a simple yet powerful linear model for binary classification problems
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
# The model has been trained, so we can now use it to make predictions on unseen data
y_pred = model.predict(X_test)

# Evaluate the model
# We use accuracy (the proportion of correct predictions) and ROC-AUC (a measure of how well the model distinguishes between classes) as our metrics
accuracy = accuracy_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred)

# Print the performance metrics
# These give us an indication of how well our model is performing
print("Baseline Model " + f'Accuracy: {accuracy}')
print("Baseline Model " + f'ROC-AUC: {roc_auc}')

With an accuracy of 0.885 and an ROC-AUC score of 0.884, our basic logistic regression model has set a solid baseline for us to improve upon. These metrics indicate that our model is performing quite well at distinguishing between patients with and without heart disease. Let’s see if we can make it better.

In machine learning, a model’s performance can often be improved by tuning its hyperparameters. Hyperparameters are parameters that are not learned from the data, but are set prior to the start of the learning process. For example, in logistic regression, the regularization strength ‘C’ and the type of penalty ‘l1’ or ‘l2’ are hyperparameters.

Let’s perform hyperparameter tuning on our logistic regression model using grid search. We’ll tune the ‘C’ and ‘penalty’ hyperparameters, and we’ll use ROC-AUC as our scoring metric. Let’s see if we can beat our baseline model’s performance.

Now, let’s start with the Python code for this section.

# Grid Search

# Import necessary libraries
from sklearn.model_selection import GridSearchCV

# Define the hyperparameters and their values
# 'C' is the inverse of regularization strength (smaller values specify stronger regularization)
# 'penalty' specifies the norm used in the penalization (l1 or l2)
hyperparameters = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
'penalty': ['l1', 'l2']}

# Implement grid search
# GridSearchCV is a method used to tune our model's hyperparameters
# We pass our model, the hyperparameters to tune, and the number of folds for cross-validation
# We're using ROC-AUC as our scoring metric
grid_search = GridSearchCV(LogisticRegression(), hyperparameters, cv=5, scoring='roc_auc')
grid_search.fit(X_train, y_train)

# Get the best hyperparameters
# GridSearchCV has found the best hyperparameters for our model, so we print them out
best_params = grid_search.best_params_
print(f'Best hyperparameters: {best_params}')

# Evaluate the best model
# GridSearchCV also gives us the best model, so we can use it to make predictions and evaluate its performance
best_model = grid_search.best_estimator_
y_pred_best = best_model.predict(X_test)
accuracy_best = accuracy_score(y_test, y_pred_best)
roc_auc_best = roc_auc_score(y_test, y_pred_best)

# Print the performance metrics of the best model
# These give us an indication of how well our model is performing after hyperparameter tuning
print("Grid Search Method " + f'Accuracy of the best model: {accuracy_best}')
print("Grid Search Method " + f'ROC-AUC of the best model: {roc_auc_best}')

With the best hyperparameters found to be {‘C’: 0.1, ‘penalty’: ‘l2’}, our grid search has an accuracy of 0.852 and an ROC-AUC score of 0.853 for the best model. Interestingly, this performance is slightly lower than our baseline model. This could be due to the fact that our baseline model’s hyperparameters were already well-suited to this particular dataset, or it could be a result of the randomness inherent in the train-test split. Regardless, it’s a valuable reminder that more complex models and techniques are not always better.

However, you might have also noticed that our grid search only explored a relatively small number of possible hyperparameter combinations. In practice, the number of hyperparameters and their potential values can be much larger, making grid search computationally expensive or even infeasible.

This is where the Monte Carlo method comes in. Let’s see if this more guided approach improves on either the original baseline or grid search-based model’s performance:

#Monte Carlo

# Import necessary libraries
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import numpy as np

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the range of hyperparameters
# 'C' is the inverse of regularization strength (smaller values specify stronger regularization)
# 'penalty' specifies the norm used in the penalization (l1 or l2)
C_range = np.logspace(-3, 3, 7)
penalty_options = ['l1', 'l2']

# Initialize variables to store the best score and hyperparameters
best_score = 0
best_hyperparams = None

# Perform the Monte Carlo simulation
# We're going to perform 1000 iterations. You can play with this number to see how the performance changes.
# Remember the Law of Large Numbers!
for _ in range(1000):

# Randomly select hyperparameters from the defined range
C = np.random.choice(C_range)
penalty = np.random.choice(penalty_options)

# Create and evaluate the model with these hyperparameters
# We're using 'liblinear' solver as it supports both L1 and L2 regularization
model = LogisticRegression(C=C, penalty=penalty, solver='liblinear')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Calculate the accuracy and ROC-AUC
accuracy = accuracy_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred)

# If this model's ROC-AUC is the best so far, store its score and hyperparameters
if roc_auc > best_score:
best_score = roc_auc
best_hyperparams = {'C': C, 'penalty': penalty}

# Print the best score and hyperparameters
print("Monte Carlo Method " + f'Best ROC-AUC: {best_score}')
print("Monte Carlo Method " + f'Best hyperparameters: {best_hyperparams}')

# Train the model with the best hyperparameters
best_model = LogisticRegression(**best_hyperparams, solver='liblinear')
best_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = best_model.predict(X_test)

# Calculate and print the accuracy of the best model
accuracy = accuracy_score(y_test, y_pred)
print("Monte Carlo Method " + f'Accuracy of the best model: {accuracy}')

In the Monte Carlo method, we found that the best ROC-AUC score was 0.9014, with the best hyperparameters being {‘C’: 0.1, ‘penalty’: ‘l1’}. The accuracy of the best model was 0.9016.

Looks like Monte Carlo just pulled an ace from the deck — this is an improvement over both the baseline model and the model tuned using grid search. I encourage you to tweak the Python code to see how it impacts the performance, remembering the principles discussed. See if you can improve the grid search method by increasing the hyperparameter space, or compare the computation time to the Monte Carlo method. Increase and decrease the number of iterations for our Monte Carlo method to see how that impacts performance.

The Monte Carlo method, born from a game of solitaire, has undoubtedly reshaped the landscape of computational mathematics and data science. Its power lies in its simplicity and versatility, allowing us to tackle complex, high-dimensional problems with relative ease. From estimating the value of pi with a game of darts to tuning hyperparameters in machine learning models, Monte Carlo simulations have proven to be an invaluable tool in our data science arsenal.

In this article, we’ve journeyed from the origins of the Monte Carlo method, through its theoretical underpinnings, and into its practical applications in machine learning. We’ve seen how it can be used to optimize machine learning models, with a hands-on exploration of hyperparameter tuning using a real-world dataset. We’ve also compared it with other methods, demonstrating its efficiency and effectiveness.

But the story of Monte Carlo is far from over. As we continue to push the boundaries of machine learning and data science, the Monte Carlo method will undoubtedly continue to play a crucial role. Whether we’re developing sophisticated AI applications, making sense of complex data, or simply playing a game of solitaire, the Monte Carlo method is a testament to the power of simulation and approximation in solving complex problems.

As we move forward, let’s take a moment to appreciate the beauty of this method — a method that has its roots in a simple card game, yet has the power to drive some of the most advanced computations in the world. The Monte Carlo method truly is a high-stakes game of chance and complexity, and so far, it seems, the house always wins. So, keep shuffling the deck, keep playing your cards, and remember — in the game of data science, Monte Carlo could just be your ace in the hole.

Congratulations on making it to the end! We’ve journeyed through the world of probabilities, wrestled with complex models, and emerged with a newfound appreciation for the power of Monte Carlo simulations. We’ve seen them in action, simplifying intricate problems into manageable components, and even optimizing hyperparameters for machine learning tasks.

If you enjoy diving into the intricacies of ML problem-solving as much as I do, follow me on Medium and LinkedIn. Together, let’s navigate the AI labyrinth, one clever solution at a time.

Until our next statistical adventure, keep exploring, keep learning, and keep simulating! And in your data science and ML journey, may the odds be ever in your favor.

Note: All images, unless otherwise noted, are by the author.



Source link

This post originally appeared on TechToday.