Home

Portfolio

Home

Portfolio

Predicting Sleep Efficiency

Published: Mar 10, 2025

•

11 min read

Python

Machine Learning

Background image for cover photo showing a blue gradientForeground image for the cover photo showing a line art image of a bed which changes color from light to dark when the page is scrolled

Click here to download the Jupyter Notebook file created for this project.

Introduction

Sleep plays an essential role in physical health and overall well-being, but many people struggle with poor sleep quality. The goal of this project is to analyze and predict sleep efficiency (the proportion of time actually spent asleep compared to time spent in bed) by building a classification model based on various lifestyle factors and sleep patterns.

This project seeks to answer:

  • Can sleep efficiency be accurately predicted using the other features in this dataset?
  • Which features most strongly influence sleep efficiency?

By addressing these questions, this project aims to reveal insights into the relationship between lifestyle and sleep quality, this could potentially aid individuals in making changes to improve their sleep.


Overview of the Dataset

For this project, I’ll be working with the Sleep Efficiency Dataset from Kaggle, which captures sleep patterns and related behaviors among 452 subjects. The dataset includes demographic information, sleep duration, efficiency metrics, sleep-stage distribution, lifestyle factors such as caffeine and alcohol intake, smoking status, and exercise frequency. This detailed information will allow analysis and prediction of sleep efficiency.

FeatureDescription
IDUnique identifier for each test subject
AgeAge of the subject (in years)
GenderGender of the subject
BedtimeTime subject goes to bed
Wakeup timeTime subject wakes up
Sleep durationTotal sleep duration (hours)
Sleep efficiencyProportion of time in bed actually spent asleep
REM sleep percentagePercentage of sleep in REM (Rapid Eye Movement) stage
Deep sleep percentagePercentage of sleep in deep sleep stage
Light sleep percentagePercentage of sleep in light sleep stage
AwakeningsNumber of awakenings during sleep
Caffeine consumptionCaffeine intake in 24 hours prior to bedtime
Alcohol consumptionAlcohol intake in 24 hours prior to bedtime
Smoking statusIndicates if subject is a smoker ("Yes"/"No")
Exercise frequencyFrequency of exercise (scale: 0 to 5)

Pre-Processing

Before training a model it's essential to prepare the dataset. Pre-processing ensures the data is clean and in a format suitable for machine learning algorithms.

An important first step is to check for and handle missing values:

# count nulls
nulls.isnull().sum().sum()
>np.int64(65)

The output shows a numpy integer with value 65; so there are several missing values in the dataset that need to be accounted for. Since there are only 65, those rows are dropped.

df.dropna(inplace=True)

Irrelevant features like “ID” are removed. This is a feature which does not provide meaningful information when training the model.

# remove irrelevant features
df = df.drop(columns=["ID"])

Next, we need to examine the data type of each feature. Features with the type object require conversion into numeric or categorical formats before the model is trained.

df.dtypes
CategoryData Type
IDint64
Ageint64
Genderobject
Bedtimeobject
Wakeup timeobject
Sleep durationfloat64
Sleep efficiencyfloat64
REM sleep percentageint64
Deep sleep percentageint64
Light sleep percentageint64
Awakeningsfloat64
Caffeine consumptionfloat64
Alcohol consumptionfloat64
Smoking statusobject
Exercise frequencyfloat64

The output of df.dtypes shows that we need to modify 4 features: Gender, Bedtime, Wakeup time, and Smoking status.

In this dataset there are only two values for gender: 'Male' and 'Female', this is converted into a binary format, where Male = 1 and Female = 0:

df["Gender"] = df["Gender"].apply(lambda x: 1 if x == "Male" else 0)

Smoking Status also only has two values. It is converted to a binary format as well.

df["Smoking status"] = df["Smoking status"].apply(lambda x: 1 if x == "Yes" else 0)

The bedtime feature originally included dates; this was removed and the hour values were converted into a float (21:30 becomes 21.5). The date component was removed because it does not meaningfully contribute to predicting sleep efficiency. Including a specific date adds unnecessary complexity. Sleep patterns depend more on daily habits rather than the date on which those habits occurred.

# converted Bedtime and WakeupTime to datetime format
df["Bedtime"] = pd.to_datetime(df["Bedtime"], format="%Y-%m-%d %H:%M:%S")
df["Wakeup time"] = pd.to_datetime(df["Wakeup time"], format="%Y-%m-%d %H:%M:%S")

# remove the date component
df["Bedtime"] = df["Bedtime"].dt.time
df["Wakeup time"] = df["Wakeup time"].dt.time

# Convert time to float (21:30 equals 21.5)
df["Bedtime"] = df["Bedtime"].apply(lambda x: x.hour + x.minute / 60)
df["Wakeup time"] = df["Wakeup time"].apply(lambda x: x.hour + x.minute / 60)

Examining Correlation Between Features

To better understand feature relationships, a correlation matrix was generated, this will show how strongly each feature relates to the others. Unlike the previous dataset selected for this project (which was synthetic), this one shows significant correlations between several features. The resulting correlation values are provided below and visualized using a custom D3.js component for clearer more readable insights.

correlation_matrix = df.corr()

plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, cmap="inferno", fmt=".2f", linewidths=0.5)
plt.show()
Correlation Matrix
RankFeature 1Feature 2Correlation (abs)
1Light sleep percentageDeep sleep percentage0.97
2Light sleep percentageSleep efficiency0.82
3Deep sleep percentageSleep efficiency0.79
4Wakeup timeBedtime0.77
5AwakeningsSleep efficiency0.56
6Sleep durationWakeup time0.51
7Exercise frequencyBedtime0.40
8Alcohol consumptionSleep efficiency0.39
9Alcohol consumptionLight sleep percentage0.38
10Alcohol consumptionDeep sleep percentage0.36

Training the Model

A decision tree was selected for modeling because of its ability to handle both numeric and categorical data without much pre-processing. Decision trees also provide clear insights into how each feature affects the prediction. A decision tree simplifies the process of understanding and communicating the results of the model.

from sklearn.model_selection import train_test_split
from sklearn import tree

X = df.drop(columns="sleep_eff_bucket")  # Features
y = df["sleep_eff_bucket"]               # Target variable

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = 0.2 ) #80% train, 20% test
X_train.shape, X_test.shape, y_train.shape, y_test.shape

classifier = tree.DecisionTreeClassifier()
classifier.fit(X_train, y_train)

# Make predictions
predictions = classifier.predict(X_test)
accuracy = classifier.score(X_test, y_test)

print("Initial Model Accuracy:", accuracy)
>Initial Model Accuracy: 0.75

This a very reasonable accuracy score for the initial model. Next, we will attempt to improve the accuracy of the model.

Improving the Model

In order to improve the model we will first look at the distribution of each feature. This can help identify imbalances, outliers, or skewed data that may negatively impact model performance.


Distribution for Each Feature

The distribution for Bedtime is clustered at the edges; this is because midnight, when most people are asleep, is the highest (and lowest) value. To correct this, midnight will be repositioned to the middle of the feature’s range.

df["Bedtime"] = (df["Bedtime"] + 12) % 24

Caffeine consumption is simplified to a binary format, where 0 is no caffeine consumed and 1 represents caffeine was consumed:

df["Caffeine consumption"] = (df["Caffeine consumption"] > 0).astype(int)

The model was run again with the adjusted dataset and gave an accuracy score of 0.79. It's unlikely there is any significant improvement in the model and this just reflects changes in the training and testing data.

Simplifying the Dataset

Lastly the dataset was simplified by removing features that showed a high correlation with each other:

df = df.drop(columns="Wakeup time")
df = df.drop(columns="Deep sleep percentage")

This final model gave an accuracy score of 0.75.

Important Features

The final step is to look at the most important features, this will provide information about influencing ‘sleep efficiency.’

fi = classifier.feature_importances_ #feature importance array
fi = pd.Series(data = fi, index = X.columns) #convert to Pandas series for plotting
fi.sort_values(ascending=False, inplace=True) #sort descending

Conclusions

The two most important features influencing sleep efficiency are 'light sleep percentage' and 'awakenings', unfortunately these are things that are largely out of an individuals control. Age is the third most influential and is not something that can be willingly changed. Finally, bedtime is the next most influential, and is something that most anyone can choose to change and improve their sleep efficiency.

Improving sleep efficiency involves both managing controllable lifestyle choices and understanding the uncontrollable ones. Making small changes in daily routines could lead to better sleep and overall health.