Predicting Sleep Efficiency

Click here to download the Jupyter Notebook file created for this project.

Introduction

Sleep plays an essential role in physical health and overall well-being, but many people struggle with poor sleep quality. The goal of this project is to analyze and predict sleep efficiency (the proportion of time actually spent asleep compared to time spent in bed) by building a classification model based on various lifestyle factors and sleep patterns.

This project seeks to answer:

Can sleep efficiency be accurately predicted using the other features in this dataset?
Which features most strongly influence sleep efficiency?

By addressing these questions, this project aims to reveal insights into the relationship between lifestyle and sleep quality, this could potentially aid individuals in making changes to improve their sleep.

Overview of the Dataset

For this project, I’ll be working with the Sleep Efficiency Dataset from Kaggle, which captures sleep patterns and related behaviors among 452 subjects. The dataset includes demographic information, sleep duration, efficiency metrics, sleep-stage distribution, lifestyle factors such as caffeine and alcohol intake, smoking status, and exercise frequency. This detailed information will allow analysis and prediction of sleep efficiency.

Feature	Description
ID	Unique identifier for each test subject
Age	Age of the subject (in years)
Gender	Gender of the subject
Bedtime	Time subject goes to bed
Wakeup time	Time subject wakes up
Sleep duration	Total sleep duration (hours)
Sleep efficiency	Proportion of time in bed actually spent asleep
REM sleep percentage	Percentage of sleep in REM (Rapid Eye Movement) stage
Deep sleep percentage	Percentage of sleep in deep sleep stage
Light sleep percentage	Percentage of sleep in light sleep stage
Awakenings	Number of awakenings during sleep
Caffeine consumption	Caffeine intake in 24 hours prior to bedtime
Alcohol consumption	Alcohol intake in 24 hours prior to bedtime
Smoking status	Indicates if subject is a smoker ("Yes"/"No")
Exercise frequency	Frequency of exercise (scale: 0 to 5)

Pre-Processing

Before training a model it's essential to prepare the dataset. Pre-processing ensures the data is clean and in a format suitable for machine learning algorithms.

An important first step is to check for and handle missing values:

# count nulls
nulls.isnull().sum().sum()

>np.int64(65)

The output shows a numpy integer with value 65; so there are several missing values in the dataset that need to be accounted for. Since there are only 65, those rows are dropped.

df.dropna(inplace=True)

Irrelevant features like “ID” are removed. This is a feature which does not provide meaningful information when training the model.

# remove irrelevant features
df = df.drop(columns=["ID"])

Next, we need to examine the data type of each feature. Features with the type object require conversion into numeric or categorical formats before the model is trained.

df.dtypes

Category	Data Type
ID	int64
Age	int64
Gender	object
Bedtime	object
Wakeup time	object
Sleep duration	float64
Sleep efficiency	float64
REM sleep percentage	int64
Deep sleep percentage	int64
Light sleep percentage	int64
Awakenings	float64
Caffeine consumption	float64
Alcohol consumption	float64
Smoking status	object
Exercise frequency	float64

The output of df.dtypes shows that we need to modify 4 features: Gender, Bedtime, Wakeup time, and Smoking status.

In this dataset there are only two values for gender: 'Male' and 'Female', this is converted into a binary format, where Male = 1 and Female = 0:

df["Gender"] = df["Gender"].apply(lambda x: 1 if x == "Male" else 0)

Smoking Status also only has two values. It is converted to a binary format as well.

df["Smoking status"] = df["Smoking status"].apply(lambda x: 1 if x == "Yes" else 0)

The bedtime feature originally included dates; this was removed and the hour values were converted into a float (21:30 becomes 21.5). The date component was removed because it does not meaningfully contribute to predicting sleep efficiency. Including a specific date adds unnecessary complexity. Sleep patterns depend more on daily habits rather than the date on which those habits occurred.

# converted Bedtime and WakeupTime to datetime format
df["Bedtime"] = pd.to_datetime(df["Bedtime"], format="%Y-%m-%d %H:%M:%S")
df["Wakeup time"] = pd.to_datetime(df["Wakeup time"], format="%Y-%m-%d %H:%M:%S")

# remove the date component
df["Bedtime"] = df["Bedtime"].dt.time
df["Wakeup time"] = df["Wakeup time"].dt.time

# Convert time to float (21:30 equals 21.5)
df["Bedtime"] = df["Bedtime"].apply(lambda x: x.hour + x.minute / 60)
df["Wakeup time"] = df["Wakeup time"].apply(lambda x: x.hour + x.minute / 60)

Examining Correlation Between Features

To better understand feature relationships, a correlation matrix was generated, this will show how strongly each feature relates to the others. Unlike the previous dataset selected for this project (which was synthetic), this one shows significant correlations between several features. The resulting correlation values are provided below and visualized using a custom D3.js component for clearer more readable insights.

correlation_matrix = df.corr()

plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, cmap="inferno", fmt=".2f", linewidths=0.5)
plt.show()

Correlation Matrix

Rank	Feature 1	Feature 2	Correlation (abs)
1	Light sleep percentage	Deep sleep percentage	0.97
2	Light sleep percentage	Sleep efficiency	0.82
3	Deep sleep percentage	Sleep efficiency	0.79
4	Wakeup time	Bedtime	0.77
5	Awakenings	Sleep efficiency	0.56
6	Sleep duration	Wakeup time	0.51
7	Exercise frequency	Bedtime	0.40
8	Alcohol consumption	Sleep efficiency	0.39
9	Alcohol consumption	Light sleep percentage	0.38
10	Alcohol consumption	Deep sleep percentage	0.36

Training the Model

A decision tree was selected for modeling because of its ability to handle both numeric and categorical data without much pre-processing. Decision trees also provide clear insights into how each feature affects the prediction. A decision tree simplifies the process of understanding and communicating the results of the model.

from sklearn.model_selection import train_test_split
from sklearn import tree

X = df.drop(columns="sleep_eff_bucket")  # Features
y = df["sleep_eff_bucket"]               # Target variable

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = 0.2 ) #80% train, 20% test
X_train.shape, X_test.shape, y_train.shape, y_test.shape

classifier = tree.DecisionTreeClassifier()
classifier.fit(X_train, y_train)

# Make predictions
predictions = classifier.predict(X_test)
accuracy = classifier.score(X_test, y_test)

print("Initial Model Accuracy:", accuracy)

>Initial Model Accuracy: 0.75

This a very reasonable accuracy score for the initial model. Next, we will attempt to improve the accuracy of the model.

Improving the Model

In order to improve the model we will first look at the distribution of each feature. This can help identify imbalances, outliers, or skewed data that may negatively impact model performance.

Distribution for Each Feature

The distribution for Bedtime is clustered at the edges; this is because midnight, when most people are asleep, is the highest (and lowest) value. To correct this, midnight will be repositioned to the middle of the feature’s range.

df["Bedtime"] = (df["Bedtime"] + 12) % 24

Caffeine consumption is simplified to a binary format, where 0 is no caffeine consumed and 1 represents caffeine was consumed:

df["Caffeine consumption"] = (df["Caffeine consumption"] > 0).astype(int)

The model was run again with the adjusted dataset and gave an accuracy score of 0.79. It's unlikely there is any significant improvement in the model and this just reflects changes in the training and testing data.

Simplifying the Dataset

Lastly the dataset was simplified by removing features that showed a high correlation with each other:

df = df.drop(columns="Wakeup time")
df = df.drop(columns="Deep sleep percentage")

This final model gave an accuracy score of 0.75.

Important Features

The final step is to look at the most important features, this will provide information about influencing ‘sleep efficiency.’

fi = classifier.feature_importances_ #feature importance array
fi = pd.Series(data = fi, index = X.columns) #convert to Pandas series for plotting
fi.sort_values(ascending=False, inplace=True) #sort descending

Conclusions

The two most important features influencing sleep efficiency are 'light sleep percentage' and 'awakenings', unfortunately these are things that are largely out of an individuals control. Age is the third most influential and is not something that can be willingly changed. Finally, bedtime is the next most influential, and is something that most anyone can choose to change and improve their sleep efficiency.

Improving sleep efficiency involves both managing controllable lifestyle choices and understanding the uncontrollable ones. Making small changes in daily routines could lead to better sleep and overall health.