Published: Mar 10, 2025
•
11 min read
Python
Machine Learning
Click here to download the Jupyter Notebook file created for this project.
Sleep plays an essential role in physical health and overall well-being, but many people struggle with poor sleep quality. The goal of this project is to analyze and predict sleep efficiency (the proportion of time actually spent asleep compared to time spent in bed) by building a classification model based on various lifestyle factors and sleep patterns.
This project seeks to answer:
By addressing these questions, this project aims to reveal insights into the relationship between lifestyle and sleep quality, this could potentially aid individuals in making changes to improve their sleep.
For this project, I’ll be working with the Sleep Efficiency Dataset from Kaggle, which captures sleep patterns and related behaviors among 452 subjects. The dataset includes demographic information, sleep duration, efficiency metrics, sleep-stage distribution, lifestyle factors such as caffeine and alcohol intake, smoking status, and exercise frequency. This detailed information will allow analysis and prediction of sleep efficiency.
| Feature | Description |
|---|---|
| ID | Unique identifier for each test subject |
| Age | Age of the subject (in years) |
| Gender | Gender of the subject |
| Bedtime | Time subject goes to bed |
| Wakeup time | Time subject wakes up |
| Sleep duration | Total sleep duration (hours) |
| Sleep efficiency | Proportion of time in bed actually spent asleep |
| REM sleep percentage | Percentage of sleep in REM (Rapid Eye Movement) stage |
| Deep sleep percentage | Percentage of sleep in deep sleep stage |
| Light sleep percentage | Percentage of sleep in light sleep stage |
| Awakenings | Number of awakenings during sleep |
| Caffeine consumption | Caffeine intake in 24 hours prior to bedtime |
| Alcohol consumption | Alcohol intake in 24 hours prior to bedtime |
| Smoking status | Indicates if subject is a smoker ("Yes"/"No") |
| Exercise frequency | Frequency of exercise (scale: 0 to 5) |
Before training a model it's essential to prepare the dataset. Pre-processing ensures the data is clean and in a format suitable for machine learning algorithms.
An important first step is to check for and handle missing values:
# count nulls
nulls.isnull().sum().sum()
>np.int64(65)
The output shows a numpy integer with value 65; so there are several missing values in the dataset that need to be accounted for. Since there are only 65, those rows are dropped.
df.dropna(inplace=True)
Irrelevant features like “ID” are removed. This is a feature which does not provide meaningful information when training the model.
# remove irrelevant features
df = df.drop(columns=["ID"])
Next, we need to examine the data type of each feature. Features with the type object require conversion into numeric or categorical formats before the model is trained.
df.dtypes
| Category | Data Type |
|---|---|
| ID | int64 |
| Age | int64 |
| Gender | object |
| Bedtime | object |
| Wakeup time | object |
| Sleep duration | float64 |
| Sleep efficiency | float64 |
| REM sleep percentage | int64 |
| Deep sleep percentage | int64 |
| Light sleep percentage | int64 |
| Awakenings | float64 |
| Caffeine consumption | float64 |
| Alcohol consumption | float64 |
| Smoking status | object |
| Exercise frequency | float64 |
The output of df.dtypes shows that we need to modify 4 features: Gender, Bedtime, Wakeup time, and Smoking status.
In this dataset there are only two values for gender: 'Male' and 'Female', this is converted into a binary format, where Male = 1 and Female = 0:
df["Gender"] = df["Gender"].apply(lambda x: 1 if x == "Male" else 0)
Smoking Status also only has two values. It is converted to a binary format as well.
df["Smoking status"] = df["Smoking status"].apply(lambda x: 1 if x == "Yes" else 0)
The bedtime feature originally included dates; this was removed and the hour values were converted into a float (21:30 becomes 21.5). The date component was removed because it does not meaningfully contribute to predicting sleep efficiency. Including a specific date adds unnecessary complexity. Sleep patterns depend more on daily habits rather than the date on which those habits occurred.
# converted Bedtime and WakeupTime to datetime format
df["Bedtime"] = pd.to_datetime(df["Bedtime"], format="%Y-%m-%d %H:%M:%S")
df["Wakeup time"] = pd.to_datetime(df["Wakeup time"], format="%Y-%m-%d %H:%M:%S")
# remove the date component
df["Bedtime"] = df["Bedtime"].dt.time
df["Wakeup time"] = df["Wakeup time"].dt.time
# Convert time to float (21:30 equals 21.5)
df["Bedtime"] = df["Bedtime"].apply(lambda x: x.hour + x.minute / 60)
df["Wakeup time"] = df["Wakeup time"].apply(lambda x: x.hour + x.minute / 60)
To better understand feature relationships, a correlation matrix was generated, this will show how strongly each feature relates to the others. Unlike the previous dataset selected for this project (which was synthetic), this one shows significant correlations between several features. The resulting correlation values are provided below and visualized using a custom D3.js component for clearer more readable insights.
correlation_matrix = df.corr()
plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, cmap="inferno", fmt=".2f", linewidths=0.5)
plt.show()
| Rank | Feature 1 | Feature 2 | Correlation (abs) |
|---|---|---|---|
| 1 | Light sleep percentage | Deep sleep percentage | 0.97 |
| 2 | Light sleep percentage | Sleep efficiency | 0.82 |
| 3 | Deep sleep percentage | Sleep efficiency | 0.79 |
| 4 | Wakeup time | Bedtime | 0.77 |
| 5 | Awakenings | Sleep efficiency | 0.56 |
| 6 | Sleep duration | Wakeup time | 0.51 |
| 7 | Exercise frequency | Bedtime | 0.40 |
| 8 | Alcohol consumption | Sleep efficiency | 0.39 |
| 9 | Alcohol consumption | Light sleep percentage | 0.38 |
| 10 | Alcohol consumption | Deep sleep percentage | 0.36 |
A decision tree was selected for modeling because of its ability to handle both numeric and categorical data without much pre-processing. Decision trees also provide clear insights into how each feature affects the prediction. A decision tree simplifies the process of understanding and communicating the results of the model.
from sklearn.model_selection import train_test_split
from sklearn import tree
X = df.drop(columns="sleep_eff_bucket") # Features
y = df["sleep_eff_bucket"] # Target variable
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = 0.2 ) #80% train, 20% test
X_train.shape, X_test.shape, y_train.shape, y_test.shape
classifier = tree.DecisionTreeClassifier()
classifier.fit(X_train, y_train)
# Make predictions
predictions = classifier.predict(X_test)
accuracy = classifier.score(X_test, y_test)
print("Initial Model Accuracy:", accuracy)
>Initial Model Accuracy: 0.75
This a very reasonable accuracy score for the initial model. Next, we will attempt to improve the accuracy of the model.
In order to improve the model we will first look at the distribution of each feature. This can help identify imbalances, outliers, or skewed data that may negatively impact model performance.
The distribution for Bedtime is clustered at the edges; this is because midnight, when most people are asleep, is the highest (and lowest) value. To correct this, midnight will be repositioned to the middle of the feature’s range.
df["Bedtime"] = (df["Bedtime"] + 12) % 24
Caffeine consumption is simplified to a binary format, where 0 is no caffeine consumed and 1 represents caffeine was consumed:
df["Caffeine consumption"] = (df["Caffeine consumption"] > 0).astype(int)
The model was run again with the adjusted dataset and gave an accuracy score of 0.79. It's unlikely there is any significant improvement in the model and this just reflects changes in the training and testing data.
Lastly the dataset was simplified by removing features that showed a high correlation with each other:
df = df.drop(columns="Wakeup time")
df = df.drop(columns="Deep sleep percentage")
This final model gave an accuracy score of 0.75.
The final step is to look at the most important features, this will provide information about influencing ‘sleep efficiency.’
fi = classifier.feature_importances_ #feature importance array
fi = pd.Series(data = fi, index = X.columns) #convert to Pandas series for plotting
fi.sort_values(ascending=False, inplace=True) #sort descending
The two most important features influencing sleep efficiency are 'light sleep percentage' and 'awakenings', unfortunately these are things that are largely out of an individuals control. Age is the third most influential and is not something that can be willingly changed. Finally, bedtime is the next most influential, and is something that most anyone can choose to change and improve their sleep efficiency.
Improving sleep efficiency involves both managing controllable lifestyle choices and understanding the uncontrollable ones. Making small changes in daily routines could lead to better sleep and overall health.