psg_project

CS4641 Summer 2022 Project - Sleep Stage Classification

Andrew Inman, Ashok Iyer, Bryce Smith, Josh Lee

Infographic

Introduction

Sleep is an important physiological process directly correlated with physical health, mental well-being, and chronic disease risk. Unfortunately, nearly 70 million Americans suffer from sleep disorders.¹ The most effective measurement of sleep quality to date is collecting polysomnography (PSG) data in a sleep laboratory and measuring the duration of sleep stages. However, sleep studies are expensive, time-consuming, and inaccessible to the majority of the population. Wearables have attempted to use heart rate data and machine learning algorithms to predict sleep stage, but suffer from low accuracy.² We intend to create a machine learning model for the automatic classification of sleep stages using a minimum viable subset of biosignals from PSG data. Success of this algorithm could inform the design of and be applied in a simpler, more accessible sleep monitoring system that uses minimal sensors to accurately detect sleep stages.

Methodology

Dataset

Our data source is the CAP Sleep Database on PhysioNet.³ It contains PSG recordings for 108 individuals; each waveform has over 10 channels including EEGs (brain), EMGs (muscle), ECGs (heart), EOGs (eyes), and SAO2 (respiration) signals.⁴ From each voltage waveform we extracted numerical measurements taken every 1.9 milliseconds, generating a time series dataset. Additionally, for each individual, a text file provides labeled sleep stages every epoch (30 second interval) along with age, gender, and sleep disease information.

Due to significant variation in the exact signals recorded for each individual, we defined a common set of 11 signals found in most individuals: “Fp2-F4”, “F4-C4”, “C4-P4”, “P4-O2”, “C4-A1”, “ROC-LOC”, “EMG1-EMG2”, “ECG1-ECG2”, “DX1-DX2”, “SX1-SX2”, “PLETH”, and “SAO2”. The distribution of diseases in the 108 individuals is also very unbalanced, with only 16 “normal” individuals (no sleep disease), 2 individuals with Bruxism, 9 individuals with Narcolepsy, 40 individuals with NFLE (Nocturnal Frontal Lobe Epilepsy), 10 individuals with PLM (Periodic Leg Movement), 22 individuals with RBD (REM Behavior Disorder), and 4 individuals with SDB (Sleep-Disordered Breathing). Additionally, the number of individuals with useable data varies significantly within each of these disease groups. There is significant variation in the types of signals recorded in the “normal” individuals, and only 4/16 (25%) of them have all signals in the common set. Meanwhile, a majority of the individuals with NFLE, RBD, or PLM have all signals in the common set available. In order to select an overall balanced subset of individuals such that each disease has significant (not necessarily equal) representation while also trying to capture as much data as possible, we limited the number of individuals selected from these three groups such that no disease group had over 3 times as many individuals as another.

In total, 36 individuals were selected, with 4 normal individuals, 4 individuals with insomnia, 10 individuals with NFLE, 9 individuals with PLM, and 9 individuals with RBD. Narcolepsy, Bruxism, and SDB were not considered due to a lack of available data. From here, we divided these individuals into training and testing subgroups. Our training subset had a total of 22 individuals, with 2 normal individuals, 2 individuals with insomnia, 6 individuals with NFLE, 6 individuals with PLM, and 6 individuals with RBD. Our testing subset had a total of 14 individuals, with 2 normal individuals, 2 individuals with insomnia, 4 individuals with NFLE, 3 individuals with PLM, and 3 individuals with RBD. Our unsupervised learning methods exclusively used training data; we only added testing data for use in our supervised learning methods.

After data preparation and feature extraction for these individuals, there were ~30,000 total data points (~20,000 training, ~10,000 testing) and ~47 features, most of which were engineered. The target values are discrete sleep stages (Wake, REM, NREM 1-4). An overview of the distribution of sleep stages for all individuals in the dataset is shown below.

Data Preparation

Our initial data cleaning involved converting each individual’s waveform data to numerical measurements and capturing a common set of ~10 signals. The original data for each individual had a sampling frequency of 512 Hz, meaning measurements were taken every 1.9 milliseconds. However, the sleep stage labels provided for each individual were taken every 30 seconds. Therefore, we needed a way to encapsulate our original data into 30 second chunks in order to align PSG data with sleep stages. Simply averaging measurements across each epoch is not the best way to encapsulate the data for each epoch; we implemented feature extraction methods based on previous research on effective features for each type of signal to get multiple features from each signal. Feature Extraction methods are detailed below.

Due to physiological variability between PSG subjects, many of the recordings being taken years apart, and potential testing inconsistencies, such as electrode connection quality, there is high variability between the subjects’ baseline values. For example, the baseline EMG amplitude (background noise) for some subjects is significantly higher than others. Similarly, so subjects have a naturally higher heart rate than others. These differences in baselines are unique to each subject but persist through the entire recording. This was remedied by centering each individual’s data by subtracting the mean of each of their features before combining their data with the rest of the subjects’ data.

Outliers were detected in the dataset using the Local Outlier Factor (LOF) method. This algorithm considers if a point is an outlier among its nearest neighbors, as opposed to considering the point in relation to the entire dataset. Thus, extreme outliers due to recording errors are removed, but expected outliers, such as spikes in muscle activity are not removed. For example, when heart beats are overrun by noise due to recording abnormalities, certain heart rate metrics, such as the low frequency change in heart rate can spike, often by multiple orders of magnitude higher than expected. Such outliers were detected and removed based on the LOF method. Other statistical outliers, such as those due to spikes in EMG (muscle) activity are not removed using this method, which is advantageous because this type of outlier is a valid measurement that can be used to detect motion during sleep, often associated with REM and Wake sleep stages.

Finally, we applied robust scaling to our dataset using the interquartile range. We opted for robust scaling over standard scaling due to concerns regarding the effect of outliers on our dataset. The Box Cox Transformation was used to standardize all features to a normal distribution. The results of Box-Cox transformation applied to EOG Energy Content Band are shown below.

Before Box-Cox:

After Box-Cox:

Additionally, we used encoding methods to express sleep disease and sleep stage (target) in numerical form. For sleep disease, we employed “dummy” encoding; with five disease classes including “Normal”, we created four new binary variables that took value “1” if an individual had a certain disease and “0” if not, with a “Normal” individual having all four of these binary variables equal to 0. For sleep stage, we used “ordinal” encoding in which we simply assigned each sleep stage a numerical value; the “awake” stage was assigned value “0”, NREM stages 1-4 were assigned their respective stage numbers, and REM was assigned value “5”.

Feature Engineering

Feature extraction methods for each type of signal from the PSG data are described:

EEG: EEG (electroencephalogram) is a technique used to detect electrical activity in the brain. Manual sleep stage classification is largely dependent on the fraction of brain waves with specific frequencies (e.g., delta waves with a frequency of 1 - 4 Hz) and secondary time-domain features. In our dataset, available EEG signals differ slightly between individuals, but broadly follow the International 10-20 System. Extensive literature exists on useful EEG features, so a subset of suggested features were selected. First, the time-domain EEG signal was decomposed into the frequency-domain using Welch’s method (see image below), and the power of each frequency band of each brain wave was computed. Second, multiple entropy-based metrics (i.e., metrics conveying the amount of information given by a signal) were computed. Finally, miscellaneous more sophisticated time-domain metrics (e.g., Petrosian fractal dimension) were calculated. In total, thirteen unique features were computed using the provided EEG signals. All EEG features were averaged across each individual’s EEG channels.

ECG & PPG: ECG (electrocardiogram) and PPG (photoplethysmogram) are two methods used to record heart beats during the sleep studies. First, Python’s heartpy library was used to detect heart beats (see image below).⁵ Once heart beats were located, heart rate could be calculated. Beyond heart rate, an informative set of metrics consist of those that quantify variation in heart rate. The root mean square of the differences in time between adjacent heart beats (RMSSD) is one measure of heart rate variability, which is useful in our application because it can be meaningfully calculated over short time periods, such as 30 second epochs.⁶ Heart rate changes in the frequency domain, specifically “low frequency” changes (0.04-0.15Hz) and “high frequency” changes (0.15-0.5 Hz) have been observed to vary with sleep stage, so these were also applied using the implementation in the heartpy library.^7,8

EMG - EMG (electromyography) is a method for measuring electrical activity of muscles. The main metric used to quantify the EMG activity was energy, calculated as the sum of squared differences between each point and the sample mean, divided by the number of samples.⁹ Progression into deeper stages of sleep is typically correlated with a decrease in muscle tone, which corresponds to a decrease in baseline EMG energy, but REM sleep is also associated with brief spikes in muscle activity (see image below).¹⁰ To capture these transient spikes in EMG energy that were “averaged out” over an entire 30 second epoch, a moving average with a five second window was applied over each second, and the average of the five highest windows was recorded within each 30 second epoch.¹¹

EOG: EOG (electrooculography) is used to detect activity within the human eye. One study aimed at Human-Computer Interaction applications mentioned a few useful features that were extracted from EOG signals, including: Maximum Peak Amplitude, which measures the maximum positive amplitude, Maximum Valley Amplitude, which measures the maximum negative amplitude, Area Under Curve, which is a summation of the absolute values of amplitude under positive and negative curves, and Signal Variance.¹² All of these metrics were calculated within each epoch. Another study that focused specifically on sleep staging estimated the Power Spectrum for the EOG signal and calculated the Energy Content Band by integrating this function over the frequency range 0.35-0.5 Hz, where REM activity is concentrated.⁹ Using a Welch method to estimate the power spectrum, we calculated the Energy Content Band for each epoch.
SAO2: SAO2 (or SPO2) refers to a blood-oxygen saturation reading which indicates the percentage of hemoglobin molecules that are saturated with oxygen. Readings can vary from 0 to 100% . Normal reading will range from 94% to 100%. Literature suggests readings below 50% are artifacts. Related literature to sleep staging using oximetry data engineered features by taking the peaks of each time period and the percentage of time spent above a certain threshold.¹³ We followed suit with our data by taking the maximum of each epoch and the percentage of time spent above 70%, 80%, and 90% oxygen saturation by epoch. In addition, we included the average oxygen saturation of each epoch.

Feature Selection

After feature extraction, the correlation and mutual information methods were used to eliminate unnecessary features.

Correlation Method: Correlated features were detected and removed using the method proposed by Kuhn and Johnson.¹⁴ This method involves first calculating a correlation matrix for the data. Then, correlations are assessed pairwise. For any pair of features with a correlation above a set threshold (0.8 was used here), the feature in this pair with the larger average correlation between itself and every other feature was removed. This method eliminated 13 features, as shown in the results section.
Mutual Information Method: After using the correlation method, we calculated the normalized mutual information between each feature and the sleep stages (target values) and defined four feature sets: the top 5, 10, 20, and 30 features with the greatest normalized mutual information values. Ultimately, only the top 5 and top 10 sets have been used since they yield the strongest performance metrics.

Dimensionality Reduction

After feature selection, two methods were employed to reduce the dimensionality of data - Principal Component Analysis (PCA) and T-Distributed Stochastic Neighbor Embedding (TSNE). Broadly, PCA linearly transforms combinations of features such that variance is maximized along each principal component (i.e., axis). TSNE is a more sophisticated dimensionality reduction technique that is able to account for nonlinear features in data. Both techniques were employed on the four feature groups (i.e., top 5, 10, 20, & 30 features) and were used to reduce to 1, 2, 3, and 4 components. Isomap, a non-linear, manifold-based dimensionality reduction algorithm was also applied to the dataset. This algorithm differs from PCA in that it does not assume that datapoints that are close together in Euclidean space are meaningfully similar. Because the method did not yield noticeably improved differentiation of sleep stages in the reduced feature space compared to PCA, these results were not used with the unsupervised learning algorithms.

Unsupervised Learning

Following dimensionality reduction, we applied several unsupervised learning methods to our training data, including K-Means, GMM, and DBSCAN. To determine the quality of our clustering, we used the external measures of homogeneity, F1 score, normalized mutual information, Rand Statistic, and Fowlkes-Mallows measure. These external measures were selected because they quantify how clusters found by the unsupervised learning methods represent the known sleep stages (target values). Normalized mutual information was selected to quantify how much information we gather about our targets by knowing the unsupervised cluster assignments. The Rand Stat was applied to capture the accuracy of the cluster assignments. F1 and Fowlkes-Mallows were both selected because they represent both precision and recall. In our application, false positives and false negatives have similar impact, so a balance of precision and recall is desired. In order to calculate these clustering metrics, the sleep stages were taken as the “ground-truth” assignments, and each cluster was assigned a sleep stage based on the sleep stage of the majority of points in that cluster. We defined the “predicted” label of a point as the sleep stage of the cluster that it was assigned to.

K-Means: K-Means was applied to our dataset via the sklearn implementation. As K-Means is notoriously sensitive to outliers, we expected suboptimal results. Thus, we explored similar methods to K-Means such as K-Medians and K-Medoids which are both more resistant to outliers.¹⁵ K-Means gave the baseline behavior while K-Medoids was chosen as it was the most outlier resistant due to the nature of cluster center selection. We utilized the elbow method and found that 3 clusters was optimal for K-Means & K-Medoids (see image below). It should be noted that 6 clusters are expected for our dataset as this would capture each stage of sleep. Thus, we ran the K-Means & K-Medoids on both 3 and 6 clusters. Because the goal of our algorithm is to distiguish between all 6 sleep stages, the primary results shown here will be those achieved using 6 clusters. The Ancillary Results section shows K-Means applied with 3 clusters and a simplified interpretation of the sleep stages as Awake, NREM, and REM.

GMM: GMM was applied to our dataset via the sklearn implementation. Like K-Means, the most important parameter for GMM is the specified number of clusters. For our data, six clusters are used, corresponding to the 6 stages of sleep. The Ancillary Results section shows the performance of the algorithm when applied to only 3 clusters.
DBSCAN: DBSCAN was applied using the implementation in the sklearn package. The critical parameters to set for the algorithm are epsilon, or the maximum radius of a neighborhood around a point, and MinPts, the minimum number of points required to be in a point’s epsilon neighborhood for that point to be considered a core point. The starting value of MinPts was determined based on the dimensionality of the data being clustered, using the rule of thumb that in noisy datasets, a MinPts of 2xD is often appropriate. Epsilon was calculated using the distance to the 4 nearest neighbors of each point (see image below). These distances were sorted and plotted, yielding a graph that shows a flat region followed by a sharp increase in distance to outliers. A starting value of epsilon was selected as a value in the flat region of this graph, and it was adjusted further by steps of 0.1 to increase the clustering metrics.

Supervised Learning

Based on our unsupervised learning results and the imbalance in distribution of sleep stages, NREM stages 1-4 were consolidated into a single class before applying supervised learning algorithms. Many ML sleep staging studies, such as one by Satapathy¹⁶, build predictive models with a consolidated NREM class. This results in approximately 70% of datapoints having the target label ‘NREM’, which causes some algorithms to largely ignore the minority classes Wake and REM when making predictions in an attempt to optimize overall accuracy. Therefore, before running certain algorithms, undersampling and/or oversampling techniques were applied to our training data. This helps reduce artifical inflation of accuracy from data imbalances. Undersampling and oversampling techniques were applied after dimensionality reduction, and they were only applied to the training dataset. The supervised learning classification algorithms we applied include Naive Bayes, Logistic Regression, Random Forest, SVM, and LSTM Neural Network.

Undersampling: The undersampling technique used was the Neighborhood Cleaning Rule (NCR) as implemented in the Imbalanced Learn library. This technique assesses the nearest neighbors (3 neighbors were used) of each datapoint, and removes the data points for which all neighbors are not in the same class. In addition, NCR runs a 3 nearest neighbors classifier and removes data points not belonging to the predicted class.
Oversampling: The oversampling technique used was the Synthetic Minority Oversampling Technique (SMOTE). This method uses interpolation between data points in the same class to generate prototype data points for the class. This was used to add additional data points to the minority classes (Wake and REM) until they matched the number of points in the NREM class.
Naive Bayes: Gaussian Naive Bayes was applied using the implementation in sklearn. The best results were obtained by applying dimensionality reduction using a 5-component PCA and then isolating the 3rd, 4th, and 5th principal components. Following dimensionality reduction, one challenge in training an effective Naive Bayes classifier was overlapping data points from different classes. Even in regions that were primarily composed of a single sleep stage, there were often noisy data points from other sleep stages interspersed in these regions, so NCR undersampling was applied to clean these regions in the training dataset. The next challenge arose as a result of an imbalance between the relatively few REM data points and the more common Wake and NREM data points. SMOTE oversampling was applied following PCA to increase the number of the points in the minority classes.
Logistic Regression: Logistic regression was applied using the implementation in sklearn. The one-vs-rest technique was used to classify the data into three groups, and for each binary classification arising in the one-vs-rest comparisons, a probability threshold of 0.5 was used as the cutoff for predicting one class over another. As with the Naive Bayes classifier, prior to running the logistic regression, NCR undersampling and SMOTE oversampling were applied to the dimensionality reduced data.
Random Forest: After applying dimensionality reduction followed by undersampling to our training data using the Neighborhood Clearing Rule, Random Forest was applied to our dataset via the sklearn implementation. By default, the function fits 100 decision trees using a bootstrapping (random selection with replacement) of our training data and expands each tree until every node is pure. However, this expansion leads to major overfitting on our training data. One way to resolve this issue is through pruning, which eliminates certain branches of the node at the cost of greater impurity of nodes. Adjusting the ccp_alpha parameter in the sklearn implementation is one method of pruning. Using the sklearn default of 100 trees, we can determine the optimal ccp_alpha by finding the lowest value that maximizes prediction accuracy in our testing data. The plots below show parameter tuning using our top 5 dataset after applying a two-dimensional TSNE algorithm.

Based on the plot above, the optimal ccp_alpha value is 0.0035. Random Forests of various sizes were fitted to our training data and their prediction accuracies on testing data were checked to determine an appropriate number of trees.

Based on the plot above, prediction accuracy stabilizes significantly as the size of a Random Forest increases. Since the prediction accuracy is mostly stable for Random Forests comprised of over 50 decision trees, a forest of 100 decision trees is appropriate.

Another parameter that is often tuned in Random Forest is the number of features used in each decision tree, but this was not found to be helpful in improving accuracy for the top 5 data after a two-dimensional TSNE, which is logical given the small dimensionality.

SVM with Linear Kernel: Support vector machines are robust, supervised models commonly used for classification and regression of labeled data. A support vector machine with a linear kernel was applied to the dataset across after applying PCA dimensionality reduction for 1 - 5 features. No undersampling or oversampling was used. Based on unsupervised learning results, this technique was only applied to PCA results from the Top 5 and Top 10 feature sets. Additionally, the same training and testing sets were used as preceding supervised learning techniques. Finally, accuracy and the F1 metric were compared across PCA features and between the Top 5 and Top 10 feature sets. Furthermore, confusion matrices were used to compare performance.
SVM with RBF Kernel: A support vector machine with a radial basis function (RBF) kernel was applied to the dataset after TSNE dimensionality reduction to two dimensions. This was completed both with and without NCR undersampling and SMOTE oversampling. The parameters for SVM with the RBF kernel are a misclassification term, C, which allows for more misclassification at lower values, and gamma, the positive coefficient of the exponent in the RBF kernel function. In order to set these parameters, a grid search was performed. This involved testing every combination of these parameters over a specified range (C from 0.1 to 100 and gamma from 0.0001 to 1). For each parameter combination, a 5-fold cross validation was performed, and the combination achieving the best accuracy was selected. A heat map showing the accuracy acheived with each parameter combination is whosn below. During this grid search, only the designated training data was used. The selected parameters were then used to apply the SVM model to the designated testing data, and slight adjustments were made manually to further improve accuracy.

LSTM Neural Network: This implementation of a recurrent neural network utilizes the continuous changes to the biases and weights of the network with respect to certain aspects of the previous data that has been processed. Epochs, learning rate, and the number of hidden layers were tuned to develop the most optimal model that we could muster. In addition, we experimented with adding different convolutional and pooling layers as per related literature. The changes to the model failed to improve the model due to overfitting. Our final hyper-parameters were as follows: batch size-1, hidden layers-2, epochs-1000, learning rate-0.001, and a ReLu activation function. Adam was used as our optimization algorithm. Our data was processed by TSNE with 2 components.
MLP Neural Network: After applying dimensionality reduction followed by undersampling to our training data using the Neighborhood Clearing Rule, a Multi-Layer Perceptron (MLP) neural network was applied using the sklearn implementation. Experimenting with the various parameters of the function, such as activation function, number of hidden layers, number of neurons in each layer, type of weight optimization solver, and regularization term size did not result in a consistent noticeable improvement in accuracy or F1 score. As a result, the model fitted on the data used the default variables provided in the sklearn function, which includes a ReLU hidden layer activation function, 1 hidden layer with 100 neurons, and a regularization term of 0.0001. The weight optimization solver used was stochastic gradient descent based.

Results

Feature Engineering & Selection

The feature engineering and selection process discussed in the Methodology section was followed. As discussed, we selected the top 5, 10, 20, and 30 features and consider these sets separate for dimensionality reduction & unsupervised learning tasks. As an example, the image of the correlation matrix below shows the original 37 features (left) being reduced to a set of the 30 features with the lowest correlation and highest mutual information (right).

Figure 1: Correlation Heat Map Before and After Eliminating Highly Correlated Features

After removing highly correlated features, normalized mutual information with the target values (sleep stages) was calculated, and the most informative features were selected.

Figure 2: Remaining Features Sorted by Normalized Mutual Information Values