Why validation strategy matters, or how to get the perfect score with session fingerprints

While researching articles on EEG-based depression detection, we found an article [1] with a study that was 98% accurate in predicting depression. This finding is quite favorable, especially in light of our ability to apply the study method to our data set. Successful application of previous research to create a depression detection model with the same prediction quality will reach half of our project’s research goals.

Table from [1]

The attributes of the dataset in the study conducted by Subha and colleagues (2012) were as follows: data collection included a total of 15 normal (20 to 50 years old) and 15 depressed (20 to 50 years old) subjects. The bipolar EEG signals were recorded from the left half (FP1-T3 channel pairs) and right half (FP2-T4 channel pairs) of the brain. EEG signals were recorded for each subject for a duration of 5 minutes with eyes open and closed (resting state). The TD-BRAIN dataset we accessed had similar attributes which allow us to replicate the mentioned study. Data collection included 1030 subjects (18 to 88 years old) of which  approx. 20% are diagnosed with major depressive disorder (MDD). EEG signals were recorded with 26 channels 2 minutes with eyes open and closed (resting state).

As I wrote in a previous post, we took on the task of predicting sex. The hypothesis of our experiment was that if we could achieve a similar level of accuracy up to the 98% in the article described above in predicting sex, we could then transfer the learned weights to a new model aimed to predict depression. We then trained the model, measured the quality of the prediction, and failed, achieving accuracy of around 60%. To determine why we obtained these results, we had to understand the source of the problem and low accuracy. We considered whether low accuracy could result from employing different data validation strategies.

A common approach is to divide the available data into three sets: training, validation and testing.  Typically 10-20% of all data contributes to validation and testing datasets, while the rest is used for training.

In one of the studies analyzed, the authors describe the validation scheme used [2]. More or less, we utilized a similar strategy, with a 60/20/20 ratio. Initially, we considered errors in code or in the dataset as reasons for differences in the final result. However, it turns out that the difference lies in the algorithm of dataset building. All the data represent a logical structure:

Imagine how a study goes. Participants are asked to wash their heads not long before the session, and come to a clinic where an assistant puts a cap with electrodes on their heads. Usually the assistant uses special electrically conductive gel and performs checks according to the study protocol. If the cap is removed and put on again, the electrodes move, which will interfere with the results. When the cap is put on, gel is used, and each time small differences exist in how the gel is applied, which may change conduction. Other factors may also come into play if a participant presents for testing on a different day, such as hairstyle and mood. Numerous confounds can impact the data obtained in multiple sessions from the same person. 

Within a session in TD-BRAIN dataset, there are two records taken with eyes open (EO) and eyes closed (EC). When the participant’s eyes are open, they must be looking at a point on the screen. When their eyes are closed, they are also instructed to either think of a specific thing or to try not to think about a specific thing. In summary, the research we are conducting requires a deep understanding of the data. 

Data structure can be explored at different conceptual levels, let’s focus on segment and subject levels, where we shall see a big difference. First, let’s clarify the terms. A segment is a small piece of a single session. In neuroscience literature a segment is often called “epoch”. As a data scientist I’ll abstain from this because in our world “epoch” is a very commonly used term with completely different meaning.  For example, a segment could be a four seconds part of a two minute open-eye session. 

Choosing a validation strategy we had at least two methods:

  • “Cross-segment split”, the data is split into train-valid-test parts at the segment level. These parts can contain data from the same session with the same eye condition of the same subject. 
  • “Cross-subject split”, the data is split into different parts at the subject level, rather than at the level of time within a session. Thus, the training, validation, and testing datasets contain data from different participants.

First, we utilized the cross-subject split method. We took all 1030 subjects and distributed them among training, validation and testing sets in 60/20/20 proportion with stratification by sex and age. All segments from all sessions of a subject were assigned to the same subset as the subject. This particular approach to cross-validation split produced poor metric value as I mentioned in the beginning.

Then we applied the cross-segment split approach. We took all segments from the same subjects (approx. 254k from 1094 sessions with 117 segments each) and randomly distributed them among training, validation and testing sets in 60/20/20 proportion. As you may have guessed, when we applied a cross-segment method we were able to achieve accuracy of 99.99% with a small neural net! 

We made the following conclusions: first, while selecting and analyzing articles for the purpose of replicating results or validating a particular model, we need to highlight the study methodology. Specifically, we need to determine whether a cross-subject or cross-segment approach was used to analyze the data. Our most important finding is that we cannot expect a high level of accuracy from replicating studies implementing the cross-segment split method. Further, we find the results interesting only when using the cross-subject split method. I’ll tell you why we find them interesting and what significance this has for our project in the next post.

References

[1] Acharya, U.R., Oh, S.L., Hagiwara, Y., Tan,  J.H., Adeli, H., & Subha D.P. (2018). Automated EEG-based screening of depression using deep convolutional neural network. Computer Methods & Programs in Biomedicine, 161, 103–113. https://doi.org/10.1016/j.cmpb.2018.04.012

[2] Ay, B., Yildirim, O., Talo, M., Baloglu, U. B., Aydin, G., Puthankattil, S. D., & Acharya, U. R. (2019). Automated depression detection using deep representation and sequence learning with EEG signals. Journal of Medical Systems, 43(7), 205. https://doi.org/10.1007/s10916-019-1345-y

Back to blog list

Last posts