Globally, depression affects over 300 million people at any given time and is the leading cause of disability. While different patients may benefit more from different therapies, there is no principled way for clinicians to predict individual patient responses or side effect profiles. Deep learning is a form of machine learning based on artificial neural networks and might be useful for generating a predictive model that could aid in clinical decision making. Such a model’s primary outcomes would be to help clinicians select the most effective treatment plans and mitigate adverse side effects, which would allow doctors to provide greater personalized care to a larger number of patients. In this commentary, we discuss the need for the personalization of depression treatment and how a deep learning model could be used to construct a clinical decision aid.
Tags: deep learning, mental health, psychiatry, depression, personalized medicine, precision medicine
Major Depression is a debilitating health condition that affects 11.1% of people over the course of their lives and is responsible for the majority of Disability Adjusted Life Years (DALYs) lost globally [1,2]. At any one time, over 322 million people around the world struggle with depression . Depression causes significant suffering and entails high treatment and social costs [4,5]. While seeking professional help for depression is indeed a step towards recovery, mental health professionals and the patients they work with face another challenge: selecting the best treatment. Though a range of effective medications, therapies, and other treatments do exist, these are not equivalently effective for all patients. Some patients can spend years finding the right choice or combination from the dozens of available medications, multiple psychotherapies, and several neurostimulation techniques (e.g. repetitive Transcranial Magnetic Stimulation (rTMS), transcranial direct current stimulation (tDCS), and deep brain stimulation (DBS)). Currently, most patients and their physicians have little option but to undergo an educated “guess and check” approach to finding the right treatment. In the large Sequenced Treatment Alternatives to Relieve Depression (STAR*D) study, only about one third of patients improved after the first step of treatment in the trial, with decreasing response rates after further steps . The decision about which treatment to try is one with significant consequences. In addition, different patients develop varying side effects to the same medication in an unpredictable manner, which further complicates treatment choice .
As such, a research objective in this domain should be to develop an evidence-based approach to rapidly select the most effective treatment for any given patient, as early on in their clinical course as possible, whilst minimizing side effects that lead to reduced quality of life or poor treatment adherence. Existing guidelines do categorize the large array of treatment options into first-, second-, and third-line treatments . However, doctors lack a systematic, evidence-based tool for mental health conditions that helps clinicians choose treatments in a way that is personalized to each given patient [7,8,9,10]. Indeed, the most important study of comparative treatment efficacy to date, a meta-analysis of twenty-one antidepressants by Cipriani et al. , was unable to make any clear recommendations regarding the personalization of treatment. In this commentary article, the possibility of using deep learning as a solution to this problem will be discussed.
A clinical decision tool is a potential solution to the aforementioned problem. The tool would synthesize both existing and newly-recorded data with the aim of producing treatment recommendations to optimize symptom remission while mitigating adverse side effects. This can be framed as a classification problem that could be solved through a machine learning approach. Machine learning is the use of algorithms to learn patterns in often large datasets, sometimes in order to make predictions about new pieces of data (i.e. learning about previous weather patterns to make predictions about future weather). These approaches can be as simple as classical linear or logistic regression, or as complex as multi-layered artificial neural networks, known as deep learning (DL) . Artificial neural networks (ANN) have existed for decades in the form of the archetypal multilayer perceptron . Due to limited computing capabilities and already superior performance by standard statistical methods, research into artificial networks slowed for decades . However, recent advances  have led to an explosion of applications using DL . DL works by passing information through several layers of weighted artificial neurons , producing increasingly abstract representations of relationships between variables in the original dataset . The problem that comes with deeper networks is that while they can capture more complex relationships between features, the gradient-based training procedure’s error signal can either attenuate or magnify in a manner that can impede training. Superficial layers either learn very slowly, if at all (i.e., vanishing gradients), or much too quickly when compared to ‘deeper’ layers (i.e. exploding gradients). Several approaches, beyond the scope of this paper, have been developed to solve this problem.
Why use DL? This method has the tendency to overfit, or to produce solutions that fit the training data but fail to generalize to other data . It also suffers from the black box problem: results provided by DL networks are difficult for humans to interpret, which is not as much of a problem with other approaches like decision trees. We will discuss possible solutions to this problem below. However, besides having surpassed other methods in a variety of tasks , DL has two distinct advantages. Firstly, it remains robust in the face of noisy or incomplete datasets [18,19], which are common in psychiatry. Secondly, the failure of simple models to explain or predict psychiatric phenomena speaks accurately to a complexity  of which DL may be able to capture through its increasingly abstract data representations .
Generally, to train deep networks large datasets are required. For prediction of response to depression treatment, these datasets could include information such as socio-demographic factors, symptom profiles, and previous response to treatment, as well as genetic, metabolic, endocrine, immunological, and neuroimaging data. Moreover, the training datasets must include valid outcome measures. Therefore, clinical trials and treatment research studies are of significant interest. Another potential source of data are clinics that employ measurement-based patient care or structure clinical data collection in a manner which facilitates machine learning for outcome prediction. Despite its established efficacy, measurement-based care is not routinely implemented in psychiatry, thereby limiting the amount of available data . Other clinical data (i.e., extant medical records) may not be optimal data for training because they are often incomplete and lack the same rigorous outcome measures as controlled trial data or data collected for machine learning applications.
Collecting sufficient data for DL is challenging. Producing data de novo that is conducive to machine learning requires extensive clinical partnerships and years of collection, a route being explored by the Alphabet-owned company Verily . However, large, and well-designed clinical trials can be prohibitively expensive. For example, the STAR*D trial cost 35 million dollars (USD) and enrolled 4041 patients – a sizeable number, but below the tens of thousands of patients required for a machine learning application to make reliable predictions for a patient’s response to all available treatments . For datasets large enough for machine learning, this leaves data collected by industry and researchers, which is often difficult to access, though recent open data initiatives such as the National Institutes of Mental Health Data Archive have facilitated this. The fact that little standardization exists between datasets means that data pooling for the purposes of machine learning remains a challenge .
Assuming that sufficient data could be collected for effective model training, a DL classifier would learn patterns corresponding to patient subgroups, reporting varying side effect profiles and responses to treatments. Ideally, the complete tool would allow clinicians to input a set of patient data and personalized treatment options and a predicted probability of treatment efficacy would be outputted. The listed treatment options should include all proven therapies for depression with positive clinical trials such as medication, psychotherapy, lifestyle interventions, and neuromodulation. Furthermore, the tool could continue to rely on both clinician input and self-reported patient data, with the aim of providing a clinical decision tool that compliments the physician’s expertise. Figure 1 describes how such a tool could be integrated into the clinical workflow.
While there is ongoing effort to identify single biomarkers that could reliably and accurately predict response to treatment, none have emerged . A more fruitful strategy may be to deﬁne a panel of biomarkers and clinical questions, which, together, might provide more reliable and accurate predictions of treatment response when fed into a DL system as input features .
In order to capture clinical response patterns in a population of individuals with complex medical and personal histories and biological profiles, a predictive model must be trained on complex and heterogeneous data. Having a dataset of sufficient complexity is necessary to help the model generalize to real life clinical populations-- using unrepresentative data would be a source of bias . One DL approach to such a dataset would be to use a feed-forward stacked denoising autoencoder (SDA) , which consists of a series of context-learning layers. The general architecture of the autoencoder is composed of the encoder (e.g., E(x)) and decoder (e.g. D(x)) and is defined as follows: the dataset, x, gets mapped to a learned latent space E(x). Using the decoder, D(E(x)) is then mapped back to the original x. The features learned in E(x) are a subset of the original, meaning that the ANN constructs representations of the interactions between input features. Pairing an SDA with a discriminative model (such as another ANN) allows for greater depth (i.e., numbers of layers) over standard feedforward neural networks, resulting in higher-order representations of data and an appreciation of more complex relationships between variables. It should be noted that recent developments in other approaches, such as generative adversarial networks (GANs), are not necessarily suited to our purpose. This is because they are designed to create artificial data samples that remain representative of the training data. Training on artificially generated patient data might skew further analyses.
The training technique for an SDA involves two steps: unsupervised learning followed by supervised learning. Unsupervised learning provides the network with contextual information about data by allowing it to autonomously explore latent groupings and dependencies - this trains the auto-encoder layers mentioned above. Supervised learning asks the network to take inputs, run them through the previously trained autoencoder layers, and maximize a specific output target - a process optimized by the network’s underlying knowledge of the dataset acquired during unsupervised learning . Unsupervised learning may be effectively applied to unstructured data such as electronic medical records. Doing so might then improve supervised learning on more structured data with rigorous outcome measurement, like clinical trial data. In this way, both structured clinical data and unstructured electronic medical records may be processed and used to improve outcome prediction.
Patient outcomes would be assessed in terms of validated depression rating scales. Standard validation assessments could be used to critically examine all aspects of the classifier, including its vulnerabilities. These methods include computation of the sensitivity, specificity, positive predictive value, negative predictive value, accuracy, and the receiver operating curve (ROC)  along with the area under the ROC (AUC). To ensure model validity, k-fold cross validation could also be implemented . As well, dropout should be implemented during training to prevent overfitting and to improve generalizability. Dropout refers to the practice of occasionally excluding artificial neurons during training to prevent over-reliance on certain nodes . Clinicians require interpretability in their decision-making tools. Interpretation of DL models has been traditionally difficult due to their increasingly abstract internal representations of information. Failing to provide easily interpretable results of the model’s decisions may compromise clinician trust and technology adoption. Nevertheless, certain tools for interpretability do exist and continue to improve as DL is applied towards the clinical domain . For example, the authors of  used a DL model to “teach” a gradient boosted tree model to predict mortality in an ICU, which produced a model that benefits from the power of a DL model and the interpretability of a gradient boosted tree model.
Several other interpretability methods exist, including validating the final model input features by relating them to existing literature and using receptive field analysis on all layers in the network to gain a sense of low-level feature groupings [32,33]. Moreover, saliency map generation may elucidate features from the input data sample that most significantly contribute to network prediction . Such clusters may correspond to similarities between certain patient types extracted via t-Distributed Stochastic Neighbor Embedding (t-SNE) . Some of the feature clusters identified may even spawn novel research avenues. Interpretability tools such as these can produce reports explaining the most salient features in the making of a given decision. This provides a level of detail familiar to mental health clinicians, similar to analyses looking at risk factors for different conditions which cannot always definitively explain causal links.
Targeting response to treatment in depression gives rise to several challenges. For example, there is significant debate about the nosology of depression. Anxiety is often comorbid with depression but is not part of the diagnostic criteria for the disorder . To avoid the pitfalls associated with diagnostic validity when selecting the features to be used in a machine learning model, it is possible to adopt a “dimensional approach”  - that is, by focusing on symptoms and other patient features independently of the actual diagnosis (assuming, of course, that non-psychiatric causes of symptoms such as fatigue (e.g. hypothyroidism or anemia) have been ruled out or are not suspected). Another consideration is the temporally sensitive nature of response to treatment - that is, it may not be possible to predict treatment response solely from patient features at baseline, but also from observations shortly following treatment initiation. This is especially relevant to psychotherapy, where the strength of the relationship with the therapist - and not the individual patient or treatment features  - is most predictive of outcome. In this case, two machine learning models could be used. The first could predict whether the patient is likely to respond to psychotherapy based on data from patients who had a good relationship with their therapist (i.e. the ‘best case’ scenario). A few weeks into treatment, a second model would evaluate the actual patient-therapist relationship amongst other factors that predict response.
Should a predictive model withstand clinical validation, it would be among the first personalized medicine tools in mental health specifically designed for use by clinicians. This solution has the potential to reduce the disappointingly high rate of failure to reach remission, as seen in the STAR*D trial . It is difficult to ascertain the potential reduction of treatment failure rate prior to clinical testing. However, a recent machine learning approach using random forests  has attained some success in identifying patients who have attempted suicide (AUC = 0.84, precision = 0.79, recall = 0.95) using electronic medical records. The authors of  use a technique called bootstrap optimism to reduce overfitting and find that their predictions became more accurate closer in time to the actual attempt. Although this model was only tested on previously collected data, the finding that these retroactive predictions became more accurate closer to the attempt suggests that the model was able to capture the patients’ temporal evolution. Given the similar nature of datasets used to predict suicide or response to depression treatment, we might reasonably expect a 60-80% accuracy in predicting the most effective treatment using DL (see  for a previous approach to this using another machine learning technique). This accuracy would represent a significant improvement over the one-third success rate reported in Step 1 of STAR*D . For more complex or chronic cases, we might expect a lower success rate while still aiming to surpass STAR*D Step 2 and 3 remission rates (25% and 12-20%, respectively) . These are evidently rough predictions and will need to be continually revised.
Given existing clinical psychiatry data, there are likely to be less training samples than typical machine learning datasets [6,10]. Model training might prove problematic if there are too many predictors. Manually inputting many predictors would be time-consuming and cumbersome for clinicians. However, many of the features in a patient’s file or a clinical trial are not actually expected to influence the response to anti-depressants, making it possible to narrow down the number of features fed into the model. This feature reduction could be approached via several routes. One method would be to ask a panel of experts to review the features in the dataset and to select only the most important predictors. This list of predictors can be compared with the results of an exhaustive literature review. One can further decrease predictors by running many iterations of a model with certain input features omitted, to see which combination of predictors provides the most accurate results while being efficient enough with respect to time that clinicians will be able to incorporate it into their busy practices.
Clinical validation is key to ensure the safety and efficacy of a predictive model in the clinical setting. A predictive model should be subject to rigorous testing, including: an open-label clinical trial to establish safety and effectiveness and a randomized control trial, to evaluate efficacy by comparing the predictive model to a usual practice control group and to a group using a model loaded with static, non-personalized suggestions derived from current clinical guidelines; for an example of these guidelines, see . It is important to compare both the static model and the ‘practice as usual’ groups to the AI-powered model group in order to assess the effect of clinicians using guideline-centered, measurement-based care, which, on its own may improve quality of care and patient outcomes.
Physicians are not familiar with using AI technologies in their practice. Building physician and patient trust will be critical to the success of any clinical decision tool. For this reason, physicians, as well as patient representatives, should be involved in the design of such a product. Considering that a clinical decision tool must be incorporated into the medical workflow, the ultimate utility of these applications will depend in large part on them being user-friendly. Use of companion patient self-report applications might also help to reduce the time spent by the clinician inputting patient information into the predictive model while providing rich data. Most importantly, however, a clinical decision tool should be conceptualized as a tool used to compliment or augment physician capabilities, and not as a means to supplant or replace clinical judgment. As such, it is critical that the clinician actively engage with the application. Active clinician and patient engagement, coupled with a seamless integration into clinician workflow, will bring clinical medicine into the age of big data and AI-augmented care.
The authors received no funding for this research.
All authors are co-founders and shareholders of Aifred Health, the company engaging in the research described here.