A machine learning approach to integrate big data for precision medicine in acute myeloid leukemia

From Nature

Abstract: Cancers that appear pathologically similar often respond differently to the same drug regimens. Methods to better match patients to drugs are in high demand. We demonstrate a promising approach to identify robust molecular markers for targeted treatment of acute myeloid leukemia (AML) by introducing: data from 30 AML patients including genome-wide gene expression profiles and in vitro sensitivity to 160 chemotherapy drugs, a computational method to identify reliable gene expression markers for drug sensitivity by incorporating multi-omic prior information relevant to each gene’s potential to drive cancer. We show that our method outperforms several state-of-the-art approaches in identifying molecular markers replicated in validation data and predicting drug sensitivity accurately. Finally, we identify SMARCA4 as a marker and driver of sensitivity to topoisomerase II inhibitors, mitoxantrone, and etoposide, in AML by showing that cell lines transduced to have high SMARCA4 expression reveal dramatically increased sensitivity to these agents.

Discussion: Due to the small sample size and the potential confounding factors in the gene expression and the drug sensitivity data, standard methods to discover gene-drug associations usually fail to identify replicable signals. We present a new way to identify robust gene-drug associations by prioritizing genes based on the multi-dimensional information on each gene’s potential to drive cancer. We demonstrate that our method increases the chance that the identified gene-drug associations are replicated in validation data. This leads us to a short list of genes which are all attractive biomarkers for different classes of drugs. Our results—including the expression, drug sensitivity data, and association statistics from patient samples—have been made freely available to academic communities.

Our results suggest that high SMARCA4 expression could be a molecular marker for sensitivity to topoisomerase II inhibitors in AML cells. These results offer a potentially enormous impact to improve patient response. Mitoxantrone is an anthracycline, like daunorubicin or idarubicin, and one of the two component classes of drugs included in nearly all upfront AML treatment regimens. It is also included (the “M”) in the CLAG-M regimen55, a triple-drug component upfront regimen now being studied as GCLAM56. Mitoxantrone and etoposide (also a topoisomerase II inhibitor) are two of the three drugs in the MEC regimen57, used together with cytarabine, as a common regimen for relapsed/refractory AML. Many modern regimens are in clinical trials that add an investigational drug to the MEC backbone, for example, an antibody to CXCR4 (NCT01120457) or an E selectin inhibitor (NCT02306291) in combination, or decitabine priming preceding the MEC regimen58. Identifying a predictor of response to mitoxantrone based on clinically available biospecimens, such as leukemic blast gene expression measured prior to treatment, could potentially increase median survival rates for patients with high expression of SMARCA4 and indicate alternative therapies for patients with low SMARCA4 expression.

The AML patients used in our study were consecutively enrolled on a protocol to obtain laboratory samples for research. They were selected solely based on sufficient leukemia cell numbers. As the patient samples were consecutively obtained and not selected for any specific attribute, we postulated that they were representative of patients seen at a tertiary referral center and that the results would be relevant to a larger, more general clinical population. Moreover, since each of the data sets from which we collected prior information (driver features) contained many more than 30 samples (e.g., TCGA AML data), it would be highly likely that MERGE results would be more generalizable to larger clinical populations than the methods that retrieve results specifically based on the 30 AML samples. In fact, Fig. 2a, b implies higher generalizability of MERGE compared to alternative methods.

While we have genotype information on FLT3 and NPM1 and the cytogenetic risk category for most of the 30 patients, the current version of the MERGE framework did not take these features into account: our main focus sought to build a general framework that could address the high-dimensionality challenge (i.e., the number of samples being much smaller than the number of genes) and make efficient use of expression data to identify robust associations. However, to consolidate our findings, we performed a covariate analysis to confirm that the top-ranked gene-drug associations discovered by MERGE remained significant when the risk group/cytogenetic features were considered in the association analysis. We checked whether the gene-drug associations shown in the heat map in Fig. 6b (highlighted as red or green) were conserved when we added each of the following as an additional covariate to the linear model: (1) cytogenetic risk, (2) FLT3 mutation status, and (3) NPM1 mutation status. In Supplementary Fig. 9, each dot corresponds to a gene-drug pair, and each color to a different covariate. Most of the dots being closer to the diagonal indicates that the associations did not decrease significantly after adding the covariates. Moreover, of 357 dots, only eight were below the horizontal red line; this indicates that 98% of the gene-drug associations MERGE uncovered were still significant (p ≤ 0.05) after modeling the covariate.

A machine learning approach to integrate big data for precision medicine in acute myeloid leukemia

Mineração de Dados lança luz nos casos de Autismo

Este é um caso bastante relevante do uso da mineração de dados na área biomédica. Os cientistas da Rockfeller University conduziram um estudo utilizando as técnicas de mineração de dados para geração de insights a respeito das causas de autismo. 

Mineração de Dados lança luz nos casos de Autismo

Comparação das técnicas de aprendizado de máquina para previsão de sobrevivência em Câncer de Mama

Um ótimo estudo do BioDataMining que poderia ser reproduzido aqui em terra brasilis. Uma crítica que eu vejo nesse trabalho foi que a seleção de atributos como diria o Daniel Larose foi um pouco black-box e particularmente a abordagem em Algoritmos Genéticos não deve ser tão performática em relação a SVM (o ponto dos autores é que os dados tinha uma dimensionalidade razoável).

A comparison of machine learning techniques for survival prediction in breast cancer

Comparação das técnicas de aprendizado de máquina para previsão de sobrevivência em Câncer de Mama

BioDatamining Site

Recomendado sem nenhum tipo de restrição.

BioDatamining Site

Mineração de Dados aplicado à Neurociência Preditiva

Um interessante artigo recém publicado pela Ecole Polytechnique Fédérale de Lausanne no PLoS One de autoria de Georges Khazen, Sean Hill, Felix Schürmann e Henry Markram apresenta um interessante avanço na utilização de mineração de dados para previsão de padrões de estruturas cerebrais. Através dessas previsões é possível chegar a um mapeamento da estrutura anatômica e propriedades elétricas dos tipos de neurônios, os quais os autores podem chegar aos resultados futuros de extração de regras para padrões genéticos e dessa forma prever as características cerebrais do indivíduo.

Esta pesquisa é um importante avanço em relação ao que é praticado e estudado em mineração de dados pois, com a utilização de ‘strats’ e a posteriori extração/descoberta dessas regras, parte do paradigma de estudos relacionados a morfologia cerebral tem sua complexidade reduzida; no qual através de ‘pedaços de dados’ os autores conseguem predizer o complemento dos canais iônicos os quais são presentes nos neurônios e através disso consegue-se saber quais genes que se manifestam de acordo com o comportamento elétrico do neurônio.

Combinatorial Expression Rules of Ion Channel Genes in Juvenile Rat (Rattus norvegicus) Neocortical Neurons

PLoS ONE_ Combinatorial Expression Rules of Ion Channel Genes in Juvenile Rat (Rattus norvegicus) Neocortical Neurons

Mineração de Dados aplicado à Neurociência Preditiva

Projeto Genoma Disponível

Uma ótima noticia para os viciados em Mineração de Dados é a disponibilização do projeto genoma pela Amazon. Para quem não sabe o projeto Genoma tem como finalidade realizar o mapeamento da carga genética e de acordo com a análise desse material realizar estudos que permitam prever uma doença específica, bem como realizar trabalhos de acompanhamento do desenvolvimento dessas doenças para implementação de pesquisas bio-médicas.

Projeto Genoma Disponível