Assistant Professor University of Idaho Moscow, Idaho, United States
Recent years saw many theoretical advances and new tools for investigating population divergence, size changes, and gene flow from sequence data. These existing methods often rely on computationally expensive likelihood calculations, making their application to dozens of genome-scale samples challenging. Moreover, these approaches often focus on one or few data characteristics for computational and mathematical tractability.
Coinciding with the genomic revolution, there have been dramatic improvements in processing of image, video and audio data made possible thanks to advances in deep learning. Deep learning is a class of algorithms that learn data representations of multiple levels of abstraction and excel in tasks such as automated object identification in images or speech recognition. As a predictive tool, deep learning has some advantages over likelihood-based methods, including fast prediction times and ability to consider arbitrarily complex models. Because of these desirable properties, deep learning is making inroads into population genetics.
Summary statistics such as allele frequency spectra have so far dominated input for population genetics applications leveraging deep learning. Here we test whether this approach can be used directly on variable sites patterns of molecular sequence alignments to infer best-fitting demographic scenarios. We use sequence data simulated along coalescent histories under different demographic scenarios to train a neural network classifier. We find that this approach provides accurate inference while also being computationally feasible for large datasets. We develop our analytical pipeline into DEMES (Deep lEarning for deMographic modEl Selection).