Attention-based neural networks for population genetics

Théophile Sanchez (ETH Zurich) [distanciel]

31 mars 2023

Résumé : Artificial neural networks (ANNs) have recently offered new perspectives to solve inference problems from high dimensional data in numerous scientific fields, but it is yet unclear which architectures are the most suited to genomic data. Here, we present a new ANN architecture integrating attention mechanisms to infer effective population size history from genomic data. Built upon our previous exchangeable architecture SPIDNA, MixAttSPIDNA adds attention layers that allow computing more expressive and complex features from combinations of haplotypes. The contribution of each haplotype to the features is learned automatically and depends on its content and affinity with the other haplotypes. Likewise, we use this mechanism to automatically perform a voting scheme that aggregates predictions from different genomic regions. This new architecture outperforms approximate Bayesian computation and SPIDNA on simulations while relying directly on raw genetic data and being invariant to haplotype permutation in the input. As a proof-of-concept, we use this architecture to infer the effective population size history of 54 populations from the HGDP dataset (Bergström et al., 2020), and we compare our results to smc++ (Terhorst et al., 2017). This application highlights the ability of the network to handle data with a varying number of haplotypes and to quickly perform predictions for datasets including numerous populations.;