Reusability report Compressing regulatory networks to vectors for interpreting gene expression and genetic variants (Yong Wang)----Academy of Mathematics and Systems Science

Integrating multi-omics data to better interpret transcription control and reveal regulatory mechanisms is of fundamental importance. Usually, high-dimensional data are mathematically represented and modelled in a biological network in which nodes represent biological units and edges represent the interactions between the units. Recent progress in representation learning has demonstrated the possibility of embedding heterogeneous networks with multiple types of nodes and links in low-dimensional vector space1. In particular, Cao et al. have utilized a state-of-the-art embedding method, GEEK (‘Gene Expression Embedding frameworK’), to combine biological networks and omics data with the metapath concept, and have produced interpretable biological knowledge such as gene function, protein complex, chromatin domain and replication timing.

To demonstrate the robustness and re-usability of the embedding framework, the authors carried out two different downstream tasks that are complementary to the GEEK study: (1) integrating the regulatory information embedded in vectors generated by GEEK to regress the gene expression level in K562 cells using DeepExpression3 and (2) incorporating an attention score based on GEEK embedding vectors to prioritize genetic variants for high-altitude adaptation around the EPAS1 region in human umbilical vein endothelial cells (HUVECs), as also identified by vPECA (‘variants interpretation method by paired expression and chromatin accessibility’) in a previous publication4. Briefly, DeepExpression is a densely connected convolutional neural network for integrating DNA sequence information and enhancer–promoter interaction data to model gene expression, and vPECA is a variant interpretation method for identifying active selected regulatory elements (REs) and the associated regulatory network. Our objective is to evaluate the regulatory information in GEEK embedding vectors by investigating whether the performance of those methods can be improved with the incorporation of the vectors. The results show that GEEK embedding vectors are informative for predicting gene expression and potentially useful in prioritizing genetic variants. Applications using the embedding vector from GEEK should be carefully interpreted with consideration of their context-specific and non-specific information.

Publication:

- Natural Machine Intelligence, 3, 7, 576-580 (2021).

Authors:

- Wanwen Zeng (Nankai University)

- Jingxue Xin (Stanford University)

- Rui Jiang (Tsinghua University)

- Yong Wang (Institute of Applied Mathematics, AMSS, Chinese Academy of Sciences)