We first searched the PubMed database to obtain publications related to congenital heart disease (CHD) using the query term as follows:
(congenital heart disease*[All Fields] OR heart defect*[All Fields] OR transposition of the great arteri*[All Fields] OR pulmonary atresia[All Fields] OR pulmonary artery atresia[All Fields] OR Anomalous pulmonary venous*[All Fields] OR Ebstein anomaly[All Fields] OR Epstein anomaly[All Fields]) AND (gene[Title/Abstract] AND (proteomics[Title/Abstract] OR expression[Title/Abstract] OR CNV[Title/Abstract] OR copy number variation[Title/Abstract] OR microarray*[Title/Abstract] OR microdel*[Title/Abstract] OR microdup*[Title/Abstract] OR rearrange*[Title/Abstract] OR linkage[Title/Abstract] OR associa*[Title/Abstract] OR scan[Title/Abstract] OR sequenc*[Title/Abstract])) AND ("1000/01/01"[Date - Publication]: "2020/01/10"[Date - Publication])
The abstracts of 2762 publications retrieved were scrutinized to remove irrelevant papers, and the remaining 1114 publications were systematically reviewed.
For each qualified study, items of research evidence were retrieved. According to the strategy of studies, we classified research evidence into the following groups: “Genetic association”, “SNV/Indel”, “Expression”, “Linkage”, “CNV” and “Other” evidence. For example, “SNV/Indel” evidence is retrieved from sequencing studies reporting single-nucleotide variants/insertion-deletion variants (SNVs/Indels) in patients. “CNV” evidence is mainly from microarray studies reporting copy number variations (CNVs) in patients. “Other” evidence is mainly from functional analysis in cell culture and animal model studies. If multiple experimental methods or different datasets were used in one study, the evidence of this study was divided into multiple independent items. For example, if an association study was performed in a discovery cohort and validated significant findings in a replication cohort, two pieces of “Genetic association” evidence were collected from the two cohorts. If a sequencing study identified SNVs/indels in patients and then conducted functional analysis for the candidate gene in animal model, we collected one piece of “SNV/Indel” evidence and one piece of “Other” evidence. According to each type of evidence, comprehensive meta-data was collected. If the study reported variations in patients, we further collected the detailed phenotypes of variation carriers. Please note only carriers presenting CHD were included in CHDbase. A detailed description of the collected information is listed in Table 1.
The disease name was standardized according to the 11th revision of the International Classification of Diseases (ICD-11) congenital cardiology terms. Transcript-based SNVs/Indels were annotated by Ensembl Variant Effect Predictor (VEP) and confirmed by Mutalyzer following the nomenclature recommendations of the Human Genome Variation Society (HGVS). Genomic coordinates are provided according to the human reference genome (GRCh37/hg19).
Table 1 Collected information for different types of evidence
Genetic Association | SNV/Indel | Expression | Linkage | CNV | Other | ||
---|---|---|---|---|---|---|---|
Publication | First author | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
Title | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | |
Year of publication | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | |
Journal | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | |
PubMed ID | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | |
Abstract | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | |
Summary | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | |
Population | Species | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
Ethnicity | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | |
Country of origin | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | |
Number of families, cases and controls | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | |
Male/female ratio of cases and controls | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | |
Age of cases and controls (mean, sd, range) | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | |
Diagnosis of cases | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | |
Other phenotype of cases | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | |
Animal/cell model | ✔ | ||||||
Study design | Study type | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
Experimental method/platform | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | |
Statistical method | ✔ | ✔ | ✔ | ||||
Results | Variation/region/gene | Reported SNPs | Reported SNVs/Indels | Reported gene | Reported linkage regions | Reported CNVs | Reported gene/translocations |
Evaluation parameters | Allele and genotype distribution (Number, Frequency) | Number of independent carriers in cases and controls | Up or down regulated | Significant markers | Number of independent carriers in cases and controls | ||
Statistical significance and findings | P value, OR, RR, 95%CI for allele and/or genotype comparison | Pathogenicity classification based on authors’ conclusion | Fold change, P value |
LOD, NPL, P value | Pathogenicity classification based on authors’ conclusion | Result summary | |
Associated CHD | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | |
Associated Syndrome | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | |
Sample | Family ID | ✔ | ✔ | ✔ | |||
Sample ID | ✔ | ✔ | ✔ | ||||
Gender | ✔ | ✔ | ✔ | ||||
Age | ✔ | ✔ | ✔ | ||||
Diagnosis | ✔ | ✔ | ✔ | ||||
Other phenotypes | ✔ | ✔ | ✔ | ||||
Transmission (de novo, paternal, maternal, familial) | ✔ | ✔ | ✔ | ||||
Zygosity (homozygous, heterozygous, mosaic) | ✔ | ✔ | ✔ |
In CHDbase, 1124 genes have been reported to increase susceptibility to CHD. For these CHD-related genes, we constructed an unweighted network based on the experimentally-verified protein–protein interactions from the STRING database (https://string-db.org/). In the network, the two genes were connected with an edge only when the STRING experimental score > 0. Degree, betweenness centrality and eigenvector centrality are used to measure the significance of each gene in the network. To obtain the core gene set with high importance, we obtained a k-core with 163 genes and 7925 edges from the network with k-core decomposition. These 163 genes with a maximum value of k (k = 70) are considered as the most centrally located nodes in the original network.
To facilitate users to better understand the function of genes, we annotated them using public databases and data as follows and provide the information on the Gene Annotation Page.
Information for gene-related human phenotypic abnormalities and diseases was collected from Human Phenotype Ontology (HPO, Version: Jan 8, 2019), GWAS Catalog (downloaded from UCSC on May 29, 2020) and DisGeNET (v7.0).
Spatiotemporal gene expression levels were integrated from the latest gene expression data as follows: 1) RNA-Seq data generated by the genotype-tissue expression (GTEx) project for 31 human tissues, including adipose, adrenal gland, artery, bladder, brain, breast, cell lines, cervix, colon, esophagus, fallopian tube, heart, kidney, liver, lung, minor salivary gland, muscle, nerve, ovary, pancreas, pituitary, prostate, skin, small intestine, spleen, stomach, testis, thyroid, uterus, vagina, and whole blood; 2) RNA-Seq data published by Cardoso-Moreira et al. for seven human organs (cerebrum, cerebellum, heart, kidney, liver, ovary and testis) across multiple developmental time points; and 3) spatial subcellular expression data published by Asp et al. for the developing human heart at three developmental phases, namely, 4.5-5, 6.5 and 9 weeks post conception.
PTM is the chemical modification of an amino acid on the protein or peptide after translation. We collected PTM data from UniProt (downloaded on May 29, 2020) and dbPTM (08 January 2019).
GO terms were parsed from the NCBI gene2go.gz file (May 5, 2020), including three categories, biological process, molecular function and cellular component.
The biological pathways in which genes are involved were integrated from four databases: Reactome download from UniProt (release 2020_06), BioCyc (v22.0), KEGG download from KOBAS (v3.0.3), and PANTHER (v3.6.4).
Molecular interactions were retrieved from the STRING (V11.0), OmniPath (April 15, 2020), BioGRID (V3.5.185) and human reference interactome (HuRI) databases. In the STRING dataset, the red node represents the quality score of the evidence, the color gradient of which is proportional to the quality: the darker the color, the higher the quality. The OmniPath interactions downloaded from the webserver (https://archive.omnipathdb.org) include not only omnipath data but also pathwayextra, kinaseextra, ligrecextra, and tfregulons data. The action, direction, and evidence of the interaction are provided. The BioGRID dataset contains protein, genetic, and chemical interactions (Homo sapiens). The HuRI dataset includes all protein–protein interactions identified in HI-I-05, HI-II-14, HuRI, Venkatesan-09, Yu-11, Yang-16, and test space screens-19.
To facilitate the identification of clinically relevant drugs, we also collected gene–drug interactions from DGIdb (v3.0), DrugCentral (v2020), and PharmGKB (Oct, 2012). The drug name, interaction type, evidence and reference are provided.
We provide comprehensive functional annotations for SNVs/Indels on the "Variation Annotation" page, including related gene, variant type, ID of other existing databases such as dbSNP and ClinVar, allele frequency in different populations of gnomAD, conservation scores calculated with phastCons, phyloP, and GERP++, and deleteriousness predicted by SIFT, PolyPhen, CADD, DANN, LRT, M-CAP, MetaLR, PrimateAI, fathmm-MKL, and fathmm-XF at transcript level.
We provides two search modes for users to retrieve data of interest from CHDbase. The usage of basic and advanced search functions will be introduced.
Figure 1 Search box on the Home Page and top navigation bar
Table 2 Format of search terms for basic search mode
Term | Exact Search | Fuzzy Search |
---|---|---|
Gene Symbol | Official gene symbol, e.g. GATA4 | Partial official gene symbol, e.g. GATA |
Entrez Gene ID | [0-9], e.g. 2626 | Partial Entrez Gene ID, e.g. 26 |
Ensembl ID | [ENSG*], e.g. ENSG00000136574 | Partial Ensembl ID, e.g. ENSG00000136 |
Genome range | [chr*:start-end], e.g. chr9:133738277-133738277 | [chr*:start-end], e.g. chr9:133738277-133738277 |
HGVSc | [NM_*.*:c*], e.g. NM_005157.6:c.677A>G | Partial HGVSc expression, e.g. NM_005157.6:c.677A, but at least including transcript name with version and ":c", e.g. NM_005157.6:c |
HGVSp | [NP_*.*:p*], e.g. NP_001123517.1:p.Gly112Arg | Partial HGVSp expression, e.g. NP_001123517.1:p.Gly, but at least including protein name with version and ":p", e.g. NP_001123517.1:p |
dbSNP | [rs*], e.g. rs114390380 | Partial dbSNP ID, but at least starting with "rs" and including one digit, e.g. rs1 |
CHD/Syndrome | Full name or abbreviation, e.g. Mitral atresia or MA | Partial full name or abbreviation |
Literature | PubMed ID, e.g. 21815254 | Partial PubMed ID |
The basic search mode is provided on the Home Page and top navigation bar. Search terms such as gene symbols, Entrez IDs, Ensembl IDs, the range of genomic coordinates, HGVS expression at the cDNA level (HGVSc), HGVS expression at the protein level (HGVSp), dbSNP IDs, the full name or abbreviation of diseases, and PubMed IDs can be recognized by the basic search engine. The user can click on the circle at the end of the search box to choose "Exact Search" or "Fuzzy Search" mode (Figure 1). In the "Exact Search" mode, users need to enter a complete query term in valid format (Table 2). When searching for genes or variations, if CHDbase finds an exact match, the target Gene or Variation Evidence Page will be returned. Otherwise, the Browse Page will be redirected to facilitate users to retrieve the data of interest. When searching for diseases or literature, the query results will be returned on the Browse Page, and users can further review the details.
In the "Fuzzy Search" mode, users only need to enter the partial query term (Table 2). For example, if a user wants to search for the information about the variations on NM_005157.6 transcript, "NM_005157.6:c" can be input for HGVSc to fuzzily search. Please note for genome range query, if "chr9:1234567-2345678" is entered and "Exact Search" selected, the variation exactly mapped onto the corresponding positions will be returned. If "Fuzzy Search" is selected, the variations with start position ≥ chr9:1234567 and end position ≤ chr9:2345678 will be returned.
To help users retrieve data more flexibly, we also provide table browser at the gene level and variation level on the Browse Page. Users can click on the "Browse" menu and then choose the browse level (Figure 2). When browsing at the gene level, evidence for the gene of interest can be obtained (Evidence ID starting with "CHDGE"). When browsing at the variation level, evidence for the variation of interest is obtained (Evidence ID starting with "CHDVE").
On the Browse Page, users can click on "Show/hide columns" button to show specific columns of interest; enter multiple query terms in the search boxes under the corresponding columns for advanced search; click on the up/down arrows on the right of column names to sort the table or use “Multiple sort” tool to sort it by multiple columns in a customized order (Figure 2). The crosslinks in the table browser can enable users to view the detailed information on Gene or Variation Evidence Page. Finally, users can click on "Export data" button to download the table in JSON or CSV format.
Figure 2 Browse Page with embedded toolbox for advanced search
CNVs and linkage regions associated with diseases typically cover numerous genes, making it difficult to identify true causal genes. We thus do not provide the Evidence Page and Annotation Page for the genes only supported by “CNV” and/or “Linkage” evidence.
Detailed information about evidence at the gene level and variation level is provided in a similar manner on the Gene and Variation Evidence Pages, respectively. Taking Gene Evidence Page as an example, the evidence list is shown in the upper panel. "Conclusion" column indicates whether the evidence supports the association of the gene with CHD. “Associated CHD” and “Syndrome” columns help users understand to which CHD types and syndromes this gene has been linked. “Associated CHD” generally integrates the diagnosed CHD and accompanying CHD reported in other phenotypes of patients, unless the authors clearly concluded that the gene is specifically related to some CHD types. “Syndrome” is from the diagnosed syndromes of patients with CHD. If users click on the "Detail" button before the Evidence ID, the evidence details are shown in the lower panel.
Figure 3 Gene Evidence Page
To view functional annotations about genes, users can click on the label of “Gene Annotation” in the top left. The section of "Basic gene information" includes official symbols, aliases, full gene names, functional summary and crosslinks to other databases, such as HGNC, NCBI Entrez Gene, Ensembl, Genecards, OMIM, UniProt, Mouse Genome Informatics (MGI), and Zebrafish Model Organism Database (ZFIN) (blue buttons), for the convenience of users (Figure 4). Furthermore, functional information and data, including gene–phenotype associations, spatiotemporal gene expression profiles, PTMs, GO terms and biological pathways in which the genes are involved, protein–protein interactions, and gene–drug interactions, are available to facilitate in-depth investigation (Figure 4).
Figure 4 Gene Annotation Page
On the Variation Page, users can also click on the label of “Variation Annotation” in the top left to view the annotations for SNVs/Indels, including variant consequences, allele frequencies in different populations of gnomAD, conservation scores, and variant predicted deleteriousness (Figure 5).
Figure 5 Variation Annotation Page
In CHDbase, collected genotype–phenotype associations totally link to ~150 CHD types and 160 related syndromes. The top 52 CHD types ranked by the number of publications reported are shown in the bar plot first (Figure 6). Besides the number of publications, the number of genes reported only in syndromic CHD, reported only in nonsyndromic CHD and reported both in syndromic and nonsyndromic CHD are also provided for each CHD type. Users can click on the bar to view the details on the Browse Page.
Figure 6 The top 52 CHD types ranked by the number of publications on CHD Type Page
For 27 CHD types associated with at least ten genes in CHDbase, the pairwise correlations were calculated using the Jaccard coefficient. It is a measure of similarity for the genes between two CHD types, ranging from 0 to 1. The number of overlapping genes, the number of genes for each CHD type, Jaccard coefficient, Jaccard distance, statistical significance are shown in the table (Figure 7). The value of the "Statistics" column is the centered Jaccard coefficient, which equals the Jaccard coefficient minus the unbiased estimation of expectation. P values were calculated with the Jaccard test and corrected for multiple testing using the Benjamini–Hochberg false discovery rate. Based on the pairwise Jaccard distance, which equals 1 minus Jaccard coefficient, 27 CHD types were classified into seven major groups using hierarchical clustering analysis. The matrix of pairwise Jaccard distances and the clustering result are shown in the heatmap (Figure 7).
Figure 7 Pairwise correlations and classification of 27 CHD types on CHD Type Page
Finally, the list of 160 CHD-related syndromes with specific cardiac features, other clinical findings, and OMIM links is provided (Figure 8) . These syndromes are categorized into five groups according to the occurrence frequency of CHD in the syndrome: very commonly associated (CHD frequency ≥ 50%), frequently associated (CHD frequency ≥ 20% and CHD frequency < 50%), occasionally associated (CHD frequency ≥ 5% and CHD frequency < 20%), and very occasionally associated (CHD frequency < 5%).
Figure 8 List of 160 CHD-related syndromes on CHD Type Page
As mentioned in the section of “CHD-related Gene Prioritization”, we prioritized the 1124 CHD-related genes using a gene interaction network approach and extracted a core sub-network of 163 genes. To facilitate users to rank genes based on their own experiences and preferences, CHD-related genes with the meta-data, such as the information indicating whether it belongs to the core gene set, its centrality scores, and the number of evidence items supporting its association with CHD are provided on the Gene Page (Figure 9). The expression profile and functional enrichment of CHD-related genes are also shown.
Figure 9 1124 CHD-related genes with core gene identity, network centrality scores and number of supporting evidence items
A total of 1006 structural variations and 2585 SNVs/Indels are included in CHDbase. On the Variation Page, the histogram in the upper panel shows the number of variations grouped by different types as annotated by the Ensembl Variant Effect Predictor (VEP) (Figure 10). Users can click on the bar to view the details on the Browse Page. The table in the lower panel lists the number of variations grouped by gene and variation type (Figure 10). If one variation is annotated as different types on different transcripts, the variation number of each corresponding type is increased by one.
Figure 10 Variation Page
A total of 1114 publications were manually curated into CHDbase. On Source Page, the number of publications is shown in the bar plot for each year of publication (Figure 11). The detailed information of these publications is provided in the table (Figure 11).
Figure 11 Source Page
Abbreviation | Disease |
---|---|
AA | Aortic valvar atresia |
APVC | Anomalous pulmonary venous connection |
AR | Congenital aortic regurgitation |
AS | Aortic stenosis |
ASD | Atrial septal defect |
ASI | Atrial situs inversus |
AVS | Congenital aortic valvar stenosis |
AVSD | Atrioventricular septal defect |
BAV | Bicuspid aortic valve |
BPV | Bicuspid pulmonary valve |
cc-TGA | Congenitally corrected transposition of the great arteries |
cAVJ | Common atrioventricular junction |
CHD | Congenital heart defect |
CoA | Coarctation of aorta |
CTD | Cardiac conotruncal defects |
DCM | Dilated cardiomyopathy |
DCRV | Double-chambered right ventricle |
DOLV | Double outlet left ventricle |
DORV | Double outlet right ventricle |
Dxc | Dextrocardia |
d-TGA | D-Transposition of the great arteries |
EA | Ebstein malformation of tricuspid valve |
ECD | Endocardial cushion defect |
HCM | Hypertrophic cardiomyopathy |
HLHS | Hypoplastic left heart syndrome |
HRHS | Hypoplastic right heart syndrome |
HTX | Heterotaxy syndrome |
IAA | Interrupted aortic arch |
LSLs | Left-sided lesions |
LSVC | Left superior caval vein |
LVNC | Left ventricular non-compaction cardiomyopathy |
LVOTO | Congenital left ventricular outflow tract obstruction |
MA | Mitral atresia |
MAPCAs | Major aortopulmonary collateral arteries |
MR | Congenital mitral regurgitation |
MVS | Congenital mitral valvar stenosis |
OA | Overriding aorta |
PA | Congenital pulmonary atresia |
PAH | Pulmonary arterial hypertension |
PA-IVS | Pulmonary atresia with intact ventricular septum |
PAPVC | Partial anomalous pulmonary venous connection |
PA-VSD | Pulmonary atresia with ventricular septal defect |
PCD | Primary ciliary dyskinesia |
PDA | Patent arterial duct |
POF | Patent oval foramen |
PS | Pulmonary stenosis |
PTA | Persistent truncus arteriosus |
PVS | Pulmonary valve stenosis |
RAA | Right aortic arch |
RAI | Right atrial isomerism |
RVOTO | Congenital right ventricular outflow tract obstruction |
SA | Single atrium |
SV | Single ventricle |
TA | Tricuspid atresia |
TAPVC | Total anomalous pulmonary venous connection |
TGA | Transposition of the great arteries |
TOF | Tetralogy of Fallot |
TR | Congenital tricuspid regurgitation |
VCAb | Vena cava abnormality |
VOTO | Congenital ventricle outflow tract obstruction |
VSD | Ventricular septal defect |