• PanGFR-HM is a resource for quick estimate of gene content of human associated microflora by building pan genome of the species, genera and more. It includes around 1300 complete microbial genomes (annotated proteomes) available for pan genomic analysis. The raw protein fasta sequences from HMP-DACC Database are used after processing. Any of these strains can be selected for such analysis at desired taxonomic level (preferably species or genus level). Different modules of this resource (accessible from the homepage) provide different options for pan genomic and comparative analysis of these microbiota. • Pan-TX and Pan-BS enables taxonomy wise and body site wise strain selection, respectively; providing exactly the same results. While Pan-CA enables comparative analysis of 2 to 4 groups of desired strains giving out the group wise presence absence of genes (based protein sequence orthology), along with COG and KEGG based presence absence.
Pan-TX : Taxonomy wise pan genome estimation (Input steps)
This module enables taxonomy wise selection of human microbiome strains. The steps for running this module are illustrated in Fig. 1. Figure 1: Steps for running Pan-TX Module. Please wait while the process is running. Larger queries will take proportionate time. Be patient, as large amount of data will be queried. All the results will appear in the same page with further data download links at respective panels.
Pan-BS : Bodysite wise pan genome estimation (Input steps)
This module enables body site wise selection of human microbiome strains. The steps for running this module are illustrated in Fig. 2. Figure 2: Steps for running Pan-BS Module. Everything remains same as Pan-TX except the fact that it will show body site wise lists of strains, so that the user can select strains of concerned body site of isolation. Fig. 2 shows Enterococcus strains from Gastrointestinal Tract. List of Enterococcus strains from other body sites (if any) can be seen after scrolling down the selection panel. Please wait while the process is running. Larger queries will take proportionate time. Be patient, as large amount of data will be queried. All the results will appear in the same page with further data download links at respective panels.
Pan-TX and Pan-BS Results
Pan-TX and Pan-BS produce exactly the same output. The pan genome analysis results are explained in this section. The list of strains selected in Fig. 1 in Pan-TX section are used as test dataset .
• SELECTED DATASET: The selected list of strains with their basic details is displayed at the top of the results page. The example is shown in the following Figure.
• PAN GENOME DISTRIBUTION: Pan genome distribution is given in text as well as pie, bar and boxplots. The example is shown in the following Figure. The red boxes highlight the download options.
• PHYLOGENETIC RECONSTRUCTION: Two types of phylogenetic trees reconstructed using core set of genes and presence absence of genes in pan genome respectively.
• STRAINWISE PAN GENOME STATISTICS: This section provides core, accessory and unique gene families from each included strains along with options to visualize and download respective sequences for each strain. The example is shown in the following Figure.
• COG AND KEGG DISTRIBUTION: The COG and KEGG distribution for core, accessory and unique gene families are plotted as bars using annotation counts. The example is shown in the following Figure.
Pan-CA : Comparative Pan Genome Analysis (Input steps)
This module enables comparative gene analysis of up to 4 groups of human microbiome strains. The steps for running this module are illustrated here. Steps for running Pan-CA Module: Please wait while the process is running. Larger queries will take proportionate time. Be patient, as large amount of data will be queried. All the results will appear in the same page with further data download links at respective panels.
Pan-CA Results
Pan-CA produces 3 kinds of comparative analyses. The pan genome analysis results are explained in this section. The list of strains shown in Pan-CA input section are used as test dataset .
• SELECTED DATASET: The selected list of strains with their basic details is displayed at the top of the results page. The example is shown in the following Figure. (Complete list contains 5 strains for each of the 3 groups as shown in Pan-CA Input Panel.)
• COMPARATIVE GENE DISTRIBUTION: Pan genome distribution is given as a venn diagram. The example is shown in the following Figure. The red boxes highlight the download options. Blue box shows sample result for 3067 core (shared) gene families between three groups (at least 1 strain per group). Further details explain how many out of these 3067 shared gene families were common in all 15 selected strains (Core), present in more than 1 strains per group (Accessory) or only 1 strain per group (Unique). Every detail is available in downloaded XLS files.
• COMPARATIVE COG DISTRIBUTION: Comparative COG distribution is given as a venn diagram. The comparison involves exclusive presence of non-redundant set of COG Identifiers (NCBI - COG Database) present in respective sets.
• COMPARATIVE KEGG DISTRIBUTION: Comparative KEGG distribution is given as a venn diagram. The comparison involves exclusive presence of non-redundant set of KEGG Orthology Identifiers (KEGG Orthology Database) present in respective sets.
BLAST Search
This module allows user to upload their own protein fasta file to give out significant BLASTP hits against desired strains from PanGFR-HM dataset.
Steps for running BLAST Search: First Upload your file or paste sequences, then select 1 or more strains from the list to search against all their protein families. The analysis id will be displayed. You can download your results via Retrieve Earlier Results link. The output will be provided after completion of the BLAST search, in the form of table of significant hits with the core, accessory and unique protein families, and their KEGG and COG details.
Methodology
Design and development of PanGFR-HM involved multiple steps and rigorous processing. The overall workflow is depicted in the following flowchart. Steps in Design and Development of PanGFR-HM
• ACQUISITION OF RAW DATA : The raw data was accessed from public data repository of Human Microbiome Project (HMP-DACC) as protein multifasta (PEP) files for around 1300 strains in January 2017. The assembly levels were up to high quality drafts for most of the strains with optimum protein annotations. These sequences are filtered i.e. sequences with less that 50 amino acids were removed before proceeding, however very few such sequences were found.
• SEQUENCE CLUSTERING : The protein sequences from all the strains were clustered using USEARCH Linux Version using various clustering identity cut-off levels separately. The clusters (also referred to as gene families) generated from each run were then processed to create the logical presence absence matrix of proteins by BPGA Pipeline along with in house Perl scripts. Representative sequences of these gene families were then mapped to KEGG and COG databases to assign a protein function to each gene family by BPGA Pipeline. Large portion of these gene families were still not assigned any function and remain uncharacterized.
• SCHEMA AND DATABASE DESIGN : The gene presence/absence data for each gene family for each microbiome strain at various sequence clustering levels was huge in size and difficult to handle. So, a database schema was designed to organize the data into structured form and to enable easy query from web interface using MySQL Database Engine , Community Version.
• DESIGN AND DEVELOPEMENT OF WEB INTERFACE: PanGFR-HM web interface has been designed using HTML and JavaScript. While the queries and internal programs are coded in PHP. The user selected strains, type of analysis and parameters are processed and the respective gene presence/absence or KEGG, COG and other details are accessed from the database. The information is then processed further to display results in table or chart formats and also written to downloadable files. Web interface enables easy query and detailed analysis of gene or functional repertoire of microbiome strains extracted from a huge database.
Dependencies and Other Tools
• PanGFR-HM is compatible with latest versions of Mozilla Firefox, Google Chrome and Safari. • PanGFR-HM uses external scripts for plotting results as charts, including Plotly. The advantage of using Plotly scripts is that, user gets interactive charts with image download options and cloud based editing option for plots. • Phylogenetic tree reconstruction needs MUSCLE. • Venn Diagrams and phylogeny images are constructed using D3 and venn.js scripts.