MetaFunc Docs

Documentation of the MetaFunc pipeline for metatranscriptomics/genomics analysis. The pipeline summarizes microbial taxa abundances and identified protein matches. The protein matches are then used to identify functional categories of the microbiome.

In addition, the pipeline can analyse host reads among the input to establish host gene expression and functional information and expression correlation with microbial taxa abundance.

Default usage

In microbiome studies, it is useful to know functional processes contributed by the microbiome as well as the identity of the microbes that are present in a sample. The Kaiju package uses protein matches to identify which microorganisms are present in a sequenced sample. Basing on Kaiju’s nr_euk database, this workflow takes this a step further by summarizing the identified microbes into abundance tables and binning the protein accessions into gene ontology functional annotations to give users a picture of what functions are taking place in the microbiome environment. The most important results are then summarized and presented in a web application so that a user is able to easily navigate and interrogate the results per sample or group/condition.

This workflow takes as input sequencing reads (any combinations of paired-end, single-end, or both) in fasta or fastq format, runs these through Kaiju, and parse the raw results through custom scripts to give the following results:

Taxonomy Information

  • Taxonomy ID : Read Count (per sample) table.

    Taxonomy IDs as rows and samples as columns

    Taxonomy IDs belonging to any species from a taxa specified in kaiju-taxonlistEuk.tsv. Only taxonomy IDs with an abundance >= 0.001% in >= 1 sample are included (see Taxonomy ID : Scaled Read Count (per sample) information below). Lineage1 of the TaxIDs are also included as columns.

  • Taxonomy ID : Scaled Read Count (per sample) table.

    Taxonomy IDs as rows and samples as columns

    Per sample, scaled counts are obtained by dividing read counts for each taxonomy ID by the total number of reads identified at the species level and multiplying this quotient by 100. Only taxonomy IDs with abundance >= 0.001% in >= 1 sample are included. Lineage1 of the TaxIDs are also included as columns.

  • Taxonomy ID : Average Scaled Reads (per group/condition) table (Optional).

    Taxonomy IDs as rows and groups/conditions as columns

    From the scaled read count table, the average of the scaled read counts matching to a taxonomy ID are obtained from samples belonging to a group. Lineage1 of the TaxIDs are also included as columns.

1Lineage includes Kingdom, Phylum, Class, Order, Family, Genus, Species

  • Differential abundance results from edgeR comparing species between groups (Optional).

    Differential abundance analysis of the species is carried out using edgeR. Rows are per <Taxon IDs>_<Taxon Names> with additional result columns from the exactTest with p-value adjustment using FDR. Outputs are sorted in an ascending manner based on FDR.

Functional Information

Note

For functional assessment, the pipeline uses proportional read counts. Per sample, Kaiju classifies reads by matching the six translated frames to protein sequences. Proportional read counts are obtained if a read matched to more than one protein by dividing 1 read count by the number of proteins it has matched to. A corresponding scaled read count is obtained by dividing the proportional read count by the total number of species-level reads and multiplying the result by 100. Only reads at the species level are considered.

  • Gene Ontologies : %Read Count (per sample) table.

    Gene Ontologies as rows and samples as columns

    Proteins are gathered per taxon of interest (e.g. Bacteria, Viruses) and the gene ontology (GO) annotations are obtained. Only proteins belonging to taxonomy IDs included in the taxonomy tables are considered. Proteins are binned into GO annotations by adding up their scaled read counts (see Note) per GO ID. Per namespace (Biological Process, Molecular Function, Cellular Component), the percentage (%) of a GO ID is obtained by dividing the scaled read counts covering a GO ID by the total number of scaled reads in that namespace, multiplied by 100.

  • Gene Ontologies : %Read Count (per group/condition) table (Optional).

    Gene ontologies as rows and groups/conditions as columns

    As above, but read counts and scaled read counts of a protein accession number are first obtained by averaging the counts among samples belonging to a group. (GO) annotation information is then obtained in the same manner, using a group/condition.

Additional Features

Host Information

Often, microbiomes are studied that belong to a host organism (e.g. human). The MetaFunc pipeline offers the option of mapping reads first to a host genome using STAR. Unmapped reads would be analysed by Kaiju to generate the results of Default usage. Results from host mapping can then undergo gene expression quantification, differential gene expression analysis (DGEA), and gene set enrichment analysis (GSEA).

Results with host mapping include the following:

  • Gene : Read Count (per sample) table.

    Gene IDs as rows and samples as columns.

  • Gene : Tags per million (per sample) table

    Gene IDs as rows and samples as columns.

  • Differential Gene Expression Analysis (DGEA) results from edgeR comparing genes between groups (Optional).

    DGEA on the genes are carried out using edgeR. Rows are gene IDs with additional result columns from the exactTest function with p-value adjustment using FDR. Outputs are sorted in an ascending manner based on FDR.

  • Gene Set Enrichment Analysis (GSEA) (Optional)

    Based on dgea results, would indicate what gene sets are enriched in differentially abundant and differentially depleted genes when ranked. Positive enrichment scores mean a particular gene set is overrepresented in the differetially abundant genes while negative enrichment scores mean they are overrepresented in differentially depleted genes.

Host-Microbiome Information

Host-microbiome relationships can also be obtained through spearman correlation analysis of differentially expressed genes (DEGs) and differentially abundanct (DA) microbes.

  • Spearman Correlation Analysis between top DGEA genes and top DA species (Optional)

    Spearman correlation analysis is performed between the top differentially expressed genes and top differentially abundant species in a pairwise manner. Results include the correlation score rho and p-value. A summary matrix of rho values between each gene and each species is also provided.

An illustration of the above results can be found in Details of pipeline results.