METHOD

Metagenome Pipeline from raw data to taxonomic & functional table for analyzing metagenomic sequencing

Metagenomic Binning

Metagenomic binning is a computational process used to categorize and group sequences obtained from metagenomic data into bins, each representing a distinct genome. This approach is critical for understanding the complex microbial communities in various environments.

Cleandata=“$PWD"
source activate metawrap-env
mkdir ${1}_metawrap
cd ${1}_metawrap
ln -s ${cleandata}/${1}_1.cl.fq ${1}_1.fastq
ln -s ${cleandata}/${1}_2.cl.fq ${1}_2.fastq
#metawrap assembly -1 ${1}_1.fastq -2 ${1}_2.fastq -m 80 -t 8 --megahit -o ${1}_ASSEMBLY
metawrap kraken -o ${1}_KRAKEN -t 8 ${1}_ASSEMBLY/final_assembly.fasta *.fastq
rm -rf ${1}_ASSEMBLY/megahit/intermediate_contigs
metawrap binning -o ${1}_INITIAL_BINNING -t 8 -a ${1}_ASSEMBLY/final_assembly.fasta \
--metabat2 --maxbin2 --concoct --run-checkm *fastq
metawrap bin_refinement -o ${1}_BIN_REFINEMENT -t 8 -A ${1}_INITIAL_BINNING/metabat2_bins/ \
-B ${1}_INITIAL_BINNING/maxbin2_bins/ -C ${1}_INITIAL_BINNING/concoct_bins/ -c 50 -x 10
metawrap blobology -a ${1}_ASSEMBLY/final_assembly.fasta -t 8 -o ${1}_BLOBOLOGY \
--bins ${1}_BIN_REFINEMENT/metaWRAP_bins *fastq
metawrap quant_bins -b ${1}_BIN_REFINEMENT/metaWRAP_bins -t 8 -o ${1}_QUANT_BINS \
-a ${1}_ASSEMBLY/final_assembly.fasta *fastq
metawrap reassemble_bins -o ${1}_BIN_REASSEMBLY -1 ${1}_1.fastq -2 ${1}_2.fastq -t 8 \
 -c 50 -x 10 -b ${1}_BIN_REFINEMENT/metaWRAP_bins
metawrap classify_bins -b ${1}_BIN_REASSEMBLY/reassembled_bins -o ${1}_BIN_CLASSIFICATION -t 8
metaWRAP annotate_bins -o ${1}_FUNCT_ANNOT -t 8 -b ${1}_BIN_REASSEMBLY/reassembled_bins/
source deactivate

GTDB-Tk (Genome Taxonomy Database Toolkit)

GTDB-Tk is a bioinformatics toolkit designed for the classification and taxonomic annotation of bacterial and archaeal genomes. It is based on the Genome Taxonomy Database (GTDB), which provides a standardized framework for prokaryotic taxonomy using genome-based phylogenetic analyses. GTDB-Tk uses a set of curated reference genomes and a standardized set of phylogenetic markers to classify genomes at various taxonomic levels, from phylum to species.

conda activate gtdbtk
time gtdbtk identify --genome_dir metawrap_70_5_bins/ --out_dir identify_output --cpus 6 --prefix F --debug -x .fa --write_single_copy_genes
time gtdbtk align --identify_dir identify_output/ --out_dir align_output --cpus 3 --prefix F –debug
time gtdbtk classify --genome_dir metawrap_70_5_bins/ --align_dir align_output/ --out_dir classify_output --cpus 3 --prefix F --debug -x .fa 

Microbial Network Analysis (python code)

Microbial network analysis is a powerful approach used to study the complex interactions within microbial communities. This method involves constructing networks where nodes represent microbial taxa (such as species, genera, or operational taxonomic units), and edges represent the interactions or associations between these taxa. These interactions can be based on various types of data, including co-occurrence patterns, gene expression correlations, metabolic dependencies, or environmental co-factors.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
df = pd.read_csv("genus-sum.txt.t",sep="\t",index_col=0)
def corr_sig(df=None):
    p_matrix = np.zeros(shape=(df.shape[1],df.shape[1]))
    for col in df.columns:
        for col2 in df.drop(col,axis=1).columns:
            a , p = stats.pearsonr(df[col],df[col2])
#            print(a)
            p_matrix[df.columns.to_list().index(col),df.columns.to_list().index(col2)] = p
    return p_matrix
p_values = corr_sig(df)
mask = np.invert(np.tril(p_values<0.05))
ko=df.columns.tolist()
ppp=pd.DataFrame(p_values,columns=ko).fillna(1)
print(ppp)
cordata=df.corr()
#df3.to_csv("abandance",sep="\t",header=0)
cordata=cordata.fillna(0)
cordata.to_csv("tmp.cordata",sep="\t") #相关性
ppp.index=ko
ppp.to_csv("*****",sep="\t")##P值
 

Functional Annotation and Various Data Comparisons

Functional annotation is a crucial step in understanding the roles and functions of genes within a genome, particularly in metagenomic studies where the goal is to decipher the functional potential of microbial communities. This process involves assigning functional information to genes or protein sequences by comparing them against known databases. One of the most common tools used for functional annotation is BLAST (Basic Local Alignment Search Tool), which compares nucleotide or protein sequences to sequence databases and identifies homologous sequences with similar functions.

When using BLAST for functional annotation, sequences from a newly sequenced genome or metagenome are aligned against well-established databases such as NCBI’s non-redundant (NR) protein database, UniProt, KEGG (Kyoto Encyclopedia of Genes and Genomes), and Pfam (Protein family database). The goal is to find high-scoring matches that indicate homology, suggesting similar functional roles. These matches help predict the biological functions of unknown genes based on the known functions of homologous genes in other organisms.

In addition to BLAST, various data comparison methods are employed to enhance the accuracy and depth of functional annotation. These methods include:

1. HMMER and Hidden Markov Models (HMMs): HMMER searches use profile HMMs, which are statistical models representing the sequence diversity of protein families. Databases like Pfam and TIGRFAMs provide HMM profiles for protein families and domains, allowing for more sensitive detection of distant homologs compared to BLAST.

2. KEGG Orthology (KO) Assignment: Mapping sequences to KO terms in the KEGG database helps in reconstructing metabolic pathways and understanding the functional capacities of a community, as each KO is associated with a specific biochemical reaction or pathway.

3. EggNOG and COG Analysis: EggNOG (evolutionary genealogy of genes: Non-supervised Orthologous Groups) and COG