Assembly analysis¶
In the presentation we will cover:
Overview of MGnify annotation pipeline
Taxonomic assignment
Functional characterisation
Pathways/systems
Using the contig viewer
MGnify hands-on exercises¶
For this session we will look at some of the data and analyses that are available from MGnify. We will navigate the resource, try out different ways to search for interesting samples/studies, and then investigate the analysis results that are available for assemblies.
Browsing MGnify¶
From the MGnify front page (https://www.ebi.ac.uk/metagenomics/) you can see various options to browse the data. There are quick links to the various data-types (e.g. amplicon, assembly, metagenomes, etc) we support, as well as a subset of the biomes that the data covers.
Click on the “wastewater” biome icon.
How many studies does MGnify hold that relate to wastewater?
How many samples does that relate to?
From the sample page, filter the rows with accession ERS1215575, and take a look at the metadata available.
Do you know the exact location of where the sample was taken?
What are the lat/long co-ordinates?
Follow the link to the BioSamples record, can you find any more information about the location of the sample?
From the tabs in the header bar, select Text search, and then select Samples below the search box. There are a number of metadata fields available to allow you to filter for a sample of interest to you. Not all are relevant to all samples. Within the hierarchy of biomes, navigate to environmental>aquatic>lentic. You should see 92 samples. Now select the depth filter.
How many lentic samples have depth data associated with them?
Using the sliders, can you identify a sample of a lentic water system from a depth between 47-49m?
MGnify assembly analysis¶
Now we will look at some assembly data that has been analysed by MGnify.
Search for MGYS00003598, and go to this study page. This is a large study where MGnify have assembled the raw reads from an existing public study. The list of assemblies is shown at the bottom of the study page.
How many analyses are included in this study?
Click on the 2nd analysis link in the list MGYA00510849. You could alternatively search for this accession using the text search options. Have a look at the information within the Quality control tab.
How many contigs are included in this analysis?
What length is the longest contig in this dataset?
Click on the Taxonomic analysis tab and examine the phylum composition in the graphs and the krona plot.
What proportion of the total LSU rRNA predictions are eukaryotic?
Try switching between the other available graph views.
Which phylum contains the highest proportion of LSU rRNA predictions?
Click on the Functional analysis tab. The top part of this page shows a sequence feature summary, showing the number of contigs with predicted coding sequences (pCDS), the number of pCDS with InterPro matches, etc.
How many predicted coding sequences (pCDS) are in the assembly?
How many pCDS have InterProScan hits?
Scroll down the page to the InterPro match summary section.
How many different InterPro entries are matched by the pCDS?
Why is this number different to the number of pCDS that have InterProScan hits?
Click on the GO Terms sub-tab. This shows a summary of the most common GO terms annotated to the pCDS as both bar charts, and pie charts.
What are the top 3 biological process terms predicted for the pCDS from this assembly?
Have a look at the information in the Pfam and KO (KEGG orthologue) sub-tabs.
Click on the Pathways/Systems tab. Have a look at the data reported in the 3 sub-tabs: KEGG Module, Genome Properties, and antiSMASH.
How many KEGG modules are reported for this assembly?
How many of these are 100% complete (i.e. all of the constituent KOs are found)?
How many Genome Properties of the category DNA handling, are found within this assembly?
What is the most common class of biosynthetic gene cluster found in this assembly?
How many non-ribosomal peptide synthetase gene clusters are identified by antiSMASH in this assembly?
Click on the Contig Viewer tab. Load the data for the 4th contig in the list by clicking on the contig name (ERZ501066.4-NODE-4-length-276957-cov-33.799655). This contig will now be loaded into the viewer.
The longest pCDS in the contig appears to start at 202339. What protein is coded for?
Looking at the antiSMASH annotations, where within the contig do any transport-related genes fall?
Zoom into that region to see the predicted regions in more detail. Have a look at the information about the various transport-related genes.
What region of the contig is predicted to code for a major facilitator transporter?
There are lots of different visualisation options available within the contig viewer. Take some time now to investigate the various options, and play about with it by looking at a few different contigs and the anotations they contain.