Assembling data¶
Sequence quality control (quality/adapter trimming, host decontamination)
Assembly/co-assembly
Evaluating an assembly
Submitting primary assemblies to the ENA
Prerequisites¶
This tutorial requires Bandage which can be installed as per the github instructions for your system https://github.com/rrwick/Bandage#pre-built-binaries
For this tutorial you will need to make a working directory to store your data in:
mkdir -p ~/BiATA/session1/data
chmod -R 777 ~/BiATA
export DATADIR=~/BiATA/session1/data/session1
In this directory, download the tarball from http://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_courses/biata_2021/
cd ~/BiATA/session1/data/
wget -q http://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_courses/biata_2021/session1.tgz
tar -xzvf session1.tgz
Now makes sure that you have pulled the docker container
docker pull microbiomeinformatics/biata-qc-assembly:v2021
Finally, start the docker container in the following way:
xhost +
docker run --rm -it -e DISPLAY=$DISPLAY -v $DATADIR:/opt/data -v /tmp/.X11-unix:/tmp/.X11-unix:rw -e DISPLAY=unix$DISPLAY microbiomeinformatics/biata-qc-assembly:v2021
Note that if running Docker windows, WSL must be disabled by selecting “Turn Windows features or or off” when Docker starts.
Part 1 - Quality control and filtering of the raw sequence files¶
Learning Objectives - in the following exercises you will learn
how to assess the quality of short read sequences, identify the
presence of adapter sequences, and remove both adapters and low quality
sequences. You will also learn how to construct a reference database for
decontamination.
First go to your working area. The data that you downloaded
has been mounted in /opt/data in the docker container.
cd /opt/data
ls
Here you should see the same contents as you had from
downloading and uncompressing the session data. As we write into this
directory, we should be able to see this from inside the container, and
on the filesystem of the computer running this container. We will use
this to our advantage as we go through this practical. Unless stated
otherwise, all of the following commands should be executed in the
terminal running the Docker container.
Generate a directory of the fastqc results
cd /opt/data
mkdir fastqc_results
cd fastqc_results
fastqc ../oral_human_example_1_splitaa.fastq.gz
fastqc ../oral_human_example_2_splitaa.fastq.gz
mv ../*html .
mv ../*zip .
Now on your local computer, go to the browser, and
File -> Open File. Use the file navigator to select the following file
~/BiATA/session1/data/fastqc_results/oral_human_example_1_splitaa_fastqc.html
Spend some time looking at the ‘Per base sequence quality’.
For each position a BoxWhisker type plot is drawn. The
elements of the plot are as follows:
The central red line is the median value
The yellow box represents the inter-quartile range (25-75%)
The upper and lower whiskers represent the 10% and 90% points
The blue line represents the mean quality
The y-axis on the graph shows the quality scores. The higher the score the better the base call. The background of the graph divides the y axis into very good quality calls (green), calls of reasonable quality (orange), and calls of poor quality (red). The quality of calls on most platforms will degrade as the run progresses, so it is common to see base calls falling into the orange area towards the end of a read.
What does this tell you about your sequence data? Where do the
errors start?
In the pre-processed files we see two warnings, as shown on the left side of the report. Navigate to the “Per bases sequence content”
At around 15-19 nucleotides, the DNA composition becomes
very even but at the 5’ end of the sequence there are distinct
differences. Why do you think that is?
Open up the FastQC report corresponding to the reversed
reads.
Are there any significant differences between the forward
and reverse reads?
For more information on the FastQC report, please consult the ‘Documentation’ available from this site: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
We are currently only looking at two files but often we want
to look at many files. The tool multiqc aggregates the FastQC results
across many samples and creates a single report for easy comparison.
Here we will demonstrate the use of this tool
cd /opt/data
mkdir multiqc_results
multiqc fastqc_results -o multiqc_results
In this case, we provide the folder containing the fastqc results to multiqc and the -o allows us to set the output directory for this summarised report.
Now on your local computer, open the summary report from
MultiQC. To do so, go to your browser, and use File -> Open File. Use the
file navigator to select the following file
~/BiATA/session1/data//multiqc_results/multiqc_report.html
Scroll down through the report. The sequence quality
histograms show the following results from each file as two separate
lines. The ‘Status Checks’ show a matrix of which samples passed check
and which ones have problems.
What fraction of reads are duplicates?
So, far we have looked at the raw files and assessed their
content, but we have not done anything about removing
sequences with low quality scores or adapters. So, lets
start this process. The first step in the process is to make a database
relevant for decontaminating the sample. It is always good to routinely
screen for human DNA (which may come from the host and/or staff
performing the experiment). However, if the sample is say from mouse,
you would want to download the the mouse genome instead.
In the following exercise, we are going to use two “genomes”
already downloaded for you in the decontamination folder. To make this
tutorial quicker and smaller in terms of file sizes, we are going to use
PhiX (a common spike in) and just chromosome 10 from human.
cd /opt/data/decontamination
For the next step we need one file, so we want to merge the two different fasta files. This is simply done using the command line tool cat.
cat phix.fasta GRCh38_chr10.fasta > GRCh38_phix.fasta
Now we need to build a bowtie index for them:
bowtie2-build GRCh38_phix.fasta GRCh38_phix.index
It is possible to automatically download a pre-indexed human
genome in Bowtie2 format using the following command (but do not do this
now, as this will take a while to download):
kneaddata_database –download human_genome bowtie2
Now we are going to use the GRCh38_phix database and clean-up
our raw sequences. kneaddata is a helpful wrapper script for a number
of pre-processing tools, including Bowtie2 to screen out contaminant
sequences, and Trimmomatic to exclude low-quality sequences. We also
have written wrapper scripts to run these tools (see below), but using
kneaddata allows for more flexibility in options.
cd /opt/data/
mkdir clean
We now need to uncompress the fastq files.
gunzip -c oral_human_example_2_splitaa.fastq.gz > oral_human_example_2_splitaa.fastq
gunzip -c oral_human_example_1_splitaa.fastq.gz > oral_human_example_1_splitaa.fastq
kneaddata --remove-intermediate-output -t 2 --input oral_human_example_1_splitaa.fastq --input oral_human_example_2_splitaa.fastq --output /opt/data/clean --reference-db /opt/data/decontamination/GRCh38_phix.index --trimmomatic-options "SLIDINGWINDOW:4:20 MINLEN:50" --bowtie2-options "--very-sensitive --dovetail" --remove-intermediate-output
* –input, Input FASTQ file. This option is given twice as we have paired-end data.
* –output, Output directory.
* –reference-db, Path to bowtie2 database for decontamination.
* -t, # Number of threads to use (2 in this case).
* –trimmomatic-options, Options for Trimmomatic to use, in quotations (“SLIDINGWINDOW:4:20 MINLEN:50” in this case). See the Trimmomatic website for more options.
* –bowtie2-options, Options for bowtie2 to use, in quotations. The options “–very-sensitive” and “–dovetail” set the alignment parameters to be very sensitive and sets cases where mates extend past each other to be concordant (i.e. they will be called as contaminants and be excluded).
* –remove-intermediate-output, Intermediate files, including large FASTQs, will be removed.
Kneaddata generates multiple outputs in the “clean” directory, containing different 4 different files for each read.
Using what you have learned previously, generate a fastqc
report for each of the oral_human_example_1_splitaa_kneaddata_paired
files. Do this within the clean directory.
cd /opt/data/clean
mkdir fastqc_final
<you construct the command>
Also generate a multiqc report and look at the sequence
quality historgrams.
cd /opt/data/clean
mkdir multiqc
<you construct the command>
View the multiQC report as before using your browser. You
should see something like this:
Open the previous MultiQC report and see have they have
improved?
Did sequences at the 5’ end become uniform? Why might that
be? Is there anything that suggests that adapter sequences were found?
To generate a summary file of how the sequence were
categorised by Kneaddata, run the following command.
cd /opt/data
kneaddata_read_count_table --input /opt/data/clean --output kneaddata_read_counts.txt
less kneaddata_read_counts.txt
What fraction of reads were deemed to be contaminating?
The reads have now be decontaminated any can be uploaded to
ENA, one of the INSDC members. It is beyond the scope of this course to
include a tutorial on how to submit to ENA, but there is additional
information available on how to do this in this Online Training guide
provided by EMBL-EBI
Part 2 - Assembly and Co-assembly¶
Learning Objectives - in the following exercises you will
learn how to perform a metagenomic assembly and to start some basic
analysis of the output. Subsequently, we will demonstrate the
application of co-assembly. Note, due to the complexity of metagenomics
assembly, we will only be investigating very simple example datasets as
these often take days of CPU time and 100s of GB of memory. Thus, do not
think that there is an issue with the assemblies.
Once you have quality filtered your sequencing reads (see Part 1 of this session), you may want to perform de novo assembly in addition to, or as an alternative to a read-based analyses. The first step is to assemble your sequences into contigs. There are many tools available for this, such as MetaVelvet, metaSPAdes, IDBA-UD, MegaHIT. We generally use metaSPAdes, as in most cases it yields the best contig size statistics (i.e. more continguous assembly) and has been shown to be able to capture high degrees of community diversity (Vollmers, et al. PLOS One 2017). However, you should consider the pros and cons of different assemblers, which not only includes the accuracy of the assembly, but also their computational overhead. Compare these factors to what you have available. For example, very diverse samples with a lot of sequence data uses a lot of memory with SPAdes. In the following practicals we will demonstrate the use of metaSPAdes on a small sample and the use of MegaHIT for performing co-assembly.
Using the sequences that you have previously QC-ed, run
metaspades. To make things faster, we are going to turn-off metaspades
own read error correction method, by specifying the command
–only-assembler.
cd /opt/data
mkdir assembly
metaspades.py \
-t 2 \
--only-assembler \
-m 10 \
-1 /opt/data/clean/oral_human_example_1_splitaa_kneaddata_paired_1.fastq \
-2 /opt/data/clean/oral_human_example_1_splitaa_kneaddata_paired_2.fastq \
-o /opt/data/assembly
This takes about 1 hour to complete.
Once this completes, we can investigate the assembly. The
first step is to simply look at the contigs.fasta file.
Now take the first 40 lines of the sequence and perform a blast search at NCBI (https://blast.ncbi.nlm.nih.gov/Blast.cgi, choose Nucleotide:Nucleotide from the set of options). Leave all other options as default on the search page. To select the first 40 lines of sequence perform the following:
head -41 contigs.fasta
Which species do you think this sequence may be coming from?
Does this make sense as a human oral bacteria? Are you surprised by this
result at all?
Now let us consider some statistics about the entire assembly
cd /opt/data/assembly
assembly_stats scaffolds.fasta
This will output two simple tables in JSON format, but it is
fairly simple to read. There is a section that corresponds to the
scaffolds in the assembly and a section that corresponds to the contigs.
What is the length of longest and shortest contigs?
What is the N50 of the assembly? Given that are input
sequences were ~150bp long paired-end sequences, what does this tell you
about the assembly?
N50 is a measure to describe the quality of assembled genomes
that are fragmented in contigs of different length. We can apply this
with some caution to metagenomes, where we can use it to crudely assess
the contig length that covers 50% of the total assembly. Essentially
the longer the better, but this only makes sense when thinking about
alike metagenomes. Note, N10 is the minimum contig length to cover 10
percent of the metagenome. N90 is the minimum contig length to cover 90
percent of the metagenome.
In addition to evaluating the contiguity the assemblies, we can
ask what fraction of the diversity in the samples was assembled. We can
answer this question by quantifying the number of reads that map to the
assembly. BWA expects that the read names in the forward and reverse reads
are the same so we will first remove the read identifiers and make sure that
they are ordered correctly.
sed 's/\/1//g' ../clean/oral_human_example_1_splitaa_kneaddata_paired_1.fastq > ../clean/oral_human_example_1_splitaa_kneaddata_paired_noidentifiers_1.fastq
sed 's/\/2//g' ../clean/oral_human_example_1_splitaa_kneaddata_paired_2.fastq > ../clean/oral_human_example_1_splitaa_kneaddata_paired_noidentifiers_2.fastq
repair.sh in=../clean/oral_human_example_1_splitaa_kneaddata_paired_noidentifiers_1.fastq in2=../clean/oral_human_example_1_splitaa_kneaddata_paired_noidentifiers_2.fastq out=../clean/oral_human_example_1_splitaa_kneaddata_paired_noidordered_1.fastq out2=../clean/oral_human_example_1_splitaa_kneaddata_paired_noidordered_2.fastq
To calculate the percent reads mapping to the assembly using the flagstat output generated in the previous step, calculate the number of primary alignments (mapped - secondary - supplementary). Then divide the number of primary alignments by the sum of forward and reverse reads to get the fraction of reads mapped.
bwa index scaffolds.fasta
bwa mem -t 2 scaffolds.fasta ../clean/oral_human_example_1_splitaa_kneaddata_paired_noidordered_1.fastq ../clean/oral_human_example_1_splitaa_kneaddata_paired_noidordered_2.fastq | samtools view -bS - | samtools sort -@ 2 -o oral_human_example_1_splitaa.sam -
samtools flagstat oral_human_example_1_splitaa.sam > oral_human_example_1_splitaa_flagstat.txt
To get the total number of reads in the forward read, run the command below and divide by 4. Repeat for the reverse read.
wc -l ../clean/oral_human_example_1_splitaa_kneaddata_paired_noidordered_1.fastq
wc -l ../clean/oral_human_example_1_splitaa_kneaddata_paired_noidordered_2.fastq
What percent of the reads were incorporated into the assembly?
What factors can affect the percent of reads mapping to the assembly?
Bandage (a Bioinformatics Application for Navigating De novo
Assembly Graphs Easily), is a program that creates interactive
visualisations of assembly graphs. They can be useful for finding
sections of the graph, such as rRNA, or to try to find parts of a
genome. Note, you can install Bandage on your local system - see the prerequisites section. With
Bandage, you can zoom and pan around the graph and search for sequences,
plus much more. The following guide allows you to look at the assembly
graph. Normally, I would recommend looking at the ‘
assembly_graph.fastg, but our assembly is quite fragmented, so we will
load up the assembly_graph_after_simplification.gfa.
In the the Bandage GUI perform the following
Select File->Load graph
Navigate to ~/BiATA/session1/data/session1/assembly and select on assembly_graph_after_simplification.gfa
Once loaded, you need to draw the graph. To do so, under the “Graph drawing” panel on the left side perform the following:
Set Scope to ‘Entire graph’
The click on Draw graph
Use the sliders in the main panel to move around and look at
each distinct part of the assembly graph.
Can you find any large, complex parts of the graph? If so,
what do they look like.
In this particular sample, we believe that strains related to
the species Rothia dentocariosa, a Gram-positive, round- to rod-shaped
bacteria that is part of the normal community of microbes residing in
the mouth and respiratory tract, should be present in our sample. While
this is a tiny dataset, lets try to see if there is evidence for this
genome. To do so, we will search the R. dentocariosa genome against
the assembly graph.
To do so, go to the “BLAST” panel on the left side of the GUI.
Step 1 - Select Create/view BLAST search, this will open a new window
Step 2 - select build Blast database
Step 3 - Load from FASTA file -> navigate to the genome folder /opt/data/genome and select GCA_000164695.fasta
Step 4 - modify the blast filters to 95% identity
Step 6 - run blast
Step 7 - close this window
To visualise just these hits, go back to “Graph drawing” panel.
Set Scope to ‘Around BLAST hits’
Set Distance 2
The click on Draw graph
You should then see something like this:
In the following steps of this exercise, we will look at
performing co-assembly of multiple datasets. Due to computational
limitations, we can only look a example datasets. However, the
principles are the same. We have also pre-calculated some assemblies for
you. In the co-assembly directory, there are already 2 assemblies. We
have a single paired-end assembly.
megahit -1 clean_other/oral_human_example_1_splitac_kneaddata_paired_1.fastq -2 clean_other/oral_human_example_1_splitac_kneaddata_paired_1.fastq -o coassembly/assembly1 -t 2 --k-list 23,51,77
Now run the assembly_stats on the contigs for this assembly.
cd /opt/data
assembly_stats coassembly/assembly1/final.contigs.fa
How do these differ to the ones you generated previously? What may account for these differences?
We have also generated the first coassembly using MegaHIT.
This was produced using the following command. To specify the files, we
put all of the forward file as a comma separated list, and all of the
reversed as a comma separated list, which should be ordered that same in
both, such that the mate pairs match up.
cd /opt/data
megahit -1 clean_other/oral_human_example_1_splitac_kneaddata_paired_1.fastq,clean_other/oral_human_example_1_splitab_kneaddata_paired_1.fastq -2 clean_other/oral_human_example_1_splitac_kneaddata_paired_1.fastq,clean_other/oral_human_example_1_splitab_kneaddata_paired_2.fastq -o coassembly/assembly2 -t 2 --k-list 23,51,77
Now perform another co-assembly:
megahit -1 clean_other/oral_human_example_1_splitab_kneaddata_paired_1.fastq,clean_other/oral_human_example_1_splitac_kneaddata_paired_1.fastq,clean/oral_human_example_1_splitaa_kneaddata_paired_1.fastq -2 clean_other/oral_human_example_1_splitab_kneaddata_paired_2.fastq,clean_other/oral_human_example_1_splitac_kneaddata_paired_2.fastq,clean/oral_human_example_1_splitaa_kneaddata_paired_2.fastq -o coassembly/assembly3 -t 2 --k-list 23,51,77
This takes about 20-30 minutes. Also, if you are using a
laptop, make sure that it does not go into standby mode.
You should now have three different assemblies, two provide
and one generated by yourselves. Now let us compare the assemblies.
cd /opt/data
assembly_stats coassembly/assembly1/final.contigs.fa
assembly_stats coassembly/assembly2/final.contigs.fa
assembly_stats coassembly/assembly3/final.contigs.fa
We only have contigs.fa from MegaHIT, so the contigs and
scaffold sections are the same.





