02/10/2018
Since I enjoyed my Summer Bioinformatics experience, for my Masters dissertation I hoped to work again with Amanda Clare. Looking at new sequences made from the same research group. My project included looking at some soil from Aberystwyth, which was sequenced with an output of ~2 million sequences. These sequences were made of ACGT
.
Week 1
Acidobacteria is a phylum of bacteria belonging in the bacteria kingdom. It was only recognised in 2012 despite the most abundant and diverse on Earth soils. It has been observed in mines, soils, and metal-contaminated soils; which is quite unique as there are metal mines near Aberystywth that have contaminated the Rheidol streams possibly resulting in metal contaminated soils.
Week 2
Amanda and I discussed looking into a variety of tools to study the data to observe read count, quality, and time-yield plots. Plus to BLAST
in order to look at species. I used Kaiju
and found that Acidobacteria was present with a major portion classified as "unclassified" (not yet placed in a class group/subdivision). Furthermore, the GC
content of the genomes are consistent within their subdivisions (class ranks in Acidobacteria), for example: in subdivision 3 the GC
content for those species will all be around the same GC
coverage - plus subdivisions are dependent on pH, e.g. a pH of 4 means subdivision 1, 2, 3, and 13 will be more likely to appear.
Week 3
After meeting 3, we wanted to find a way to extract the Acidobacteria sequences which Kaiju
classified. Despite Kaiju
providing an output file with the sequence IDs, we can't determine which are Acidobacteria due to seqIDs are coincide with taxonIDs. This is where the idea of acidoseq
came: using a Kaiju
output file with a list of Acidobacteria taxonomy IDs and find the links.
Week 4
I downloaded the full and partial genomes and found that the GC
content was somewhat consistent in the subdivisions. After the first week of my package being successful, we decided to expand it further by looking into the GC
content in the sequences and see if we can plot the pattern of the subdivisions.
Week 5
We found that the BLAST
job of the 2 million reads took a month to process 400,000 sequences. We thought about filtering the dataset to 200,000 reads so a new job would only take 2 weeks but we thought to look into an alternative. We found Blast2Go
that runs locally and looks at the genes in further detail.
Week 6
We decided to use Blast2Go
to create a database of Acidobacteria genomes and ran a local BLAST
to find the sequences which identified as Acidobacteria. Regarding acidoseq
, we were plotting the AT
content to compare with GC
(high AT
can prove an unstable DNA).
Week 7
I added a feature to acidoseq
that outputs subdivisions of sequences which have that particular GC
content.
Week 8
We started to consider looking into assembly: building up the sequences into larger ones.
Week 9
The assembly job with Miniasm
was unsuccessful: due to soil being diverse, the output didn't build up larger sequences: largest being 16,000 base-pairs long.
Week 10
We started to look into command line options for acidoseq
and to package into PIP
.
Week 11
We filtered the data to a quality score of 12 and read-length of 2500: 89 reads. We decided to use Blast2Go
to do a final run and look into the genes, we found Acidobacteria, however, due to lack of time we couldn't explore the genes further via the Gene Ontology
. We finally made acidoseq
into a package and made it available. For the next two weeks the time was mostly focused on writing up my dissertation.
Week 12
During my final meeting, Amanda and I discussed corrections and she provided great feedback. Three days later, I submitted!
And so it is done! My Masters is completed and it feels great! After submission, I only had 4 days until the start of my PhD. I had such fun with this project that I made a Twitter bot, acidobot
, that dispenses facts about Acidobacteria once a day!
Links that might be of interest:
Masters disseration the final copy is available to read
previous Bioinformatics blog post this is the Summer 2017 experience
acidoseq the GitHub
repository
summary of acidoseq a shorter version of the dissertation
acidobot the discontinued Twitter bot
Tools mentioned in order:
BLAST
Kaiju
Blast2Go
Miniasm
Gene Ontology