.

Masters Dissertation

02/10/2018

Since I enjoyed my Summer Bioinformatics experience, for my Masters dissertation I hoped to work again with Amanda Clare. Looking at new sequences made from the same research group. My project included looking at some soil from Aberystwyth, which was sequenced with an output of ~2 million sequences. These sequences were made of ACGT.

Week 1
Acidobacteria is a phylum of bacteria belonging in the bacteria kingdom. It was only recognised in 2012 despite the most abundant and diverse on Earth soils. It has been observed in mines, soils, and metal-contaminated soils; which is quite unique as there are metal mines near Aberystywth that have contaminated the Rheidol streams possibly resulting in metal contaminated soils.

Week 2
Amanda and I discussed looking into a variety of tools to study the data to observe read count, quality, and time-yield plots. Plus to BLAST in order to look at species. I used Kaiju and found that Acidobacteria was present with a major portion classified as "unclassified" (not yet placed in a class group/subdivision). Furthermore, the GC content of the genomes are consistent within their subdivisions (class ranks in Acidobacteria), for example: in subdivision 3 the GC content for those species will all be around the same GC coverage - plus subdivisions are dependent on pH, e.g. a pH of 4 means subdivision 1, 2, 3, and 13 will be more likely to appear.

Week 3
After meeting 3, we wanted to find a way to extract the Acidobacteria sequences which Kaiju classified. Despite Kaiju providing an output file with the sequence IDs, we can't determine which are Acidobacteria due to seqIDs are coincide with taxonIDs. This is where the idea of acidoseq came: using a Kaiju output file with a list of Acidobacteria taxonomy IDs and find the links.

Week 4
I downloaded the full and partial genomes and found that the GC content was somewhat consistent in the subdivisions. After the first week of my package being successful, we decided to expand it further by looking into the GC content in the sequences and see if we can plot the pattern of the subdivisions.

Week 5
We found that the BLAST job of the 2 million reads took a month to process 400,000 sequences. We thought about filtering the dataset to 200,000 reads so a new job would only take 2 weeks but we thought to look into an alternative. We found Blast2Go that runs locally and looks at the genes in further detail.

Week 6
We decided to use Blast2Go to create a database of Acidobacteria genomes and ran a local BLAST to find the sequences which identified as Acidobacteria. Regarding acidoseq, we were plotting the AT content to compare with GC (high AT can prove an unstable DNA).

Week 7
I added a feature to acidoseq that outputs subdivisions of sequences which have that particular GC content.

Week 8
We started to consider looking into assembly: building up the sequences into larger ones.

Week 9
The assembly job with Miniasm was unsuccessful: due to soil being diverse, the output didn't build up larger sequences: largest being 16,000 base-pairs long.

Week 10
We started to look into command line options for acidoseq and to package into PIP.

Week 11
We filtered the data to a quality score of 12 and read-length of 2500: 89 reads. We decided to use Blast2Go to do a final run and look into the genes, we found Acidobacteria, however, due to lack of time we couldn't explore the genes further via the Gene Ontology. We finally made acidoseq into a package and made it available. For the next two weeks the time was mostly focused on writing up my dissertation.

Week 12
During my final meeting, Amanda and I discussed corrections and she provided great feedback. Three days later, I submitted!

And so it is done! My Masters is completed and it feels great! After submission, I only had 4 days until the start of my PhD. I had such fun with this project that I made a Twitter bot, acidobot, that dispenses facts about Acidobacteria once a day!

I would like to thank Amanda for a fun project and Arwyn Edwards and his team for the intellectual engagement.

Amanda Clare Arwyn Edwards

Links that might be of interest:
Masters disseration the final copy is available to read

previous Bioinformatics blog post this is the Summer 2017 experience

acidoseq the GitHub repository

summary of acidoseq a shorter version of the dissertation

acidobot the discontinued Twitter bot

Tools mentioned in order:
BLAST Kaiju Blast2Go Miniasm Gene Ontology