06/10/2017
I saw the position for a Summer researcher in Bioinformatics and was interested, I thought it would be good experience and I could gain practical knowledge. I know somewhat the base of Bioinformatics: studying and analysing DNA sequences. I'm no expert in coding or specifically Bioinformatics, but more than willing to learn & teach myself.
So Bioinformatics is a wide range of topics: studying DNA, or biological components of a person, e.g. blood biomarkers. This specific project was looking into sequences.
"An analysis of current state of the art software on nanopore metagenomic data"
In preparation, a friend helped me install Linux Mint
partition for my laptop - this benefits my research (installing packages and source code of software) - I chose Mint
as I've had experience with it.
Week 1
For my first day, my supervisor, Amanda Clare suggested I read some papers about the sort of research that was conducted from the research team who collected these sequences. The research was in a mine from South Wales, UK. DNA was extracted to study the Earth's subsurface, where there is no internet, to find bacteria. Two datasets exist: the research team said the second results are better quality as improved protocols were used for DNA extraction.
The DNA was sequenced using new Nanopore technology: DNA goes in, an electric current runs through, proteins send signals creating a squiggle. The output sequences are ACGT
codes. GC
content in sequences are said to be more stable and AT
are lower quality, especially if repetitive (possible error).
The first tool to use was Goldilocks
. I spent some time figuring out how to run, since this was the first Python
package I had to install. Goldilocks
plots were interesting: we were able to see some low AT codes.
I also looked into Poretools
and SAMtools
for file converting. Poretools
also created histograms, which I was able to use to recreate plots from the original paper, a personal success.
My supervisor of the project and I went to watch the nanopore minION in action. That first week was emotionally draining: all these new terms I didn't understand, packages to install, and imposter syndrome. But I finally produced some plots and asked for help to understand them.
Week 2
I have found out that the longest reads (via Poretools
) were potential errors and the quality of the data is low overall. The research team left the location after 50 minutes however continued running the data: the longest reads occurred after 50 minutes. However, the T
heavy reads are not the same as the longest reads so I'll be conducting tests if T
heavy were after 50 mins too.
Week 3
Interestingly, one T
heavy read is within the 50 mins; when using BLAST
, the query cover is low though results included fungus and bacteria, which was what the research team was looking for. On a side note: I used the Linux command uniq to see there are no duplicate reads. I once again attempted using Poretools
to produce squiggles, this yet again had no results, however I tried this on the data set without the MUX (QC reads: not quality filtered) reads and the results were ever so slightly different (still no diagrams though) - I noticed there was an issue opened for Poretools
, however apparently the fix is to avoid using the PIP
installer.
Week 4
At the beginning of the week, I met with Amanda and she saw the report I started: we agreed to look more into a comparison of the datasets: BP_v1 & BP_v2 (v2 is the better quality data). After discussing with one of the researchers, Andre Soares, it was mentioned that the datasets were at different times of the year (BP_v1: Dec 2016, BP2: Apr 2017) and the reason for low quality data is because it was field work.
I spent most of the week working on the report, improving it and including the analysis the the version 1 data. I need to BLAST
both datasets fully but to do so I need to be added to AU network cluster since these jobs will be too heavy loaded for my personal laptop.
I BLAST
some random reads of both datasets: specifically the long reads which weren't T
heavy, plus some that were shorter - data set 1 had nothing useful at all to report, however data set 2 has bacteria: specifically bacteria that thrives in 40+ degrees - which is unusual as the mine was 15-20 degrees.
My supervisor and I mentioned about presenting my work through a poster to an Bioinformatics lab - I have translated the report into a poster.
Week 5
I was finally connected to the cluster and ran a BLAST
on both datasets, the first set (BP_v1) had barely any results: scores (query cover/read lengths) were all low (less than 50) with random species that included peppers and piranha. However, BP_v2 results were much better, results in ~800 which we could observe there were different types of bacteria within the subsurface.
So I went into the lab with the creator of Goldilocks
, Sam Nicholls, since they were doing some lab work for their PhD. I saw him run DNA through Gel and extraction, then to take some pictures under UV light, which I found out that degrades DNA quality so it needs to be quick - I also got the opportunity to watch the centrifuge (machine: vortex) run.
Week 6
So this is a final wrapping up of this summer research project: I am to BLAST
the datasets against each other, and look into the tool, Pavian
, to create hierarchy tree diagrams from the results. I blasted BP_v1 against BP_v2 and vise versa: no similarities. However, BP_v1 against BP_v1 & plus BP_v2 against BP_v2 had reads that were very similar (90%+).
I had minor issues with Pavian
as it only works with particular tool outputs - so I ran out of time and was not able to create any diagrams.
To complete up the project, I presented my work at a BCS Mid Wales "Show & Tell" event. I only had 24 hours notice but I enjoyed presenting! I also presented my poster to the Bioinformatics lab team and we discussed the results.
I would like to thank:
Amanda Clare
Andre Soares
Sam Nicholls
Some videos / links:
nanopore MinION watching it live
PCR prep a time-lapse
me and the poster at BSC Show & Tell
report on bioRxiv
poster final design
Tools mentioned in order:
Goldilocks
Poretools
SAMtools
BLAST
Pavian