.

My first Bioinformatics experience

06/10/2017

I saw the position for a Summer researcher in Bioinformatics and was interested, I thought it would be good experience and I could gain practical knowledge. I know somewhat the base of Bioinformatics: studying and analysing DNA sequences. I'm no expert in coding or specifically Bioinformatics, but more than willing to learn & teach myself.

So Bioinformatics is a wide range of topics: studying DNA, or biological components of a person, e.g. blood biomarkers. This specific project was looking into sequences.

"An analysis of current state of the art software on nanopore metagenomic data"

In preparation, a friend helped me install Linux Mint partition for my laptop - this benefits my research (installing packages and source code of software) - I chose Mint as I've had experience with it.

Week 1
For my first day, my supervisor, Amanda Clare suggested I read some papers about the sort of research that was conducted from the research team who collected these sequences. The research was in a mine from South Wales, UK. DNA was extracted to study the Earth's subsurface, where there is no internet, to find bacteria. Two datasets exist: the research team said the second results are better quality as improved protocols were used for DNA extraction.
The DNA was sequenced using new Nanopore technology: DNA goes in, an electric current runs through, proteins send signals creating a squiggle. The output sequences are ACGT codes. GC content in sequences are said to be more stable and AT are lower quality, especially if repetitive (possible error).

The first tool to use was Goldilocks. I spent some time figuring out how to run, since this was the first Python package I had to install. Goldilocks plots were interesting: we were able to see some low AT codes.
I also looked into Poretools and SAMtools for file converting. Poretools also created histograms, which I was able to use to recreate plots from the original paper, a personal success.

My supervisor of the project and I went to watch the nanopore minION in action. That first week was emotionally draining: all these new terms I didn't understand, packages to install, and imposter syndrome. But I finally produced some plots and asked for help to understand them.

Week 2
I have found out that the longest reads (via Poretools) were potential errors and the quality of the data is low overall. The research team left the location after 50 minutes however continued running the data: the longest reads occurred after 50 minutes. However, the T heavy reads are not the same as the longest reads so I'll be conducting tests if T heavy were after 50 mins too.

Week 3
Interestingly, one T heavy read is within the 50 mins; when using BLAST, the query cover is low though results included fungus and bacteria, which was what the research team was looking for. On a side note: I used the Linux command uniq to see there are no duplicate reads. I once again attempted using Poretools to produce squiggles, this yet again had no results, however I tried this on the data set without the MUX (QC reads: not quality filtered) reads and the results were ever so slightly different (still no diagrams though) - I noticed there was an issue opened for Poretools, however apparently the fix is to avoid using the PIP installer.

Week 4
At the beginning of the week, I met with Amanda and she saw the report I started: we agreed to look more into a comparison of the datasets: BP_v1 & BP_v2 (v2 is the better quality data). After discussing with one of the researchers, Andre Soares, it was mentioned that the datasets were at different times of the year (BP_v1: Dec 2016, BP2: Apr 2017) and the reason for low quality data is because it was field work.
I spent most of the week working on the report, improving it and including the analysis the the version 1 data. I need to BLAST both datasets fully but to do so I need to be added to AU network cluster since these jobs will be too heavy loaded for my personal laptop.

I BLAST some random reads of both datasets: specifically the long reads which weren't T heavy, plus some that were shorter - data set 1 had nothing useful at all to report, however data set 2 has bacteria: specifically bacteria that thrives in 40+ degrees - which is unusual as the mine was 15-20 degrees.
My supervisor and I mentioned about presenting my work through a poster to an Bioinformatics lab - I have translated the report into a poster.

Week 5
I was finally connected to the cluster and ran a BLAST on both datasets, the first set (BP_v1) had barely any results: scores (query cover/read lengths) were all low (less than 50) with random species that included peppers and piranha. However, BP_v2 results were much better, results in ~800 which we could observe there were different types of bacteria within the subsurface.

So I went into the lab with the creator of Goldilocks, Sam Nicholls, since they were doing some lab work for their PhD. I saw him run DNA through Gel and extraction, then to take some pictures under UV light, which I found out that degrades DNA quality so it needs to be quick - I also got the opportunity to watch the centrifuge (machine: vortex) run.

Week 6
So this is a final wrapping up of this summer research project: I am to BLAST the datasets against each other, and look into the tool, Pavian, to create hierarchy tree diagrams from the results. I blasted BP_v1 against BP_v2 and vise versa: no similarities. However, BP_v1 against BP_v1 & plus BP_v2 against BP_v2 had reads that were very similar (90%+).
I had minor issues with Pavian as it only works with particular tool outputs - so I ran out of time and was not able to create any diagrams.

To complete up the project, I presented my work at a BCS Mid Wales "Show & Tell" event. I only had 24 hours notice but I enjoyed presenting! I also presented my poster to the Bioinformatics lab team and we discussed the results.

I would like to thank:
Amanda Clare Andre Soares Sam Nicholls

Some videos / links:

nanopore MinION watching it live

PCR prep a time-lapse

me and the poster at BSC Show & Tell

report on bioRxiv

poster final design

Tools mentioned in order:
Goldilocks Poretools SAMtools BLAST Pavian