Genome annotation and AMR detection#

You now have an assembled E.coli genome and are in a position to use bioinformatics analyses to determine the catalog of genes present in the genome and predict antimicrobial resitance (AMR) genes. Understanding AMR is a key challenge in public health and sequencing may be a useful tool.

At this stage in your research project you are going to approach this work semi-independently. This workbook will guide you but you will need to read the associated papers, and the program’s help documentation for each analysis.

1. Genome annotation#

At the moment you have a fasta file which was the output of the genome assembly program Flye. This is the E. coli genome, and some other sequences. Make sure that you know which this file is, and have an analysis of what sequences are in this file, their lengths and any other information you have gathered.

Genome annotation moves from a bare fasta seuqence to a set of rich information about the sequence.

Question: what information do you want to know about the sequence?

What information can be added to the bare ACGT to make it more biologically relevant, and to reveal useful information? This is a general question, not one requiring specialist genomics knowledge. Discuss among yourselves and make notes.

REFERENCES

Stein L. Genome annotation: from sequence to biology. Nat Rev Genet. 2001;2: 493–503. doi:10.1038/35080529
Yandell M, Ence D. A beginner’s guide to eukaryotic genome annotation. Nat Rev Genet. 2012;13: 329–342. doi:10.1038/nrg3174

The approaches to annotating eukaryotic genomes and prokaryotic genomes are quite different. Why might this be?

You will see reference to many bacterial genome annotation software. Examples include:

Discussion: how to annotate a genome sequence#

(optional) Prepare the genome file#

You have 5 sequences in genome assembly of strain B and 6 sequences in the genome assembly of strain C. Do you want to analyse these together or as separate sequences?

There is not a ‘right’ answer here. Some programs will be happy with everything together, others less so. Some will be happy but the annotation results will refer to everything together and might be less easy to interpret. Others might make much more sense when they are describing all the sequences rather than one at a time. My point here is to make you think about what you are submittign and to critically assess the outputs in that light. If you don’t like it you can do it the other way.

Many of you have already had a go at splitting fasta into individual sequences. There are many ways to do this, Google it. My personal favourite though is using the seqkit split2 command, read the documentation to find the right syntax.

1.1 BAKTA#

Bakta is a modern and powerful approach to bacterial genome annotation. Explore the Bakta annotation server, read the paper, github website and understand what it is doing and why.

You can annotate your genome at the Bakta server. This might take an hour or more, you could however begin competency 1 (below).

1.2 DFAST#

Again, please learn about DFAST as an approach to genome annotation. The DFAST Annotation Server is available for your job submission

1.3 Genome annotation outputs#

Each server will provide you with output files. Take the time to retrieve information about these output file types and what information they contain.

At the very least you should back up the output files for later use.

One interesting thing you could do next is to load these files into a genome browser. A good one is IGV

Competency 1 week D#

Pick one genome annotation software package, it doesn’t have to be the one you have used, though there are advantages to this for writing your manuscript and understanding your data

Describe briefly how a named software attempts to annotate components of the bacterial genome.

“Briefly” means less than a page. Most people should choose to write about half a page, more is not better and this is not an essay but a brief explanation of an approach. If you absolutely need a diagram that is OK too.

Do not talk about eukaryotic genome analysis. Do not talk about it being fast, low memory, or easy to install, this is not a review of the software. Instead describe the general approach by which it can annotate (write a biological description) of a section of the genome assembly fasta.

BREAK HERE. Please wait before progressing

2. AMR analysis#

We are going to analyse for anti-microbial resistance genes with 2 software workflows:

abritAMR
CARD

2.1 abritAMR (optional, we probably won’t use this today)#

abritAMR is a bioinformatics workflow that uses several other tools to create an analysis and report on AMR genes. It is designed for use in healthcare situations.

Sherry et al. An ISO-certified genomics workflow for identification and surveillance of antimicrobial resistance. Nat Commun. 2023;14: 60. doi:10.1038/s41467-022-35713-4

The code and help documentation are available:

https://github.com/MDU-PHL/abritamr

2.2 CARD-RGI#

The Comprehensive Antibiotic Resistance Database (CARD) is a bioinformatics database focussed on resistance genes.

We are going to try to analyse our data using the Resistance Gene Identifier web server.

https://card.mcmaster.ca/analyze/rgi

There is a 20 Mb file size limit, how big are your assembly files? Are you going to analyse all 5-6 sequences together or separately?

Do you understand the analyses carried out by RGI? Can you describe the approach that it takes? If not then searching and reading will be required.

What output files have been produced and what do they contain?

Competency 2 Week D#

Produce a simple table of AMR genes present in your strains

Although something similar may have been produced by your analyses you will need to simplify and explain it. Remove information not pertinent to your experiment, write a legend. Perhaps write 3-4 sentences on the methods for producing the table. It is an exercise to demonstrate your generation of results, but also to familiarise you with different types of AMR.

Have you done a literature search yet to find out about the action of antibiotics, and AMR genes held by bacteria to defeat them? This understanding will be a key aspect of interpreting your data and writing it up.

BGS course book

Genome annotation and AMR detection

Contents