Methods

Overview
Source of organism data sets
Pre processing of sequencing traces
Aligning sequencing clones to a finished reference genome
Nucleotide clone coverage analysis
Gene clonability analysis
Unclonable intergenic region detection
Fosmid and plasmid independent coverage analysis
Creating PanDaTox blast databases
Collecting PanDaTox homologs for each gene
Blasting user sequences against PanDaTox
 


Overview

Our approach is based on a by-product of the whole-genome shotgun sequencing method (WGS),
by which most genomes have been sequenced to date.
WGS begins by randomly shearing multiple copies of the genome into overlapping fragments of DNA,
typically producing libraries of 3 kb, 8 kb and 40 kb, which are transformed into an E. coli cell
via vectors containing the cloned fragments.
Subsequently, the library inserts are sequenced, and overlapping sequences are then used for genome assembly.
(An example of the WGS approach at JGI (Doe Joint Genome Institute) can be found here).

However, such initial assemblies almost always contain gaps due to DNA fragments which are unclonable in bacteria.
It was discovered that these unclonable gaps are caused by genes that are toxic to E. coli.
The PanDaTox database uses this concept to detect toxic genes and intergenic regions in a diverse array of organisms.

By computationally mapping raw sequenced data of any organism’s genome to its “finished” reference genome,
we identify unclonable genes and intergenic regions, potentially toxic to E. coli.


Source of organism data sets

The toxicity pipeline requires the following information as input for each organism:

  • Raw sequencing reads
    The raw paired-end sequencing reads (mate-pairs) for each processed genome were downloaded from NCBI Trace Archive (http://www.ncbi.nlm.nih.gov/Traces/) in multi-fasta format, along with ancillary information describing trace information for each read.
    This information enabled determination of the read source, its mate pair, the sequencing method used to produce it, the read direction (whether it was sequenced from the 5` (forward) or 3` (reverse) end of the clone), and the cloning library size.

    The ancillary information was necessary to group together pairs of reads pertaining to the same clone, and to determine the correct clone location on the reference genome.
    In the absence of explicit mate pair information from the trace data, possible read pairs corresponding to opposite ends of the same insert were inferred computationally by clustering forward and reverse reads according to their template IDs and trace names. Insert sizes for reads lacking such information, were assumed to be the same as insert sizes for reads that belong to the same sequencing library, or sequencing plate.
  • The corresponding reference genome
    The corresponding finished genomes were downloaded from the NCBI RefSeq database (http://www.ncbi.nlm.nih.gov/RefSeq/), or NCBI GenBank database (http://www.ncbi.nlm.nih.gov/genbank/).
  • Additional genome annotations, if available
    Supplementary gene annotations, such as COG and TIGRFAM IDs, were downloaded from the IMG v3.2 database (http://img.jgi.doe.gov/cgi-bin/w/main.cgi), if existed.
  • The status of the genome sequencing project
    The status of the genome sequencing projects were retrieved from the Genomes Online Database (GOLD; http://www.genomesonline.org/).


Pre processing of sequencing traces

Each batch of sequencing traces was initially processed to discard traces that applied to non clone-based sequencing methods (such as 454, Illumina Solexa), traces that were considered obsolete (i.e have been replaced by an updated trace or have been withdrawn), and traces that lacked information regarding direction of read (forward or reverse corresponding to the 5' or 3' end of the clone, respectively).
The fasta sequences of the remaining clone based reads were trimmed according to clip information available in the Trace Archive, to remove vector and/or poor quality sequences.


Aligning sequencing clones to a finished reference genome

The trimmed read sequences were mapped to their corresponding reference genome using the NUCmer application from the MUMmer3.21 software package (http://mummer.sourceforge.net/), with the <-maxmatch> parameter, thus computing all maximal matches regardless of their uniqueness. The delta-filter application from the MUMmer3.21 software package was then used with the <–i> and <–l> parameters, to discard reads with less than 95% identity or less than 300 aligned bases to the reference genome, considering that reads produced by clone based sequencing methods, such as Sanger sequencing, vary between 600-1000 bp in length.

Based on read alignments, clone positions on the reference genome were inferred from the mapping of both mate reads (representing the 5` and 3` sequenced ends pertaining to the same clone). In ambiguous cases, where a read mapped to several positions on the reference genome, thus producing several possible clones with its mate read, the correct position was inferred by the NCBI ancillary trace information indicating the sequenced clone expected length.

 Clones were considered invalid and discarded from further consideration if both mate reads mapped to the same strand, to different replicons, or produced a clone size that did not coincide with the trace information indicating mean library insert size and standard deviation (clones were considered valid if they did not exceed a distance of 2.5 standard deviations from the defined library mean insert size).

 In the absence of insert size information, an estimation of the mean insert size and standard deviation was computed based on actual clone lengths produced by paired reads of the same annotated library or sequencing plate, which mapped uniquely to the reference genome to produce a valid clone.
If no such library or plate information existed, or if there were not enough uniquely mapped valid clones to assume normal distribution of clone lengths (N<50), all clones shorter than 50kb were considered as valid.

After discarding all invalid clones, there were cases in which some pairs of mate reads produced more than one valid clone. In such cases, one of the clones was selected randomly per each mate pair, and the remaining clones were discarded from further analysis.


Nucleotide clone coverage analysis

Coverage was determined for each nucleotide position in the reference genome, by counting the number of valid mapped clones that spanned that position (see Aligning Sequencing Clones to a Finished Refenece Genome). Replicons for which the average clone coverage depth was less than x 10 were considered as insufficiently covered, and did not participate in any further coverage analysis. Furthermore, replicons in which more than 3% of the sequence was not covered by a single clone were considered to be outliers, and were ignored as well.


Gene clonability analysis

Gene clonability, the ability of a gene to be cloned into E. coli, was assessed by the number of mapped clones which fully contained the gene (see Aligning Sequencing Clones to a Finished Reference Genome), compared to simulations of random gene clone coverage, and considering clonability of near-by genomics elements.

The possible values describing gene clonability are the following:

  • “unclonable” - indicating that the gene was not fully contained in any of the sequenced clones mapped to the reference genome, and this was not expected to happen by chance alone.
  • “decreased coverage” - indicating that the gene was contained by at least one sequenced clone mapped to the reference genome, but the number of such clones was significantly lower than what would be expected by chance.
  • “hitchhiker” - indicating that the gene showed lower clone coverage than expected, but this is thought to be a side-effect of the existence of toxic elements in its neighborhood.
  • “n/a” - indicating that the replicon on which the gene resides did not have sufficient clone coverage for quantifying gene clonability.
  • “normal” - indicating that the number of clones containing the gene was as expected, meaning that the gene was successfully cloned into E. coli.

Gene positions were defined according to the "gene" feature annotation in the reference genomes' GenBank files.

In order to define if a gene was covered by fewer clones than expected by chance, it was necessary to quantify the "unclonability" of a gene and assign it with a p-value. Such quantification would also allow comparisons between genes across replicons and genomes, which vary in coverage depths, making the number of covering clones alone irrelevant for comparison. Furthermore, long genes are less plausible to be fully covered by a single clone, thus their low clone coverage does not necessarily imply any biological effect.

Therefore, to assess the significance of low or zero-coverage of a gene, 100 coverage simulations were performed, by randomly mapping all valid sequencing clones to the reference genome, ignoring sequence alignment. The number of clones covering each gene was obtained per simulation, and the mean of random clone coverage was calculated per gene (N=100). A p-value for the actual gene clone coverage was calculated, relative to a cumulative Poisson distribution of the random coverage values (λ=mean). P-values for each gene were then corrected for multiple hypothesis testing using FDR correction (N=number of genes per genome).

Additional processing was then performed to identify "hitchhiker" genes. Hitchhiker genes are thought to be genes whose unclonability is likely to be due to a near-by genomic element, such as a neighbor gene, which is the core reason for the unclonability of sequencing clones spanning that area. Such hitchhiker genes were detected by searching for the local minimum of nucleotide clone-coverage in a window surrounding the gene in question. The window size used was the median of all clones mapped to the genome. If the lowest nucleotide clone-coverage value was not within the gene coordinates, the gene was marked as a "hitchhiker".

Eventually, each gene was assigned with an unclonability value ("unclonable", "decreased coverage", "hitchhiker" , "normal" , “n/a”) according to the number of clones fully containing it, the assigned p-value indicating the chance of seeing such coverage by chance, and the hitchihiker analysis. A gene was considered to have significantly low clone coverage if P < 0.01.


Unclonable intergenic region detection

Unclonable intergenic regions were detected by a naive search for a successive sequence of nucleotides not covered by any mapped sequencing clones. If such a sequence was found to span a region which was solely intergenic (i.e. had no overlap with any annotated gene sequence), it was considered as an unclonable intergenic region.


Fosmid and plasmid independent coverage analysis

It is quite common for sequencing centers to use two types of cloning vectors when creating shotgun sequencing libraries 30. The vector type may be either a plasmid, which can contain insert sizes of up to 20kb and appear in the transformed cell in 20-100 copies, or a fosmid, which contains larger insert sizes, typically of 35 kb, and appears in the host cell as a single copy.
The differentiation between these 2 vector types in terms of copy number and cloned insert size can be of use in analyzing gene toxicity to the host, which may depend on its expression level, on existence or absence of genes in near-by loci, etc.
Therefore, all clone coverage analysis for genes, nucleotides, and intergenic regions was performed for plasmid- and fosmid-derived clones independently, if data existed, and if clone coverage depth for each type of library was sufficient.
Due to absence of explicit library type information in trace data, all libraries containing insert sizes smaller than 20kb were considered plasmid libraries, and all libraries with insert sizes larger than 20 kb were considered fosmid libraries.


Creating PanDaTox blast databases

In order to allow users to search and compare any sequence against the sequences available in PanDaTox, three PanDaTox-derived databases were created using the formatdb software tool (version 2.2.20):
  • a protein database containing protein sequences of all the coding genes in PanDaTox (1,506,369 protein sequences).
  • a nucleotide database containing nucleotide sequences of all the genes in PanDaTox (1,566,994 gene sequences).
  • a nucleotide database containing nucleotide sequences of all the replicons comprising the genomes in PanDaTox (857 replicon sequences).


Collecting PanDaTox homologs for each gene

To allow a broader analysis of genes whose clonability is in question, homologous genes were collected for each of the genes in the PanDaTox database. The homolog search was performed for each of the genes in PanDaTox, by running the blastall softwear tool (version 2.2.20) against the relevant PanDaTox preformatted BLAST database, without masking of low complexity regions (-F F flag). Homologs for a protein coding gene were searched for in the protein sequence collection db, using blastp, and homologs for a non-coding gene were searched for in the gene nucleotide sequence collection, using blastn. The maximum alignments limit was defined as 1000, and the E-value cutoff was 0.01


Blasting user sequences against PanDaTox

A user sequence is blasted against a PanDaTox collection by running the blastall software tool (version 2.2.20) against the relevant PanDaTox preformatted BLAST database, without masking of low complexity regions (-F F flag)