Frequently Asked Questions

What is gene clonability?
How is gene clonability calculated?
What is the difference between a plasmid and a fosmid?
Which sources of information were used to build PanDaTox?
How are gene homologs defined?
How big is the PanDaTox collection?
How can one contribute to the PanDaTox database?

How come I see a gene defined as unclonable but when looking at its coverage plot I do not see a coverage gap?


What is gene clonability?

Gene clonability is defined by the number of clones fully covering a gene of interest,
compared to random gene clone coverage, and clonability of near-by genomics elements.
The following values are used to define clonability:
Unclonable - the gene had zero clone coverage, i.e. it could not be cloned in E. coli in any of the attempts.
Decreased coverage - the gene showed significant coverage decrease, i.e. there were a few clones which
contained the gene, but the number of such clones was significantly lower than expected.
Hitchhiker- the gene had significant decreased or zero clone coverage, but it's low coverage may be due to
a near-by genomic element which resided on the same sequencing clones.
Normal- the gene showed no significant coverage decrease.


How is gene clonability calculated?

The process of calculating gene clonability is as follows:
  1. Count the number of sequencing clones which fully cover the gene of interest,
    according to alignment of WGS sequencing clones to a reference genome.
  2. Perform 100 simulations, in which we randomly map the sequencing clones to the reference genome,
    without taking sequence alignement into consideration.
  3. Count the number of clones fully covering the gene in each of the random simulations.
    The random gene clone coverage values behave as a poisson distribution.
  4. Calculate the p-value of the actual gene clone coverage relative to a cumulutive poisson
    distribution of the random coverage values
  5. Use FDR correction for multiple hypothesis testing to correct the p-value, according to the number
    of genes in the given genome.
  6. If the corrected p-value ≥ 0.01, gene clonability is considered Normal.
  7. If the corrected p-value < 0.01, we check the nucleotide coverage of the gene's neighborhood:
    1. If the minimal nucleotide coverage is not within the gene limits, the gene is considerd  a Hitchhiker
    2. If the minimal nucleotide coverage is within the gene limits, the gene is considered either Unclonable
      (if there are zero clones which fully contain the gene), or Decreased coverage (if there is at least
      one clones fully covering the gene).

       


What is the difference between a plasmid and a fosmid?

Plasmid and Fosmids are different vector types used during the WGS sequencing process to
transform foreign genomic sequences into the E. coli cells.
Plasmid cloning vectors typically contain small insert sizes, such as 3kb, 6kb.
They are present in high copy number in the transformed cell (20-100 copies per cell).
Fosmid cloning vectors typically contain larger insert sizes, such as 35 kb.
They are present in a single copy in the transformed cell.


Which sources of information were used to build PanDaTox?

There are 4 main sources of information:
  1. GOLD: Genomes Online Database (GOLD) is a World Wide Web resource for comprehensive
    access to information regarding complete and ongoing genome projects.
    We used GOLD as starting point for processing, to identify genomes whose sequencing projects
    have been complete and published.
    www.genomesonline.org/
     
  2. NCBI RefSeq: The Reference Sequence (RefSeq) collection provides a comprehensive, integrated,
    non-redundant, well-annotated set of sequences, which provide a stable reference for genome annotation.
    We use the NCBI RefSeq database as a source for reference genomes, i.e. fully sequenced
    and assembled genomes against which we map the raw sequencing reads and infer clone position.
    www.ncbi.nlm.nih.gov/refseq/
     
  3. NCBI Trace Archive: A repository containing raw sequence traces for Whole Genome Shotgun projects.
    The archive offers trace files, fasta files, quality scores, and ancillary data, which are used to infer clone
    coverage of a reference genome.
    http://nsdl.org/resource/2200/test.20061004111541306T
     
  4. IMG: The Integrated Microbial Genomes (IMG) system serves as a community resource for
    comparative analysis and annotation of all publicly available genomes from three domains of life.
    img.jgi.doe.gov/cgi-bin/w/main.cgi


How are gene homologs defined?

Please see our Methods page.


How big is the PanDaTox collection?

Please see our Statistics page.


How can one contribute to the PanDaTox database?

The PanDaTox database would like to expand its pool of experimentally validated unclonable genes and intergenic regions.
If you would like to share your experimental data or any insight regarding data presented by PanDaTox, please contact us.


How come I see a gene defined as unclonable but when looking at its coverage plot I do not see a coverage gap?

The coverage plot shows the number of clones covering each nucleotide in the genomic sequence, regardless of the underlying genomic element.
Gene clonability is defined by the number of sequencing clones which contain the whole gene.
If there are no clones which cover the gene, the gene will be defined as unclonable.
However, there may be clones that contain only a portion of the gene. These partially covering clones
account for the fact we do not see a gap in the nucleotide coverage plot.
See schematic illustration as an example: