T-lex is a computational pipeline that detects presence and/or absence of annotated individual transposable elements (TEs) using next-generation sequencing (NGS) data. T-lex combines two distinct and complementary TE detection approaches.
The two TE detection approaches.
The TE detection results from the two detection approaches may be combined to get a more accurate TE call. Using data from multiple strains from a same population, T-lex ascertains TE presence/absence for each strain and also estimates the TE frequency in population adding the number of strains for which the TE is 'present' and one-half times the number of 'polymorphic' strains, and dividing by the total number of strains for which T-lex returns data. Both TE detection approaches are based on the mapping of reads on the junctions of the TE insertions. By consequence, a high repeat density at the TE flanking regions or TE mis-annotation may bias the T-lex calls. To overcome that, T-lex identifies the TE insertions nested or in high-repeat density regions (see 'TE flanking sequence analysis' part), and detects putatively mis-annotated TE insertions (see 'Mis-annotated TE detection' part). The detection of the mis-annotation is based on the detection of the 'true' TE breakpoint. T-lex uses the outputs of the 'absence' detection to recover the ancestral genomic sequence prior the TE insertion and detects the Target Site Duplication (TSD), traces of the mechanism of TE insertion. If the TSD detection fails, the TE can be suspected as mis-annotated.
2. RepeatMasker and libraries open-3.2.8 or greater (Smit 1996-2010 ;http://www.repeatmasker.org/RMDownload.html for RepeatMasker program and http://www.girinst.org for the libraries).
3. Maq - Mapping and Assembly with Qualities- version 0.7.1 (Li et 2008; http://sourceforge.net/projects/maq/files/maq).
4. SHRIMP2 - SHort Read Mapping Package- Release 2.2.1/Oct. 31, 2011 (David et al 2011; http://compbio.cs.toronto.edu/shrimp).
5. BLAT version 34 or greater (Kent 2002; http://www.soe.ucsc.edu/~kent)
Only for the Target Site Duplication (TSD) detection:
6. Phrap version 1.090518 ("Phil's Revised Assembly Program"; Green, 1999; http://www.phrap.org)
7. FastaGrep developped by the Department of Bioinformatics at the University of Tartu & Estonian Biocentre (http://bioinfo.ebc.ee)
TE list                     - the list of the TEs to be analyzed with one TE idenfier per line (e.g for a TE in Drosophila 'FBti0019293').
TE annotations        - the annotations of these TEs. Tabulated file with five columns:
TE name / location / start nucleotide position/ end nucleotide position / strand (+ or -),
e.g. 'FBti0019293 3R 405387 406627 - '.
reference genome     - the reference genomic sequences where TEs have been annotated in fasta format (e.g. '>3R XXXXX').
NGS data                    - the NGS data in official fastq format (http://en.wikipedia.org/wiki/Fastq).
In order to handle multiple strains at the same time, T-lex requires a specific file organization for the input data. Even if you only want to run T-lex with one strain, the NGS data has to be stored in one directory in which each subdirectory corresponds to a single strain NGS data such as:
[input strain directory]/
[strain name 1]/
[strain name 1]_read.fastq
[strain name 2]/
[strain name 2]_read.fastq
To handle paired-end reads, separate the reads from a same pair ([strain name]) in two fastq files ([strain name]_read1.fastq and [strain name]_read2.fastq). Each fastq file name should rename such as get 'XXX_readX.fastq'.
[input strain directory]/
-O          text    project name
-noclean            keep the intermediate files
-h or -help         display this help
For the TE filtering step:
-noFilterTE  int    do not filter TEs
-s        string    name of the species studied ( for RepeatMasker, e.g. 'drosophila' )
-d        float     minimum repeat density at the flanking regions of the TEs ( default: 0.5 (50%) )
For the TE presence detection:
-q                  launch only the presence detection approach
-j           int    length of the junction sequences to extract in bp ( default: 1000 )
-b           int    length of the internal region of the TE in bp ( default: 60 )
-limp        int    minimum match length required with the TE sequence in bp ( default: 5 )
-id          int    minimum sequence identity required with the TE sequence in % ( default: 95 )
-minQ        int    minimum quality Phrep score required to select a read ( default: 30 )
-processes   int    number of processes (used only for the NGS data reformatting; default: 1 )
For the TE absence detection:
-p                  launch only the absence detection approach
-f           int    length of the flanking sequences to extract and concatenate in bp ( default: 125 )
-v           int    minimum read length spanning the two TE sides in bp ( default: 15 )
-pairends string    PE mapping at 'no' by default ('yes' or 'no')
-lima        int    minimum non-repeated region on each side of the sequence in bp ( default: 5 )
To detect other TEs using the same NGS data or the same TEs using other NGS data, the options -binreads and -binref can be used to bypass the reformatting step of the input data and save computation time (see 'Running-time' part below). To do this, the option '-noclean' has to be added to the T-lex command line. In addition, the same project name has to be used. To combine all the TE detection results, specific options can be used:
-combRes      combine the presence/absence results from one strain
-combine      combine the presence/absence results from different strains
-freq         return the TE frequency based on the given strains
-pooled       return the TE frequency based on pooled data (To use with the option -freq)
-tsd          return the TSDs for the TE insertions detected as absent (use with -align and -p)
-align        return the multiple alignments
-combAll      combine the frequency estimates with the TE breakpoint analysis
Tflank_checking_[length of the flanking region analyzed].fasta.masked
Tflank_checking_[length of the flanking region analyzed].fasta.out
Tpoly_[length of the flanking region analyzed].fasta.polyAT
Using the option '-pooled', T-lex can also estimate the frequency for each TE in the population using pooled NGS data (Fiston-Lavier et al., 2011, in prep). This estimates is based on the number of reads supporting the absence and the presence of the TE insertions reported by T-lex. The frequency estimates are stored in the file called 'Tfreq_pooled'. This file contains five columns such as:
The 'Talign' directory is divided into two sub-directories: the 'presence_detection' and the 'absence_detection'. They contain respectively the alignments from the presence and the absence TE detection such as:
[TE name]_[TE side].contig_all: multiple alignment of the contigs.
[TE name]_absence.fasta: all the pairwise alignments of a TE.
The intermediate files can be kept (see option '-noclean'). The intermediate files from the two detection approaches are stored in the 'Tpresence' and 'Tabsence' directories. These two directories and the fasta files containig the reformatted reference sequences are stored in the T-lex output directory: 'tlex_[project]'.
If a TE is detected as 'present' but a contig sequence overlapping the TE sequence is short (small 'left_match_length' or 'right_match_length' value), we recommend to visually inspect the alignments from the presence detection approach using an alignment viewer such as Jalview (http://www.jalview.org/download.html. If terminal mismatches are detected, the T-lex result could be erroneous.
The 'read_number' value allows to evaluate the absence detection approach. The absence detection based on a small number of reads can be explained by a low read coverage while the detection of an extreme number of reads can only be explained by the repetitive features of the flanking sequences of the TE (e.g. TE inserted into another TE). Note that if the TE filtering step is 'on', only the segmental duplications encompassing TEs could explained this high number of reads. Therefore, we recommend to manually curate the TEs in this situation by blasting the genomic sequence corresponding to a potential duplication on the reference genomic sequences. Another way is used available segmental duplication annotations. Alignments of the reads spanning the two TE sides could also be used to check that T-lex is not disturbed by imperfect satellites or short poly-A tails not detected by RepeatMasker (shorter than 20 bp).
Fiston-Lavier AS, Carrigan M, Petrov DA and Gonzalez J. (2010) T-LEX: A program for fast and accurate assessment of transposable element presence using next-generation sequencing data. Nuc. Acids. Res. 2011 Mar 1;39(6):e36. Epub 2010 Dec 21
-Jurka,J., Kapitonov,V.V., Pavlicek,A., Klonowski,P., Kohany,O., Walichiewicz,J. (2005) Repbase update, a database of eukaryotic repetitive elements. Cytogenet Genome Res 110:462_467
-Li,H., Ruan,J., Durbin,R. (2008) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18:1851_1858.
-Rumble,S.M., Lacroute,P., Dalca,A.V., Fiume,M., Sidow,A., Brudno,M. (2009) SHRiMP: Accurate Mapping of Short Color-space Reads. PLoS Comput. Biol. 5(5): e1000386. doi:10.1371/journal.pcbi.1000386.
-David M, Dzamba M, Lister D, Ilie L, Brudno M. (2011) SHRiMP2: sensitive yet practical SHort Read Mapping. Bioinformatics. Apr 1;27(7):1011-2. Epub 2011 Jan 28.
-Kent W.J. (2002) BLAT - the BLAST-like alignment tool. Genome Res. 12:656-664
-de la Bastide M, McCombie WR. (2007) Assembling genomic DNA sequences with PHRAP. Curr Protoc Bioinformatics. 2007 Mar;Chapter 11:Unit11.4.
Website designed by Anna-Sophie Fiston-Lavier
(asfiston at univ-montp2 dot fr)