![]() |
T-lex2 manual |
![]() |
HOME |
T-lex is a computational pipeline that detects presence and/or absence of annotated individual transposable elements (TEs) using next-generation sequencing (NGS) data. T-lex combines two distinct and complementary TE detection approaches. The two TE detection approaches. The TE detection results from the two detection approaches may be combined to get a more accurate TE call. Using data from multiple strains from a same population, T-lex ascertains TE presence/absence for each strain and also estimates the TE frequency in population adding the number of strains for which the TE is 'present' and one-half times the number of 'polymorphic' strains, and dividing by the total number of strains for which T-lex returns data. Both TE detection approaches are based on the mapping of reads on the junctions of the TE insertions. By consequence, a high repeat density at the TE flanking regions or TE mis-annotation may bias the T-lex calls. To overcome that, T-lex identifies the TE insertions nested or in high-repeat density regions (see 'TE flanking sequence analysis' part), and detects putatively mis-annotated TE insertions (see 'Mis-annotated TE detection' part). The detection of the mis-annotation is based on the detection of the 'true' TE breakpoint. T-lex uses the outputs of the 'absence' detection to recover the ancestral genomic sequence prior the TE insertion and detects the Target Site Duplication (TSD), traces of the mechanism of TE insertion. If the TSD detection fails, the TE can be suspected as mis-annotated. Go up How to run T-lex- Prerequisites1. Unix system with Perl version 5.10.0 or higher (http://www.perl.org/get.html).2. RepeatMasker and libraries open-3.2.8 or greater (Smit 1996-2010 ;http://www.repeatmasker.org/RMDownload.html for RepeatMasker program and http://www.girinst.org for the libraries). 3. Maq - Mapping and Assembly with Qualities- version 0.7.1 (Li et 2008; http://sourceforge.net/projects/maq/files/maq). 4. SHRIMP2 - SHort Read Mapping Package- Release 2.2.1/Oct. 31, 2011 (David et al 2011; http://compbio.cs.toronto.edu/shrimp). 5. BLAT version 34 or greater (Kent 2002; http://www.soe.ucsc.edu/~kent) Only for the Target Site Duplication (TSD) detection: 6. Phrap version 1.090518 ("Phil's Revised Assembly Program"; Green, 1999; http://www.phrap.org) 7. FastaGrep developped by the Department of Bioinformatics at the University of Tartu & Estonian Biocentre (http://bioinfo.ebc.ee) - Input dataT-lex requires at least four input data:TE list                     - the list of the TEs to be analyzed with one TE idenfier per line (e.g for a TE in Drosophila 'FBti0019293'). TE annotations        - the annotations of these TEs. Tabulated file with five columns:                                     TE name / location / start nucleotide position/ end nucleotide position / strand (+ or -),                                    e.g. 'FBti0019293 3R 405387 406627 - '. reference genome     - the reference genomic sequences where TEs have been annotated in fasta format (e.g. '>3R XXXXX'). NGS data                    - the NGS data in official fastq format (http://en.wikipedia.org/wiki/Fastq). In order to handle multiple strains at the same time, T-lex requires a specific file organization for the input data. Even if you only want to run T-lex with one strain, the NGS data has to be stored in one directory in which each subdirectory corresponds to a single strain NGS data such as:        [input strain directory]/              [strain name 1]/                    [strain name 1]_read.fastq              [strain name 2]/                    [strain name 2]_read.fastq To handle paired-end reads, separate the reads from a same pair ([strain name]) in two fastq files ([strain name]_read1.fastq and [strain name]_read2.fastq). Each fastq file name should rename such as get 'XXX_readX.fastq'.        [input strain directory]/                [strain name]/                    [strain name]_read1.fastq                    [strain name]_read2.fastq - T-lex commandtlex-open-v2.pl [ Options ] [ -T TE list ] [ -M TE annotations ] [ -G reference genome ] [ -R NGS data ]- T-lex options       -A           int    maximum read length in the data set in bp ( default = 100 )       -O          text    project name        -noclean            keep the intermediate files        -h or -help         display this help        For the TE filtering step:        -noFilterTE  int    do not filter TEs        -s        string    name of the species studied ( for RepeatMasker, e.g. 'drosophila' )        -d        float     minimum repeat density at the flanking regions of the TEs ( default: 0.5 (50%) )        For the TE presence detection:        -q                  launch only the presence detection approach        -j           int    length of the junction sequences to extract in bp ( default: 1000 )        -b           int    length of the internal region of the TE in bp ( default: 60 )        -limp        int    minimum match length required with the TE sequence in bp ( default: 5 )        -id          int    minimum sequence identity required with the TE sequence in % ( default: 95 )        -minQ        int    minimum quality Phrep score required to select a read ( default: 30 )        -processes   int    number of processes (used only for the NGS data reformatting; default: 1 )        For the TE absence detection:        -p                  launch only the absence detection approach        -f           int    length of the flanking sequences to extract and concatenate in bp ( default: 125 )        -v           int    minimum read length spanning the two TE sides in bp ( default: 15 )        -pairends string    PE mapping at 'no' by default ('yes' or 'no')        -lima        int    minimum non-repeated region on each side of the sequence in bp ( default: 5 ) To detect other TEs using the same NGS data or the same TEs using other NGS data, the options -binreads and -binref can be used to bypass the reformatting step of the input data and save computation time (see 'Running-time' part below). To do this, the option '-noclean' has to be added to the T-lex command line. In addition, the same project name has to be used. To combine all the TE detection results, specific options can be used:        -combRes      combine the presence/absence results from one strain        -combine      combine the presence/absence results from different strains        -freq         return the TE frequency based on the given strains        -pooled       return the TE frequency based on pooled data (To use with the option -freq)        -tsd          return the TSDs for the TE insertions detected as absent (use with -align and -p)        -align        return the multiple alignments        -combAll      combine the frequency estimates with the TE breakpoint analysis Go up Running-timeBelow, the computional running time on average for the main T-lex2 steps using a personal UNIX machine (1 processor, 2.33Ghz, 8GB RAM) to detect the presence and/or absence of 100 TEs in a single Drosophila melanogaster strain (15X coverage, composed of reads of 100 bp in length). Note that the number of annotated TEs gave as input will only change but not significantly (e.g., 10 min more for 800 TEs) the running time for the mapping steps.
Go up T-lex outputsT-lex produces several of output directories and files. The output is stored in a working directory named by default: 'tlex_output' or 'tlex_[project name]'. By default, only the final results (the 'Tresults' file) and the data necessary for the manual curation (the 'Tanalysis' and 'Talign' sub-directories) are returned.- TE flanking sequence analysisThe 'Tanalysis' sub-directory includes two RepeatMasker output files: One contains the submitted sequences in which the repeats have been masked (*.masked). The repeated regions are represented by 'N's'. The other file contains the table summarizing the detected repeats (*.out). Tanalysis also includes the detection of longer poly-A tails by looking for strech of 'A' or 'T' in three prime of the TE flanking sequence. The files are organized such as:       Tanalysis/           Tflank_checking_[length of the flanking region analyzed].fasta.masked           Tflank_checking_[length of the flanking region analyzed].fasta.out           Tpoly_[length of the flanking region analyzed].fasta.polyAT - TE detection resultsThe 'Tresults' file contains the TE presence/absence detection results. Information about the read and repeat density at the flanking regions of each TE are also returned. This file is composed of 12 columns described below:
- TE frequency estimatesUsing the combination of the results, T-lex estimates the frequency for each TE in the population dividing the number of strains for which the TE is present by the total number of strains for which T-lex returns data. This step is included by default. The 'Tfreq' file contains 6 columns such as:
Using the option '-pooled', T-lex can also estimate the frequency for each TE in the population using pooled NGS data (Fiston-Lavier et al., 2011, in prep). This estimates is based on the number of reads supporting the absence and the presence of the TE insertions reported by T-lex. The frequency estimates are stored in the file called 'Tfreq_pooled'. This file contains five columns such as:
The 'Talign' directory is divided into two sub-directories: the 'presence_detection' and the 'absence_detection'. They contain respectively the alignments from the presence and the absence TE detection such as:        Talign/           presence_detection/              [TE name]_[TE side].contig_all: multiple alignment of the contigs.           absence_detection/              [TE name]_absence.fasta: all the pairwise alignments of a TE. The intermediate files can be kept (see option '-noclean'). The intermediate files from the two detection approaches are stored in the 'Tpresence' and 'Tabsence' directories. These two directories and the fasta files containig the reformatted reference sequences are stored in the T-lex output directory: 'tlex_[project]'. - TSD detectionUsing the option '-tsd', T-lex also reports the putative TSDs for each TE insertion. T-lex parses the pairwise alignments of the selected reads against the T-lex 'absence' reference sequence (i.e., result of the concatenation of the two TE flanking sides). The TSD detection module starts to assemble all the selected reads supporting the absence of a same TE insertion using the Phrap program. Phrap requires a minimum of three sequences to built a contig. If less than three reads are selected to support the 'absence' call, the reads will be considered as contigs for the next of the T-lex pipeline. Each contig (or read) is then re-aligned on the T-lex 'absence' reference sequence using BLAT. Note that for some alignments or TE, no gap may be detected (1). If a gap close to the TE breakpoint and larger than two base pairs is detected, the motif in front of the gap is called the putative target site (PTS). Using the fastagrep program, a fast tool to look for very short non-exact matches, T-lex looks for the copy of the PTS. By default, T-lex selects the matches with more than 80% of sequence identity. FastaGrep may fail to detect a match (2). if fastagrep returns at least a match, The TSD is detected. The PTS and its closest copy sequences are also returned (3). Based on the number of contigs or reads, several TSDs can be returned for the same TE insertion. The results of the TSD detection process are stored in a file called 'TSDannot' in the 'tlex_[project]' directory. The option '-CombTSD' combines the TSD detection results with the TE presence/absence detection results from the 'Tresults'. The file returned called 'Tresults_TSD' has as last column the TSD results.Go up Manual curation- Mis-annotated TE detectionThe TSD detection may fail for old TE insertions or when the boundaries of the TE insertions are not well defined. We may expect the TSD detection to fail when the TE sequence is mis-annotated. Using the TSD detection outputs, we may expect two different alignment patterns as traces of mis-annotation. When the TE sequence is larger than the official annotation, a TE sequence is non-annotated. This sequence should be observed in front of the gap located on the read sequence. This sequence can correspond to a poly-A tail region. When the TE sequence is shorter than the official annotation, a TE flanking sequence is mis-annotated. This sequence should be identified in front of a gap located on the T-lex reference sequence. Based on that, we recommend to (i) identify TEs flanked by poly(A) or poly(T) using the results from the 'Tresults' file; (ii) then identify the TEs for wich the TSD failed. For this list of putatively mis-annotated TEs, we recommend to compare the TSD results stored in the 'TSDannot' file with the multiple alignments from the 'absence' detection. Using the TE flanking sequences, you should be able to relocation the TE breakpoint on the alignments and then relocate the correct TSD when possible.- TE detection manual curationOn top of the TE detection results, the 'Tresults' file also contains information about the read and repeat environment for each TE. This information can be used to assess the quality of the results. For instance, a low read coverage and a high-repeat density at the flanking sequences of a TE may prevent accurate TE detection. We thus recommend to start by manually curating the TEs in this configuration.If a TE is detected as 'present' but a contig sequence overlapping the TE sequence is short (small 'left_match_length' or 'right_match_length' value), we recommend to visually inspect the alignments from the presence detection approach using an alignment viewer such as Jalview (http://www.jalview.org/download.html. If terminal mismatches are detected, the T-lex result could be erroneous. The 'read_number' value allows to evaluate the absence detection approach. The absence detection based on a small number of reads can be explained by a low read coverage while the detection of an extreme number of reads can only be explained by the repetitive features of the flanking sequences of the TE (e.g. TE inserted into another TE). Note that if the TE filtering step is 'on', only the segmental duplications encompassing TEs could explained this high number of reads. Therefore, we recommend to manually curate the TEs in this situation by blasting the genomic sequence corresponding to a potential duplication on the reference genomic sequences. Another way is used available segmental duplication annotations. Alignments of the reads spanning the two TE sides could also be used to check that T-lex is not disturbed by imperfect satellites or short poly-A tails not detected by RepeatMasker (shorter than 20 bp). Go up How to cite T-lexFiston-Lavier AS, Carrigan M, Petrov DA and Gonzalez J. (2010) T-LEX: A program for fast and accurate assessment of transposable element presence using next-generation sequencing data. Nuc. Acids. Res. 2011 Mar 1;39(6):e36. Epub 2010 Dec 21Go up References  -RepeatMasker. version Open 3.2.8 A.F.A. Smit, R. Hubley & P. Green RepeatMasker at http://www.repeatmasker.org/.  -Jurka,J., Kapitonov,V.V., Pavlicek,A., Klonowski,P., Kohany,O., Walichiewicz,J. (2005) Repbase update, a database of eukaryotic repetitive elements. Cytogenet Genome Res 110:462_467   -Li,H., Ruan,J., Durbin,R. (2008) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18:1851_1858.   -Rumble,S.M., Lacroute,P., Dalca,A.V., Fiume,M., Sidow,A., Brudno,M. (2009) SHRiMP: Accurate Mapping of Short Color-space Reads. PLoS Comput. Biol. 5(5): e1000386. doi:10.1371/journal.pcbi.1000386.   -David M, Dzamba M, Lister D, Ilie L, Brudno M. (2011) SHRiMP2: sensitive yet practical SHort Read Mapping. Bioinformatics. Apr 1;27(7):1011-2. Epub 2011 Jan 28.   -Kent W.J. (2002) BLAT - the BLAST-like alignment tool. Genome Res. 12:656-664   -de la Bastide M, McCombie WR. (2007) Assembling genomic DNA sequences with PHRAP. Curr Protoc Bioinformatics. 2007 Mar;Chapter 11:Unit11.4.
Website designed by Anna-Sophie Fiston-Lavier (asfiston at univ-montp2 dot fr) |