![]() |
T-lex manual |
![]() |
HOME |
T-lex is a computational pipeline to detect presence and/or absence of annotated individual transposable elements (TEs) in sequenced genomes using NGS data. When using NGS data from multiple strains, T-lex returns the frequency estimate for each TE in the tested strains. Because high repeat density in the flanking regions of the TEs can complicate the mapping process, T-lex starts by identifing the nested TEs and the TEs in high repeated regions. These TEs can be removed from the list of TEs to detect. Then T-lex uses two distinct approaches to detect the presence and/or absence of the TEs.To detect the presence, T-lex attempts to find reads that overlap the junctions of each TE with its flanking region. The TE junction sequences and NGS data are then converted into binary formats (bfa and bfq, respectively) by Maq (Li et al 2008). The reads mapped on the TE junction sequences are then used to build contigs. If at least one contig spans the TE sequence, the TE is defined as 'present' (see Figure below). To detect the absence, T-lex attempts to find reads that span both flanking regions. The TE flanking regions are extracted and concatenated. The reads are then mapped on the concatenated sequence of the TEs using SHRIMP (Rumble et al 2009). T-lex selects the reads spanning the two flanking regions of the TEs. If at least one read that does not fully correspond to repeated sequence maps correctly on both TE sides, T-lex classifies the TE as 'absent' (see Figure below). The two TE detection approaches. The TE detection results from the two approaches are then combined to arrive at the final conclusion. Handling multiple NGS data, T-lex can estimate the frequency of each TE. Go up How to run T-lex- Prerequisites1. Unix system with Perl version 5.10.0 or higher (http://www.perl.org/get.html).2. RepeatMasker and libraries open-3.2.8 or greater (http://www.repeatmasker.org/RMDownload.html for RepeatMasker program and http://www.girinst.org for the libraries). 3. Maq -Mapping and Assembly with Qualities- version 0.7.0 or greater (http://sourceforge.net/projects/maq/files/maq). 4. SHRIMP -SHort Read Mapping Package- Release 1.3.2 (http://compbio.cs.toronto.edu/shrimp). - Input dataT-lex requires at least four input data:TE list                     - the list of the TEs to be analyzed with one TE idenfier per line (e.g for a TE in Drosophila 'FBti0019293'). TE annotations        - the annotations of these TEs. Tabulated file with 4 columns:                                     TE name / location / start nucleotide position/ end nucleotide position, e.g. 'FBti0019293 3R 405387 406627'. reference genome     - the reference genomic sequences where TEs have been annotated in fasta format (e.g. '>3R XXXXX'). NGS data                    - the NGS data in official fastq format (http://en.wikipedia.org/wiki/Fastq). In order to handle multiple strains at the same time, T-lex requires a specific file organization for the input data. Even if you only want to run T-lex with one strain, the NGS data has to be stored in one directory in which each subdirectory corresponds to a single strain NGS data such as:        [input strain directory]/              [strain name 1]/                    fastq files              [strain name 2]/                    fastq files               etc... - T-lex commandtlex-open-v1.pl [ Options ] [ -T TE list ] [ -M TE annotations ] [ -G reference genome ] [ -R NGS data ]- T-lex options       -A           int    maximum read length in the data set in bp ( default = 100 )       -O          text    project name        -noclean            keep the intermediate files        -h or -help         display this help        For the TE filtering step:        -noFilterTE  int    do not filter TEs        -s        string    name of the species studied ( for RepeatMasker, e.g. 'drosophila' )        -d        float     minimum repeat density at the flanking regions of the TEs ( default = 0.5 )        For the TE presence detection:        -q                  launch only the presence detection approach        -j           int    length of the junction sequences to extract in bp ( default = 1000 )        -b           int    length of the internal region of the TE in bp ( default = 20 )        -limp        int    minimum match length required with the TE sequence in bp ( default = 5)        -id          int    minimum sequence identity required with the TE sequence in % ( default = 5)        -processes   int    number of processes (used only for the NGS data reformatting; default = 1)        For the TE absence detection:        -p                  launch only the absence detection approach        -f           int    length of the flanking sequences to extract and concatenate in bp ( default = 100 )        -v           int    minimum read length spanning the two TE sides in bp ( default = 15 )        -lima        int    minimum non-repeated region on each side of the sequence in bp ( default = 5 ) To detect other TEs using the same NGS data or the same TEs using other NGS data, the options -binreads and -binref can be used to bypass the reformatting step of the input data and save computation time (see 'Running-time' part below). To do this, the option '-noclean' has to be added to the T-lex command line. In addition, the same project name has to be used. To combine all the TE detection results, specific options can be used:        -combRes           combine the presence/absence results from one strain        -combine           combine the presence/absence results from different strains        -freq              return the TE frequency based on the given strains        -align             return the multiple alignments Go up Running-timeUsing a personal UNIX machine (1 processor, 2.33Ghz, 8GB RAM), we detected the presence and/or absence of 768 TEs in a Drosophila melanogaster strain ( NGS data of 15X coverage, composed of reads of 100 bp in length) in less than 2 hours. The computional time for the main T-lex steps are:
Go up T-lex outputsT-lex produces several of output directories and files. The output is stored in a working directory named by default: 'tlex_output' or 'tlex_[project name]'. By default, only the final results (the 'Tresults' and 'Tfreq' files) and the data necessary for the manual curation (the 'Tfilter' and 'Talign' sub-directories) are returned.- TE filtering stepThe 'Tfilter' sub-directory includes two RepeatMasker output files: One contains the submitted sequences in which the repeats have been masked (*.masked). The repeated regions are represented by 'N's'. The other file contains the table summarizing the detected repeats (*.out). The files are organized such as:       Tfilter/           Tflank_checking_[length of the flanking region analyzed].fasta.masked           Tflank_checking_[length of the flanking region analyzed].fasta.out - TE detection resultsThe 'Tresults' file contains the TE presence/absence detection results. Information about the read and repeat density at the flanking regions of each TE are also returned. This file is composed of 12 columns described below:
- TE frequency estimatesUsing the combination of the results, T-lex estimates the frequency for each TE in the population dividing the number of strains for which the TE is present by the total number of strains for which T-lex returns data. This step is included by default. The 'Tfreq' file contains 6 columns such as:
The 'Talign' directory is divided into two sub-directories: the 'presence_detection' and the 'absence_detection'. They contain respectively the alignments from the presence and the absence TE detection such as:        Talign/           presence_detection/              [TE name]_[TE side].contig_all: multiple alignment of the contigs.           absence_detection/              [TE name]_absence.fasta: all the pairwise alignments of a TE. The intermediate files can be kept (see option '-noclean'). The intermediate files from the two detection approaches are stored in the 'Tpresence' and 'Tabsence' directories. These two directories and the fasta files containig the reformatted reference sequences are stored in the T-lex output directory: 'tlex_[project]'. - Manual curationOn top of the TE detection results, the 'Tresults' file also contains information about the read and repeat environment for each TE. This information can be used to assess the quality of the results. For instance, a low read coverage and a high-repeat density at the flanking sequences of a TE may prevent accurate TE detection. We thus recommend to start by manually curating the TEs in this configuration.If a TE is detected as 'present' but a contig sequence overlapping the TE sequence is short (small 'left_match_length' or 'right_match_length' value), we recommend to visually inspect the alignments from the presence detection approach using an alignment viewer such as Jalview (http://www.jalview.org/download.html. If terminal mismatches are detected, the T-lex result could be erroneous. The 'read_number' value allows to evaluate the absence detection approach. The absence detection based on a small number of reads can be explained by a low read coverage while the detection of an extreme number of reads can only be explained by the repetitive features of the flanking sequences of the TE (e.g. TE inserted into another TE). Note that if the TE filtering step is 'on', only the segmental duplications encompassing TEs could explained this high number of reads. Therefore, we recommend to manually curate the TEs in this situation by blasting the genomic sequence corresponding to a potential duplication on the reference genomic sequences. Another way is used available segmental duplication annotations. Alignments of the reads spanning the two TE sides could also be used to check that T-lex is not disturbed by imperfect satellites or short poly-A tails not detected by RepeatMasker (shorter than 20 bp). Go up How to cite T-lexFiston-Lavier AS, Carrigan M, Petrov DA and Gonzalez J. (2010) T-LEX: A program for fast and accurate assessment of transposable element presence using next-generation sequencing data. Nuc. Acids. Res. 2011 Mar 1;39(6):e36. Epub 2010 Dec 21Go up References  RepeatMasker. version Open 3.2.8 A.F.A. Smit, R. Hubley & P. Green RepeatMasker at http://www.repeatmasker.org/.  Jurka,J., Kapitonov,V.V., Pavlicek,A., Klonowski,P., Kohany,O., Walichiewicz,J. (2005) Repbase update, a database of eukaryotic repetitive elements. Cytogenet Genome Res 110:462_467   Li,H., Ruan,J., Durbin,R. (2008) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18:1851_1858.   Rumble,S.M., Lacroute,P., Dalca,A.V., Fiume,M., Sidow,A., Brudno,M. (2009) SHRiMP: Accurate Mapping of Short Color-space Reads. PLoS Comput. Biol. 5(5): e1000386. doi:10.1371/journal.pcbi.1000386.
(afiston at stanford dot edu) |