T-lex: A tool for fast and accurate assessment of transposable element presence in next-generation sequencing data
T-lex

T-lex manual


Stanford University




HOME

Download T-lex

How to run T-lex
   Prerequisites
   Input data
   T-lex command
   T-lex options

Running-time

T-lex outputs
   TE filtering step
   TE detection results
   TE frequency estimates
   Manual curation

How to cite T-lex

References


T-lex is a computational pipeline to detect presence and/or absence of annotated individual transposable elements (TEs) in sequenced genomes using NGS data. When using NGS data from multiple strains, T-lex returns the frequency estimate for each TE in the tested strains.

Because high repeat density in the flanking regions of the TEs can complicate the mapping process, T-lex starts by identifing the nested TEs and the TEs in high repeated regions. These TEs can be removed from the list of TEs to detect. Then T-lex uses two distinct approaches to detect the presence and/or absence of the TEs.

To detect the presence, T-lex attempts to find reads that overlap the junctions of each TE with its flanking region. The TE junction sequences and NGS data are then converted into binary formats (bfa and bfq, respectively) by Maq (Li et al 2008). The reads mapped on the TE junction sequences are then used to build contigs. If at least one contig spans the TE sequence, the TE is defined as 'present' (see Figure below).

To detect the absence, T-lex attempts to find reads that span both flanking regions. The TE flanking regions are extracted and concatenated. The reads are then mapped on the concatenated sequence of the TEs using SHRIMP (Rumble et al 2009). T-lex selects the reads spanning the two flanking regions of the TEs. If at least one read that does not fully correspond to repeated sequence maps correctly on both TE sides, T-lex classifies the TE as 'absent' (see Figure below).



T-lex pipeline

The two TE detection approaches.


The TE detection results from the two approaches are then combined to arrive at the final conclusion. Handling multiple NGS data, T-lex can estimate the frequency of each TE.


Go up

How to run T-lex

- Prerequisites

1. Unix system with Perl version 5.10.0 or higher (http://www.perl.org/get.html).
2. RepeatMasker and libraries open-3.2.8 or greater (http://www.repeatmasker.org/RMDownload.html for RepeatMasker program and http://www.girinst.org for the libraries).
3. Maq -Mapping and Assembly with Qualities- version 0.7.0 or greater (http://sourceforge.net/projects/maq/files/maq).
4. SHRIMP -SHort Read Mapping Package- Release 1.3.2 (http://compbio.cs.toronto.edu/shrimp).

- Input data

T-lex requires at least four input data:
TE list                     - the list of the TEs to be analyzed with one TE idenfier per line (e.g for a TE in Drosophila 'FBti0019293').
TE annotations        - the annotations of these TEs. Tabulated file with 4 columns:
                                    TE name / location / start nucleotide position/ end nucleotide position, e.g. 'FBti0019293 3R 405387 406627'.
reference genome     - the reference genomic sequences where TEs have been annotated in fasta format (e.g. '>3R XXXXX').
NGS data                    - the NGS data in official fastq format (http://en.wikipedia.org/wiki/Fastq).




In order to handle multiple strains at the same time, T-lex requires a specific file organization for the input data. Even if you only want to run T-lex with one strain, the NGS data has to be stored in one directory in which each subdirectory corresponds to a single strain NGS data such as:

       [input strain directory]/
             [strain name 1]/
                   fastq files
             [strain name 2]/
                   fastq files
              etc...


- T-lex command

tlex-open-v1.pl [ Options ] [ -T TE list ] [ -M TE annotations ] [ -G reference genome ] [ -R NGS data ]

- T-lex options

       -A           int    maximum read length in the data set in bp ( default = 100 )
       -O          text    project name
       -noclean            keep the intermediate files
       -h or -help         display this help


       For the TE filtering step:
       -noFilterTE  int    do not filter TEs
       -s        string    name of the species studied ( for RepeatMasker, e.g. 'drosophila' )
       -d        float     minimum repeat density at the flanking regions of the TEs ( default = 0.5 )


       For the TE presence detection:
       -q                  launch only the presence detection approach
       -j           int    length of the junction sequences to extract in bp ( default = 1000 )
       -b           int    length of the internal region of the TE in bp ( default = 20 )
       -limp        int    minimum match length required with the TE sequence in bp ( default = 5)
       -id          int    minimum sequence identity required with the TE sequence in % ( default = 5)
       -processes   int    number of processes (used only for the NGS data reformatting; default = 1)


       For the TE absence detection:
       -p                  launch only the absence detection approach
       -f           int    length of the flanking sequences to extract and concatenate in bp ( default = 100 )
       -v           int    minimum read length spanning the two TE sides in bp ( default = 15 )
       -lima        int    minimum non-repeated region on each side of the sequence in bp ( default = 5 )


To detect other TEs using the same NGS data or the same TEs using other NGS data, the options -binreads and -binref can be used to bypass the reformatting step of the input data and save computation time (see 'Running-time' part below). To do this, the option '-noclean' has to be added to the T-lex command line. In addition, the same project name has to be used. To combine all the TE detection results, specific options can be used:

       -combRes           combine the presence/absence results from one strain
       -combine           combine the presence/absence results from different strains
       -freq              return the TE frequency based on the given strains
       -align             return the multiple alignments



Go up


Running-time

Using a personal UNIX machine (1 processor, 2.33Ghz, 8GB RAM), we detected the presence and/or absence of 768 TEs in a Drosophila melanogaster strain ( NGS data of 15X coverage, composed of reads of 100 bp in length) in less than 2 hours. The computional time for the main T-lex steps are:

T-lex steps Mean running time
TE filtering < 1 min
Presence module: input formatting (fastq to bfq) 15 min
Presence module: mapping + selection 18 min
Absence module: input formatting (fastq to fasta) 3 min
Absence module: mapping + selection 50 min
Output preparation 2 min



Go up

T-lex outputs

T-lex produces several of output directories and files. The output is stored in a working directory named by default: 'tlex_output' or 'tlex_[project name]'. By default, only the final results (the 'Tresults' and 'Tfreq' files) and the data necessary for the manual curation (the 'Tfilter' and 'Talign' sub-directories) are returned.

- TE filtering step

The 'Tfilter' sub-directory includes two RepeatMasker output files: One contains the submitted sequences in which the repeats have been masked (*.masked). The repeated regions are represented by 'N's'. The other file contains the table summarizing the detected repeats (*.out). The files are organized such as:

       Tfilter/
          Tflank_checking_[length of the flanking region analyzed].fasta.masked
          Tflank_checking_[length of the flanking region analyzed].fasta.out



- TE detection results

The 'Tresults' file contains the TE presence/absence detection results. Information about the read and repeat density at the flanking regions of each TE are also returned. This file is composed of 12 columns described below:

Column Field Description
1 strain name of the strain
2 TE name of the TE
3 presence detection TE detection result from the presence detection approach
4 absence detection TE detection result from the absence detection approach
5 combination final result combining the results from the two approaches
6 read_number number of reads spanning the two TE flanking sequences
7 left_match_length length of the terminal match overlapping the TE sequence of the left contig. This value can be negative if the full TE sequence is missing
8 left_match_id sequence identity of the terminal match overlapping the TE sequence of the left contig. The number of mismatches/matches is reported in parenthesis
9 left_coverage mean read coverage at the left flanking side of the TE
10 left_repeat name of the repeat located at the left side of the TE. If no repeat is detected = 'no repeat'
11 right_match_length length of the terminal match overlapping the TE sequence of the right contig. This value can be negative if the full TE sequence is missing
12 right_match_id sequence identity of the terminal match overlapping the TE sequence of the right contig. The number of mismatches/matches is reported in parenthesis
12 right_coverage mean read coverage at the right flanking side of the TE
13 right_repeat name of the repeat located at the right side of the TE. If no repeat is detected = 'no repeat'



- TE frequency estimates

Using the combination of the results, T-lex estimates the frequency for each TE in the population dividing the number of strains for which the TE is present by the total number of strains for which T-lex returns data. This step is included by default. The 'Tfreq' file contains 6 columns such as:

Column Field Description
1 TE Name of the TE
2 presence_results number of strains for which T-lex classified the TE as 'present'
3 polymorphic_results number of strains for which T-lex classified the TE as 'polymorphic'
4 absence_results number of strains for which T-lex classified the TE as 'absent'
5 total_results number of strains for which T-lex returns a clear result
6 TEfrequency TE frequency (i.e. adding the number of strains for which the TE is 'present' and one-half times the number of 'polymorphic' strains, and dividing by the total number of strains for which T-lex returns clear result)


The 'Talign' directory is divided into two sub-directories: the 'presence_detection' and the 'absence_detection'. They contain respectively the alignments from the presence and the absence TE detection such as:

       Talign/
          presence_detection/
             [TE name]_[TE side].contig_all: multiple alignment of the contigs.
          absence_detection/
             [TE name]_absence.fasta: all the pairwise alignments of a TE.



The intermediate files can be kept (see option '-noclean'). The intermediate files from the two detection approaches are stored in the 'Tpresence' and 'Tabsence' directories. These two directories and the fasta files containig the reformatted reference sequences are stored in the T-lex output directory: 'tlex_[project]'.

- Manual curation

On top of the TE detection results, the 'Tresults' file also contains information about the read and repeat environment for each TE. This information can be used to assess the quality of the results. For instance, a low read coverage and a high-repeat density at the flanking sequences of a TE may prevent accurate TE detection. We thus recommend to start by manually curating the TEs in this configuration.
If a TE is detected as 'present' but a contig sequence overlapping the TE sequence is short (small 'left_match_length' or 'right_match_length' value), we recommend to visually inspect the alignments from the presence detection approach using an alignment viewer such as Jalview (http://www.jalview.org/download.html. If terminal mismatches are detected, the T-lex result could be erroneous.
The 'read_number' value allows to evaluate the absence detection approach. The absence detection based on a small number of reads can be explained by a low read coverage while the detection of an extreme number of reads can only be explained by the repetitive features of the flanking sequences of the TE (e.g. TE inserted into another TE). Note that if the TE filtering step is 'on', only the segmental duplications encompassing TEs could explained this high number of reads. Therefore, we recommend to manually curate the TEs in this situation by blasting the genomic sequence corresponding to a potential duplication on the reference genomic sequences. Another way is used available segmental duplication annotations. Alignments of the reads spanning the two TE sides could also be used to check that T-lex is not disturbed by imperfect satellites or short poly-A tails not detected by RepeatMasker (shorter than 20 bp).

Go up


How to cite T-lex

Fiston-Lavier AS, Carrigan M, Petrov DA and Gonzalez J. (2010) T-LEX: A program for fast and accurate assessment of transposable element presence using next-generation sequencing data. Nuc. Acids. Res. 2011 Mar 1;39(6):e36. Epub 2010 Dec 21

Go up


References

  RepeatMasker. version Open 3.2.8 A.F.A. Smit, R. Hubley & P. Green RepeatMasker at http://www.repeatmasker.org/.
  Jurka,J., Kapitonov,V.V., Pavlicek,A., Klonowski,P., Kohany,O., Walichiewicz,J. (2005) Repbase update, a database of eukaryotic repetitive elements. Cytogenet Genome Res 110:462_467
  Li,H., Ruan,J., Durbin,R. (2008) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18:1851_1858.
  Rumble,S.M., Lacroute,P., Dalca,A.V., Fiume,M., Sidow,A., Brudno,M. (2009) SHRiMP: Accurate Mapping of Short Color-space Reads. PLoS Comput. Biol. 5(5): e1000386. doi:10.1371/journal.pcbi.1000386.


T-lex
Website designed by Anna-Sophie Fiston-Lavier
(afiston at stanford dot edu)