T-lex: A tool for fast and accurate assessment of transposable element presence in next-generation sequencing data
T-lex

T-lex2 manual


Stanford University




HOME

Download T-lex

How to run T-lex
   Prerequisites
   Input data
   T-lex command
   T-lex options

Running-time

T-lex outputs
   TE flanking sequence analysis
   TE detection results
   TE frequency estimates
   TSD detection

Manual curation
   Mis-annotated TE detection
   TE detection manual curation

How to cite T-lex

References


T-lex is a computational pipeline that detects presence and/or absence of annotated individual transposable elements (TEs) using next-generation sequencing (NGS) data. T-lex combines two distinct and complementary TE detection approaches.

(A) The 'presence' detection approach looks for reads overlapping the flanking regions of the TE insertion. T-lex starts by extracting the TE junction sequences and the NGS data are converted into binary formats (bfa and bfq, respectively). Using the MAQ program (Li et al., 2008), the single reads are then mapped on the TE junction sequences. The reads are selected based on their Phrep score (>30 by default). The selected reads are used to build the contigs (one for each TE sides). If at least one contig spans the TE sequence with a min. match length (15bp by default) and a min. sequence identity (95% by default), the TE is defined as 'present'. The number of selected reads is also returned (see Figure below).

(B) The 'absence' detection approach looks for reads matching on both sides of the TE insertion, i.e., the ancestral genomic sequence prior the TE insertion. T-lex starts by extracting the TE flanking sequences (100bp by default) and concatenate them. The NGS data are converted into fasta. Using the SHRIMP2 program (David et al., 2011), the reads are then mapped as pairs (PE) and/or single reads on the TE concatenated sequence. Differences between the reference and the reads are indicative either of polymorphisms, or of sequencing errors. While most of the mapping tools are successful at mapping reads from organisms that exhibit low polymorphism rates, they do not perform well at mapping reads from highly polymorphic organisms. SHRIMP2 maps short reads to a genome even in the presence of a large amount of polymorphism and handles gaps. After the mapping, the reads from proper pairs (i.e., two reads on opposite strands) both overlapping the TE breakpoint are assembled. T-lex then selects the reads (and reads merged) spanning the two flanking regions of the TEs (15bp min. match length by default). If at least one read that does not fully correspond to repeated sequence maps correctly on both TE sides (5bp min. non-repeated length by default), T-lex classifies the TE as 'absent' (see Figure below).



T-lex pipeline

The two TE detection approaches.


The TE detection results from the two detection approaches may be combined to get a more accurate TE call. Using data from multiple strains from a same population, T-lex ascertains TE presence/absence for each strain and also estimates the TE frequency in population adding the number of strains for which the TE is 'present' and one-half times the number of 'polymorphic' strains, and dividing by the total number of strains for which T-lex returns data. Both TE detection approaches are based on the mapping of reads on the junctions of the TE insertions. By consequence, a high repeat density at the TE flanking regions or TE mis-annotation may bias the T-lex calls. To overcome that, T-lex identifies the TE insertions nested or in high-repeat density regions (see 'TE flanking sequence analysis' part), and detects putatively mis-annotated TE insertions (see 'Mis-annotated TE detection' part). The detection of the mis-annotation is based on the detection of the 'true' TE breakpoint. T-lex uses the outputs of the 'absence' detection to recover the ancestral genomic sequence prior the TE insertion and detects the Target Site Duplication (TSD), traces of the mechanism of TE insertion. If the TSD detection fails, the TE can be suspected as mis-annotated.


Go up

How to run T-lex


- Prerequisites

1. Unix system with Perl version 5.10.0 or higher (http://www.perl.org/get.html).
2. RepeatMasker and libraries open-3.2.8 or greater (Smit 1996-2010 ;http://www.repeatmasker.org/RMDownload.html for RepeatMasker program and http://www.girinst.org for the libraries).
3. Maq - Mapping and Assembly with Qualities- version 0.7.1 (Li et 2008; http://sourceforge.net/projects/maq/files/maq).
4. SHRIMP2 - SHort Read Mapping Package- Release 2.2.1/Oct. 31, 2011 (David et al 2011; http://compbio.cs.toronto.edu/shrimp).
5. BLAT version 34 or greater (Kent 2002; http://www.soe.ucsc.edu/~kent)

Only for the Target Site Duplication (TSD) detection:
6. Phrap version 1.090518 ("Phil's Revised Assembly Program"; Green, 1999; http://www.phrap.org)
7. FastaGrep developped by the Department of Bioinformatics at the University of Tartu & Estonian Biocentre (http://bioinfo.ebc.ee)


- Input data

T-lex requires at least four input data:
TE list                     - the list of the TEs to be analyzed with one TE idenfier per line (e.g for a TE in Drosophila 'FBti0019293').
TE annotations        - the annotations of these TEs. Tabulated file with five columns:
                                    TE name / location / start nucleotide position/ end nucleotide position / strand (+ or -),
                                   e.g. 'FBti0019293 3R 405387 406627 - '.
reference genome     - the reference genomic sequences where TEs have been annotated in fasta format (e.g. '>3R XXXXX').
NGS data                    - the NGS data in official fastq format (http://en.wikipedia.org/wiki/Fastq).


In order to handle multiple strains at the same time, T-lex requires a specific file organization for the input data. Even if you only want to run T-lex with one strain, the NGS data has to be stored in one directory in which each subdirectory corresponds to a single strain NGS data such as:

       [input strain directory]/
             [strain name 1]/
                   [strain name 1]_read.fastq
             [strain name 2]/
                   [strain name 2]_read.fastq


To handle paired-end reads, separate the reads from a same pair ([strain name]) in two fastq files ([strain name]_read1.fastq and [strain name]_read2.fastq). Each fastq file name should rename such as get 'XXX_readX.fastq'.

       [input strain directory]/
               [strain name]/
                   [strain name]_read1.fastq
                   [strain name]_read2.fastq


- T-lex command

tlex-open-v2.pl [ Options ] [ -T TE list ] [ -M TE annotations ] [ -G reference genome ] [ -R NGS data ]

- T-lex options

       -A           int    maximum read length in the data set in bp ( default = 100 )
       -O          text    project name
       -noclean            keep the intermediate files
       -h or -help         display this help


       For the TE filtering step:
       -noFilterTE  int    do not filter TEs
       -s        string    name of the species studied ( for RepeatMasker, e.g. 'drosophila' )
       -d        float     minimum repeat density at the flanking regions of the TEs ( default: 0.5 (50%) )


       For the TE presence detection:
       -q                  launch only the presence detection approach
       -j           int    length of the junction sequences to extract in bp ( default: 1000 )
       -b           int    length of the internal region of the TE in bp ( default: 60 )
       -limp        int    minimum match length required with the TE sequence in bp ( default: 5 )
       -id          int    minimum sequence identity required with the TE sequence in % ( default: 95 )
       -minQ        int    minimum quality Phrep score required to select a read ( default: 30 )
       -processes   int    number of processes (used only for the NGS data reformatting; default: 1 )


       For the TE absence detection:
       -p                  launch only the absence detection approach
       -f           int    length of the flanking sequences to extract and concatenate in bp ( default: 125 )
       -v           int    minimum read length spanning the two TE sides in bp ( default: 15 )
       -pairends string    PE mapping at 'no' by default ('yes' or 'no')
       -lima        int    minimum non-repeated region on each side of the sequence in bp ( default: 5 )


To detect other TEs using the same NGS data or the same TEs using other NGS data, the options -binreads and -binref can be used to bypass the reformatting step of the input data and save computation time (see 'Running-time' part below). To do this, the option '-noclean' has to be added to the T-lex command line. In addition, the same project name has to be used. To combine all the TE detection results, specific options can be used:

       -combRes      combine the presence/absence results from one strain
       -combine      combine the presence/absence results from different strains
       -freq         return the TE frequency based on the given strains
       -pooled       return the TE frequency based on pooled data (To use with the option -freq)
       -tsd          return the TSDs for the TE insertions detected as absent (use with -align and -p)
       -align        return the multiple alignments
       -combAll      combine the frequency estimates with the TE breakpoint analysis



Go up


Running-time

Below, the computional running time on average for the main T-lex2 steps using a personal UNIX machine (1 processor, 2.33Ghz, 8GB RAM) to detect the presence and/or absence of 100 TEs in a single Drosophila melanogaster strain (15X coverage, composed of reads of 100 bp in length). Note that the number of annotated TEs gave as input will only change but not significantly (e.g., 10 min more for 800 TEs) the running time for the mapping steps.

T-lex steps Mean running time (min)
Analysis of the TE flanking sequences < 1
Presence module: input formatting (fastq to bfq) 15
Presence module: mapping + selection 20
Absence module: input formatting (fastq to fasta) 5
Absence module: mapping + selection 60
Combination of the TE detection results <1
Frequency estimates <1
Multiple alignments 15
TSD detection 15



Go up

T-lex outputs

T-lex produces several of output directories and files. The output is stored in a working directory named by default: 'tlex_output' or 'tlex_[project name]'. By default, only the final results (the 'Tresults' file) and the data necessary for the manual curation (the 'Tanalysis' and 'Talign' sub-directories) are returned.

- TE flanking sequence analysis

The 'Tanalysis' sub-directory includes two RepeatMasker output files: One contains the submitted sequences in which the repeats have been masked (*.masked). The repeated regions are represented by 'N's'. The other file contains the table summarizing the detected repeats (*.out). Tanalysis also includes the detection of longer poly-A tails by looking for strech of 'A' or 'T' in three prime of the TE flanking sequence. The files are organized such as:

       Tanalysis/
          Tflank_checking_[length of the flanking region analyzed].fasta.masked
          Tflank_checking_[length of the flanking region analyzed].fasta.out
          Tpoly_[length of the flanking region analyzed].fasta.polyAT



- TE detection results

The 'Tresults' file contains the TE presence/absence detection results. Information about the read and repeat density at the flanking regions of each TE are also returned. This file is composed of 12 columns described below:

Column Field Description
1 strain name of the strain
2 TE name of the TE
3 TE absence detection TE detection result from the presence detection approach
4 TE presence detection TE detection result from the absence detection approach
5 TE detection conclusion final result combining the results from the two approaches
6 absence read number number of reads spanning the two TE flanking sequences
7 left match length length of the terminal match overlapping the TE sequence of the left contig. This value can be negative if the full TE sequence is missing
8 left match id sequence identity of the terminal match overlapping the TE sequence of the left contig. The number of mismatches/matches is reported in parenthesis
9 poly(A) detection on left TE side poly(A) detection result after analysis of the left TE side. If no poly(A) or poly(T) detected = 'no_polyAT'
10 left_coverage mean read coverage at the left flanking side of the TE
11 left_repeat name of the repeat located at the left side of the TE. If no repeat is detected = 'no repeat'
12 presence read number left number of reads spanning the TE left junction flanking sequences
13 right match length length of the terminal match overlapping the TE sequence of the right contig. This value can be negative if the full TE sequence is missing
14 right match id sequence identity of the terminal match overlapping the TE sequence of the right contig. The number of mismatches/matches is reported in parenthesis
15 poly(A) detection on right TE side poly(A) detection result after analysis of the right TE side. If no poly(A) or poly(T) detected = 'no_polyAT'
16 right_coverage mean read coverage at the right flanking side of the TE
17 right_repeat name of the repeat located at the right side of the TE. If no repeat is detected = 'no repeat'
18 presence read number right number of reads spanning the TE right junction flanking sequences



- TE frequency estimates

Using the combination of the results, T-lex estimates the frequency for each TE in the population dividing the number of strains for which the TE is present by the total number of strains for which T-lex returns data. This step is included by default. The 'Tfreq' file contains 6 columns such as:

Column Field Description
1 TE TE name
2 presence results number of strains for which T-lex classified the TE as 'present'
3 polymorphic results number of strains for which T-lex classified the TE as 'polymorphic'
4 absence results number of strains for which T-lex classified the TE as 'absent'
5 total results number of strains for which T-lex returns a clear result
6 TEfrequency TE frequency estimate (i.e. adding the number of strains for which the TE is 'present' and one-half times the number of 'polymorphic' strains, and dividing by the total number of strains for which T-lex returns clear result)


Using the option '-pooled', T-lex can also estimate the frequency for each TE in the population using pooled NGS data (Fiston-Lavier et al., 2011, in prep). This estimates is based on the number of reads supporting the absence and the presence of the TE insertions reported by T-lex. The frequency estimates are stored in the file called 'Tfreq_pooled'. This file contains five columns such as:

Column Field Description
1 TE Name of the TE
2 absence read number Number of reads providing evidence of absence
3 presence read number/left Number of reads providing evidence of presence on the left TE side
4 presence read number/right Number of reads providing evidence of presence on the right TE side
5 TE frequency TE frequency estimate(i.e. Sum of the frequency estimates using each TE detection approach (see Fiston-Lavier et al. 2012 in prep)


The 'Talign' directory is divided into two sub-directories: the 'presence_detection' and the 'absence_detection'. They contain respectively the alignments from the presence and the absence TE detection such as:

       Talign/
          presence_detection/
             [TE name]_[TE side].contig_all: multiple alignment of the contigs.
          absence_detection/
             [TE name]_absence.fasta: all the pairwise alignments of a TE.



The intermediate files can be kept (see option '-noclean'). The intermediate files from the two detection approaches are stored in the 'Tpresence' and 'Tabsence' directories. These two directories and the fasta files containig the reformatted reference sequences are stored in the T-lex output directory: 'tlex_[project]'.

- TSD detection

Using the option '-tsd', T-lex also reports the putative TSDs for each TE insertion. T-lex parses the pairwise alignments of the selected reads against the T-lex 'absence' reference sequence (i.e., result of the concatenation of the two TE flanking sides). The TSD detection module starts to assemble all the selected reads supporting the absence of a same TE insertion using the Phrap program. Phrap requires a minimum of three sequences to built a contig. If less than three reads are selected to support the 'absence' call, the reads will be considered as contigs for the next of the T-lex pipeline. Each contig (or read) is then re-aligned on the T-lex 'absence' reference sequence using BLAT. Note that for some alignments or TE, no gap may be detected (1). If a gap close to the TE breakpoint and larger than two base pairs is detected, the motif in front of the gap is called the putative target site (PTS). Using the fastagrep program, a fast tool to look for very short non-exact matches, T-lex looks for the copy of the PTS. By default, T-lex selects the matches with more than 80% of sequence identity. FastaGrep may fail to detect a match (2). if fastagrep returns at least a match, The TSD is detected. The PTS and its closest copy sequences are also returned (3). Based on the number of contigs or reads, several TSDs can be returned for the same TE insertion. The results of the TSD detection process are stored in a file called 'TSDannot' in the 'tlex_[project]' directory. The option '-CombTSD' combines the TSD detection results with the TE presence/absence detection results from the 'Tresults'. The file returned called 'Tresults_TSD' has as last column the TSD results.

Go up


Manual curation

- Mis-annotated TE detection

The TSD detection may fail for old TE insertions or when the boundaries of the TE insertions are not well defined. We may expect the TSD detection to fail when the TE sequence is mis-annotated. Using the TSD detection outputs, we may expect two different alignment patterns as traces of mis-annotation. When the TE sequence is larger than the official annotation, a TE sequence is non-annotated. This sequence should be observed in front of the gap located on the read sequence. This sequence can correspond to a poly-A tail region. When the TE sequence is shorter than the official annotation, a TE flanking sequence is mis-annotated. This sequence should be identified in front of a gap located on the T-lex reference sequence. Based on that, we recommend to (i) identify TEs flanked by poly(A) or poly(T) using the results from the 'Tresults' file; (ii) then identify the TEs for wich the TSD failed. For this list of putatively mis-annotated TEs, we recommend to compare the TSD results stored in the 'TSDannot' file with the multiple alignments from the 'absence' detection. Using the TE flanking sequences, you should be able to relocation the TE breakpoint on the alignments and then relocate the correct TSD when possible.

- TE detection manual curation

On top of the TE detection results, the 'Tresults' file also contains information about the read and repeat environment for each TE. This information can be used to assess the quality of the results. For instance, a low read coverage and a high-repeat density at the flanking sequences of a TE may prevent accurate TE detection. We thus recommend to start by manually curating the TEs in this configuration.
If a TE is detected as 'present' but a contig sequence overlapping the TE sequence is short (small 'left_match_length' or 'right_match_length' value), we recommend to visually inspect the alignments from the presence detection approach using an alignment viewer such as Jalview (http://www.jalview.org/download.html. If terminal mismatches are detected, the T-lex result could be erroneous.
The 'read_number' value allows to evaluate the absence detection approach. The absence detection based on a small number of reads can be explained by a low read coverage while the detection of an extreme number of reads can only be explained by the repetitive features of the flanking sequences of the TE (e.g. TE inserted into another TE). Note that if the TE filtering step is 'on', only the segmental duplications encompassing TEs could explained this high number of reads. Therefore, we recommend to manually curate the TEs in this situation by blasting the genomic sequence corresponding to a potential duplication on the reference genomic sequences. Another way is used available segmental duplication annotations. Alignments of the reads spanning the two TE sides could also be used to check that T-lex is not disturbed by imperfect satellites or short poly-A tails not detected by RepeatMasker (shorter than 20 bp).

Go up


How to cite T-lex

Fiston-Lavier AS, Carrigan M, Petrov DA and Gonzalez J. (2010) T-LEX: A program for fast and accurate assessment of transposable element presence using next-generation sequencing data. Nuc. Acids. Res. 2011 Mar 1;39(6):e36. Epub 2010 Dec 21

Go up


References

  -RepeatMasker. version Open 3.2.8 A.F.A. Smit, R. Hubley & P. Green RepeatMasker at http://www.repeatmasker.org/.
  -Jurka,J., Kapitonov,V.V., Pavlicek,A., Klonowski,P., Kohany,O., Walichiewicz,J. (2005) Repbase update, a database of eukaryotic repetitive elements. Cytogenet Genome Res 110:462_467
  -Li,H., Ruan,J., Durbin,R. (2008) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18:1851_1858.
  -Rumble,S.M., Lacroute,P., Dalca,A.V., Fiume,M., Sidow,A., Brudno,M. (2009) SHRiMP: Accurate Mapping of Short Color-space Reads. PLoS Comput. Biol. 5(5): e1000386. doi:10.1371/journal.pcbi.1000386.
  -David M, Dzamba M, Lister D, Ilie L, Brudno M. (2011) SHRiMP2: sensitive yet practical SHort Read Mapping. Bioinformatics. Apr 1;27(7):1011-2. Epub 2011 Jan 28.
  -Kent W.J. (2002) BLAT - the BLAST-like alignment tool. Genome Res. 12:656-664
  -de la Bastide M, McCombie WR. (2007) Assembling genomic DNA sequences with PHRAP. Curr Protoc Bioinformatics. 2007 Mar;Chapter 11:Unit11.4.


T-lex
Copyright 2011, 2012, 2013, 2014
Website designed by Anna-Sophie Fiston-Lavier
(asfiston at univ-montp2 dot fr)