![]() |
|
|
|
|
|
|
Repeats Extractor: Repeats Extractor is a customized program developed to extract repeats and form SDR groups and alignments from a genome. This uses blast and extracts as much information as possible removing duplications and overlapping sequences, recursively aligning the found repeats and grouping them according to the given parameters. Repeats Extractor is embedded in the development of NLSAP-GUI and can be downloaded from the website. It is also referred as RE in this document. Follow the notes for details on RE: RE reads sequence in Fasta Format as of release dated 08-09-2004, I shall add support for other sequence formats soon. It creates the database of the sequence using the formatdb utility from NCBI and blasts the sequence with the created database using blastall from NCBI. It does all the work from blast output file. The parameters are explained as follows: Minimum Length Threshold: It is the minimum number of base pairs allowed for a sequence to be considered as repeat. The default value is 40; it means that any sequence matches with match length less than 40 are removed. Note that the smaller this value the greater the chance of finding the repeats, so taking this threshold lower, increases occurrence of a repeat “by chance”. Make sure you are not going below 20 for big sequences of around 600k which in case slows the application and the results may not be uniform in distribution. Maximum Length Threshold: It is the maximum number of base pairs allowed for sequence to be considered as a repeat. The default value is 1000; it means that any sequence matches with match length greater than 1000 are removed. Use this program specifically to find repeat groups lower than 1000 if you don’t have a big repeats mask file. If you are going beyond 1000 make sure you input the big repeats mask file else one can result in redundant data. For details on how to make the big repeats mask file see below. Identity Threshold: It is the minimum identity between any two sequences to be considered as a repeat. The default value is 90. The lower this value the greater are the number of repeats typically, it’s a good practice to run the program with different identity thresholds to check if there are large number of repeats in certain identity ranges. For example, if a genome reports more number of repeats with 80-90%, one might miss all those when program ran with threshold 90; there might also be a case where one finds nothing in 80-90% range. So it’s better to decide on identity threshold after looking at the results. Minimum Overlap: It is the percent overlap of any two sequences, reported by the blast, to consider as two different repeats. The greater this value, the greater is the overlap allowed. Overlap is calculated as follows: say, in a sequence “seq”, seqA, seqB, seqC are the three matches shown by the blast and their co-ordinates go as such: seqA = 50-75 seqB = 140-165 seqC = 143-169 then, Overlap (seqB,seqC) = seqB intersection seqC = 143-165 = 22 bp / (effective length, 29) = .75 * 100 = 75 % Overlap (seqA,seqB) = 0, obviously. So, if one wants to keep the match seqC, then minimum overlap cutoff SHOULD NOT be down of 75 and if one considers it as redundancy, they could set it accordingly. Group Minimum Overlap: It is the percent overlap of any two sequences, picked as repeats, to consider as belongs to the same group. The greater this value, the smaller the group size, the greater the number of groups. The smaller this value, the greater the group size, the smaller the number of groups. (Note: Do not confuse min overlap with group min overlap; they are entirely different in terms of what they are doing). Group overlap is calculated as follows: say, in a sequence “seq”, seqA, seqB, seqC are the three matches shown by the blast and these sequences as such: seqA = atatatgagcaggacgcgacgat = [seqA] = Group1 seqB = atatatgagcaggacgcgacgat = [seqA] = Group1 seqC = gcatatatgagcaggacgcgacgatag = gc[seqA]ag = Group1? Or does it fall in other Group2?, decided by the factor. It is clear from above seqA and seqB aligns perfectly and they should belong to the same group. groupOverlap (seqA, seqB) = Identity(Aligned seqA,seqB) = 23/23 = 1*100 = 100 groupOverlap (seqB, seqC) = Identity(Aligned seqB,seqC) = 23/27 = .8518*100 = 85.18 So, if one wants to group the match seqC with Group1, then group overlap cutoff SHOULD BE down of 85 and if one considers breaking it and form a different group, they could set it accordingly. (Note: The program needle from EMBOSS is used in these alignments, for more info on needle, please see emboss help pages) Distribution Size: It specifies the range of the bp’s to create the distribution table. For example, if this value is 20 and the minimum length selected is 40, the table looks as follows: 40-59 #num of repeats 60-79 #num of repeats 80-99 #num of repeats Match reward and Mismatch Penalty: These are the parameters given to the blast. They are defined as reward for a nucleotide match and penalty for a nucleotide mismatch. It is recommended to use the default values for these parameters mainly when looking for repeats. For more information on these, please see the blast help pages. Mask Large Repeats: This is a four column tab delimited file which lists the coordinates of the large repeats to be taken care of by the repeat extractor. Example Mask File for Zea Mays. How does this file affect the results? - A subsequence of a larger repeat IS NOT considered as a repeat if it is the same subsequence occurring in the repeat but doesn’t occur somewhere else in the genome. - A subsequence of a larger repeat IS considered as a repeat if it is the same subsequence occurring in the repeat but also occurs somewhere else in the genome. Hence it is always recommended to the mask file for larger genomes greater than 100k. This information can be typically obtained from the dotplot tools like PipMaker. I am planning to provide with a tool for creating this mask file and will be done soon. Jim, if you think any information needs to be added, let me know. Also please feel free to edit or comment on the above statements if they seem to be confusing. |
|
|
|
| Maize
Mitochondrial Newton Lab Research 324 Tucker Hall Biological Sciences University of Missouri - Columbia |
|