[spctools-discuss] Re: fasta file length and match precision

Jimmy Eng Tue, 16 Jun 2009 10:47:10 -0700

Kris wrote:
> I actually found out that I left some data out of one of the fasta
> files. So, you can disregard the descriptions of my runs I posted
> before.
> 
> My question is probably more about SEQUEST behavior. But it also
> includes the behavior of PeptideProphet.
> 
> My question is basically: Will using a large database_file (in fasta
> format) affect the SEQUEST output or Peptide Prophet output in a
> positive or negative way?
> 
> In other words, will a large database of sequences give more or less
> valid peptide matches?

If you have good quality spectra, you're going to get correct/valid 
peptide matches irrespective of database size.  The larger the database, 
the larger the population of negative/incorrect peptides to match 
against which increases your background scores.  This increases the 
chance that an incorrect peptide gets a higher score than the correct 
peptide match for a given spectrum.  This can happen when the spectrum 
isn't the greatest and normally (hopefully) the resulting scores 
indicates the match isn't significant.  Doubling the database size isn't 
going to have a tremendous effect on the resulting IDs.  Increasing the 
database size by a 100 fold will.

There's no one-size-fits-all answer to your two questions above.  Does 
the larger database include protein sequences in your sample that the 
smaller database excludes?  If so, the larger database can give you more 
correct peptide matches just because you have the ability to match real 
sequences that aren't present in the smaller database.  If the larger 
database just includes more background sequence entries (that aren't 
present in your real sample), it's hard to see how you would get more 
peptide matches.  But how many valid matches you might lose, if any, is 
really an open ended question based on how large is 'large'.

Going from tryptic searches to semi-tryptic searches will increase your 
search space (# peptide sequences considered) ~20 fold or higher, which 
is like but not quite the same as using larger database, but there are 
real benefits of increased IDs.  Tools like PeptideProphet will make use 
of the tryptic termini info to help distinguish correct vs. incorrect 
IDs.  It's not directly related to your questions but I just wanted to 
mention that there can be benefits to a larger search space.

> -Kris
> 
> On Jun 15, 11:13 am, "Brian Pratt" <[email protected]> wrote:
>> Hi Kris,
>>
>> So is this a question about SEQUEST performance and behavior?  You might be
>> better off asking Thermo about that.  On the other hand there are lots of
>> SEQUEST users on this list so you might get an answer here too...
>>
>> Brian
>>
>> -----Original Message-----
>> From: [email protected]
>>
>> [mailto:[email protected]] On Behalf Of Kris
>> Sent: Monday, June 15, 2009 7:31 AM
>> To: spctools-discuss
>> Subject: [spctools-discuss] fasta file length and match precision
>>
>> Sorry if this is covered somewhere else, I was unable to find the
>> answer to this question.
>>
>> I am using TPP with SEQUEST to search for peptides in mass spectra.
>>
>> I had an original fastafile (specified in the parameters file by
>> "database_name=") that I have been using for all my database_search
>> runs. I just generated a new fasta file that included the possibility
>> of amino acids being cut off the N-terminus of the protein fragments.
>>
>> E.G.
>> If the original fasta file contained MVMNDANQAQITATFKTK
>>
>> The new fasta file contains
>> MVMNDANQAQITATFKTK
>> VMNDANQAQITATFKTK
>> MNDANQAQITATFKTK
>> NDANQAQITATFKTK
>> DANQAQITATFKTK
>> ANQAQITATFKTK
>> NQAQITATFKTK
>> QAQITATFKTK
>> AQITATFKTK
>>
>> These sequences were included because I was concerned that the
>> database_search would not find proteins where the N-terminus was
>> modified. (My work mainly concerns the N-termini of proteins).
>>
>> I ran some preliminary tests to see how this would affect the run
>> speed and results.
>>
>> Using the original fasta file, a run took 3.5 hours. Using the new
>> longer fasta file (which has all the sequences of the original file
>> and more), the same run took 2 hours.
>>
>> There were about 6000 peptides found when run with the original fasta
>> file and only about 2500 peptides found when run with the new longer
>> fasta file.
>>
>> Using the new fasta file, I found some peptide matches with
>> probability of 1 that had slightly lower probability (around 0.9) when
>> using the original fasta file.
>>
>> MY QUESTIONS:
>>
>> Does using a longer fasta file somehow cause a lose of precision when
>> searching for peptide matches in mass spec data?
>>
>> How does the fasta file length (and specifically adding the same
>> sequences with a shortened N-terminus) affect the database_searches
>> and more importantly the peptide matches output?
>>
>> Thank,
>> Kris
> > 

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"spctools-discuss" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/spctools-discuss?hl=en
-~----------~----~----~----~------~----~------~--~---

[spctools-discuss] Re: fasta file length and match precision

Reply via email to