Kris wrote: > I actually found out that I left some data out of one of the fasta > files. So, you can disregard the descriptions of my runs I posted > before. > > My question is probably more about SEQUEST behavior. But it also > includes the behavior of PeptideProphet. > > My question is basically: Will using a large database_file (in fasta > format) affect the SEQUEST output or Peptide Prophet output in a > positive or negative way? > > In other words, will a large database of sequences give more or less > valid peptide matches?
If you have good quality spectra, you're going to get correct/valid peptide matches irrespective of database size. The larger the database, the larger the population of negative/incorrect peptides to match against which increases your background scores. This increases the chance that an incorrect peptide gets a higher score than the correct peptide match for a given spectrum. This can happen when the spectrum isn't the greatest and normally (hopefully) the resulting scores indicates the match isn't significant. Doubling the database size isn't going to have a tremendous effect on the resulting IDs. Increasing the database size by a 100 fold will. There's no one-size-fits-all answer to your two questions above. Does the larger database include protein sequences in your sample that the smaller database excludes? If so, the larger database can give you more correct peptide matches just because you have the ability to match real sequences that aren't present in the smaller database. If the larger database just includes more background sequence entries (that aren't present in your real sample), it's hard to see how you would get more peptide matches. But how many valid matches you might lose, if any, is really an open ended question based on how large is 'large'. Going from tryptic searches to semi-tryptic searches will increase your search space (# peptide sequences considered) ~20 fold or higher, which is like but not quite the same as using larger database, but there are real benefits of increased IDs. Tools like PeptideProphet will make use of the tryptic termini info to help distinguish correct vs. incorrect IDs. It's not directly related to your questions but I just wanted to mention that there can be benefits to a larger search space. > -Kris > > On Jun 15, 11:13 am, "Brian Pratt" <[email protected]> wrote: >> Hi Kris, >> >> So is this a question about SEQUEST performance and behavior? You might be >> better off asking Thermo about that. On the other hand there are lots of >> SEQUEST users on this list so you might get an answer here too... >> >> Brian >> >> -----Original Message----- >> From: [email protected] >> >> [mailto:[email protected]] On Behalf Of Kris >> Sent: Monday, June 15, 2009 7:31 AM >> To: spctools-discuss >> Subject: [spctools-discuss] fasta file length and match precision >> >> Sorry if this is covered somewhere else, I was unable to find the >> answer to this question. >> >> I am using TPP with SEQUEST to search for peptides in mass spectra. >> >> I had an original fastafile (specified in the parameters file by >> "database_name=") that I have been using for all my database_search >> runs. I just generated a new fasta file that included the possibility >> of amino acids being cut off the N-terminus of the protein fragments. >> >> E.G. >> If the original fasta file contained MVMNDANQAQITATFKTK >> >> The new fasta file contains >> MVMNDANQAQITATFKTK >> VMNDANQAQITATFKTK >> MNDANQAQITATFKTK >> NDANQAQITATFKTK >> DANQAQITATFKTK >> ANQAQITATFKTK >> NQAQITATFKTK >> QAQITATFKTK >> AQITATFKTK >> >> These sequences were included because I was concerned that the >> database_search would not find proteins where the N-terminus was >> modified. (My work mainly concerns the N-termini of proteins). >> >> I ran some preliminary tests to see how this would affect the run >> speed and results. >> >> Using the original fasta file, a run took 3.5 hours. Using the new >> longer fasta file (which has all the sequences of the original file >> and more), the same run took 2 hours. >> >> There were about 6000 peptides found when run with the original fasta >> file and only about 2500 peptides found when run with the new longer >> fasta file. >> >> Using the new fasta file, I found some peptide matches with >> probability of 1 that had slightly lower probability (around 0.9) when >> using the original fasta file. >> >> MY QUESTIONS: >> >> Does using a longer fasta file somehow cause a lose of precision when >> searching for peptide matches in mass spec data? >> >> How does the fasta file length (and specifically adding the same >> sequences with a shortened N-terminus) affect the database_searches >> and more importantly the peptide matches output? >> >> Thank, >> Kris > > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "spctools-discuss" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/spctools-discuss?hl=en -~----------~----~----~----~------~----~------~--~---
