Hello,

I'm working with some large protein databases derived from PacBio 
sequencing projects which contain quite a lot of redundancy (different 
isoforms and variants of the same gene / protein). I've noticed that the 
protein level identifications I get are heavily dependent on the search 
engine or combination of search engines I use. I concerns me, for example, 
when a protein identified rather confidently (eg 10's of PSMs, >5 peptides) 
with one search engine doesn't appear in the list at all with another 
search engine - even as a subset protein in a group. 

To understand better how the particular nature of these databases is 
affecting the protein identifications I want to compare the search results 
in various ways. Before I invest time writing my own solutions, I wanted to 
know if anyone knows of any software that can compute metrics such as the 
following when comparing search results:

1) For each spectrum in the PSM searches, how many of the top N PSMs are 
shared between the searches?
2) Proportion of all lead proteins (top protein from a group) that are the 
same
3) Proportion of IDs that are the same when considering all group members
4) More complex measures of group similarity: eg. Proportion of groups 
where >X% of members are within a group together in the other search, and 
how many of these equivalent groups contain the lead protein from the other 
group?
5) Other things you can think of that might be useful

I saw one tool called compid, 
(https://pubs.acs.org/doi/pdf/10.1021/pr100824w) but it only works with 
MASCOT and PARAGON results and the download is broken anyway.

Are there any other solutions around?

To work with the data myself I'd need to convert the pep.xml and prot.xml 
data to tsv format. Obviously Petunia does this beautifully, but I'd rather 
be able to process multiple files on the linux command line if possible. I 
saw an old post in this group about this issue and they suggested RpepXML 
(I don't know if it will work for prot.xml) or tppXMLparser (which isn't 
flexible in terms of the fields it can output). Are these the most current 
solutions to the problem or is there something more recent that I've missed?

Any ideas / thoughts / help would be very much appreciated!!

Thanks,

Alastair

-- 
You received this message because you are subscribed to the Google Groups 
"spctools-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/spctools-discuss.
For more options, visit https://groups.google.com/d/optout.
  • [spctools-discuss] Tools for sea... alastair.skeffington via spctools-discuss

Reply via email to