Hello, I'm working with some large protein databases derived from PacBio sequencing projects which contain quite a lot of redundancy (different isoforms and variants of the same gene / protein). I've noticed that the protein level identifications I get are heavily dependent on the search engine or combination of search engines I use. I concerns me, for example, when a protein identified rather confidently (eg 10's of PSMs, >5 peptides) with one search engine doesn't appear in the list at all with another search engine - even as a subset protein in a group.
To understand better how the particular nature of these databases is affecting the protein identifications I want to compare the search results in various ways. Before I invest time writing my own solutions, I wanted to know if anyone knows of any software that can compute metrics such as the following when comparing search results: 1) For each spectrum in the PSM searches, how many of the top N PSMs are shared between the searches? 2) Proportion of all lead proteins (top protein from a group) that are the same 3) Proportion of IDs that are the same when considering all group members 4) More complex measures of group similarity: eg. Proportion of groups where >X% of members are within a group together in the other search, and how many of these equivalent groups contain the lead protein from the other group? 5) Other things you can think of that might be useful I saw one tool called compid, (https://pubs.acs.org/doi/pdf/10.1021/pr100824w) but it only works with MASCOT and PARAGON results and the download is broken anyway. Are there any other solutions around? To work with the data myself I'd need to convert the pep.xml and prot.xml data to tsv format. Obviously Petunia does this beautifully, but I'd rather be able to process multiple files on the linux command line if possible. I saw an old post in this group about this issue and they suggested RpepXML (I don't know if it will work for prot.xml) or tppXMLparser (which isn't flexible in terms of the fields it can output). Are these the most current solutions to the problem or is there something more recent that I've missed? Any ideas / thoughts / help would be very much appreciated!! Thanks, Alastair -- You received this message because you are subscribed to the Google Groups "spctools-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/spctools-discuss. For more options, visit https://groups.google.com/d/optout.
