Re: [spctools-discuss] some questions: model failure, parametric vs semi-parametric, same search engine but different parameters, function of scripts

David Shteynberg Wed, 02 Jan 2019 13:18:21 -0800

Hello Alastair,

I will attempt to answer your queries below.

*1: Model failure*

When a model for a charge state fails (" Mixture model quality test failed
for charge (4+)") what happens to data from that sample and charge state?
Is it excluded from the rest of the pipeline?

The results from failed models get assigned a probability that is either
zero, when they cannot be result is not expected to be correct, or a
negative value of magnitude equal to the charge state of the PSM when they
can still be a correct ID based on the results of a different charge state
model.

*2: Parametric vs. semi-parametric modelling*

In many of my samples I find more proteins with non-parametric modelling
than with parametric modelling. Do you expect non-parametric modelling to
be less conservative, or is this very data-set dependent? Is the
semi-supervised / semi-parametric modelling normally preferred  because it
makes less assumptions of the data?

In the FAQ page "What is CLEVEL and how do I use it?" there are nice plots
that I can imagine being helpful for making decisions about model
performance. Is there an easy way to produce these plots? If I copy the
data over to look at in the Petunia GUI on windows I can't find any such
plots, but I've seen them in another question in this discussion group
which makes me think I'm missing something...

The semi-parametric model usually gives a closer match to the underlying
data and is generally preferred as it seems to boost performance of the
classifier as you have also observed.  It does require your database to
include a portion of decoys that will be made available to PeptideProphet
for modelling.

*3. Combining search engine results run with different parameters*

There's supposed to be no assumption of orthogonality when combining the
results of multiple search engines in the TPP. So is it also acceptable to
run the same search engine with multiple parameters (eg with and without
certain variable modifications) and then combine these results? Is there no
danger that this will artificially inflate the probability of a protein,
because the search space is made to appear artificially small?

This is a good question but I don't know if I have a good answer other than
try it with a set of hidden decoys and see if the false discovery rates
based on your hidden decoys are still accurate. If you combine the results
of multiple searches run with different parameters you might have to
disable the sibling searches (NONSS option) model.

*4. The function of some of the scripts...*

I've worked out how to run TPP on unix by running it on the windows GUI and
looking at the command list. However I'm really not clear about the precise
function of some of the programs and scripts. If I'm combining the results
of multiple search engines then I run the following programs. Below I give
a brief description of how I use it and what I think it does. Any
corrections or elaborations would be appreciated:

You should be able to run the commands on the commandline without any
options to generate detailed usage information for the command.

*InteractParser*
Here I combine different pep.xml files from technical replicates and set
the experiment and enzyme tags

Yes, this is the main function of InteractParser.  It can also fix problems
commonly found in some search engine native pepXML formats and add more
information for visualization with PepXMLViewer, such as retention time,
precursor intensity, etc...

*DatabaseParser*
Is this necessary? How does it alter the pepXML files?

It doesn't alter the pepXML and simply parses out the path to the database
written in the pepXML.

*RefreshParser*
I use this to make sure that all pepXML files are referencing the same
database. This is necessary because some search engines have different
database requirements - some generate the decoy database themselves while
other require it appended to the real database. So here I make sure all the
files reference the non-appended version - otherwise I presume there would
be problems when combining the results downstream when they derive from
'different' databases. *Is this use of RefreshParser necessay /
appropriate?*

This is an important tool that maps the peptides to all proteins in the
database.  If you do a search against one database you can remap the
peptides to a different database.  If you are combining the results of
different database one reasonable approach would be to create a
concatenated database of all the databases and use RefreshParser to remap
your PSMs against the database containing all sequences. If you are
combining the results of different search engines I would not recommend you
to use different decoys (or search engine generate decoys) as the decoys
will be different between the different searches and will not accurately
represent false matches to the target sequences, which can be the same
between the different databases.

*PeptideProphetParser*
Runs peptide prophet

Yes.

*ProphetModels.pl*
Does this alter the pepXML file?

Yes because it runs tpp_models under the hood.

*tpp_models.pl <http://tpp_models.pl>*
Does this alter the pepXML file?

Yes.

*InterProphetParser*
Runs iprophet, combining the different search results. I presume that this
uses the appropriate decoy tag for each file as provided to
PeptideProphetParser?

Yes it runs iProphet.  No, it doesn't care about or use decoys unless
enabled as an option.

*RefreshParser*
Seems to be necessary. Not quite sure why.

See above.

*ProteinProphet*
Runs protein prophet on the iprophet results

It runs ProteinProphet, but it doesn't require iProphet results.  The
original version of ProteinProphet was built for PeptideProphet result
processing.  When running it on iProphet results, enable the IPROPHET
commandline flag to use iProphet probabilities.

I hope this is helpful.  Let me know if other questions come up.

Cheers,
-David

On Tue, Dec 11, 2018 at 7:32 AM alastair.skeffington via spctools-discuss <
[email protected]> wrote:

> Hello,
>
> So I've got TPP working well on unix, processing the results of multiple
> search engine very efficiently. Thanks for the help that's got me this far.
> However I have a number of outstanding questions that I'd like to
> understand before I really trust my use of the pipeline. Hopefully someone
> can help. If I've missed something in the documentation or in the papers
> please let me know.
>
>
> *1: Model failure*
>
> When a model for a charge state fails (" Mixture model quality test failed
> for charge (4+)") what happens to data from that sample and charge state?
> Is it excluded from the rest of the pipeline?
>
> *2: Parametric vs. semi-parametric modelling*
>
> In many of my samples I find more proteins with non-parametric modelling
> than with parametric modelling. Do you expect non-parametric modelling to
> be less conservative, or is this very data-set dependent? Is the
> semi-supervised / semi-parametric modelling normally preferred  because it
> makes less assumptions of the data?
>
> In the FAQ page "What is CLEVEL and how do I use it?" there are nice plots
> that I can imagine being helpful for making decisions about model
> performance. Is there an easy way to produce these plots? If I copy the
> data over to look at in the Petunia GUI on windows I can't find any such
> plots, but I've seen them in another question in this discussion group
> which makes me think I'm missing something...
>
> *3. Combining search engine results run with different parameters*
>
> There's supposed to be no assumption of orthogonality when combining the
> results of multiple search engines in the TPP. So is it also acceptable to
> run the same search engine with multiple parameters (eg with and without
> certain variable modifications) and then combine these results? Is there no
> danger that this will artificially inflate the probability of a protein,
> because the search space is made to appear artificially small?
>
> *4. The function of some of the scripts...*
>
> I've worked out how to run TPP on unix by running it on the windows GUI
> and looking at the command list. However I'm really not clear about the
> precise function of some of the programs and scripts. If I'm combining the
> results of multiple search engines then I run the following programs. Below
> I give a brief description of how I use it and what I think it does. Any
> corrections or elaborations would be appreciated:
>
> *InteractParser*
> Here I combine different pep.xml files from technical replicates and set
> the experiment and enzyme tags
>
>
> *DatabaseParser*
> Is this necessary? How does it alter the pepXML files?
>
>
> *RefreshParser*
> I use this to make sure that all pepXML files are referencing the same
> database. This is necessary because some search engines have different
> database requirements - some generate the decoy database themselves while
> other require it appended to the real database. So here I make sure all the
> files reference the non-appended version - otherwise I presume there would
> be problems when combining the results downstream when they derive from
> 'different' databases. *Is this use of RefreshParser necessay /
> appropriate?*
>
>
> *PeptideProphetParser*
> Runs peptide prophet
>
>
> *ProphetModels.pl*
> Does this alter the pepXML file?
>
>
> *tpp_models.pl <http://tpp_models.pl>*
> Does this alter the pepXML file?
>
>
> *InterProphetParser*
> Runs iprophet, combining the different search results. I presume that this
> uses the appropriate decoy tag for each file as provided to
> PeptideProphetParser?
>
>
> *RefreshParser*
> Seems to be necessary. Not quite sure why.
>
> *ProteinProphet*
> Runs protein prophet on the iprophet results
>
>
>
> Thanks!
> Alastair
>
> --
> You received this message because you are subscribed to the Google Groups
> "spctools-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/spctools-discuss.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"spctools-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/spctools-discuss.
For more options, visit https://groups.google.com/d/optout.

Re: [spctools-discuss] some questions: model failure, parametric vs semi-parametric, same search engine but different parameters, function of scripts

Reply via email to