Dear TPP developers and community, I wanted to point out a potential error in how StPeter estimates protein mass (nanograms) in the proteome sample. As described in the paper <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5891225/>, the program normalizes the spectral index, dSI, by the protein length, L, and the total spectral index from the sample, Sum(dSI), as in the formula below: [image: image.png] This is correct for estimating the relative copy number, or *mole fraction* (the fraction of the total number of protein molecules), of each protein. However, for nanograms, or *mass fraction* (the fraction of the total proteome mass), the normalization by L should be omitted.
I hope this makes sense. The mass abundance of each protein is proportional to *both* its length and its copy number, therefore, normalization by length should not be performed for mass abundance estimation. Unfortunately, as the StPeter paper says (and as I have verified in the output), for calculating the nanograms "each protein SIN is divided by the sum of all proteins’ SIN and multiplied by the protein load in nanograms". This is effectively using mole fraction in place of mass fraction, which is incorrect. The authors (and other users) may not have noticed this error because it is inconsequential for tracking changes between different samples/conditions. However, it would be significant for consistency with other mass quantitation methods. To check the consistency, when StPeter's SIN output is correctly used to estimate mass fractions, i.e. dSIN * L / Sum(dSIN * L) is calculated instead of the above formula, the result is highly correlated with that of spectral counting, as expected, and as you can see in an example below: [image: StPeter_PSMs_aer.png] The method of mass fraction estimation using spectral counting is already established in the literature, for example in this paper <https://www.embopress.org/doi/full/10.15252/msb.20145697> see the "Absolute protein quantitation" section: "The absolute abundance of a protein was calculated by dividing the total number of spectra of all peptides for that protein by the total number of 14N spectra in the sample." No normalization by protein length is done, because length has to be included in the *mass *abundance of a protein. The paper also verifies the consistency of this method with 15N-labeled relative quantitation (see their supplementary figure S9). I have also verified the agreement in my own relative quantitation experiments. I would be interested in learning your thoughts on this. For obtaining protein mass abundances (or mass fractions), StPeter's "SIn" output (which is log2[dSIN]) is currently usable in the way described above, but the "ng" output needs to be corrected in the source code. Optionally, the current "ng" calculation can also be re-labeled as "copy numbers" given a total load of copy numbers (instead of total nanograms) provided by the user, but that would probably be of less interest than nanograms. Please let me know what you think. Thank you, Farshad -- You received this message because you are subscribed to the Google Groups "spctools-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/spctools-discuss/CAFyEx3x%2BeKx6%2BUZeXvzf8WN7x98wHkwrk1Y9Lm2HSprVZ52HJQ%40mail.gmail.com.
