For anyone curious, by the way, in the Trapper output that Ludovic
posted, you can see the section:
<nativeScanRef coordinateType="Agilent" >
<coordinate name="scan"
value="3560" />
</nativeScanRef>
for mzXML scan "1". I do remember working Agilent to request that
they expose at least unique single ID for each scan through their API,
as it would be easier for things in the TPP world; Chee-Hong and I
came up with this nativeScan ref system and it may be one of the
things that is optional but not required in the later mzXML schemas.
-Natalie
On Fri, Feb 11, 2011 at 11:23 AM, Natalie Tasman
<[email protected]> wrote:
> Hello Ludovic, Matt (and probably Joe, whose comments I'd appreciate),
>
> I have not been involved with MS data conversion for some time now,
> and certainly direct anyone to use msconvert and the ProteoWizard
> project if they are using the newer standard, mzML.
>
> However, that said, some users still have compelling uses for mzXML.
> My first recommendation is that you think about using mzML if
> possible; as Matt describes, the Agilent MassHunter format is one
> which internally maintains scans with a coordinate system rather than
> internally sequential scans. It is of course a good thing to maintain
> those full coordinates to maintain tracking of your scans back to the
> originals in the raw files.
>
> On to mzXML. As the SPC is the originator of the mzXML format, we
> have to take the TPP's requirements as defining 'correct' mzXML. For
> some background, first:
>
> mzXML was a very successful *format*, but not a standard. There was a
> need for a fast, open format in the proteomics software community, and
> mzXML worked well. It was defined as needed by the SPC/ISB and other
> developers in Zurich. As such, it mainly served the needs of the TPP
> and other related software, but was picked up by many other projects
> as well.
>
> As an academic-based format, it often changed, and was documented
> fairly well for an academic project-- but not completely. In the
> mzXML usage document-- very difficult to find, by the way:
> http://sashimi.sourceforge.net/schema_revision/mzXML_2.0/Doc/mzXML_2.0_tutorial.pdf,
> page 12, it spells out the assumption that mzXML scan numbers must
> occur in consecutive ascending numbers: " scan num (required): the
> scan number for the current scan element. The values of the num must
> start from 1 and increase sequentially!"
>
> For mzWiff and Trapper (and maybe others) this requires artificially
> renumbering the scans. In some of the later (3.0+) revisions of
> mzXML, Chee-Hong (the excellent developer who gave us mzWiff) and I
> came up with some additional separate fields in mzXML for storing the
> original scan coordinates. However, being somewhat of an ad-hoc
> format, I don't think that these requirements were *strongly* encoded
> into the mzXML schema.
>
> So in short, I believe that msconvert's reluctance to generate
> sequential consecutive scan numbers may result in *valid* mzXML, by
> the schema (if mzXML even validates-- I'm honestly not sure; some of
> the optional elements were difficult to grammatically encode in the
> XML schema if I recall). BUT the TPP, and most other SPC/Aebersold
> tools really, honestly expect and require those sequential scan
> numbers.
>
> On the other hand, you can get past this dilema by switching to mzML
> if you like, which is a *standard*, not just a format. This means it
> is throughly curated, tested, and documented by a consortium of
> intersted parties. They contain MUCH more detailed metadata and are
> better for archival purposes. And pwiz/mscovert is certainly the
> authoritative software suite for all things mzML.
>
> All this said, it has been some time since I have worked on the SPC
> converters in any detail, nor mzXML/mzML formats/standards, so perhaps
> the TPP team has relaxed this sequential mzXML scan requirement, and I
> hope they correct me if so (Joe?). But last time this came up it was
> still a requirement and I think it explains why Ludovic is having
> trouble.
>
> Hope this helps with some background and explanation.
>
> Best wishes to all,
>
> Natalie
>
>
>
> On Fri, Feb 11, 2011 at 8:58 AM, Matthew Chambers
> <[email protected]> wrote:
>> Hi Ludovic,
>>
>> On 2/11/2011 2:19 AM, lgillet wrote:
>>>
>>> Dear Matt,
>>>
>>> My problem (and others' from our lab) is that, with the current
>>> version of msconvert, you almost cannot do anything with the converted
>>> Agilent data. For example, MzXML2Search splits out a "segmentation
>>> fault" error message as soon as one scan number exceed 27'219 (i.e. if
>>> scan>27'220 it crashes; this probably has something to do with single/
>>> double integers stuff?). Second, our Sequest server (Sage-Sorcerer)
>>> also crashes on those files (the number of .dta files created from the
>>> mzXML are again very much limited to a well defined scan number limit
>>> and therefore very few spectra are actually searched).
>>
>> If this is true of MzXML2Search then it's a bug. I thought it was fixed
>> actually. Thermo Velos instruments easily exceed 30000 spectra. And if it's
>> LTQ, you double that to 60000 (DTAs).
>>
>>
>>> I don't know if I make myself clear but here are my comments:
>>>
>>> 1) could you verify why msconvert is behaving differently than Trapper
>>> (while they supposedly use the same Agilent libraries) when exporting
>>> the scan numbers (Trapper performing the correct conversion by
>>> conserving the same scan numbering as the raw file)
>>
>>>
>>> I do not know what the Agilent API does or not
>>> to the data, but what I can tell is that the scan are indeed
>>> *consecutively* numbered (from 1 till 5'000 or more) in the raw data
>>> when you browse them with the Agilent MassHunter Qual software. So my
>>> guess is that there might still be something fishy about msconvert
>>> here. My understanding was that the former converter (Trapper from
>>> Natalie Tasman) was actually relying on the same Agilent API as well!
>>> Maybe Natalie could comment on that. And since Trapper was conserving
>>> the proper numbering of the scan as in the raw data, something might
>>> have changed upon switching to msconvert.
>>
>> I'll quote my post to the psidev-ms mailing list from 6/30/2009:
>>>
>>> In the MassHunter API there are two ways to uniquely address a spectrum:
>>> by
>>> "row number" or "scan id". Row number is essentially a 0-based index
>>> that refers to the spectra after the acquisition software has done
>>> something...perhaps internal merging? Scan id represents the ordinal
>>> number of acquisitions as they come off the instrument. So, at least on
>>> their (Q)TOF instruments, the rowNumber is very disparate from the
>>> scanId, but both of them are unique identifiers that can technically be
>>> used to refer to a native spectrum. The kink is that the MassHunter API
>>> only refers to the parent scan by its scan id and doesn't provide a way
>>> to directly translate a scan id to a row number - translation must be
>>> done indirectly by enumerating all the row numbers and building a
>>> mapping of scan id to row number. For this reason I would recommend that
>>> the nativeID format be defined as "scanId=xsd:nonNegativeInteger" but
>>> I'm open to comment on this!
>>
>> This explains why we adopted scanId to be used as the nativeID despite it
>> not being consecutive. It was not a strong reason for choosing one over the
>> other, but ids being consecutive means even less.
>>
>> However, if it's true that it's impossible to find a scan in MassHunter with
>> the scanId, that's a major issue of which I was unaware! That's a pretty
>> compelling reason to switch to the row number, but we've never had to change
>> a nativeID format before. We'll have to discuss it with Agilent and the
>> PSI-MS working group.
>>
>>
>>> 2) If that's not possible for you to fix msconvert in that respect,
>>> would it be possible to provide an option in msconvert in order to
>>> renumber the scan consecutively from 1 till the end. I guess such
>>> option may anyway one day be useful for other people for other
>>> applications.
>>
>> Yes, it's possible to implement this, but as I said above there is an
>> imminent problem with your pipeline if you can't support scan numbers over
>> 27219. I have no idea why that number would be a threshold; 32767 is the max
>> for a signed 16-bit integer and 65535 is the max for unsigned. This should
>> be an easy bug to fix too (just changing the scan number data type). If the
>> 16-bit integer problems are fixed, is the consecutive option still
>> necessary?
>>
>> Hope this helps,
>> -Matt
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "spctools-discuss" group.
>> To post to this group, send email to [email protected].
>> To unsubscribe from this group, send email to
>> [email protected].
>> For more options, visit this group at
>> http://groups.google.com/group/spctools-discuss?hl=en.
>>
>>
>
--
You received this message because you are subscribed to the Google Groups
"spctools-discuss" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/spctools-discuss?hl=en.