Hello Ludovic, Matt (and probably Joe, whose comments I'd appreciate), I have not been involved with MS data conversion for some time now, and certainly direct anyone to use msconvert and the ProteoWizard project if they are using the newer standard, mzML.
However, that said, some users still have compelling uses for mzXML. My first recommendation is that you think about using mzML if possible; as Matt describes, the Agilent MassHunter format is one which internally maintains scans with a coordinate system rather than internally sequential scans. It is of course a good thing to maintain those full coordinates to maintain tracking of your scans back to the originals in the raw files. On to mzXML. As the SPC is the originator of the mzXML format, we have to take the TPP's requirements as defining 'correct' mzXML. For some background, first: mzXML was a very successful *format*, but not a standard. There was a need for a fast, open format in the proteomics software community, and mzXML worked well. It was defined as needed by the SPC/ISB and other developers in Zurich. As such, it mainly served the needs of the TPP and other related software, but was picked up by many other projects as well. As an academic-based format, it often changed, and was documented fairly well for an academic project-- but not completely. In the mzXML usage document-- very difficult to find, by the way: http://sashimi.sourceforge.net/schema_revision/mzXML_2.0/Doc/mzXML_2.0_tutorial.pdf, page 12, it spells out the assumption that mzXML scan numbers must occur in consecutive ascending numbers: " scan num (required): the scan number for the current scan element. The values of the num must start from 1 and increase sequentially!" For mzWiff and Trapper (and maybe others) this requires artificially renumbering the scans. In some of the later (3.0+) revisions of mzXML, Chee-Hong (the excellent developer who gave us mzWiff) and I came up with some additional separate fields in mzXML for storing the original scan coordinates. However, being somewhat of an ad-hoc format, I don't think that these requirements were *strongly* encoded into the mzXML schema. So in short, I believe that msconvert's reluctance to generate sequential consecutive scan numbers may result in *valid* mzXML, by the schema (if mzXML even validates-- I'm honestly not sure; some of the optional elements were difficult to grammatically encode in the XML schema if I recall). BUT the TPP, and most other SPC/Aebersold tools really, honestly expect and require those sequential scan numbers. On the other hand, you can get past this dilema by switching to mzML if you like, which is a *standard*, not just a format. This means it is throughly curated, tested, and documented by a consortium of intersted parties. They contain MUCH more detailed metadata and are better for archival purposes. And pwiz/mscovert is certainly the authoritative software suite for all things mzML. All this said, it has been some time since I have worked on the SPC converters in any detail, nor mzXML/mzML formats/standards, so perhaps the TPP team has relaxed this sequential mzXML scan requirement, and I hope they correct me if so (Joe?). But last time this came up it was still a requirement and I think it explains why Ludovic is having trouble. Hope this helps with some background and explanation. Best wishes to all, Natalie On Fri, Feb 11, 2011 at 8:58 AM, Matthew Chambers <[email protected]> wrote: > Hi Ludovic, > > On 2/11/2011 2:19 AM, lgillet wrote: >> >> Dear Matt, >> >> My problem (and others' from our lab) is that, with the current >> version of msconvert, you almost cannot do anything with the converted >> Agilent data. For example, MzXML2Search splits out a "segmentation >> fault" error message as soon as one scan number exceed 27'219 (i.e. if >> scan>27'220 it crashes; this probably has something to do with single/ >> double integers stuff?). Second, our Sequest server (Sage-Sorcerer) >> also crashes on those files (the number of .dta files created from the >> mzXML are again very much limited to a well defined scan number limit >> and therefore very few spectra are actually searched). > > If this is true of MzXML2Search then it's a bug. I thought it was fixed > actually. Thermo Velos instruments easily exceed 30000 spectra. And if it's > LTQ, you double that to 60000 (DTAs). > > >> I don't know if I make myself clear but here are my comments: >> >> 1) could you verify why msconvert is behaving differently than Trapper >> (while they supposedly use the same Agilent libraries) when exporting >> the scan numbers (Trapper performing the correct conversion by >> conserving the same scan numbering as the raw file) > >> >> I do not know what the Agilent API does or not >> to the data, but what I can tell is that the scan are indeed >> *consecutively* numbered (from 1 till 5'000 or more) in the raw data >> when you browse them with the Agilent MassHunter Qual software. So my >> guess is that there might still be something fishy about msconvert >> here. My understanding was that the former converter (Trapper from >> Natalie Tasman) was actually relying on the same Agilent API as well! >> Maybe Natalie could comment on that. And since Trapper was conserving >> the proper numbering of the scan as in the raw data, something might >> have changed upon switching to msconvert. > > I'll quote my post to the psidev-ms mailing list from 6/30/2009: >> >> In the MassHunter API there are two ways to uniquely address a spectrum: >> by >> "row number" or "scan id". Row number is essentially a 0-based index >> that refers to the spectra after the acquisition software has done >> something...perhaps internal merging? Scan id represents the ordinal >> number of acquisitions as they come off the instrument. So, at least on >> their (Q)TOF instruments, the rowNumber is very disparate from the >> scanId, but both of them are unique identifiers that can technically be >> used to refer to a native spectrum. The kink is that the MassHunter API >> only refers to the parent scan by its scan id and doesn't provide a way >> to directly translate a scan id to a row number - translation must be >> done indirectly by enumerating all the row numbers and building a >> mapping of scan id to row number. For this reason I would recommend that >> the nativeID format be defined as "scanId=xsd:nonNegativeInteger" but >> I'm open to comment on this! > > This explains why we adopted scanId to be used as the nativeID despite it > not being consecutive. It was not a strong reason for choosing one over the > other, but ids being consecutive means even less. > > However, if it's true that it's impossible to find a scan in MassHunter with > the scanId, that's a major issue of which I was unaware! That's a pretty > compelling reason to switch to the row number, but we've never had to change > a nativeID format before. We'll have to discuss it with Agilent and the > PSI-MS working group. > > >> 2) If that's not possible for you to fix msconvert in that respect, >> would it be possible to provide an option in msconvert in order to >> renumber the scan consecutively from 1 till the end. I guess such >> option may anyway one day be useful for other people for other >> applications. > > Yes, it's possible to implement this, but as I said above there is an > imminent problem with your pipeline if you can't support scan numbers over > 27219. I have no idea why that number would be a threshold; 32767 is the max > for a signed 16-bit integer and 65535 is the max for unsigned. This should > be an easy bug to fix too (just changing the scan number data type). If the > 16-bit integer problems are fixed, is the consecutive option still > necessary? > > Hope this helps, > -Matt > > -- > You received this message because you are subscribed to the Google Groups > "spctools-discuss" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]. > For more options, visit this group at > http://groups.google.com/group/spctools-discuss?hl=en. > > -- You received this message because you are subscribed to the Google Groups "spctools-discuss" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/spctools-discuss?hl=en.
