Well, I'll go ahead and modify the mascot converter to emit proper XML for
proteins with reserved XML characters, but it does sound like folks would do
well to make that <> / [] substitution upstream from the search engines.
The fact that the EBI IPI site does the substitution confirms my suspicion
that a number of tools might get munged up by this.

Brian

On Wed, Nov 11, 2009 at 12:17 PM, dctrud <[email protected]> wrote:

>
> Unfortunately the offending entries are present in commonly used
> public DBs. We recently bumped into exactly this problem, as there are
> 4 entries containing <xxxx> in the IPI human v3.66 fasta file:
>
> IPI00465120 Gene_Symbol=- 3<beta>-HSD <psi>1 protein
> IPI00816409 Gene_Symbol=- V<gamma>1 protein (Fragment)
> IPI00816761 Gene_Symbol=CREB1 <alpha>CREB-1 protein (Fragment)
> IPI00930475 Gene_Symbol=GUSB F<lambda>8 protein (Fragment)
>
> After hundreds of searches, a particular experiment happened to ID one
> of these proteins, causing problems with the tools. In the event I
> manually removed the problematic IDs as they were irrelevant for the
> experiment. We already re-write IPI headers after download of the
> FASTA, so will implement a substitution there if it crops up again.
> Should a substitution be added to the IPI retrieval utility scripts in
> the TPP distribution so that the problem doesn't show it's face if
> they are being used?
>
> Interestingly, if you search on the EBI IPI site for these proteins
> the < > are substituted with [ ] , but the problematic characters are
> in the FASTA.
>
>
> http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-id+657mP1a3t7q+-e+[IPI:%27IPI00465120.3%27]+-qnum+1+-enum+1
>
> Cheers,
>
> DT
>
> On Nov 11, 7:52 pm, Brian Pratt <[email protected]> wrote:
> > Yes, one would want to escape everything properly - happily there's a
> > library call for that.  And certainly it's only right to emit valid XML.
> >
> > But I do think that it might be wisest to sidestep the whole mess - it's
> > valid FASTA but also unconventional (based on many years of TPP not
> bumping
> > into this), and even converted to valid XML I suspect it may cause other
> > problems downstream since it no longer exactly matches the FASTA.  I
> suspect
> > you're damned if you do and damned if you don't.
> >
> > Brian
> > On Wed, Nov 11, 2009 at 11:25 AM, Matthew Chambers <
> >
> > [email protected]> wrote:
> >
> > > What about the other reserved characters in XML that are valid in
> FASTA?
> > > "
> > > '
> > > &
> >
> > > Not escaping could also break downstream software - especially with &
> > > which should always begin an escape sequence. :(
> >
> > > -Matt
> >
> > > Brian Pratt wrote:
> > > > Granted, this is a defect - but that's still an unfortunate choice of
> > > > characters.  Even with the correction I can imagine this tripping
> > > > up other software downstream since the properly escaped XML would no
> > > > longer match the FASTA on a literal basis.  I don't suppose your
> > > > users could be induced to use { and } or [ and ] or ( and ) instead
> of
> > > > < and > ?
> >
> > > > Brian
> >
> > > > On Tue, Nov 10, 2009 at 9:43 PM, Simon Michnowicz
>  > >  > <[email protected] <mailto:[email protected]>>
> wrote:
> >
> > > >     Dear Group,
> >
> > > >     I would like to flag a possible bug in a TPP tool.(Sorry in
> > > >     advance if this is the wrong forum to report bugs).
> >
> > > >     One of our users has reported issues with a tpp pepXML tool (he
> > > >     was using Mascot so I assume he was using Mascot2XML.exe).
> >
> > > >     Our  FASTA database has protein entries with special characters
> in
> > > >     then, i.e.
> >
> > > >     *IFN-<alpha>2*
> >
> > > >     *&*
> >
> > > >     *V<beta>14 *
> >
> > > >     This generated a pepXML file that was not valid xml, as the tags
> > > >     were not escaped properly.
> >
> > > >     *<alternative_protein protein="tr|Q9UMA4|IFN-<alpha>2"
> > > >     num_tol_term="2" peptide_prev_aa="-" peptide_next_aa="S"/>*
> >
> > > >     regards
> >
> > > >     Simon Michnowicz
> > > >     Duty Programmer
> > > >     Australian Proteomics Computation Facility
> > > >     Ludwig Institute For Cancer Research
> > > >     Royal Melbourne Hospital,
> > > >     Victoria
> > > >     Tel: (+61 3) 9341 3155
> > > >     Fax: (+61 3) 9341 3104
> >
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"spctools-discuss" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/spctools-discuss?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to