Well, I'll go ahead and modify the mascot converter to emit proper XML for proteins with reserved XML characters, but it does sound like folks would do well to make that <> / [] substitution upstream from the search engines. The fact that the EBI IPI site does the substitution confirms my suspicion that a number of tools might get munged up by this.
Brian On Wed, Nov 11, 2009 at 12:17 PM, dctrud <[email protected]> wrote: > > Unfortunately the offending entries are present in commonly used > public DBs. We recently bumped into exactly this problem, as there are > 4 entries containing <xxxx> in the IPI human v3.66 fasta file: > > IPI00465120 Gene_Symbol=- 3<beta>-HSD <psi>1 protein > IPI00816409 Gene_Symbol=- V<gamma>1 protein (Fragment) > IPI00816761 Gene_Symbol=CREB1 <alpha>CREB-1 protein (Fragment) > IPI00930475 Gene_Symbol=GUSB F<lambda>8 protein (Fragment) > > After hundreds of searches, a particular experiment happened to ID one > of these proteins, causing problems with the tools. In the event I > manually removed the problematic IDs as they were irrelevant for the > experiment. We already re-write IPI headers after download of the > FASTA, so will implement a substitution there if it crops up again. > Should a substitution be added to the IPI retrieval utility scripts in > the TPP distribution so that the problem doesn't show it's face if > they are being used? > > Interestingly, if you search on the EBI IPI site for these proteins > the < > are substituted with [ ] , but the problematic characters are > in the FASTA. > > > http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-id+657mP1a3t7q+-e+[IPI:%27IPI00465120.3%27]+-qnum+1+-enum+1 > > Cheers, > > DT > > On Nov 11, 7:52 pm, Brian Pratt <[email protected]> wrote: > > Yes, one would want to escape everything properly - happily there's a > > library call for that. And certainly it's only right to emit valid XML. > > > > But I do think that it might be wisest to sidestep the whole mess - it's > > valid FASTA but also unconventional (based on many years of TPP not > bumping > > into this), and even converted to valid XML I suspect it may cause other > > problems downstream since it no longer exactly matches the FASTA. I > suspect > > you're damned if you do and damned if you don't. > > > > Brian > > On Wed, Nov 11, 2009 at 11:25 AM, Matthew Chambers < > > > > [email protected]> wrote: > > > > > What about the other reserved characters in XML that are valid in > FASTA? > > > " > > > ' > > > & > > > > > Not escaping could also break downstream software - especially with & > > > which should always begin an escape sequence. :( > > > > > -Matt > > > > > Brian Pratt wrote: > > > > Granted, this is a defect - but that's still an unfortunate choice of > > > > characters. Even with the correction I can imagine this tripping > > > > up other software downstream since the properly escaped XML would no > > > > longer match the FASTA on a literal basis. I don't suppose your > > > > users could be induced to use { and } or [ and ] or ( and ) instead > of > > > > < and > ? > > > > > > Brian > > > > > > On Tue, Nov 10, 2009 at 9:43 PM, Simon Michnowicz > > > > <[email protected] <mailto:[email protected]>> > wrote: > > > > > > Dear Group, > > > > > > I would like to flag a possible bug in a TPP tool.(Sorry in > > > > advance if this is the wrong forum to report bugs). > > > > > > One of our users has reported issues with a tpp pepXML tool (he > > > > was using Mascot so I assume he was using Mascot2XML.exe). > > > > > > Our FASTA database has protein entries with special characters > in > > > > then, i.e. > > > > > > *IFN-<alpha>2* > > > > > > *&* > > > > > > *V<beta>14 * > > > > > > This generated a pepXML file that was not valid xml, as the tags > > > > were not escaped properly. > > > > > > *<alternative_protein protein="tr|Q9UMA4|IFN-<alpha>2" > > > > num_tol_term="2" peptide_prev_aa="-" peptide_next_aa="S"/>* > > > > > > regards > > > > > > Simon Michnowicz > > > > Duty Programmer > > > > Australian Proteomics Computation Facility > > > > Ludwig Institute For Cancer Research > > > > Royal Melbourne Hospital, > > > > Victoria > > > > Tel: (+61 3) 9341 3155 > > > > Fax: (+61 3) 9341 3104 > > > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "spctools-discuss" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/spctools-discuss?hl=en -~----------~----~----~----~------~----~------~--~---
