Unfortunately the offending entries are present in commonly used public DBs. We recently bumped into exactly this problem, as there are 4 entries containing <xxxx> in the IPI human v3.66 fasta file:
IPI00465120 Gene_Symbol=- 3<beta>-HSD <psi>1 protein IPI00816409 Gene_Symbol=- V<gamma>1 protein (Fragment) IPI00816761 Gene_Symbol=CREB1 <alpha>CREB-1 protein (Fragment) IPI00930475 Gene_Symbol=GUSB F<lambda>8 protein (Fragment) After hundreds of searches, a particular experiment happened to ID one of these proteins, causing problems with the tools. In the event I manually removed the problematic IDs as they were irrelevant for the experiment. We already re-write IPI headers after download of the FASTA, so will implement a substitution there if it crops up again. Should a substitution be added to the IPI retrieval utility scripts in the TPP distribution so that the problem doesn't show it's face if they are being used? Interestingly, if you search on the EBI IPI site for these proteins the < > are substituted with [ ] , but the problematic characters are in the FASTA. http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-id+657mP1a3t7q+-e+[IPI:%27IPI00465120.3%27]+-qnum+1+-enum+1 Cheers, DT On Nov 11, 7:52 pm, Brian Pratt <[email protected]> wrote: > Yes, one would want to escape everything properly - happily there's a > library call for that. And certainly it's only right to emit valid XML. > > But I do think that it might be wisest to sidestep the whole mess - it's > valid FASTA but also unconventional (based on many years of TPP not bumping > into this), and even converted to valid XML I suspect it may cause other > problems downstream since it no longer exactly matches the FASTA. I suspect > you're damned if you do and damned if you don't. > > Brian > On Wed, Nov 11, 2009 at 11:25 AM, Matthew Chambers < > > [email protected]> wrote: > > > What about the other reserved characters in XML that are valid in FASTA? > > " > > ' > > & > > > Not escaping could also break downstream software - especially with & > > which should always begin an escape sequence. :( > > > -Matt > > > Brian Pratt wrote: > > > Granted, this is a defect - but that's still an unfortunate choice of > > > characters. Even with the correction I can imagine this tripping > > > up other software downstream since the properly escaped XML would no > > > longer match the FASTA on a literal basis. I don't suppose your > > > users could be induced to use { and } or [ and ] or ( and ) instead of > > > < and > ? > > > > Brian > > > > On Tue, Nov 10, 2009 at 9:43 PM, Simon Michnowicz > > > <[email protected] <mailto:[email protected]>> wrote: > > > > Dear Group, > > > > I would like to flag a possible bug in a TPP tool.(Sorry in > > > advance if this is the wrong forum to report bugs). > > > > One of our users has reported issues with a tpp pepXML tool (he > > > was using Mascot so I assume he was using Mascot2XML.exe). > > > > Our FASTA database has protein entries with special characters in > > > then, i.e. > > > > *IFN-<alpha>2* > > > > *&* > > > > *V<beta>14 * > > > > This generated a pepXML file that was not valid xml, as the tags > > > were not escaped properly. > > > > *<alternative_protein protein="tr|Q9UMA4|IFN-<alpha>2" > > > num_tol_term="2" peptide_prev_aa="-" peptide_next_aa="S"/>* > > > > regards > > > > Simon Michnowicz > > > Duty Programmer > > > Australian Proteomics Computation Facility > > > Ludwig Institute For Cancer Research > > > Royal Melbourne Hospital, > > > Victoria > > > Tel: (+61 3) 9341 3155 > > > Fax: (+61 3) 9341 3104 --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "spctools-discuss" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/spctools-discuss?hl=en -~----------~----~----~----~------~----~------~--~---
