The problem could be caused by a bad character like " appearing in one
of your protein descriptions in the database and breaking the XML
parsing.  Can you search your fasta database for occurences of " ?

-David

On Wed, Mar 31, 2010 at 1:33 PM, Brian Pratt <[email protected]> wrote:
> Huh, that's all supposed to just work, assuming you have
> -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE
> in your gcc compilation command.
>
> Although you'd think that would be moot in a 64 bit world.
>
> On Wed, Mar 31, 2010 at 1:20 PM, Dave Trudgian <[email protected]> wrote:
>>
>> Brian,
>>
>> It was built from 4.3.1 source on Ubuntu Server 9.04 64-bit, using gcc
>> 4.3.3-5ubuntu4.
>>
>> As mentioned, I think the problem is due to the index file... in the
>> interact.pep.xml.index I have lines like:
>>
>> -2069230164     -2069226665     48      EIAIIPSKKLR
>> EIAIIPSK134.11K134.11LRI    PI00622165      0       9.19630.19630.3 0.0000
>>
>> ... where the first two values are offsets in the .pep.xml for that
>> peptide, which have overflowed into signed 32-bit int 2s complement negative
>> values.
>>
>> I believe from looking at the code that when the index file exists then
>> the PepXMLViewer will use it rather than doing a full expat parse of the
>> file. Hence it's ending up with nonsense offsets for peptide information
>> which are likely causing the errors. I'll have a look tomorrow at where the
>> index is created, and what integer types are being used. I suppose an
>> unsigned int can be used then that'd be good for up to 4GB, and long int
>> would give 4GB on 32-bit or much more on 64-bit systems.
>>
>> DT
>>
>>
>>
>> On 31/03/2010 18:36, Brian Pratt wrote:
>>
>> Do you know how pepXMLViewer.cgi was built?  It's meant to support large
>> files...
>>
>> On Wed, Mar 31, 2010 at 9:54 AM, dctrud <[email protected]> wrote:
>>>
>>> Hi Brian,
>>>
>>> I thought about out of memory conditions, but am running on 64-bit
>>> linux, and have 32GB of RAM, plus whilst running the cgi is using only
>>> a very small fraction of that.
>>>
>>> Looked again and the file is *just* over the 2GB boundary, looks like
>>> you're right, which has pointed me to the index file, which shows the
>>> integer offset values have overflowed.
>>>
>>> Many Thanks,
>>>
>>> DT
>>>
>>>
>>> On 31 Mar, 17:08, Brian Pratt <[email protected]> wrote:
>>> > My guess would be that the parser is trying to fail gracefully on an
>>> > out of
>>> > memory condition - it "forgets" part of the stream then is confused
>>> > when it
>>> > hits an unmatched closing tag.
>>> >
>>> > But that's just a guess.  Could also be about crossing the dread 2GB
>>> > file
>>> > size threshold.
>>> >
>>> > It's almost certainly about largeness, though.
>>> > Brian
>>> >
>>> > On Wed, Mar 31, 2010 at 6:38 AM, dctrud <[email protected]> wrote:
>>> > > All,
>>> >
>>> > > I'm having trouble with PepXMLViewer.cgi (4.3.1) on some very
>>> > > large .pep.xml files. The cgi will exit with the error:
>>> >
>>> > > error with spreadsheet printing: XML parsing error: not well-formed
>>> > > (invalid token), at xml file line 6298020, column 17
>>> >
>>> > > This is for an export to Excel, but similar errors will also occur
>>> > > when filtering the dataset in the web interface.
>>> >
>>> > > I've checked that the interact.pep.xml file is well formed with a
>>> > > python script that uses expat to parse it (as per the cgi), and there
>>> > > are no problems. Line 6298020 is the following end tag, which isn't
>>> > > an
>>> > > invalid token:
>>> >
>>> > > </modification_info>
>>> >
>>> > > I've also checked that none of the protein descriptions in the file
>>> > > contain < > " characters which could mess up the parsing earlier. Am
>>> > > now out of ideas of what could be the cause, and wondering if anyone
>>> > > has seen this problem, or has any ideas?
>>> >
>>> > > Many Thanks,
>>> >
>>> > > DT
>>> >
>>> > > --
>>> > > You received this message because you are subscribed to the Google
>>> > > Groups
>>> > > "spctools-discuss" group.
>>> > > To post to this group, send email to
>>> > > [email protected].
>>> > > To unsubscribe from this group, send email to
>>> > >
>>> > > [email protected]<spctools-discuss%[email protected]>
>>> > > .
>>> > > For more options, visit this group at
>>> > >http://groups.google.com/group/spctools-discuss?hl=en.
>>>
>>> --
>>> You received this message because you are subscribed to the Google Groups
>>> "spctools-discuss" group.
>>> To post to this group, send email to [email protected].
>>> To unsubscribe from this group, send email to
>>> [email protected].
>>> For more options, visit this group at
>>> http://groups.google.com/group/spctools-discuss?hl=en.
>>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "spctools-discuss" group.
>> To post to this group, send email to [email protected].
>> To unsubscribe from this group, send email to
>> [email protected].
>> For more options, visit this group at
>> http://groups.google.com/group/spctools-discuss?hl=en.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "spctools-discuss" group.
>> To post to this group, send email to [email protected].
>> To unsubscribe from this group, send email to
>> [email protected].
>> For more options, visit this group at
>> http://groups.google.com/group/spctools-discuss?hl=en.
>
> --
> You received this message because you are subscribed to the Google Groups
> "spctools-discuss" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected].
> For more options, visit this group at
> http://groups.google.com/group/spctools-discuss?hl=en.
>

-- 
You received this message because you are subscribed to the Google Groups 
"spctools-discuss" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/spctools-discuss?hl=en.

Reply via email to