David,
Have checked the .fasta for offending < > " characters, and there aren't
any present. As per previous email, a separate expat based parsing
script to check that the file is well-formed runs over it fine, so am
fairly confident there aren't any problems in the xml file. The nonsense
offsets in the .pep.xml.index file look the most obvious things to cause
a hiccup.
Brian,
I just noticed that in PepXMLViewer/XMLNode.h offsets are defined as
follows:
int startOffset_;
int endOffset_;
.. and in PepXNode.h as:
int startOffset;
int endOffset;
I guess that this could be where the overflow is coming from. Most else
uses 'long' types for offsets, which should be 64-bit when compiled
using g++ on 64-bit linux, but I believe that 'int' types will still be
32-bit.
Also, the jumpParse method in PepXSAXHandler.cxx uses an int offset:
void SAXHandler::jumpParse(int offset) {
Getting late here, but tomorrow / this weekend I'll try changing int to
long on these remaining int offsets in the code and see if it does anything.
DT
On 31/03/2010 22:02, David Shteynberg wrote:
The problem could be caused by a bad character like " appearing in one
of your protein descriptions in the database and breaking the XML
parsing. Can you search your fasta database for occurences of " ?
-David
On Wed, Mar 31, 2010 at 1:33 PM, Brian Pratt<[email protected]> wrote:
Huh, that's all supposed to just work, assuming you have
-D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE
in your gcc compilation command.
Although you'd think that would be moot in a 64 bit world.
On Wed, Mar 31, 2010 at 1:20 PM, Dave Trudgian<[email protected]> wrote:
Brian,
It was built from 4.3.1 source on Ubuntu Server 9.04 64-bit, using gcc
4.3.3-5ubuntu4.
As mentioned, I think the problem is due to the index file... in the
interact.pep.xml.index I have lines like:
-2069230164 -2069226665 48 EIAIIPSKKLR
EIAIIPSK134.11K134.11LRI PI00622165 0 9.19630.19630.3 0.0000
... where the first two values are offsets in the .pep.xml for that
peptide, which have overflowed into signed 32-bit int 2s complement negative
values.
I believe from looking at the code that when the index file exists then
the PepXMLViewer will use it rather than doing a full expat parse of the
file. Hence it's ending up with nonsense offsets for peptide information
which are likely causing the errors. I'll have a look tomorrow at where the
index is created, and what integer types are being used. I suppose an
unsigned int can be used then that'd be good for up to 4GB, and long int
would give 4GB on 32-bit or much more on 64-bit systems.
DT
On 31/03/2010 18:36, Brian Pratt wrote:
Do you know how pepXMLViewer.cgi was built? It's meant to support large
files...
On Wed, Mar 31, 2010 at 9:54 AM, dctrud<[email protected]> wrote:
Hi Brian,
I thought about out of memory conditions, but am running on 64-bit
linux, and have 32GB of RAM, plus whilst running the cgi is using only
a very small fraction of that.
Looked again and the file is *just* over the 2GB boundary, looks like
you're right, which has pointed me to the index file, which shows the
integer offset values have overflowed.
Many Thanks,
DT
On 31 Mar, 17:08, Brian Pratt<[email protected]> wrote:
My guess would be that the parser is trying to fail gracefully on an
out of
memory condition - it "forgets" part of the stream then is confused
when it
hits an unmatched closing tag.
But that's just a guess. Could also be about crossing the dread 2GB
file
size threshold.
It's almost certainly about largeness, though.
Brian
On Wed, Mar 31, 2010 at 6:38 AM, dctrud<[email protected]> wrote:
All,
I'm having trouble with PepXMLViewer.cgi (4.3.1) on some very
large .pep.xml files. The cgi will exit with the error:
error with spreadsheet printing: XML parsing error: not well-formed
(invalid token), at xml file line 6298020, column 17
This is for an export to Excel, but similar errors will also occur
when filtering the dataset in the web interface.
I've checked that the interact.pep.xml file is well formed with a
python script that uses expat to parse it (as per the cgi), and there
are no problems. Line 6298020 is the following end tag, which isn't
an
invalid token:
</modification_info>
I've also checked that none of the protein descriptions in the file
contain< > " characters which could mess up the parsing earlier. Am
now out of ideas of what could be the cause, and wondering if anyone
has seen this problem, or has any ideas?
Many Thanks,
DT
--
You received this message because you are subscribed to the Google
Groups
"spctools-discuss" group.
To post to this group, send email to
[email protected].
To unsubscribe from this group, send email to
[email protected]<spctools-discuss%[email protected]>
.
For more options, visit this group at
http://groups.google.com/group/spctools-discuss?hl=en.
--
You received this message because you are subscribed to the Google Groups
"spctools-discuss" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/spctools-discuss?hl=en.
--
You received this message because you are subscribed to the Google Groups
"spctools-discuss" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/spctools-discuss?hl=en.
--
You received this message because you are subscribed to the Google Groups
"spctools-discuss" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/spctools-discuss?hl=en.
--
You received this message because you are subscribed to the Google Groups
"spctools-discuss" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/spctools-discuss?hl=en.
--
You received this message because you are subscribed to the Google Groups
"spctools-discuss" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/spctools-discuss?hl=en.