Debugging this with a stand-alone Tika would certainly make things easier. There may be an issue in Tika or even in the parser implementation itself.

On Wed, 11 Apr 2012 09:37:04 -0700 (PDT), "[email protected]" <[email protected]> wrote:
I'm running nutch on large xlsx file (100-150mb), and encounter problems and
questions in the parsing phase.
In some of the tests I use tika-parser and in some I use my own simple
parser that all it does is use POI-XSSFEventBasedExcelExtractor.

I tried running it with 2gb heap and with 4gb heap, both ways the result
is:
when parsing with tika, it crashes almost always on either "out of memory
exception" or "gc overhead limit exceeded".
when running same config but with my own parser, it has better success ratio, still it crashes some of the time but is seems like its 50-50.

1. The first question is - why is there such a difference between both parsers? sure, my parser is only 3-4 lines of code, but tika's parser,
beneath all the abstractions and factories does almost the same with
XSSFEventBasedExcelExtractor.

2. I noticed that even when comparing different runs of tika's parsers,
there seems to be differences.
In one try I can get OOM after 10 minutes, in another I can get it after two
hours!
When my parser succeeds it usually takes about 30 minutes.
Any thought on this?

3. At the bottom line, how can I get things to work if I do need parsing
large xlsx files?




--
View this message in context:

http://lucene.472066.n3.nabble.com/Having-trouble-running-nutch-on-large-xlsx-files-tp3903078p3903078.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to