Re: Having trouble running nutch on large xlsx files

Markus Jelsma Wed, 11 Apr 2012 09:41:17 -0700

Debugging this with a stand-alone Tika would certainly make thingseasier. There may be an issue in Tika or even in the parserimplementation itself.

On Wed, 11 Apr 2012 09:37:04 -0700 (PDT), "[email protected]"<[email protected]> wrote:

I'm running nutch on large xlsx file (100-150mb), and encounterproblems and
questions in the parsing phase.
In some of the tests I use tika-parser and in some I use my ownsimple
parser that all it does is use POI-XSSFEventBasedExcelExtractor.
I tried running it with 2gb heap and with 4gb heap, both ways theresult
is:
when parsing with tika, it crashes almost always on either "out ofmemory
exception" or "gc overhead limit exceeded".
when running same config but with my own parser, it has bettersuccessratio, still it crashes some of the time but is seems like its50-50.
1. The first question is - why is there such a difference betweenbothparsers? sure, my parser is only 3-4 lines of code, but tika'sparser,
beneath all the abstractions and factories does almost the same with
XSSFEventBasedExcelExtractor.
2. I noticed that even when comparing different runs of tika'sparsers,
there seems to be differences.
In one try I can get OOM after 10 minutes, in another I can get itafter two
hours!
When my parser succeeds it usually takes about 30 minutes.
Any thought on this?
3. At the bottom line, how can I get things to work if I do needparsing
large xlsx files?




--
View this message in context:

http://lucene.472066.n3.nabble.com/Having-trouble-running-nutch-on-large-xlsx-files-tp3903078p3903078.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Having trouble running nutch on large xlsx files

Reply via email to