Debugging this with a stand-alone Tika would certainly make things
easier. There may be an issue in Tika or even in the parser
implementation itself.
On Wed, 11 Apr 2012 09:37:04 -0700 (PDT), "[email protected]"
<[email protected]> wrote:
I'm running nutch on large xlsx file (100-150mb), and encounter
problems and
questions in the parsing phase.
In some of the tests I use tika-parser and in some I use my own
simple
parser that all it does is use POI-XSSFEventBasedExcelExtractor.
I tried running it with 2gb heap and with 4gb heap, both ways the
result
is:
when parsing with tika, it crashes almost always on either "out of
memory
exception" or "gc overhead limit exceeded".
when running same config but with my own parser, it has better
success
ratio, still it crashes some of the time but is seems like its
50-50.
1. The first question is - why is there such a difference between
both
parsers? sure, my parser is only 3-4 lines of code, but tika's
parser,
beneath all the abstractions and factories does almost the same with
XSSFEventBasedExcelExtractor.
2. I noticed that even when comparing different runs of tika's
parsers,
there seems to be differences.
In one try I can get OOM after 10 minutes, in another I can get it
after two
hours!
When my parser succeeds it usually takes about 30 minutes.
Any thought on this?
3. At the bottom line, how can I get things to work if I do need
parsing
large xlsx files?
--
View this message in context:
http://lucene.472066.n3.nabble.com/Having-trouble-running-nutch-on-large-xlsx-files-tp3903078p3903078.html
Sent from the Nutch - User mailing list archive at Nabble.com.