I'm still stuck with this one.

some notes:
1. I've noticed that the simple old nutch text-paresr handles large text
files way better than tika.
tika tries to apply some dombuilders and xmlcontentbuilders on a simple text
file. Can't see what the use.
Some where in the middle it takes a lot of time to parse text from the file.
Using a simple toString() to get the text seems to work better in my use
case (large files).
2. after I solved the text-parser issue, I'm still having problems with very
large html files.
Didn't manage to solve this one yet, so for the meantime, I found after a
few runs the "problematic" files, and I'm just ignoring them for now.
3. Still haven't found a solution for the main and core problem - parser
thread can not be interrupted or stopped while running parserSegment job on
a cluster in distributed mode.
The job may be long finished but the process of that task can still run and
take cpu and memory.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/parse-hangs-when-trying-to-parse-large-files-tp3998771p4000741.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to