I'm still stuck with this one. some notes: 1. I've noticed that the simple old nutch text-paresr handles large text files way better than tika. tika tries to apply some dombuilders and xmlcontentbuilders on a simple text file. Can't see what the use. Some where in the middle it takes a lot of time to parse text from the file. Using a simple toString() to get the text seems to work better in my use case (large files). 2. after I solved the text-parser issue, I'm still having problems with very large html files. Didn't manage to solve this one yet, so for the meantime, I found after a few runs the "problematic" files, and I'm just ignoring them for now. 3. Still haven't found a solution for the main and core problem - parser thread can not be interrupted or stopped while running parserSegment job on a cluster in distributed mode. The job may be long finished but the process of that task can still run and take cpu and memory.
-- View this message in context: http://lucene.472066.n3.nabble.com/parse-hangs-when-trying-to-parse-large-files-tp3998771p4000741.html Sent from the Nutch - User mailing list archive at Nabble.com.

