Just in case anybody else is trying to use Tika to parse a wide range
of PDFs, I've run into several hangs due to this issue:
https://issues.apache.org/jira/browse/PDFBOX-541
It's been fixed in PDFBox trunk, from what I can see, but not in the
0.8-incubating jar that Tika is currently using.
I don't see snapshot builds of PDFBox in the Apache Maven repo, so for
now I'm going to build from trunk and override the Tika dependency.
-- Ken
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g