Thanks, I guess I'll refrain from using extra flags on the command line.
I think the majority of files tika doesn't parse is due to tesseractOCR
timeouts.
If I run:
java -jar ~/tika-1.18/tika-app/target/tika-app-1.18.jar -J -t -i
/mailin/ -o /mailout/
A lot of my messages will have timeouts like this where the
X-TIKA:content object should be.:
X-TIKA:EXCEPTION:embedded_exception":"org.apache.tika.exception.TikaException:
TesseractOCRParser timeout\n\tat
org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:560)\n\tat
org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:432)\n\tat
org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:286)\n\tat
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)\n\tat
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)\n\tat
org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:84)\n\tat
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)\n\tat
org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:318)\n\tat
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)\n\tat
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)\n\tat
org.apache.tika.parser.mail.MailContentHandler.handleEmbedded(MailContentHandler.java:283)\n\tat
org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:228)\n\tat
org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133)\n\tat
org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:100)\n\tat
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)\n\tat
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)\n\tat
org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:84)\n\tat
org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:159)\n\tat
org.apache.tika.batch.FileResourceConsumer.parse(FileResourceConsumer.java:406)\n\tat
org.apache.tika.batch.fs.RecursiveParserWrapperFSConsumer.processFileResource(RecursiveParserWrapperFSConsumer.java:104)\n\tat
org.apache.tika.batch.FileResourceConsumer._processFileResource(FileResourceConsumer.java:181)\n\tat
org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:115)\n\tat
org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:50)\n\tat
java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\n\tat
java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat
java.lang.Thread.run(Thread.java:748)\nCaused by:
java.util.concurrent.TimeoutException\n\tat
java.util.concurrent.FutureTask.get(FutureTask.java:205)\n\tat
org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:549)\n\t...
32
more\n","X-TIKA:digest:MD5":"79171517bfedab52b24bd1691a5ff544","X-TIKA:embedded_resource_path":"/CastleBrooks
Ulana.jpg
Sometimes tika won't put any content at all in the output. It will just
be filename.eml.json of 0 bytes, that happens when I run:
java -Xmx12g -Xms12g
-Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider
-Dorg.apache.pdfbox.rendering.UsePureJavaCMYKConversion=true -jar
~/tika-1.18/tika-app/target/tika-app-1.18.jar -J -t -i /mailin/ -o
/mailout/
Sometimes the tika processing just grinds to a halt with
illegalIOexception too.
TL;DR -
I'm running 24 CPU Cores with 64 GB of RAM on SSDs.
With a directory of 100,000 .eml files (many with attachments), is there
a recommended way to parallelize or do batch parsing reliably?
On 08/28/2018 07:54 AM, Tim Allison wrote:
Hi Jake,
In reverse order...
1) command flags: right, sorry, we've only implemented text/metadata
extraction via batch-mode (triggered by -i and -o). The -z option
currently only operates one file at a time.
2) "Even though I use -J, I'm not seeing the results of OCR on the
attachments" ... when you type 'tesseract' at the command line, does
that kickoff tesseract, or is it not on your path...do you have a
custom installation? If you run tika-app.jar -J against a single file
with an attachment that should be OCR'd, what values are you getting
for X-ParsedBy.... to help isolate whether tesseract is being called
at all, try running standalone tika-app.jar -J against, e.g.
https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/testOCR.docx
or
https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/testOCR.pdf
3) "I'm also not seeing anything extracted from PDFs" -- are the PDF's
image only or do they actually contain text? If image only, once we
figure out whether tesseract is being called at all, that might solve
the problem, but also see:
https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29#OCR for
how to use a tika-config to turn on the extraction/OCR'ing of inline
images in PDFs.
On Mon, Aug 27, 2018 at 3:56 PM Jake Burns <[email protected]> wrote:
I'm trying to parse a directory full of .eml files (and many have attachments).
Even though I use -J, I'm not seeing the results of OCR on the attachments. I'm
also not seeing anything extracted from PDFs. Finally, tika-app is not
recognizing a bunch of command flags.
I'm running ubuntu 18.04 and have openjdk-8 (1.181) installed with the latest
maven (3.5.4).
I've also got the libtesseract-dev and tesseract-OCR-all installed on my
machine.
I downloaded Tika 1.18 and ran mvn clean install. The build completes fine and
I see the tika-app jar at ~/tika-1.18/tika-app-target/tika-app-1.18.jar
I am able to run java -jar ~pathto/tika-app-1.18.jar -J -i /mydirectoryoffiles/
-o /mytikaoutput/ and it works alright.
I am not able to pass any other flags to tika though. for example -r.
I'm not able to pass -z to extract attachments either.
I get stuff like this:
"INFO about to start driver
BatchProcess:No config file set via -bc, relying on tika-app-batch-config.xml
or default-tika-batch-config.xml
INFO BatchProcess: org.apache.commons.cli.UnrecognizedOptionException: Unrecognized
option: -z"
Can anyone tell me how I can parse a directory of .eml files and extract the
data from their attachments?