> With a directory of 100,000 .eml files (many with attachments), is there a recommended way to parallelize or do batch parsing reliably?
If the -J -t options get you what you need with tika-app in batch mode, it is running in parallel. You can set the number of threads with -numConsumers. At some point, even on a decent sized box, you'll become I/O bound because Tika, for some file formats, creates quite a few temp files. If you are limited to a single machine, but have several ssds, you could read from one, write to another and use a third for java.io.tmpdir. >A lot of my messages will have timeouts As of Tika 1.19-SNAPSHOT (not yet released, you can control tesseract timeouts with, e.g.: https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/org/apache/tika/config/TIKA-2705-tesseract.xml >Sometimes tika won't put any content at all in the output. It will just be >filename.eml.json of 0 bytes, that happens when I run: It is expected that tika-batch will create 0 byte .json files if something catastrophic happened during processing -- oom, permanent hang, etc. You can look at the logs for what might be happening catastrophically. >java -Xmx12g -Xms12g -Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider >-Dorg.apache.pdfbox.rendering.UsePureJavaCMYKConversion=true -jar >~/tika-1.18/tika-app/target/tika-app-1.18.jar -J -t -i /mailin/ -o /mailout/ It would be helpful to get the exact IO errors. The system properties you are setting go to the parent process, which monitors the child process, and it is the child process that is doing the heavy lifting/actual parsing. To set the system props for the child process, prefix with -J, as in: java -Dlog4j.configuration=file:log4j_driver.xml -jar tika-app.jar -JXX:-OmitStackTraceInFastThrow -JXmx6g -JDlog4j.configuration=file:log4j.xml -bc tika-batch-config-basic-test.xml -i /data2/docs/ -o /data4/batch_runs/tika_1_19-poi4d -numConsumers 10 -c tika_config.xml Sidenote: definitely include the -JXX:-OmitStackTraceInFastThrow to make sure that you're getting complete stacktraces. >I think the majority of files tika doesn't parse is due to tesseractOCR >timeouts. To see how many exceptions and of what types, consider running tika-eval in 'profile' mode. This will work well given that you're already using the -J option. See: https://wiki.apache.org/tika/TikaEval On Wed, Aug 29, 2018 at 9:57 AM Jake Burns <[email protected]> wrote: > > Thanks, I guess I'll refrain from using extra flags on the command line. > > I think the majority of files tika doesn't parse is due to tesseractOCR > timeouts. > > If I run: > > java -jar ~/tika-1.18/tika-app/target/tika-app-1.18.jar -J -t -i /mailin/ -o > /mailout/ > > A lot of my messages will have timeouts like this where the X-TIKA:content > object should be.: > > X-TIKA:EXCEPTION:embedded_exception":"org.apache.tika.exception.TikaException: > TesseractOCRParser timeout\n\tat > org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:560)\n\tat > > org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:432)\n\tat > > org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:286)\n\tat > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)\n\tat > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)\n\tat > org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:84)\n\tat > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)\n\tat > org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:318)\n\tat > > org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)\n\tat > org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)\n\tat > > org.apache.tika.parser.mail.MailContentHandler.handleEmbedded(MailContentHandler.java:283)\n\tat > > org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:228)\n\tat > > org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133)\n\tat > org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:100)\n\tat > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)\n\tat > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)\n\tat > org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:84)\n\tat > org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:159)\n\tat > > org.apache.tika.batch.FileResourceConsumer.parse(FileResourceConsumer.java:406)\n\tat > > org.apache.tika.batch.fs.RecursiveParserWrapperFSConsumer.processFileResource(RecursiveParserWrapperFSConsumer.java:104)\n\tat > > org.apache.tika.batch.FileResourceConsumer._processFileResource(FileResourceConsumer.java:181)\n\tat > > org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:115)\n\tat > > org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:50)\n\tat > java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\n\tat > java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat > java.lang.Thread.run(Thread.java:748)\nCaused by: > java.util.concurrent.TimeoutException\n\tat > java.util.concurrent.FutureTask.get(FutureTask.java:205)\n\tat > org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:549)\n\t... > 32 > more\n","X-TIKA:digest:MD5":"79171517bfedab52b24bd1691a5ff544","X-TIKA:embedded_resource_path":"/CastleBrooks > Ulana.jpg > > > Sometimes tika won't put any content at all in the output. It will just be > filename.eml.json of 0 bytes, that happens when I run: > > java -Xmx12g -Xms12g -Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider > -Dorg.apache.pdfbox.rendering.UsePureJavaCMYKConversion=true -jar > ~/tika-1.18/tika-app/target/tika-app-1.18.jar -J -t -i /mailin/ -o /mailout/ > > Sometimes the tika processing just grinds to a halt with illegalIOexception > too. > > TL;DR - > > I'm running 24 CPU Cores with 64 GB of RAM on SSDs. > > With a directory of 100,000 .eml files (many with attachments), is there a > recommended way to parallelize or do batch parsing reliably? > > > On 08/28/2018 07:54 AM, Tim Allison wrote: > > Hi Jake, > In reverse order... > > 1) command flags: right, sorry, we've only implemented text/metadata > extraction via batch-mode (triggered by -i and -o). The -z option > currently only operates one file at a time. > > 2) "Even though I use -J, I'm not seeing the results of OCR on the > attachments" ... when you type 'tesseract' at the command line, does > that kickoff tesseract, or is it not on your path...do you have a > custom installation? If you run tika-app.jar -J against a single file > with an attachment that should be OCR'd, what values are you getting > for X-ParsedBy.... to help isolate whether tesseract is being called > at all, try running standalone tika-app.jar -J against, e.g. > https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/testOCR.docx > or > https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/testOCR.pdf > > 3) "I'm also not seeing anything extracted from PDFs" -- are the PDF's > image only or do they actually contain text? If image only, once we > figure out whether tesseract is being called at all, that might solve > the problem, but also see: > https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29#OCR for > how to use a tika-config to turn on the extraction/OCR'ing of inline > images in PDFs. > On Mon, Aug 27, 2018 at 3:56 PM Jake Burns <[email protected]> wrote: > > I'm trying to parse a directory full of .eml files (and many have > attachments). Even though I use -J, I'm not seeing the results of OCR on the > attachments. I'm also not seeing anything extracted from PDFs. Finally, > tika-app is not recognizing a bunch of command flags. > > I'm running ubuntu 18.04 and have openjdk-8 (1.181) installed with the latest > maven (3.5.4). > I've also got the libtesseract-dev and tesseract-OCR-all installed on my > machine. > > I downloaded Tika 1.18 and ran mvn clean install. The build completes fine > and I see the tika-app jar at ~/tika-1.18/tika-app-target/tika-app-1.18.jar > > I am able to run java -jar ~pathto/tika-app-1.18.jar -J -i > /mydirectoryoffiles/ -o /mytikaoutput/ and it works alright. > > I am not able to pass any other flags to tika though. for example -r. > I'm not able to pass -z to extract attachments either. > > I get stuff like this: > "INFO about to start driver > BatchProcess:No config file set via -bc, relying on tika-app-batch-config.xml > or default-tika-batch-config.xml > INFO BatchProcess: org.apache.commons.cli.UnrecognizedOptionException: > Unrecognized option: -z" > > Can anyone tell me how I can parse a directory of .eml files and extract the > data from their attachments? > >
