I have also noticed since yesterday that there are files in my temp directory that aren't being cleaned up. All of these files contain the output of Tesseract
[cid:[email protected]] From: Peter Kronenberg Sent: Wednesday, February 10, 2021 12:35 PM To: [email protected] Subject: Error calling ImageMagick I think yesterday's code introduced a bug. The temporary file that is created for ImageMagick is not there. [main] INFO org.apache.tika.parser.ocr.TesseractOCRParser - Tesseract is installed and is being invoked. This can add greatly to processing time. If you do not want tesseract to be applied to your files see: https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr magick: no images found for operation `-resize' at CLI arg 9 @ error/operation.c/CLIOption/5361. [main] WARN org.apache.tika.parser.ocr.TesseractOCRParser - ImageMagick failed (commandline: [magick, -density, 300, -depth, 4, -colorspace, gray, -filter, triangle, -resize, 200%, C:\Users\PETERK~1\AppData\Local\Temp\apache-tika-3889844060604687745.tmp, C:\Users\PETERK~1\AppData\Local\Temp\apache-tika-3889844060604687745.tmp]) org.apache.commons.exec.ExecuteException: Process exited with an error: 1 (Exit value: 1) at org.apache.commons.exec.DefaultExecutor.executeInternal(DefaultExecutor.java:404) at org.apache.commons.exec.DefaultExecutor.execute(DefaultExecutor.java:166) at org.apache.commons.exec.DefaultExecutor.execute(DefaultExecutor.java:153) at org.apache.tika.parser.ocr.ImagePreprocessor.process(ImagePreprocessor.java:121) at org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:280) at org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:248) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:277) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:277) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) at org.apache.tika.parser.image.AbstractImageParser.parse(AbstractImageParser.java:94) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:277) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:277) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) at org.torchai.ImageMagick.parse(ImageMagick.java:43) at org.torchai.ImageMagick.main(ImageMagick.java:56) Text: MARLEY was dead, to begin with. There is no doubt whatever about that. The register of his burial was signed by the clergyman, the clerk, the undertaker, and the chief mourner. Scrooge signed it. And Scrooge's name was good upon 'Change, for anything he chose to put his hand to. Here's the code: public static String parse(String file) throws TikaException, SAXException, IOException { final AutoDetectParser parser = new AutoDetectParser(new TikaConfig()); final ParseContext parseContext = new ParseContext(); final TesseractOCRConfig tessConfig = new TesseractOCRConfig(); parseContext.set(AutoDetectParser.class, parser); parseContext.set(TesseractOCRConfig.class, tessConfig); tessConfig.setEnableImageProcessing(true); ContentHandler contentHandler = new BodyContentHandler(); Metadata metadata = new Metadata(); try (TikaInputStream stream = TikaInputStream.get(new BufferedInputStream(new FileInputStream(file)))) { parser.parse(stream, contentHandler, metadata, parseContext); } return contentHandler.toString(); }
