Are you still seeing tesseract txt files piling up? I'm not able to
reproduce this on windows/linux/mac.
This shouldn't cause a problem, but this:
try (TikaInputStream stream = TikaInputStream.get(new
BufferedInputStream(new FileInputStream(file)))) {
is more efficient if you do this:
try (TikaInputStream stream = TikaInputStream.get(file)) {
On Wed, Feb 10, 2021 at 2:23 PM Peter Kronenberg
<[email protected]> wrote:
>
> I have also noticed since yesterday that there are files in my temp directory
> that aren’t being cleaned up. All of these files contain the output of
> Tesseract
>
>
>
>
>
> From: Peter Kronenberg
> Sent: Wednesday, February 10, 2021 12:35 PM
> To: [email protected]
> Subject: Error calling ImageMagick
>
>
>
> I think yesterday’s code introduced a bug. The temporary file that is
> created for ImageMagick is not there.
>
>
>
>
>
> [main] INFO org.apache.tika.parser.ocr.TesseractOCRParser - Tesseract is
> installed and is being invoked. This can add greatly to processing time. If
> you do not want tesseract to be applied to your files see:
> https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr
>
> magick: no images found for operation `-resize' at CLI arg 9 @
> error/operation.c/CLIOption/5361.
>
> [main] WARN org.apache.tika.parser.ocr.TesseractOCRParser - ImageMagick
> failed (commandline: [magick, -density, 300, -depth, 4, -colorspace, gray,
> -filter, triangle, -resize, 200%,
> C:\Users\PETERK~1\AppData\Local\Temp\apache-tika-3889844060604687745.tmp,
> C:\Users\PETERK~1\AppData\Local\Temp\apache-tika-3889844060604687745.tmp])
>
> org.apache.commons.exec.ExecuteException: Process exited with an error: 1
> (Exit value: 1)
>
> at
> org.apache.commons.exec.DefaultExecutor.executeInternal(DefaultExecutor.java:404)
>
> at
> org.apache.commons.exec.DefaultExecutor.execute(DefaultExecutor.java:166)
>
> at
> org.apache.commons.exec.DefaultExecutor.execute(DefaultExecutor.java:153)
>
> at
> org.apache.tika.parser.ocr.ImagePreprocessor.process(ImagePreprocessor.java:121)
>
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:280)
>
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:248)
>
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:277)
>
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:277)
>
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>
> at
> org.apache.tika.parser.image.AbstractImageParser.parse(AbstractImageParser.java:94)
>
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:277)
>
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:277)
>
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>
> at org.torchai.ImageMagick.parse(ImageMagick.java:43)
>
> at org.torchai.ImageMagick.main(ImageMagick.java:56)
>
> Text: MARLEY was dead, to begin with. There is no doubt whatever about
>
> that. The register of his burial was signed by the clergyman, the clerk,
>
> the undertaker, and the chief mourner. Scrooge signed it. And
>
> Scrooge’s name was good upon ’Change, for anything he chose to put
>
> his hand to.
>
>
>
>
>
> Here’s the code:
>
>
>
> public static String parse(String file) throws TikaException, SAXException,
> IOException {
>
> final AutoDetectParser parser = new AutoDetectParser(new TikaConfig());
>
> final ParseContext parseContext = new ParseContext();
>
> final TesseractOCRConfig tessConfig = new TesseractOCRConfig();
> parseContext.set(AutoDetectParser.class, parser);
> parseContext.set(TesseractOCRConfig.class, tessConfig);
>
> tessConfig.setEnableImageProcessing(true);
>
> ContentHandler contentHandler = new BodyContentHandler();
>
> Metadata metadata = new Metadata();
>
>
> try (TikaInputStream stream = TikaInputStream.get(new
> BufferedInputStream(new FileInputStream(file)))) {
> parser.parse(stream, contentHandler, metadata, parseContext);
> }
>
> return contentHandler.toString();
> }
>
>