I have also noticed since yesterday that there are files in my temp directory 
that aren't being cleaned up.  All of these files contain the output of 
Tesseract

[cid:[email protected]]

From: Peter Kronenberg
Sent: Wednesday, February 10, 2021 12:35 PM
To: [email protected]
Subject: Error calling ImageMagick

I think yesterday's code introduced a bug.  The temporary file that is created 
for ImageMagick is not there.


[main] INFO org.apache.tika.parser.ocr.TesseractOCRParser - Tesseract is 
installed and is being invoked. This can add greatly to processing time.  If 
you do not want tesseract to be applied to your files see: 
https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr
magick: no images found for operation `-resize' at CLI arg 9 @ 
error/operation.c/CLIOption/5361.
[main] WARN org.apache.tika.parser.ocr.TesseractOCRParser - ImageMagick failed 
(commandline: [magick, -density, 300, -depth, 4, -colorspace, gray, -filter, 
triangle, -resize, 200%, 
C:\Users\PETERK~1\AppData\Local\Temp\apache-tika-3889844060604687745.tmp, 
C:\Users\PETERK~1\AppData\Local\Temp\apache-tika-3889844060604687745.tmp])
org.apache.commons.exec.ExecuteException: Process exited with an error: 1 (Exit 
value: 1)
            at 
org.apache.commons.exec.DefaultExecutor.executeInternal(DefaultExecutor.java:404)
            at 
org.apache.commons.exec.DefaultExecutor.execute(DefaultExecutor.java:166)
            at 
org.apache.commons.exec.DefaultExecutor.execute(DefaultExecutor.java:153)
            at 
org.apache.tika.parser.ocr.ImagePreprocessor.process(ImagePreprocessor.java:121)
            at 
org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:280)
            at 
org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:248)
            at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:277)
            at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:277)
            at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
            at 
org.apache.tika.parser.image.AbstractImageParser.parse(AbstractImageParser.java:94)
            at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:277)
            at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:277)
            at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
            at org.torchai.ImageMagick.parse(ImageMagick.java:43)
            at org.torchai.ImageMagick.main(ImageMagick.java:56)
Text: MARLEY was dead, to begin with. There is no doubt whatever about
that. The register of his burial was signed by the clergyman, the clerk,
the undertaker, and the chief mourner. Scrooge signed it. And
Scrooge's name was good upon 'Change, for anything he chose to put
his hand to.


Here's the code:

public static String parse(String file) throws TikaException, SAXException, 
IOException {

    final AutoDetectParser parser = new AutoDetectParser(new TikaConfig());

    final ParseContext parseContext = new ParseContext();

    final TesseractOCRConfig tessConfig = new TesseractOCRConfig();
    parseContext.set(AutoDetectParser.class, parser);
    parseContext.set(TesseractOCRConfig.class, tessConfig);

    tessConfig.setEnableImageProcessing(true);

    ContentHandler contentHandler = new BodyContentHandler();

    Metadata metadata = new Metadata();


    try (TikaInputStream stream = TikaInputStream.get(new 
BufferedInputStream(new FileInputStream(file)))) {
        parser.parse(stream, contentHandler, metadata, parseContext);
    }

    return contentHandler.toString();
}

Reply via email to