No, not seeing that anymore. I thought it might have been related to the ImageMagick thing, because they both seemed to have to do with temp files. But obviously, that wasn't really the case. So not sure what was causing that, but I don't see it anymore.
And thanks for the coding hint. Wasn't sure if TikaInputStream automatically did the buffering -----Original Message----- From: Tim Allison <[email protected]> Sent: Thursday, February 11, 2021 4:43 PM To: [email protected] Subject: Re: Error calling ImageMagick Are you still seeing tesseract txt files piling up? I'm not able to reproduce this on windows/linux/mac. This shouldn't cause a problem, but this: try (TikaInputStream stream = TikaInputStream.get(new BufferedInputStream(new FileInputStream(file)))) { is more efficient if you do this: try (TikaInputStream stream = TikaInputStream.get(file)) { On Wed, Feb 10, 2021 at 2:23 PM Peter Kronenberg <[email protected]> wrote: > > I have also noticed since yesterday that there are files in my temp > directory that aren’t being cleaned up. All of these files contain > the output of Tesseract > > > > > > From: Peter Kronenberg > Sent: Wednesday, February 10, 2021 12:35 PM > To: [email protected] > Subject: Error calling ImageMagick > > > > I think yesterday’s code introduced a bug. The temporary file that is > created for ImageMagick is not there. > > > > > > [main] INFO org.apache.tika.parser.ocr.TesseractOCRParser - Tesseract > is installed and is being invoked. This can add greatly to processing > time. If you do not want tesseract to be applied to your files see: > https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disab > le-ocr > > magick: no images found for operation `-resize' at CLI arg 9 @ > error/operation.c/CLIOption/5361. > > [main] WARN org.apache.tika.parser.ocr.TesseractOCRParser - > ImageMagick failed (commandline: [magick, -density, 300, -depth, 4, > -colorspace, gray, -filter, triangle, -resize, 200%, > C:\Users\PETERK~1\AppData\Local\Temp\apache-tika-3889844060604687745.t > mp, > C:\Users\PETERK~1\AppData\Local\Temp\apache-tika-3889844060604687745.t > mp]) > > org.apache.commons.exec.ExecuteException: Process exited with an > error: 1 (Exit value: 1) > > at > org.apache.commons.exec.DefaultExecutor.executeInternal(DefaultExecuto > r.java:404) > > at > org.apache.commons.exec.DefaultExecutor.execute(DefaultExecutor.java:1 > 66) > > at > org.apache.commons.exec.DefaultExecutor.execute(DefaultExecutor.java:1 > 53) > > at > org.apache.tika.parser.ocr.ImagePreprocessor.process(ImagePreprocessor > .java:121) > > at > org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser > .java:280) > > at > org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser > .java:248) > > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:277) > > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:277) > > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:14 > 3) > > at > org.apache.tika.parser.image.AbstractImageParser.parse(AbstractImagePa > rser.java:94) > > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:277) > > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:277) > > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:14 > 3) > > at org.torchai.ImageMagick.parse(ImageMagick.java:43) > > at org.torchai.ImageMagick.main(ImageMagick.java:56) > > Text: MARLEY was dead, to begin with. There is no doubt whatever about > > that. The register of his burial was signed by the clergyman, the > clerk, > > the undertaker, and the chief mourner. Scrooge signed it. And > > Scrooge’s name was good upon ’Change, for anything he chose to put > > his hand to. > > > > > > Here’s the code: > > > > public static String parse(String file) throws TikaException, > SAXException, IOException { > > final AutoDetectParser parser = new AutoDetectParser(new > TikaConfig()); > > final ParseContext parseContext = new ParseContext(); > > final TesseractOCRConfig tessConfig = new TesseractOCRConfig(); > parseContext.set(AutoDetectParser.class, parser); > parseContext.set(TesseractOCRConfig.class, tessConfig); > > tessConfig.setEnableImageProcessing(true); > > ContentHandler contentHandler = new BodyContentHandler(); > > Metadata metadata = new Metadata(); > > > try (TikaInputStream stream = TikaInputStream.get(new > BufferedInputStream(new FileInputStream(file)))) { > parser.parse(stream, contentHandler, metadata, parseContext); > } > > return contentHandler.toString(); > } > >
