What are your dependencies? Which parsers are in AutoDetectParser?
On Thu, Sep 1, 2022 at 4:38 AM David Pilato <[email protected]> wrote:
> Hey team
>
>
> I'm wondering what's wrong with my config.
> I'm running this very basic piece of code:
>
> @Test
> public void testTika() throws TikaException, IOException, SAXException {
> BodyContentHandler handler = new BodyContentHandler(new
> WriteOutContentHandler(1000));
> new AutoDetectParser().parse(getBinaryContent("test-ocr.png"), handler,
> new Metadata(), new ParseContext());
> System.out.println("handler = " + handler);
> }
>
>
> Here are my logs:
>
> 16:31:13,089 DEBUG [o.a.t.p.o.TesseractOCRParser] hasTesseract (path:
> [tesseract]): true
> 16:31:13,560 DEBUG [o.a.t.p.o.TesseractOCRParser] hasTesseract (path:
> [tesseract]): true
> 16:31:13,564 DEBUG [o.a.t.p.o.TesseractOCRParser] ImageMagick does not
> appear to be installed (commandline: convert)
> 16:31:13,591 DEBUG [o.a.t.p.o.TesseractOCRParser] hasTesseract (path:
> [tesseract]): true
> 16:31:13,595 DEBUG [o.a.t.p.o.TesseractOCRParser] ImageMagick does not
> appear to be installed (commandline: convert)
> handler =
>
>
> The content is not extracted although Tesseract is detected.
>
> When I run Tesseract manually:
>
> tesseract test-ocr.png tess.out
> cat tess.out.txt
>
> I'm getting:
>
> This file contains some words.
>
> tesseract --version gives
>
> tesseract 5.2.0
> leptonica-1.82.0
> libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.3) : libpng 1.6.37 : libtiff
> 4.4.0 : zlib 1.2.11 : libwebp 1.2.4 : libopenjp2 2.5.0
> Found AVX2
> Found AVX
> Found FMA
> Found SSE4.1
> Found libarchive 3.6.1 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8
> liblz4/1.9.3 libzstd/1.5.2
> Found libcurl/7.79.1 SecureTransport (LibreSSL/3.3.6) zlib/1.2.11
> nghttp2/1.45.1
>
>
> What I'm missing here?
>
>
>
> David
>