What are your dependencies? Which parsers are in AutoDetectParser?

On Thu, Sep 1, 2022 at 4:38 AM David Pilato <[email protected]> wrote:

> Hey team
>
>
> I'm wondering what's wrong with my config.
> I'm running this very basic piece of code:
>
> @Test
> public void testTika() throws TikaException, IOException, SAXException {
>     BodyContentHandler handler = new BodyContentHandler(new 
> WriteOutContentHandler(1000));
>     new AutoDetectParser().parse(getBinaryContent("test-ocr.png"), handler, 
> new Metadata(), new ParseContext());
>     System.out.println("handler = " + handler);
> }
>
>
> Here are my logs:
>
> 16:31:13,089 DEBUG [o.a.t.p.o.TesseractOCRParser] hasTesseract (path:
> [tesseract]): true
> 16:31:13,560 DEBUG [o.a.t.p.o.TesseractOCRParser] hasTesseract (path:
> [tesseract]): true
> 16:31:13,564 DEBUG [o.a.t.p.o.TesseractOCRParser] ImageMagick does not
> appear to be installed (commandline: convert)
> 16:31:13,591 DEBUG [o.a.t.p.o.TesseractOCRParser] hasTesseract (path:
> [tesseract]): true
> 16:31:13,595 DEBUG [o.a.t.p.o.TesseractOCRParser] ImageMagick does not
> appear to be installed (commandline: convert)
> handler =
>
>
> The content is not extracted although Tesseract is detected.
>
> When I run Tesseract manually:
>
> tesseract test-ocr.png tess.out
> cat tess.out.txt
>
> I'm getting:
>
> This file contains some words.
>
> tesseract --version gives
>
> tesseract 5.2.0
>  leptonica-1.82.0
>  libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.3) : libpng 1.6.37 : libtiff
> 4.4.0 : zlib 1.2.11 : libwebp 1.2.4 : libopenjp2 2.5.0
>  Found AVX2
>  Found AVX
>  Found FMA
>  Found SSE4.1
>  Found libarchive 3.6.1 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8
> liblz4/1.9.3 libzstd/1.5.2
>  Found libcurl/7.79.1 SecureTransport (LibreSSL/3.3.6) zlib/1.2.11
> nghttp2/1.45.1
>
>
> What I'm missing here?
>
>
>
> David
>

Reply via email to