And, what is recorded in the X-Tika-ParsedBy value in the metadata object? On Thu, Sep 1, 2022 at 5:36 AM Tim Allison <[email protected]> wrote:
> What are your dependencies? Which parsers are in AutoDetectParser? > > On Thu, Sep 1, 2022 at 4:38 AM David Pilato <[email protected]> wrote: > >> Hey team >> >> >> I'm wondering what's wrong with my config. >> I'm running this very basic piece of code: >> >> @Test >> public void testTika() throws TikaException, IOException, SAXException { >> BodyContentHandler handler = new BodyContentHandler(new >> WriteOutContentHandler(1000)); >> new AutoDetectParser().parse(getBinaryContent("test-ocr.png"), handler, >> new Metadata(), new ParseContext()); >> System.out.println("handler = " + handler); >> } >> >> >> Here are my logs: >> >> 16:31:13,089 DEBUG [o.a.t.p.o.TesseractOCRParser] hasTesseract (path: >> [tesseract]): true >> 16:31:13,560 DEBUG [o.a.t.p.o.TesseractOCRParser] hasTesseract (path: >> [tesseract]): true >> 16:31:13,564 DEBUG [o.a.t.p.o.TesseractOCRParser] ImageMagick does not >> appear to be installed (commandline: convert) >> 16:31:13,591 DEBUG [o.a.t.p.o.TesseractOCRParser] hasTesseract (path: >> [tesseract]): true >> 16:31:13,595 DEBUG [o.a.t.p.o.TesseractOCRParser] ImageMagick does not >> appear to be installed (commandline: convert) >> handler = >> >> >> The content is not extracted although Tesseract is detected. >> >> When I run Tesseract manually: >> >> tesseract test-ocr.png tess.out >> cat tess.out.txt >> >> I'm getting: >> >> This file contains some words. >> >> tesseract --version gives >> >> tesseract 5.2.0 >> leptonica-1.82.0 >> libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.3) : libpng 1.6.37 : >> libtiff 4.4.0 : zlib 1.2.11 : libwebp 1.2.4 : libopenjp2 2.5.0 >> Found AVX2 >> Found AVX >> Found FMA >> Found SSE4.1 >> Found libarchive 3.6.1 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 >> liblz4/1.9.3 libzstd/1.5.2 >> Found libcurl/7.79.1 SecureTransport (LibreSSL/3.3.6) zlib/1.2.11 >> nghttp2/1.45.1 >> >> >> What I'm missing here? >> >> >> >> David >> >
