Hey team
I'm wondering what's wrong with my config.
I'm running this very basic piece of code:
@Test
public void testTika() throws TikaException, IOException, SAXException {
BodyContentHandler handler = new BodyContentHandler(new
WriteOutContentHandler(1000));
new AutoDetectParser().parse(getBinaryContent("test-ocr.png"), handler, new
Metadata(), new ParseContext());
System.out.println("handler = " + handler);
}
Here are my logs:
16:31:13,089 DEBUG [o.a.t.p.o.TesseractOCRParser] hasTesseract (path:
[tesseract]): true
16:31:13,560 DEBUG [o.a.t.p.o.TesseractOCRParser] hasTesseract (path:
[tesseract]): true
16:31:13,564 DEBUG [o.a.t.p.o.TesseractOCRParser] ImageMagick does not appear
to be installed (commandline: convert)
16:31:13,591 DEBUG [o.a.t.p.o.TesseractOCRParser] hasTesseract (path:
[tesseract]): true
16:31:13,595 DEBUG [o.a.t.p.o.TesseractOCRParser] ImageMagick does not appear
to be installed (commandline: convert)
handler =
The content is not extracted although Tesseract is detected.
When I run Tesseract manually:
tesseract test-ocr.png tess.out
cat tess.out.txt
I'm getting:
This file contains some words.
tesseract --version gives
tesseract 5.2.0
leptonica-1.82.0
libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.3) : libpng 1.6.37 : libtiff
4.4.0 : zlib 1.2.11 : libwebp 1.2.4 : libopenjp2 2.5.0
Found AVX2
Found AVX
Found FMA
Found SSE4.1
Found libarchive 3.6.1 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3
libzstd/1.5.2
Found libcurl/7.79.1 SecureTransport (LibreSSL/3.3.6) zlib/1.2.11
nghttp2/1.45.1
What I'm missing here?
David