And, what is recorded in the X-Tika-ParsedBy value in the metadata object?

On Thu, Sep 1, 2022 at 5:36 AM Tim Allison <[email protected]> wrote:

> What are your dependencies? Which parsers are in AutoDetectParser?
>
> On Thu, Sep 1, 2022 at 4:38 AM David Pilato <[email protected]> wrote:
>
>> Hey team
>>
>>
>> I'm wondering what's wrong with my config.
>> I'm running this very basic piece of code:
>>
>> @Test
>> public void testTika() throws TikaException, IOException, SAXException {
>>     BodyContentHandler handler = new BodyContentHandler(new 
>> WriteOutContentHandler(1000));
>>     new AutoDetectParser().parse(getBinaryContent("test-ocr.png"), handler, 
>> new Metadata(), new ParseContext());
>>     System.out.println("handler = " + handler);
>> }
>>
>>
>> Here are my logs:
>>
>> 16:31:13,089 DEBUG [o.a.t.p.o.TesseractOCRParser] hasTesseract (path:
>> [tesseract]): true
>> 16:31:13,560 DEBUG [o.a.t.p.o.TesseractOCRParser] hasTesseract (path:
>> [tesseract]): true
>> 16:31:13,564 DEBUG [o.a.t.p.o.TesseractOCRParser] ImageMagick does not
>> appear to be installed (commandline: convert)
>> 16:31:13,591 DEBUG [o.a.t.p.o.TesseractOCRParser] hasTesseract (path:
>> [tesseract]): true
>> 16:31:13,595 DEBUG [o.a.t.p.o.TesseractOCRParser] ImageMagick does not
>> appear to be installed (commandline: convert)
>> handler =
>>
>>
>> The content is not extracted although Tesseract is detected.
>>
>> When I run Tesseract manually:
>>
>> tesseract test-ocr.png tess.out
>> cat tess.out.txt
>>
>> I'm getting:
>>
>> This file contains some words.
>>
>> tesseract --version gives
>>
>> tesseract 5.2.0
>>  leptonica-1.82.0
>>  libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.3) : libpng 1.6.37 :
>> libtiff 4.4.0 : zlib 1.2.11 : libwebp 1.2.4 : libopenjp2 2.5.0
>>  Found AVX2
>>  Found AVX
>>  Found FMA
>>  Found SSE4.1
>>  Found libarchive 3.6.1 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8
>> liblz4/1.9.3 libzstd/1.5.2
>>  Found libcurl/7.79.1 SecureTransport (LibreSSL/3.3.6) zlib/1.2.11
>> nghttp2/1.45.1
>>
>>
>> What I'm missing here?
>>
>>
>>
>> David
>>
>

Reply via email to