RE: Tesseract OCR text extraction issue on Debian Bullseye

Sandeep Kulkarni Thu, 05 May 2022 05:14:41 -0700

Hi Tim,

I have used your Dockerfile and built a docker image. It is working perfectly 
fine on my setup.


I have even tried various other options for tika-app and changed tika config 
and they also work fine. I have tried to set a explicit path for tesseractPath 
and not setting as well. That too works.

sandeep@SK-UBUNTU:~/tika-docker-play$ docker run -it tika-docker
DEBUG [main] 11:48:48,569 org.apache.tika.parser.external.ExternalParser exit 
value for /usr/bin/tesseract: 1
DEBUG [main] 11:48:48,570 org.apache.tika.parser.ocr.TesseractOCRParser 
hasTesseract (path: [/usr/bin/tesseract]): true
...
<Logs about ImageMagick missing>
...
DEBUG [main] 11:48:48,576 org.apache.tika.parser.ocr.TesseractOCRParser 
ImageMagick does not appear to be installed (commandline: convert)
INFO  [main] 11:48:49,694 org.apache.tika.parser.ocr.TesseractOCRParser 
Tesseract is installed and is being invoked. This can add greatly to processing 
time.  If you do not want tesseract to be applied to your files see: 
https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr
DEBUG [main] 11:48:49,702 org.apache.tika.parser.ocr.TesseractOCRParser 
Tesseract command: /usr/bin/tesseract /tmp/apache-tika-3898071413980732955.tmp 
/tmp/apache-tika-1645794913082823345.tmp --psm 1 -l eng -c page_separator= -c 
preserve_interword_spaces=0 txt
DEBUG [Thread-5] 11:48:50,274 org.apache.tika.parser.ocr.TesseractOCRParser
DEBUG [Thread-6] 11:48:50,275 org.apache.tika.parser.ocr.TesseractOCRParser 
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
...
...
<title>Presentation1</title>
</head>
<body><div class="page"><p/>
<div class="ocr">Happy New Year 2003!
</div>

I have even tried using the same commands on Ubuntu VMs which show the issue, 
that too worked fine. So definitely something is wrong with our application (or 
its configuration).

As I am still facing the issue, I will try to investigate further on my own. 
Thanks a lot for your help.

Regards,
Sandeep Kulkarni

-----Original Message-----
From: Sandeep Kulkarni <sandeep.kulkar...@veritas.com> 
Sent: Thursday, May 5, 2022 10:08 AM
To: user@tika.apache.org; talli...@apache.org
Subject: Re: Tesseract OCR text extraction issue on Debian Bullseye

Hi Tim,

Yes, I will make use of this repo for replicating the problem and let you know 
my observations. Thanks for the help.

Regards,
Sandeep Kulkarni

-----Original Message-----
From: Tim Allison <talli...@apache.org>
Sent: Thursday, May 5, 2022 2:18 AM
To: user@tika.apache.org
Subject: [External] Re: Tesseract OCR text extraction issue on Debian Bullseye

I created a very small repo with a version of tika-app that has log level set 
for debug:
https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ftballison%2Ftika-addons%2Ftree%2Fmain%2Ftika-docker-play&amp;data=05%7C01%7CSandeep.Kulkarni4%40veritas.com%7Ce7d5b69c6ae14b15396d08da2e51189d%7Cfc8e13c0422c4c55b3eaca318e6cac32%7C0%7C0%7C637873223090458628%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=O8daZfUpaNfq4M5ZEZPacv%2FB8uJEO%2B942f%2BM5upPnHk%3D&amp;reserved=0

I'm not able to replicate your problem.  To be clear, I trust you are having 
your problem!

Can you work with that repo and see if you can get it to fail?  Maybe add a 
stripped down version of your tika-config.xml?

On Wed, May 4, 2022 at 4:02 PM Tim Allison <talli...@apache.org> wrote:
>
> I just added extra debugging to the ExternalParser to figure out why 
> it doesn't find tesseract.  Are you able to add the logging to a local 
> build of Tika or use a snapshot?
>
> You aren't running in a Turkish locale, by chance? See: TIKA-1526
>
> On Wed, May 4, 2022 at 9:32 AM Sandeep Kulkarni 
> <sandeep.kulkar...@veritas.com> wrote:
> >
> > Hi Luís,
> >
> >
> >
> > Yes, I have double checked that today and tesseract dictionaries are 
> > present. I am able to do the text extraction from an image from command 
> > line within docker container.
> >
> >
> >
> > I have given examples in reply to Tim.
> >
> >
> >
> > Regards,
> >
> > Sandeep Kulkarni
> >
> >
> >
> > From: Luís Filipe Nassif <lfcnas...@gmail.com>
> > Sent: Tuesday, May 3, 2022 5:40 PM
> > To: user@tika.apache.org
> > Subject: [External] Re: Tesseract OCR text extraction issue on 
> > Debian Bullseye
> >
> >
> >
> > Just some guesses, were new tesseract dictionanaries installed? Were you 
> > able to OCR an image from cmd line with newer tesseract?
> >
> >
> >
> > Em ter, 3 de mai de 2022 02:17, Sandeep Kulkarni 
> > <sandeep.kulkar...@veritas.com> escreveu:
> >
> > Hi,
> >
> >
> >
> > Ours is a Java based application which uses Tika via AutoDetectParser. We 
> > init TesseractOCRConfig with tesseractPath and tessdataPath (and few more 
> > parameters) and set it into context before invoking ParsingReader.
> >
> >
> >
> > I am currently using Tika 2.2.1 with Tesseract OCR 4.0.0 (default version 
> > for this distro) on Debian Buster docker base image for 
> > openjdk:8u312-jre-buster. Things work as expected and I am able to get text 
> > extracted from images.
> >
> >
> >
> > We are now trying to upgrade Tesseract and have started facing some issues. 
> > We tried to move to Debian Bullseye based openjdk:8u332-jre-bullseye and 
> > Tesseract 4.1.1 (default version for this distro) and image extraction 
> > stopped working. We have not changed anything else within configuration for 
> > Tika and Tesseract.
> >
> >
> >
> > With debug logging enabled for TesseractOCRParser, I can see that 
> > hasTesseract is not working now and is not finding tesseract at 
> > /usr/bin/tesseract.
> >
> >
> >
> > 2022-05-02 10:55:26,053 DEBUG [TesseractOCRParser] hasTesseract
> > (path: [/usr/bin/tesseract]): false
> >
> >
> >
> > Because of this, Tesseract OCR does not get invoked. If I take a look a the 
> > path at which Tesseract binary is present, I can see it at 
> > /usr/bin/tesseract itself.
> >
> >
> >
> > root@vic:/# which tesseract
> >
> > /usr/bin/tesseract
> >
> > root@vic # tesseract -v
> >
> > tesseract 4.1.1
> >
> > leptonica-1.79.0
> >
> >   libgif 5.1.9 : libjpeg 6b (libjpeg-turbo 2.0.6) : libpng 1.6.37 : 
> > libtiff 4.2.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.4.0
> >
> > Found AVX2
> >
> > Found AVX
> >
> > Found FMA
> >
> > Found SSE
> >
> > Found libarchive 3.4.3 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8
> > liblz4/1.9.3 libzstd/1.4.8
> >
> >
> >
> > Whereas earlier it was working with below logs:
> >
> >
> >
> > 2022-05-02 09:55:08,275 DEBUG [TesseractOCRParser] hasTesseract
> > (path: [/usr/bin/tesseract]): true
> >
> > 2022-05-02 09:55:08,450 INFO  [Tika Parser-1] [TesseractOCRParser] 
> > Tesseract is installed and is being invoked. This can add greatly to 
> > processing time.  If you do not want tesseract to be applied to your 
> > files see:
> > https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcw
> > iki.apache.org%2Fconfluence%2Fdisplay%2FTIKA%2FTikaOCR%23TikaOCR-dis
> > able-ocr&amp;data=05%7C01%7CSandeep.Kulkarni4%40veritas.com%7C183e93
> > ac8dc148f7af1708da2e0f605f%7Cfc8e13c0422c4c55b3eaca318e6cac32%7C0%7C
> > 0%7C637872941389880300%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiL
> > CJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sd
> > ata=n4YLgud08jgGobJsE6qSA56rtRZC%2BvubTxNDmBkRf24%3D&amp;reserved=0
> >
> > 2022-05-02 09:55:08,451 DEBUG [Tika Parser-1] [TesseractOCRParser] 
> > Tesseract command: /usr/bin/tesseract 
> > /tmp/apache-tika-1769393331829017331.tmp
> > /tmp/apache-tika-8424718595969554950.tmp --psm 1 -l eng -c 
> > page_separator= -c preserve_interword_spaces=0 txt
> >
> > 2022-05-02 09:55:09,222 DEBUG [Thread-29] [TesseractOCRParser]
> >
> > 2022-05-02 09:55:09,222 DEBUG [Thread-30] [TesseractOCRParser] 
> > Tesseract Open Source OCR Engine v4.0.0 with Leptonica
> >
> >
> >
> > We use below Tesseract OCR settings (earlier and now).
> >
> >
> >
> > tesseractPath=/usr/bin/
> >
> > tessdataPath=/usr/share/tesseract-ocr/4.00/tessdata/
> >
> >
> >
> > We are also facing same issue with Ubuntu based VMs that we upgraded from 
> > 16.04 to 20.04 recently.
> >
> >
> >
> > Finally, we use simple 'apt install tesseract-ocr' command to install 
> > Tesseract OCR during building docker image as well on Ubuntu VMs. As Ubuntu 
> > is based on Debian, it is possible that the issue we are facing are related.
> >
> >
> >
> > FYI, we are not facing issue on Windows with Tesseract OCR 4.0.0, 4.1.0 and 
> > 5.0.1 on Windows at all. Here we are installing Tesseract OCR available at 
> > https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FUB-Mannheim%2Ftesseract%2Fwiki&amp;data=05%7C01%7CSandeep.Kulkarni4%40veritas.com%7Ce7d5b69c6ae14b15396d08da2e51189d%7Cfc8e13c0422c4c55b3eaca318e6cac32%7C0%7C0%7C637873223090458628%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=Ta7hmlFzvPJ8uNIKFXBFdUMoJHdIZq15TL5acRsDNig%3D&amp;reserved=0
> >  and the paths for tesseract binary  and tessdata are as below:
> >
> >
> >
> > tesseractPath=C:\Program Files\Tesseract-OCR\ 
> > tessdataPath=C:\Program Files\Tesseract-OCR\tessdata\
> >
> >
> >
> > Any help would be appreciated. Also wanted to ask whether there is a 
> > compatibility matrix for supported Tesseract OCR versions against Tika. We 
> > also plan to move to 5.x in near future.
> >
> >
> >
> > Regards,
> >
> > Sandeep Kulkarni
> >
> >

RE: Tesseract OCR text extraction issue on Debian Bullseye

Reply via email to