[tesseract-ocr] Issue with OCR rapidity on short texts

Quentin MIGNOT Fri, 10 Sep 2021 10:41:29 -0700

Hello everyone,

I am using Tesseract OCR library in a professional program where speed is 
quite important. We receive pictures (movie subtitles) containing 
characters that we need to decode (one possible treatment among many 
others). However, we have issues when we try to decode longer subtitles or 
subtitles in chinese language, the library takes too much time. I could use 
some help to see if there is something to do or configure to improve 
detection speed, knowing I am working on quite powerfull servers, with a 
lot of cores.

I wrote a little program to help with my testing.
<https://pastebin.com/8tsrY5Gf> It loads Tesseract, and perform character
detection on a picture, then displays the result with the time it took.

I tried it with the three datasets I found on github repo for chinese
language :
https://github.com/tesseract-ocr/tessdata
https://github.com/tesseract-ocr/tessdata_best
https://github.com/tesseract-ocr/tessdata_fast
The first works with OEM_TESSERACT_ONLY and OEM_LSTM_ONLY modes, the two
others only work wil OEM_LSTM_ONLY.

Here are the result I get with this input picture, which is one of the
"difficult" cases we have:
[image: 262646303.png]

In the first case (normal traineddata + OEM_TESSERACT), the detected text
is correct enough, but the time it took is too high :
./testTesseract traineddata/legacy chi_tra Tesseract pngs/test.png
OCR loaded
Picture pngs/test.png loaded : w = 438, h = 160, d = 32. Took 4.1389s,
decoded : "銅鐵人，通過 我得考慮看看".

In the second case (normal traineddata + OEM_LSTM), the detected text is
not as good but *much* faster.
./testTesseract traineddata/legacy chi_tra LSTM pngs/test.png
OCR loaded
Picture pngs/test.png loaded : w = 438, h = 160, d = 32. Took 0.348507s,
decoded : "鍘 鐵 人 ， 通 過 我 得 考 慧 看 看".

In the third case (fast traineddata), the result is catastrophic
<https://d.justpo.st/media/images/2017/01/10/it-says-here-youre-extremely-fast-at-maths-whats-30-x-17-47-thats-not-even-close-yeah-but-it-was-quick-1484096551.jpg>(but

fast)
./testTesseract traineddata/fast chi_tra LSTM pngs/test.png
OCR loaded
Picture pngs/test.png loaded : w = 438, h = 160, d = 32. Took 0.150667s,
decoded : "5 折才2起起二".

In the fourth case (best traineddata), the result is quite bad too:
./testTesseract traineddata/best chi_tra LSTM pngs/test.png
Error opening data file traineddata/best/chi_tra_vert.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your
"tessdata" directory.
Failed loading language 'chi_tra_vert'
OCR loaded
Picture pngs/test.png loaded : w = 438, h = 160, d = 32. Took 0.379917s,
decoded : "全 說 喲 基 E 選 5 所 才 下 二 起 生".

My questions are :

- Is there a reason why the "best" and "fast" training sets perform so
poorly ? Maybe I configured something wrong ?
- Does Tesseract has a feature for multithreading (I suppose it does
not, as I did not find any reference to it online) ?
- For the "normal" training set, Tesseract mode is correct but slow.
LSTM mode is less correct but much faster. Is there a way to have something
in the middle, with another training set or a specific configuration ?

One of the things I tried was using openmp, which is supposed to improve
multithreading with Tesseract (which makes me believe there is some sort of
multithreading). I recompiled Tesseract, linked it again with our program
but did not see any difference with our OCR performances. It is still
possible that we failed to configure the omp_thread_limit variable
correctly, given the complexity of out build system. My program is compiled
with openmp support, but changing the variable in my shell does not seem to
change anything... Is there a way to check if it is correctly detected ?

I'd be very glad to get a little bit of help here. My team and I are about
to change our "Legacy" OEM_TESSERACT mode for OEM_LSTM, as it is quite easy
to change and quite faster. However if there is a better solution, please
tell me :)

Have a nice weekend !

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/1d8050ae-cca1-4410-bfe0-c27cb732120cn%40googlegroups.com.

[tesseract-ocr] Issue with OCR rapidity on short texts

Reply via email to