Hello everyone,

I am using Tesseract OCR library in a professional program where speed is 
quite important. We receive pictures (movie subtitles) containing 
characters that we need to decode (one possible treatment among many 
others). However, we have issues when we try to decode longer subtitles or 
subtitles in chinese language, the library takes too much time. I could use 
some help to see if there is something to do or configure to improve 
detection speed, knowing I am working on quite powerfull servers, with a 
lot of cores.

I wrote a little program to help with my testing. 
<https://pastebin.com/8tsrY5Gf> It loads Tesseract, and perform character 
detection on a picture, then displays the result with the time it took.

I tried it with the three datasets I found on github repo for chinese 
language : 
https://github.com/tesseract-ocr/tessdata
https://github.com/tesseract-ocr/tessdata_best
https://github.com/tesseract-ocr/tessdata_fast
The first works with OEM_TESSERACT_ONLY and OEM_LSTM_ONLY modes, the two 
others only work wil OEM_LSTM_ONLY.

Here are the result I get with this input picture, which is one of the 
"difficult" cases we have:
 [image: 262646303.png]

In the first case (normal traineddata + OEM_TESSERACT), the detected text 
is correct enough, but the time it took is too high :
./testTesseract traineddata/legacy chi_tra Tesseract pngs/test.png 
OCR loaded
Picture pngs/test.png loaded : w = 438, h = 160, d = 32. Took 4.1389s, 
decoded : "銅鐵人,通過 我得考慮看看".

In the second case (normal traineddata + OEM_LSTM), the detected text is 
not as good but *much* faster.
./testTesseract traineddata/legacy chi_tra LSTM pngs/test.png
OCR loaded
Picture pngs/test.png loaded : w = 438, h = 160, d = 32. Took 0.348507s, 
decoded : "鍘 鐵 人 , 通 過 我 得 考 慧 看 看".

In the third case (fast traineddata), the result is catastrophic 
<https://d.justpo.st/media/images/2017/01/10/it-says-here-youre-extremely-fast-at-maths-whats-30-x-17-47-thats-not-even-close-yeah-but-it-was-quick-1484096551.jpg>(but
 
fast)
./testTesseract traineddata/fast chi_tra LSTM pngs/test.png 
OCR loaded
Picture pngs/test.png loaded : w = 438, h = 160, d = 32. Took 0.150667s, 
decoded : "5 折才2起起二".

In the fourth case (best traineddata), the result is quite bad too:
./testTesseract traineddata/best chi_tra LSTM pngs/test.png
Error opening data file traineddata/best/chi_tra_vert.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your 
"tessdata" directory.
Failed loading language 'chi_tra_vert'
OCR loaded
Picture pngs/test.png loaded : w = 438, h = 160, d = 32. Took 0.379917s, 
decoded : "全 說 喲 基 E 選 5 所 才 下 二 起 生".

My questions are :

   - Is there a reason why the "best" and "fast" training sets perform so 
   poorly ? Maybe I configured something wrong ?
   - Does Tesseract has a feature for multithreading (I suppose it does 
   not, as I did not find any reference to it online) ?
   - For the "normal" training set, Tesseract mode is correct but slow. 
   LSTM mode is less correct but much faster. Is there a way to have something 
   in the middle, with another training set or a specific configuration ?

One of the things I tried was using openmp, which is supposed to improve 
multithreading with Tesseract (which makes me believe there is some sort of 
multithreading). I recompiled Tesseract, linked it again with our program 
but did not see any difference with our OCR performances. It is still 
possible that we failed to configure the omp_thread_limit variable 
correctly, given the complexity of out build system. My program is compiled 
with openmp support, but changing the variable in my shell does not seem to 
change anything... Is there a way to check if it is correctly detected ?

I'd be very glad to get a little bit of help here. My team and I are about 
to change our "Legacy" OEM_TESSERACT mode for OEM_LSTM, as it is quite easy 
to change and quite faster. However if there is a better solution, please 
tell me :)

Have a nice weekend !

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/1d8050ae-cca1-4410-bfe0-c27cb732120cn%40googlegroups.com.

Reply via email to