Re: [tesseract-ocr] Experiment with Thai language

sanparith marukatat Fri, 31 Aug 2018 02:11:55 -0700

Thanks :)

On Friday, August 31, 2018 at 3:29:21 PM UTC+7, shree wrote:
>
> A few points to note:
>
> 1. langdata repo has training data for 3.04. please use langdata_lstm repo 
> for training data for LSTM training.
>
> 2. To train from existing models, you need to use traineddata files from 
> tessdata_best repo.
>
> 3. Use tesstrain.sh script to create the starter traineddata file to be 
> used for training.
>
> 4. Build the latest beta.4 code from github and use that.
>
> On Fri 31 Aug, 2018, 1:17 PM sanparith marukatat, <[email protected] 
> <javascript:>> wrote:
>
>> Hi everyone,
>>
>> I have been playing with Tesseract for Thai language for a while. The 
>> performance of the default LSTM model is good. However, I would like to 
>> know if I can further improve it.
>>
>> First I have tried to retrain the model but ran into problems. I have 
>> tried to replace top layer without success neither. I think that it is due 
>> to unicharset (but I am not sure, I forgot the error messages). So I ended 
>> up training the model from scratch. Now I get a working model but I cannot 
>> reach the same performance as the default model. Please give some advice on 
>> how to improve the accuracy of the model.
>>
>> Here is how I did it.
>> I used common  Thai fonts (Tahoma, Sarabun, Angsana, Browallia, Cordia, 
>> Dillenia, Iris) with fonts arbitrary picked from 
>> http://www.thaisignmaker.com/korkhorkore/?catalog/all/-/date/1
>> In total, 65 fonts were selected to train the new model.
>>
>> I downloaded Thai training text, i.e. 'tha.training_text', from 
>> https://github.com/tesseract-ocr/langdata/blob/master/tha/tha.training_text
>> I observed that lots of text in this file are gibberish. I think that the 
>> default model is built from this text file, so I used it as well.
>>
>> I used 'text2image' to generate training data by varying 3 exposures 
>> (-1,0,1), 2 conditions (normal, degraded), and 2 dpi (300, 400). From 
>> 'tha.training_text' and 65 fonts, I obtained 900,000+ lines to train the 
>> model. 
>>
>> I downloaded 'tha.traineddata' from 
>> https://github.com/tesseract-ocr/tessdata
>> I observed that 'tha.traineddata' contains two unicharsets i.e. 
>> 'tha.unicharset' and 'tha.lstm-unicharset'. As I am interested in LSTM 
>> model, I replaced 'tha.lstm-unicharset' with the new unicharset generated 
>> from box files using 'unicharset_extractor'.
>> Noted that the help message of 'unicharset_extractor' says:
>> ...
>> Where mode means:
>>  1=combine graphemes (use for Latin and other simple scripts)
>>  2=split graphemes (use for Indic/Khmer/Myanmar)
>>  3=pure unicode (use for Arabic/Hebrew/Thai/Tibetan)
>>
>> However, as in Thai language, we have split graphemes that are "ะ", "แ", 
>> "ำ", "ญ", and "ฐ". So I called unicharset_extractor with "--norm_mode 2" 
>> instead of 3. I am not sure if this is correct setting for norm_mode.
>>
>> Then I used 'combine_tessdata' to replace 'tha.lstm-unicharset' in 
>> 'tha.traineddata'.
>>
>> I trained the model using 'lstmtraining --traineddata tha.traineddata 
>> --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c150]' ..."
>> I believe this means that I construct a NN with:
>> - input shape (1,36,36,1), i.e. batch size=1 (ignored), bitmap size 36x36 
>> and 1 channel (grayscale)
>> - Convolution with tanh of size 3x3, 16 filters
>> - Maxpooling 3x3
>> - LSTM forward in y-direction and summarized the output into 48 values
>> - LSTM forward in x-direction with 96 outputs
>> - LSTM backward in x-direction with 96 outputs
>> - LSTM forward in x-direction with 256 outputs
>> - Output sequence of 150-dim vectors using softmax+CTC.
>> I have copied the model from somewhere on Internet and modified it. I 
>> still don't know what 'summarize' in LSTM actually means. 
>> (https://github.com/tesseract-ocr/tesseract/wiki/VGSLSpecs)
>>
>> During the training I observed lots of warning messages such as
>> Encoding of string failed! Failure bytes: ffffffe0 ffffffb8 ffffff84 
>> ffffffe0 ffffffb8 ffffffb8 ffffffe0 ffffffb8 ffffffa2 20 ffffffe0 ffffffb9 
>> ffffff80 ffffffe0 ffffffb8 ffffff94 ffffffe0 ffffffb8 ffffffb5 ffffffe0 
>> ffffffb8 ffffffa2 20 ffffffe0 ffffffb8 ffffffa3 ffffffe0 ffffffb8 ffffffb0 
>> ffffffe0 ffffffb8 ffffff9a ffffffe0 ffffffb8 ffffff9a ffffffe0 ffffffb9 
>> ffffff91 ffffffe0 ffffffb9 ffffff99 20 37 37 20 ffffffe0 ffffffb9 ffffff81 
>> ffffffe0 ffffffb8 ffffffa5 ffffffe0 ffffffb8 ffffffb0 ffffffe0 ffffffb8 
>> ffffffa1 ffffffe0 ffffffb8 ffffffb5 2e 22 20 ffffffe0 ffffffb8 ffffffa1 
>> ffffffe0 ffffffb8 ffffffb4 ffffffe0 ffffffb9 ffffff80 ffffffe0 ffffffb8 
>> ffffffa1 ffffffe0 ffffffb8 ffffffb7 ffffffe0 ffffffb8 ffffffad ffffffe0 
>> ffffffb8 ffffff87
>> Can't encode transcription: 'คุย เดีย ระบบ๑๙ 77 และมี." มิเมือง' in 
>> language ''
>>
>> I don't know what causes this kind of warning and how to solve it so I 
>> just continue the training. 
>>
>> I trained the model for 10M iterations and obtain 
>> 'newtha.lstm_checkpoint' that I convert to 'newtha.traineddata' using 
>> 'lstmtraining --stop_training --continue_from newtha.lstm_checkpoint 
>> --traineddata tha.traineddata  --model_output newtha.traineddata'.
>> Then I put 'newtha.traineddata' in '/usr/local/share/tessdata/' and call 
>> it with 'tesseract -l newtha ...'. 
>>
>> I tested this model on images captured from smartphone. The 
>> character-level accuracy is about 80% while the default model gives about 
>> 95% accuracy. During the test, I also observed that sometimes the new model 
>> strangely failed to recognize texts that seems to be easy as shown below.
>>
>> [image: Screen Shot 2561-08-31 at 11.19.53.png]
>>
>>
>>
>>
>> What should I do next to improve the accuracy? Should I tried changing 
>> the structure of LSTM model or training with text with real meaning or 
>> adding more fonts and other degradations such as Gaussian blur or 
>> salt-and-pepper noise, etc. 
>>
>> Any suggestions are welcome and appreciated.
>> Thank you,
>> Sanparith
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/78a0624a-c9ca-43c1-bd64-077bf0301e8b%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/78a0624a-c9ca-43c1-bd64-077bf0301e8b%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>


-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a02117ce-4fc9-457f-b8b4-652679a4bb9c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Experiment with Thai language

Reply via email to