Thanks :) On Friday, August 31, 2018 at 3:29:21 PM UTC+7, shree wrote: > > A few points to note: > > 1. langdata repo has training data for 3.04. please use langdata_lstm repo > for training data for LSTM training. > > 2. To train from existing models, you need to use traineddata files from > tessdata_best repo. > > 3. Use tesstrain.sh script to create the starter traineddata file to be > used for training. > > 4. Build the latest beta.4 code from github and use that. > > On Fri 31 Aug, 2018, 1:17 PM sanparith marukatat, <[email protected] > <javascript:>> wrote: > >> Hi everyone, >> >> I have been playing with Tesseract for Thai language for a while. The >> performance of the default LSTM model is good. However, I would like to >> know if I can further improve it. >> >> First I have tried to retrain the model but ran into problems. I have >> tried to replace top layer without success neither. I think that it is due >> to unicharset (but I am not sure, I forgot the error messages). So I ended >> up training the model from scratch. Now I get a working model but I cannot >> reach the same performance as the default model. Please give some advice on >> how to improve the accuracy of the model. >> >> Here is how I did it. >> I used common Thai fonts (Tahoma, Sarabun, Angsana, Browallia, Cordia, >> Dillenia, Iris) with fonts arbitrary picked from >> http://www.thaisignmaker.com/korkhorkore/?catalog/all/-/date/1 >> In total, 65 fonts were selected to train the new model. >> >> I downloaded Thai training text, i.e. 'tha.training_text', from >> https://github.com/tesseract-ocr/langdata/blob/master/tha/tha.training_text >> I observed that lots of text in this file are gibberish. I think that the >> default model is built from this text file, so I used it as well. >> >> I used 'text2image' to generate training data by varying 3 exposures >> (-1,0,1), 2 conditions (normal, degraded), and 2 dpi (300, 400). From >> 'tha.training_text' and 65 fonts, I obtained 900,000+ lines to train the >> model. >> >> I downloaded 'tha.traineddata' from >> https://github.com/tesseract-ocr/tessdata >> I observed that 'tha.traineddata' contains two unicharsets i.e. >> 'tha.unicharset' and 'tha.lstm-unicharset'. As I am interested in LSTM >> model, I replaced 'tha.lstm-unicharset' with the new unicharset generated >> from box files using 'unicharset_extractor'. >> Noted that the help message of 'unicharset_extractor' says: >> ... >> Where mode means: >> 1=combine graphemes (use for Latin and other simple scripts) >> 2=split graphemes (use for Indic/Khmer/Myanmar) >> 3=pure unicode (use for Arabic/Hebrew/Thai/Tibetan) >> >> However, as in Thai language, we have split graphemes that are "ะ", "แ", >> "ำ", "ญ", and "ฐ". So I called unicharset_extractor with "--norm_mode 2" >> instead of 3. I am not sure if this is correct setting for norm_mode. >> >> Then I used 'combine_tessdata' to replace 'tha.lstm-unicharset' in >> 'tha.traineddata'. >> >> I trained the model using 'lstmtraining --traineddata tha.traineddata >> --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c150]' ..." >> I believe this means that I construct a NN with: >> - input shape (1,36,36,1), i.e. batch size=1 (ignored), bitmap size 36x36 >> and 1 channel (grayscale) >> - Convolution with tanh of size 3x3, 16 filters >> - Maxpooling 3x3 >> - LSTM forward in y-direction and summarized the output into 48 values >> - LSTM forward in x-direction with 96 outputs >> - LSTM backward in x-direction with 96 outputs >> - LSTM forward in x-direction with 256 outputs >> - Output sequence of 150-dim vectors using softmax+CTC. >> I have copied the model from somewhere on Internet and modified it. I >> still don't know what 'summarize' in LSTM actually means. >> (https://github.com/tesseract-ocr/tesseract/wiki/VGSLSpecs) >> >> During the training I observed lots of warning messages such as >> Encoding of string failed! Failure bytes: ffffffe0 ffffffb8 ffffff84 >> ffffffe0 ffffffb8 ffffffb8 ffffffe0 ffffffb8 ffffffa2 20 ffffffe0 ffffffb9 >> ffffff80 ffffffe0 ffffffb8 ffffff94 ffffffe0 ffffffb8 ffffffb5 ffffffe0 >> ffffffb8 ffffffa2 20 ffffffe0 ffffffb8 ffffffa3 ffffffe0 ffffffb8 ffffffb0 >> ffffffe0 ffffffb8 ffffff9a ffffffe0 ffffffb8 ffffff9a ffffffe0 ffffffb9 >> ffffff91 ffffffe0 ffffffb9 ffffff99 20 37 37 20 ffffffe0 ffffffb9 ffffff81 >> ffffffe0 ffffffb8 ffffffa5 ffffffe0 ffffffb8 ffffffb0 ffffffe0 ffffffb8 >> ffffffa1 ffffffe0 ffffffb8 ffffffb5 2e 22 20 ffffffe0 ffffffb8 ffffffa1 >> ffffffe0 ffffffb8 ffffffb4 ffffffe0 ffffffb9 ffffff80 ffffffe0 ffffffb8 >> ffffffa1 ffffffe0 ffffffb8 ffffffb7 ffffffe0 ffffffb8 ffffffad ffffffe0 >> ffffffb8 ffffff87 >> Can't encode transcription: 'คุย เดีย ระบบ๑๙ 77 และมี." มิเมือง' in >> language '' >> >> I don't know what causes this kind of warning and how to solve it so I >> just continue the training. >> >> I trained the model for 10M iterations and obtain >> 'newtha.lstm_checkpoint' that I convert to 'newtha.traineddata' using >> 'lstmtraining --stop_training --continue_from newtha.lstm_checkpoint >> --traineddata tha.traineddata --model_output newtha.traineddata'. >> Then I put 'newtha.traineddata' in '/usr/local/share/tessdata/' and call >> it with 'tesseract -l newtha ...'. >> >> I tested this model on images captured from smartphone. The >> character-level accuracy is about 80% while the default model gives about >> 95% accuracy. During the test, I also observed that sometimes the new model >> strangely failed to recognize texts that seems to be easy as shown below. >> >> [image: Screen Shot 2561-08-31 at 11.19.53.png] >> >> >> >> >> What should I do next to improve the accuracy? Should I tried changing >> the structure of LSTM model or training with text with real meaning or >> adding more fonts and other degradations such as Gaussian blur or >> salt-and-pepper noise, etc. >> >> Any suggestions are welcome and appreciated. >> Thank you, >> Sanparith >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/78a0624a-c9ca-43c1-bd64-077bf0301e8b%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/78a0624a-c9ca-43c1-bd64-077bf0301e8b%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> >
-- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a02117ce-4fc9-457f-b8b4-652679a4bb9c%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

