[tesseract-ocr] Fine-turning LSTM for Japanese

Akira Hayakawa Sun, 28 May 2017 08:21:59 -0700

I am new to tesseract. My aim is to use this software to analyze Japanese 
doc. The idea in my mind is to start from existing model and fine-tune it 
by new words that weren't correctly recognized.


I am reading the Wiki and have some questions.

1)

In 
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00---Finetune

 you add training_text to tesstrain.sh

training/tesstrain.sh \
> --fonts_dir /usr/share/fonts \
> --training_text ../langdata/ara/ara.training_text \
> --langdata_dir ../langdata \
> --tessdata_dir ./tessdata \
> --lang ara \
> --linedata_only \
> --noextract_font_properties \
> --exposures "0" \
> --fontlist "Arial" \
> --output_dir ~/tesstutorial/aratest


but

In https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

You don't. Why?

training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng 
--linedata_only \
--noextract_font_properties --langdata_dir ../langdata \
--tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain

My understanding is

1. tesstrain.sh uses text2image command internally to generate images which 
are in various fonts and reshaped.
2. --linedata_only splits the training text into line and makes images for 
each line.
3. langdata_dir is essential but training_text isn't. If training_test 
isn't found, it uses the default $lang/$lang.training_text.

Am I correct?

2)

In the above example, I couldn't have an idea why it should take --tessdata 
because it seems irrelevant to making training data.

3)

In 
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00---Finetune

It says the reader should place each projects like this

./langdata
> ./langdata/eng
> ./langdata/ara
> ./tessdata
> ./tesseract
> ./tesseract/tessdata
> ./tesseract/tessdata/configs/
> ./tesseract/training
> etc

 
and all the following examples are run under tesseract directory. Then I 
think the examples should take ../tessdata as --tessdata_dir but 
./tessdata. I mean the examples should be fixed.

4) 

In In 
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00---Finetune

combine_tessdata -e ../tessdata/ara.traineddata \
> ~/tesstutorial/aratuned_from_ara/ara.lstm


This is explained as it extracts the existing LSTM model for Arabic from 
tessdata but how come?
The combine_tessdata commands extracts LSTM model because the extension of 
the second parameter is .lstm?

Another question here is why LSTM model is mixed in the traineddata? I 
think the traineddata file mixes legacy trained model and LSTM model and I 
am wondering why they aren't separated? Even if the user only uses LSTM 
both trained model are read? (is it light-weight? then it might be ok)

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/2a55760b-371b-483d-b5e2-731110bc83a4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Fine-turning LSTM for Japanese

Reply via email to