Re: [tesseract-ocr] Fine-turning LSTM for Japanese

ShreeDevi Kumar Sun, 28 May 2017 11:15:19 -0700

Please see inline replies:

On Sun, May 28, 2017 at 4:53 PM, Akira Hayakawa <[email protected]> wrote:


> I am new to tesseract. My aim is to use this software to analyze Japanese
> doc. The idea in my mind is to start from existing model and fine-tune it
> by new words that weren't correctly recognized.
>
> I am reading the Wiki and have some questions.
>
> 1)
>
> In https://github.com/tesseract-ocr/tesseract/wiki/
> TrainingTesseract-4.00---Finetune
>
>  you add training_text to tesstrain.sh
>
> training/tesstrain.sh \
>> --fonts_dir /usr/share/fonts \
>> --training_text ../langdata/ara/ara.training_text \
>> --langdata_dir ../langdata \
>> --tessdata_dir ./tessdata \
>> --lang ara \
>> --linedata_only \
>> --noextract_font_properties \
>> --exposures "0" \
>> --fontlist "Arial" \
>> --output_dir ~/tesstutorial/aratest
>
>
> but
>
> In https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
>
> You don't. Why?
>
> training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng
> --linedata_only \
> --noextract_font_properties --langdata_dir ../langdata \
> --tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain
>
> My understanding is
>
> 1. tesstrain.sh uses text2image command internally to generate images
> which are in various fonts and reshaped.
> 2. --linedata_only splits the training text into line and makes images for
> each line.
> 3. langdata_dir is essential but training_text isn't. If training_test
> isn't found, it uses the default $lang/$lang.training_text.
>
> Am I correct?
>

Yes, you are correct.

>
> 2)
>
> In the above example, I couldn't have an idea why it should take
> --tessdata because it seems irrelevant to making training data.
>

tesseract needs eng and osd traineddata during initialization. The
location can be specified via TESSDATA_PREFIX also.

>
> 3)
>
> In https://github.com/tesseract-ocr/tesseract/wiki/
> TrainingTesseract-4.00---Finetune
>
> It says the reader should place each projects like this
>
> ./langdata
>> ./langdata/eng
>> ./langdata/ara
>> ./tessdata
>> ./tesseract
>> ./tesseract/tessdata
>> ./tesseract/tessdata/configs/
>> ./tesseract/training
>> etc
>
>
That will be the directory structure if you were to clone the tesseract,
langdata and tessdata repositories.

It is not recommended to clone the whole tessdata repo (over 1 gb), you can
download the traineddata files for the languages you need.

>
> and all the following examples are run under tesseract directory. Then I
> think the examples should take ../tessdata as --tessdata_dir but
> ./tessdata. I mean the examples should be fixed.
>
>
./tessdata (in tesseract repo) does not have any traineddata files to
begin with.

You can change the directories to match your directory configuration.



> 4)
>
> In In https://github.com/tesseract-ocr/tesseract/wiki/
> TrainingTesseract-4.00---Finetune
>
> combine_tessdata -e ../tessdata/ara.traineddata \
>> ~/tesstutorial/aratuned_from_ara/ara.lstm
>
>
> This is explained as it extracts the existing LSTM model for Arabic from
> tessdata but how come?
> The combine_tessdata commands extracts LSTM model because the extension of
> the second parameter is .lstm?
>

Yes.

>
> Another question here is why LSTM model is mixed in the traineddata? I
> think the traineddata file mixes legacy trained model and LSTM model and I
> am wondering why they aren't separated? Even if the user only uses LSTM
> both trained model are read? (is it light-weight? then it might be ok)
>

The 4.0 code is in alpha stage of testing and supports both legacy engine
and new LSTM engine and the traineddata file has both models.

You can use combine_tessdata to keep only the LSTM model in the traineddata.


-- 
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/2a55760b-371b-483d-b5e2-731110bc83a4%
> 40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/2a55760b-371b-483d-b5e2-731110bc83a4%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVqHs9HeBisZm2ikPBN8tnbbaqYrpjg0U0pG6%3DqDYAnDQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Fine-turning LSTM for Japanese

Reply via email to