I am new to tesseract. My aim is to use this software to analyze Japanese doc. The idea in my mind is to start from existing model and fine-tune it by new words that weren't correctly recognized.
I am reading the Wiki and have some questions. 1) In https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00---Finetune you add training_text to tesstrain.sh training/tesstrain.sh \ > --fonts_dir /usr/share/fonts \ > --training_text ../langdata/ara/ara.training_text \ > --langdata_dir ../langdata \ > --tessdata_dir ./tessdata \ > --lang ara \ > --linedata_only \ > --noextract_font_properties \ > --exposures "0" \ > --fontlist "Arial" \ > --output_dir ~/tesstutorial/aratest but In https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 You don't. Why? training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \ --noextract_font_properties --langdata_dir ../langdata \ --tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain My understanding is 1. tesstrain.sh uses text2image command internally to generate images which are in various fonts and reshaped. 2. --linedata_only splits the training text into line and makes images for each line. 3. langdata_dir is essential but training_text isn't. If training_test isn't found, it uses the default $lang/$lang.training_text. Am I correct? 2) In the above example, I couldn't have an idea why it should take --tessdata because it seems irrelevant to making training data. 3) In https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00---Finetune It says the reader should place each projects like this ./langdata > ./langdata/eng > ./langdata/ara > ./tessdata > ./tesseract > ./tesseract/tessdata > ./tesseract/tessdata/configs/ > ./tesseract/training > etc and all the following examples are run under tesseract directory. Then I think the examples should take ../tessdata as --tessdata_dir but ./tessdata. I mean the examples should be fixed. 4) In In https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00---Finetune combine_tessdata -e ../tessdata/ara.traineddata \ > ~/tesstutorial/aratuned_from_ara/ara.lstm This is explained as it extracts the existing LSTM model for Arabic from tessdata but how come? The combine_tessdata commands extracts LSTM model because the extension of the second parameter is .lstm? Another question here is why LSTM model is mixed in the traineddata? I think the traineddata file mixes legacy trained model and LSTM model and I am wondering why they aren't separated? Even if the user only uses LSTM both trained model are read? (is it light-weight? then it might be ok) -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2a55760b-371b-483d-b5e2-731110bc83a4%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

