Re: [tesseract-ocr] Train Tesseract 4.0 LSTM based on images

ShreeDevi Kumar Wed, 12 Apr 2017 03:35:15 -0700

Arabic was never trained with the legacy tesseract engine and I doubt you
will get any improvement over existing traineddata using cube or lstm.


You are free to experiment and see what you come up with.

I have pointed to the bash scripts for training. Please refer to them for
the correct process.

- excuse the brevity, sent from mobile

On 12-Apr-2017 4:00 PM, <[email protected]> wrote:

> Hello shree, Thank you for your valuable reply.. Are there any changes i
> need to follow for the steps below.. I request you to suggest the changes
> for the below commands, these are for tess 3.0
>
> tesseract ara.arial.exp4.tif ara.arial.exp4 nobatch box.train
> unicharset_extractor ara.arial.exp4.box
> echo "arial 0 0 1 0 0" > font_properties # tell Tesseract informations
> about the font
> mftraining -F font_properties -U unicharset -O ara.unicharset ara.arial.
> exp4.tr
> shapeclustering -F unicharset ara.arial.exp4.tr
> cntraining ara.arial.exp4.tr
>
> mv inttemp ara.inttemp
> mv normproto ara.normproto
> mv pffmtable ara.pffmtable
> mv shapetable ara.shapetable
> combine_tessdata ara.
>
>
> Please suggest changes for the above steps. I plan to publish a rigorous
> explanative tutorial after getting overview of all the steps.
> Thank you.
>
>
> On Wednesday, April 12, 2017 at 3:38:11 PM UTC+5:30, shree wrote:
>>
>> see https://github.com/tesseract-ocr/tesseract/blob/master/
>> training/tesstrain.sh
>>
>>
>> if ((LINEDATA)); then
>>   phase_E_extract_features "lstm.train" 8 "lstmf"
>>   make__lstmdata
>> else
>>   phase_E_extract_features "box.train" 8 "tr"
>>   phase_C_cluster_prototypes "${TRAINING_DIR}/${LANG_CODE}.normproto"
>>   if [[ "${ENABLE_SHAPE_CLUSTERING}" == "y" ]]; then
>>       phase_S_cluster_shapes
>>   fi
>>   phase_M_cluster_microfeatures
>>   phase_B_generate_ambiguities
>>   make__traineddata
>> fi
>>
>> --------------------
>>
>> lstm.train is for LSTM training
>>
>> box.train is for 3.0 Tesseract legacy engine training
>>
>> Please note that current master code is for alpha testing for 4.0 LSTM
>> and will most probably drop support for legacy engine.
>>
>> If you want the legacy tesseract engine and train for it, please use the
>> 3.05 branch of the github repo.
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/70a9d13b-a28b-4e6f-9c78-ec1c41361d96%
> 40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/70a9d13b-a28b-4e6f-9c78-ec1c41361d96%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU4vx2rg0KdYqnxUjyhgJd4W1028P9S-5kK5S5OH77G9g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Train Tesseract 4.0 LSTM based on images

Reply via email to