Hi Dennis,

1. Copy 4.0 format box/tiff pairs to langdata/$lang directory or any other
folder of your choice.

2. Modify tesstrain.sh to copy these files to your /tmp directory - see
following for where the lines need to be added


source "$(dirname $0)/tesstrain_utils.sh"

ARGV=("$@")
parse_flags

mkdir -p ${TRAINING_DIR}
tlog "\n=== Starting training for language '${LANG_CODE}'"

# copy box tiff pairs from langdata/lang directory #shree
cp ./langdata/${LANG_CODE}/*.tif "${TRAINING_DIR}/"  #shree
cp ./langdata/${LANG_CODE}/*.box "${TRAINING_DIR}/"  #shree
ls -l "${TRAINING_DIR}/"    #shree

source "$(dirname $0)/language-specific.sh"
set_lang_specific_parameters ${LANG_CODE}

3. run tesstrain.sh with at least one font and sample training text to use,
in addition to the provided box/tiff pairs.








ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sun, Apr 15, 2018 at 12:36 PM, <denniscf...@berkeley.edu> wrote:

> Hi shree,
>
> Thanks for your reply. Is there any option to use tesstrain.sh in
> tesseract 4.0 to generate the traineddata and lstm files using the image
> and boxfiles? Or do I still have to go through the process as listed in the
> Tesseract 3.0 instructions? In which case, I would be able to generate the
> traineddata file (and the unicharset file, I think), but not the lstm file.
> How can I generate this lstm file? Is there a tool I can use?
>
> Thanks again,
> Dennis
>
> On Friday, April 13, 2018 at 5:19:47 AM UTC-7, shree wrote:
>>
>> training Tesseract 4.0 from images is not officially .supported .   Different
>> people have had success in doing LSTM training with box/tiff pairs. but it
>> requires hacks/programming on their part to create 4.0.0 compatible box
>> files.
>>
>> tesstrain.sh creates box/tiff files in the /tmp directory, these are used
>> to create the lstmf files for LSTMtraining. tesstrain.sh can create a 3.0x
>> compatible traineddata or 4.0.0 compatible starter traineddata depending on
>> options that are chosen. For 4.0.0 this starter traineddata alongwith the
>> lstmf files is used for LSTM training.
>>
>> The format of traineddata files for 3.0x and 4.0.0 is different.
>>
>> For different components of a traineddata file, See
>>
>> https://github.com/tesseract-ocr/tesseract/blob/master/doc/c
>> ombine_tessdata.1.asc
>>
>> For creating 4.0 compatible box files see
>>
>> https://github.com/tesseract-ocr/langdata/issues/83#issuecom
>> ment-375247341
>>
>> https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LST
>> M#training-tesseract-lstm-engine
>>
>> Please note that all these are unsupported options.
>>
>>
>> ShreeDevi
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Fri, Apr 13, 2018 at 12:09 PM, <denni...@berkeley.edu> wrote:
>>
>>> Hi all,
>>>
>>> I read in a different post that training Tesseract 4.0 from images is
>>> not supported, is this true? I have been able to successfully train
>>> Tesseract 4.0 so far using font data. When using tesstrain.sh, the script
>>> creates a number of files, including an lstmf file alongside the usual
>>> trainedata file (and there are some others like unicharset). I was
>>> wondering if it is possible to use the traineddata generation from image
>>> and boxfile described in the Tesseract 3.0 training instructions to create
>>> these training files to train Tesseract 4.0. Tesseract 3.0 instructions
>>> already produce a traineddata file, how can I generate the lstmf file (and
>>> the others) if it is possible?
>>>
>>> Thank you,
>>> Dennis
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/bc664de6-5386-45b3-ae4d-70ac5338938c%40goo
>>> glegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/bc664de6-5386-45b3-ae4d-70ac5338938c%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/385272ec-6801-4efd-957a-1bb5bc47175e%
> 40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/385272ec-6801-4efd-957a-1bb5bc47175e%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduW0h_dNMDEYW4108O27P7%3DcHLYBcNYiW3VwFDPd3ZEOTA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to