Please take a look at tesstrain_utils.sh and language-specific.sh in
training directory for more details about how training works.

As mentioned before training with box/tiff pairs is not supported.



On Mon 16 Apr, 2018, 8:19 AM , <denniscf...@berkeley.edu> wrote:

> Hi Shree,
>
> Thanks for your help, I was able to successfully train with the boxfiles.
> Is it possible to not provide any font data at all? Theoretically, if I was
> training for a document that did not have any font data available on the
> web, what would I do then?
> In tesstrain.sh, after I copy the box tiff pairs into /tmp like you said,
> does the script still generate box-tiff pairs using font data? It seems
> that the lines that say
>
> phase_I_generate_image 8
> phase_UP_generate_unicharset
>
> serve this function. Is the script still relying on training data
> generated by font data? Sorry, I'm not clear on the entire process that
> tesstrain.sh uses.
>
> Thanks once again,
> Dennis
>
> On Sunday, April 15, 2018 at 1:55:16 AM UTC-7, shree wrote:
>>
>> Hi Dennis,
>>
>> 1. Copy 4.0 format box/tiff pairs to langdata/$lang directory or any
>> other folder of your choice.
>>
>> 2. Modify tesstrain.sh to copy these files to your /tmp directory - see
>> following for where the lines need to be added
>>
>>
>> source "$(dirname $0)/tesstrain_utils.sh"
>>
>> ARGV=("$@")
>> parse_flags
>>
>> mkdir -p ${TRAINING_DIR}
>> tlog "\n=== Starting training for language '${LANG_CODE}'"
>>
>> # copy box tiff pairs from langdata/lang directory #shree
>> cp ./langdata/${LANG_CODE}/*.tif "${TRAINING_DIR}/"  #shree
>> cp ./langdata/${LANG_CODE}/*.box "${TRAINING_DIR}/"  #shree
>> ls -l "${TRAINING_DIR}/"    #shree
>>
>> source "$(dirname $0)/language-specific.sh"
>> set_lang_specific_parameters ${LANG_CODE}
>>
>> 3. run tesstrain.sh with at least one font and sample training text to
>> use, in addition to the provided box/tiff pairs.
>>
>>
>>
>>
>>
>>
>>
>>
>> ShreeDevi
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Sun, Apr 15, 2018 at 12:36 PM, <denni...@berkeley.edu> wrote:
>>
>>> Hi shree,
>>>
>>> Thanks for your reply. Is there any option to use tesstrain.sh in
>>> tesseract 4.0 to generate the traineddata and lstm files using the image
>>> and boxfiles? Or do I still have to go through the process as listed in the
>>> Tesseract 3.0 instructions? In which case, I would be able to generate the
>>> traineddata file (and the unicharset file, I think), but not the lstm file.
>>> How can I generate this lstm file? Is there a tool I can use?
>>>
>>> Thanks again,
>>> Dennis
>>>
>>> On Friday, April 13, 2018 at 5:19:47 AM UTC-7, shree wrote:
>>>>
>>>> training Tesseract 4.0 from images is not officially .supported .   
>>>> Different
>>>> people have had success in doing LSTM training with box/tiff pairs. but it
>>>> requires hacks/programming on their part to create 4.0.0 compatible box
>>>> files.
>>>>
>>>> tesstrain.sh creates box/tiff files in the /tmp directory, these are
>>>> used to create the lstmf files for LSTMtraining. tesstrain.sh can create a
>>>> 3.0x compatible traineddata or 4.0.0 compatible starter traineddata
>>>> depending on options that are chosen. For 4.0.0 this starter traineddata
>>>> alongwith the lstmf files is used for LSTM training.
>>>>
>>>> The format of traineddata files for 3.0x and 4.0.0 is different.
>>>>
>>>> For different components of a traineddata file, See
>>>>
>>>>
>>>> https://github.com/tesseract-ocr/tesseract/blob/master/doc/combine_tessdata.1.asc
>>>>
>>>> For creating 4.0 compatible box files see
>>>>
>>>>
>>>> https://github.com/tesseract-ocr/langdata/issues/83#issuecomment-375247341
>>>>
>>>>
>>>> https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM#training-tesseract-lstm-engine
>>>>
>>>> Please note that all these are unsupported options.
>>>>
>>>>
>>>> ShreeDevi
>>>> ____________________________________________________________
>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>
>>>> On Fri, Apr 13, 2018 at 12:09 PM, <denni...@berkeley.edu> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I read in a different post that training Tesseract 4.0 from images is
>>>>> not supported, is this true? I have been able to successfully train
>>>>> Tesseract 4.0 so far using font data. When using tesstrain.sh, the script
>>>>> creates a number of files, including an lstmf file alongside the usual
>>>>> trainedata file (and there are some others like unicharset). I was
>>>>> wondering if it is possible to use the traineddata generation from image
>>>>> and boxfile described in the Tesseract 3.0 training instructions to create
>>>>> these training files to train Tesseract 4.0. Tesseract 3.0 instructions
>>>>> already produce a traineddata file, how can I generate the lstmf file (and
>>>>> the others) if it is possible?
>>>>>
>>>>> Thank you,
>>>>> Dennis
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to tesseract-oc...@googlegroups.com.
>>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/bc664de6-5386-45b3-ae4d-70ac5338938c%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/bc664de6-5386-45b3-ae4d-70ac5338938c%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/385272ec-6801-4efd-957a-1bb5bc47175e%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/385272ec-6801-4efd-957a-1bb5bc47175e%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/425e1871-ccfa-4aa6-a087-842684c047c6%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/425e1871-ccfa-4aa6-a087-842684c047c6%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUmbNf_t2PTN0yVf%3D53AnVO6OULqn4KE11Op5UcHxWxEQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to