Hi Shree,
Thanks for your help, I was able to successfully train with the boxfiles.
Is it possible to not provide any font data at all? Theoretically, if I was
training for a document that did not have any font data available on the
web, what would I do then?
In tesstrain.sh, after I copy the box tiff pairs into /tmp like you said,
does the script still generate box-tiff pairs using font data? It seems
that the lines that say
phase_I_generate_image 8
phase_UP_generate_unicharset
serve this function. Is the script still relying on training data generated
by font data? Sorry, I'm not clear on the entire process that tesstrain.sh
uses.
Thanks once again,
Dennis
On Sunday, April 15, 2018 at 1:55:16 AM UTC-7, shree wrote:
>
> Hi Dennis,
>
> 1. Copy 4.0 format box/tiff pairs to langdata/$lang directory or any other
> folder of your choice.
>
> 2. Modify tesstrain.sh to copy these files to your /tmp directory - see
> following for where the lines need to be added
>
>
> source "$(dirname $0)/tesstrain_utils.sh"
>
> ARGV=("$@")
> parse_flags
>
> mkdir -p ${TRAINING_DIR}
> tlog "\n=== Starting training for language '${LANG_CODE}'"
>
> # copy box tiff pairs from langdata/lang directory #shree
> cp ./langdata/${LANG_CODE}/*.tif "${TRAINING_DIR}/" #shree
> cp ./langdata/${LANG_CODE}/*.box "${TRAINING_DIR}/" #shree
> ls -l "${TRAINING_DIR}/" #shree
>
> source "$(dirname $0)/language-specific.sh"
> set_lang_specific_parameters ${LANG_CODE}
>
> 3. run tesstrain.sh with at least one font and sample training text to
> use, in addition to the provided box/tiff pairs.
>
>
>
>
>
>
>
>
> ShreeDevi
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Sun, Apr 15, 2018 at 12:36 PM, <[email protected] <javascript:>>
> wrote:
>
>> Hi shree,
>>
>> Thanks for your reply. Is there any option to use tesstrain.sh in
>> tesseract 4.0 to generate the traineddata and lstm files using the image
>> and boxfiles? Or do I still have to go through the process as listed in the
>> Tesseract 3.0 instructions? In which case, I would be able to generate the
>> traineddata file (and the unicharset file, I think), but not the lstm file.
>> How can I generate this lstm file? Is there a tool I can use?
>>
>> Thanks again,
>> Dennis
>>
>> On Friday, April 13, 2018 at 5:19:47 AM UTC-7, shree wrote:
>>>
>>> training Tesseract 4.0 from images is not officially .supported .
>>> Different
>>> people have had success in doing LSTM training with box/tiff pairs. but it
>>> requires hacks/programming on their part to create 4.0.0 compatible box
>>> files.
>>>
>>> tesstrain.sh creates box/tiff files in the /tmp directory, these are
>>> used to create the lstmf files for LSTMtraining. tesstrain.sh can create a
>>> 3.0x compatible traineddata or 4.0.0 compatible starter traineddata
>>> depending on options that are chosen. For 4.0.0 this starter traineddata
>>> alongwith the lstmf files is used for LSTM training.
>>>
>>> The format of traineddata files for 3.0x and 4.0.0 is different.
>>>
>>> For different components of a traineddata file, See
>>>
>>>
>>> https://github.com/tesseract-ocr/tesseract/blob/master/doc/combine_tessdata.1.asc
>>>
>>> For creating 4.0 compatible box files see
>>>
>>>
>>> https://github.com/tesseract-ocr/langdata/issues/83#issuecomment-375247341
>>>
>>>
>>> https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM#training-tesseract-lstm-engine
>>>
>>> Please note that all these are unsupported options.
>>>
>>>
>>> ShreeDevi
>>> ____________________________________________________________
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>>> On Fri, Apr 13, 2018 at 12:09 PM, <[email protected]> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I read in a different post that training Tesseract 4.0 from images is
>>>> not supported, is this true? I have been able to successfully train
>>>> Tesseract 4.0 so far using font data. When using tesstrain.sh, the script
>>>> creates a number of files, including an lstmf file alongside the usual
>>>> trainedata file (and there are some others like unicharset). I was
>>>> wondering if it is possible to use the traineddata generation from image
>>>> and boxfile described in the Tesseract 3.0 training instructions to create
>>>> these training files to train Tesseract 4.0. Tesseract 3.0 instructions
>>>> already produce a traineddata file, how can I generate the lstmf file (and
>>>> the others) if it is possible?
>>>>
>>>> Thank you,
>>>> Dennis
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/tesseract-ocr/bc664de6-5386-45b3-ae4d-70ac5338938c%40googlegroups.com
>>>>
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/bc664de6-5386-45b3-ae4d-70ac5338938c%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected]
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/385272ec-6801-4efd-957a-1bb5bc47175e%40googlegroups.com
>>
>> <https://groups.google.com/d/msgid/tesseract-ocr/385272ec-6801-4efd-957a-1bb5bc47175e%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/425e1871-ccfa-4aa6-a087-842684c047c6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.