Re: [tesseract-ocr] Tesseract 4 for old languages

ShreeDevi Kumar Tue, 12 Jun 2018 08:34:46 -0700

Please see the project https://github.com/OCR-D/ocrd-train


It has support for training tesseract if you provide line images and
matching ground truth text.


On Tue, Jun 12, 2018 at 8:19 PM <[email protected]> wrote:

> Same question here. I see that the documentation on training Tesseract 4
> makes some reference to manuscripts:
>
>     As with base Tesseract, there is a choice between rendering synthetic
> training data from fonts, or labeling some pre-existing images (like
> ancient manuscripts for example).
>
> So, if I understand correctly, there is no support yet for training with
> labelled pre-existing images ? The concept of font does not makes sense
> with manuscripts, and what we can use in this case is just pairs of images
> and text (transcription).
>
> Best,
> Jean-Baptiste Camps
>
> Le lundi 12 mars 2018 10:59:41 UTC+1, shree a écrit :
>>
>> >I have an image and a text file with the line content for each line of
>> manuscript text. The doc says what to do, but not how.
>>
>> >I first thought I'd need img/box files pairs, but it seems it was for
>> Tesseract 3 and is now irrelevant...
>>
>> Tesseract4.0.0beta.1 does not officially support LSTM training from
>> box/tif pairs.
>>
>> It uses box/tif pairs generated using the synthetic training data
>> generation pipeline using a training_text and set of fonts, for making the
>> lstmf files that are used by lstmtraining.
>>
>> langdata refers to the langdata repository under tesseract-ocr github
>> repo. The files in it have not been updated for 4.0.0
>>
>>
>>
>>
>> ShreeDevi
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Mon, Mar 12, 2018 at 2:00 PM, ShreeDevi Kumar <[email protected]>
>> wrote:
>>
>>> Please try tesseract 4.0.0beta.1  with languages such as
>>>
>>> *enm* (English, Middle (1100-1500))
>>>
>>> and
>>>
>>> Fraktur  script
>>>
>>> Also, look at the following project from a few years back
>>>
>>> http://emop.tamu.edu/outcomes/Franken-Plus
>>>
>>> ShreeDevi
>>> ____________________________________________________________
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>>> On Mon, Mar 12, 2018 at 4:32 AM, Guillaume Desforges <[email protected]>
>>> wrote:
>>>
>>>> Hi
>>>>
>>>> I want to try using Tesseract 4 for old manuscript languages ("The Song
>>>> of Roland" and such).
>>>>
>>>> I have looked at
>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
>>>> but the steps are very unclear.
>>>>
>>>> I have an image and a text file with the line content for each line of
>>>> manuscript text. The doc says what to do, but not how.
>>>>
>>>> I first thought I'd need img/box files pairs, but it seems it was for
>>>> Tesseract 3 and is now irrelevant...
>>>>
>>>> So I guess my starting point is here :
>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#tutorial-guide-to-lstmtraining
>>>>
>>>> There is no tool to create the lstm-recoder directly. Instead there is
>>>>> a new tool, combine_lang_model which takes as input an
>>>>> input_unicharset and script_dir(script_dir points to the langdata 
>>>>> directory)
>>>>> and optional word list files. It creates the lstm-recoder from the
>>>>> input_unicharset and creates all the dawgs, if wordlists are
>>>>> provided, putting everything together into a traineddata file.
>>>>
>>>>
>>>> I don't really get this part. How do I make  input_unicharset ? What
>>>> is langdata?
>>>>
>>>> Thanks
>>>>
>>>> Guillaume Desforges
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/tesseract-ocr/fe1d68a2-76ce-4005-98ea-672710365517%40googlegroups.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/fe1d68a2-76ce-4005-98ea-672710365517%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/200db744-d010-4555-a4b7-86c64ba0b9bf%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/200db744-d010-4555-a4b7-86c64ba0b9bf%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXcqN62XHKVFj0qcOw6VztYRa63cv4n4jjkAZCAiTwm4w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Tesseract 4 for old languages

Reply via email to