Please also see http://doc-creator.labri.fr/
which makes it easy to create synthetic data similar to manuscript pages. On Tue, Jun 12, 2018 at 9:03 PM ShreeDevi Kumar <[email protected]> wrote: > Please see the project https://github.com/OCR-D/ocrd-train > > It has support for training tesseract if you provide line images and > matching ground truth text. > > > On Tue, Jun 12, 2018 at 8:19 PM <[email protected]> wrote: > >> Same question here. I see that the documentation on training Tesseract 4 >> makes some reference to manuscripts: >> >> As with base Tesseract, there is a choice between rendering synthetic >> training data from fonts, or labeling some pre-existing images (like >> ancient manuscripts for example). >> >> So, if I understand correctly, there is no support yet for training with >> labelled pre-existing images ? The concept of font does not makes sense >> with manuscripts, and what we can use in this case is just pairs of images >> and text (transcription). >> >> Best, >> Jean-Baptiste Camps >> >> Le lundi 12 mars 2018 10:59:41 UTC+1, shree a écrit : >>> >>> >I have an image and a text file with the line content for each line of >>> manuscript text. The doc says what to do, but not how. >>> >>> >I first thought I'd need img/box files pairs, but it seems it was for >>> Tesseract 3 and is now irrelevant... >>> >>> Tesseract4.0.0beta.1 does not officially support LSTM training from >>> box/tif pairs. >>> >>> It uses box/tif pairs generated using the synthetic training data >>> generation pipeline using a training_text and set of fonts, for making the >>> lstmf files that are used by lstmtraining. >>> >>> langdata refers to the langdata repository under tesseract-ocr github >>> repo. The files in it have not been updated for 4.0.0 >>> >>> >>> >>> >>> ShreeDevi >>> ____________________________________________________________ >>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>> >>> On Mon, Mar 12, 2018 at 2:00 PM, ShreeDevi Kumar <[email protected]> >>> wrote: >>> >>>> Please try tesseract 4.0.0beta.1 with languages such as >>>> >>>> *enm* (English, Middle (1100-1500)) >>>> >>>> and >>>> >>>> Fraktur script >>>> >>>> Also, look at the following project from a few years back >>>> >>>> http://emop.tamu.edu/outcomes/Franken-Plus >>>> >>>> ShreeDevi >>>> ____________________________________________________________ >>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>> >>>> On Mon, Mar 12, 2018 at 4:32 AM, Guillaume Desforges <[email protected]> >>>> wrote: >>>> >>>>> Hi >>>>> >>>>> I want to try using Tesseract 4 for old manuscript languages ("The >>>>> Song of Roland" and such). >>>>> >>>>> I have looked at >>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 >>>>> but the steps are very unclear. >>>>> >>>>> I have an image and a text file with the line content for each line of >>>>> manuscript text. The doc says what to do, but not how. >>>>> >>>>> I first thought I'd need img/box files pairs, but it seems it was for >>>>> Tesseract 3 and is now irrelevant... >>>>> >>>>> So I guess my starting point is here : >>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#tutorial-guide-to-lstmtraining >>>>> >>>>> There is no tool to create the lstm-recoder directly. Instead there >>>>>> is a new tool, combine_lang_model which takes as input an >>>>>> input_unicharset and script_dir(script_dir points to the langdata >>>>>> directory) >>>>>> and optional word list files. It creates the lstm-recoder from the >>>>>> input_unicharset and creates all the dawgs, if wordlists are >>>>>> provided, putting everything together into a traineddata file. >>>>> >>>>> >>>>> I don't really get this part. How do I make input_unicharset ? What >>>>> is langdata? >>>>> >>>>> Thanks >>>>> >>>>> Guillaume Desforges >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To post to this group, send email to [email protected]. >>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/fe1d68a2-76ce-4005-98ea-672710365517%40googlegroups.com >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/fe1d68a2-76ce-4005-98ea-672710365517%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To post to this group, send email to [email protected]. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/200db744-d010-4555-a4b7-86c64ba0b9bf%40googlegroups.com >> <https://groups.google.com/d/msgid/tesseract-ocr/200db744-d010-4555-a4b7-86c64ba0b9bf%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXw-GCAwaTeoz_ZAAEwSrRPXpwJZkYTr%2BFsWhGow%2B5w9w%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

