Please see the project https://github.com/OCR-D/ocrd-train
It has support for training tesseract if you provide line images and matching ground truth text. On Tue, Jun 12, 2018 at 8:19 PM <jbca...@hotmail.com> wrote: > Same question here. I see that the documentation on training Tesseract 4 > makes some reference to manuscripts: > > As with base Tesseract, there is a choice between rendering synthetic > training data from fonts, or labeling some pre-existing images (like > ancient manuscripts for example). > > So, if I understand correctly, there is no support yet for training with > labelled pre-existing images ? The concept of font does not makes sense > with manuscripts, and what we can use in this case is just pairs of images > and text (transcription). > > Best, > Jean-Baptiste Camps > > Le lundi 12 mars 2018 10:59:41 UTC+1, shree a écrit : >> >> >I have an image and a text file with the line content for each line of >> manuscript text. The doc says what to do, but not how. >> >> >I first thought I'd need img/box files pairs, but it seems it was for >> Tesseract 3 and is now irrelevant... >> >> Tesseract4.0.0beta.1 does not officially support LSTM training from >> box/tif pairs. >> >> It uses box/tif pairs generated using the synthetic training data >> generation pipeline using a training_text and set of fonts, for making the >> lstmf files that are used by lstmtraining. >> >> langdata refers to the langdata repository under tesseract-ocr github >> repo. The files in it have not been updated for 4.0.0 >> >> >> >> >> ShreeDevi >> ____________________________________________________________ >> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >> >> On Mon, Mar 12, 2018 at 2:00 PM, ShreeDevi Kumar <shree...@gmail.com> >> wrote: >> >>> Please try tesseract 4.0.0beta.1 with languages such as >>> >>> *enm* (English, Middle (1100-1500)) >>> >>> and >>> >>> Fraktur script >>> >>> Also, look at the following project from a few years back >>> >>> http://emop.tamu.edu/outcomes/Franken-Plus >>> >>> ShreeDevi >>> ____________________________________________________________ >>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>> >>> On Mon, Mar 12, 2018 at 4:32 AM, Guillaume Desforges <ace...@gmail.com> >>> wrote: >>> >>>> Hi >>>> >>>> I want to try using Tesseract 4 for old manuscript languages ("The Song >>>> of Roland" and such). >>>> >>>> I have looked at >>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 >>>> but the steps are very unclear. >>>> >>>> I have an image and a text file with the line content for each line of >>>> manuscript text. The doc says what to do, but not how. >>>> >>>> I first thought I'd need img/box files pairs, but it seems it was for >>>> Tesseract 3 and is now irrelevant... >>>> >>>> So I guess my starting point is here : >>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#tutorial-guide-to-lstmtraining >>>> >>>> There is no tool to create the lstm-recoder directly. Instead there is >>>>> a new tool, combine_lang_model which takes as input an >>>>> input_unicharset and script_dir(script_dir points to the langdata >>>>> directory) >>>>> and optional word list files. It creates the lstm-recoder from the >>>>> input_unicharset and creates all the dawgs, if wordlists are >>>>> provided, putting everything together into a traineddata file. >>>> >>>> >>>> I don't really get this part. How do I make input_unicharset ? What >>>> is langdata? >>>> >>>> Thanks >>>> >>>> Guillaume Desforges >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to tesseract-oc...@googlegroups.com. >>>> To post to this group, send email to tesser...@googlegroups.com. >>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/fe1d68a2-76ce-4005-98ea-672710365517%40googlegroups.com >>>> <https://groups.google.com/d/msgid/tesseract-ocr/fe1d68a2-76ce-4005-98ea-672710365517%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To post to this group, send email to tesseract-ocr@googlegroups.com. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/200db744-d010-4555-a4b7-86c64ba0b9bf%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/200db744-d010-4555-a4b7-86c64ba0b9bf%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXcqN62XHKVFj0qcOw6VztYRa63cv4n4jjkAZCAiTwm4w%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.