>I have an image and a text file with the line content for each line of manuscript text. The doc says what to do, but not how.
>I first thought I'd need img/box files pairs, but it seems it was for Tesseract 3 and is now irrelevant... Tesseract4.0.0beta.1 does not officially support LSTM training from box/tif pairs. It uses box/tif pairs generated using the synthetic training data generation pipeline using a training_text and set of fonts, for making the lstmf files that are used by lstmtraining. langdata refers to the langdata repository under tesseract-ocr github repo. The files in it have not been updated for 4.0.0 ShreeDevi ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Mon, Mar 12, 2018 at 2:00 PM, ShreeDevi Kumar <shreesh...@gmail.com> wrote: > Please try tesseract 4.0.0beta.1 with languages such as > > *enm* (English, Middle (1100-1500)) > > and > > Fraktur script > > Also, look at the following project from a few years back > > http://emop.tamu.edu/outcomes/Franken-Plus > > ShreeDevi > ____________________________________________________________ > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com > > On Mon, Mar 12, 2018 at 4:32 AM, Guillaume Desforges <aceu...@gmail.com> > wrote: > >> Hi >> >> I want to try using Tesseract 4 for old manuscript languages ("The Song >> of Roland" and such). >> >> I have looked at https://github.com/tesserac >> t-ocr/tesseract/wiki/TrainingTesseract-4.00 but the steps are very >> unclear. >> >> I have an image and a text file with the line content for each line of >> manuscript text. The doc says what to do, but not how. >> >> I first thought I'd need img/box files pairs, but it seems it was for >> Tesseract 3 and is now irrelevant... >> >> So I guess my starting point is here : https://github.com/tesseract >> -ocr/tesseract/wiki/TrainingTesseract-4.00#tutorial-guide-to-lstmtraining >> >> There is no tool to create the lstm-recoder directly. Instead there is a >>> new tool, combine_lang_model which takes as input an input_unicharset >>> and script_dir(script_dir points to the langdata directory) and >>> optional word list files. It creates the lstm-recoder from the >>> input_unicharset and creates all the dawgs, if wordlists are provided, >>> putting everything together into a traineddata file. >> >> >> I don't really get this part. How do I make input_unicharset ? What is >> langdata? >> >> Thanks >> >> Guillaume Desforges >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-ocr+unsubscr...@googlegroups.com. >> To post to this group, send email to tesseract-ocr@googlegroups.com. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit https://groups.google.com/d/ms >> gid/tesseract-ocr/fe1d68a2-76ce-4005-98ea-672710365517%40googlegroups.com >> <https://groups.google.com/d/msgid/tesseract-ocr/fe1d68a2-76ce-4005-98ea-672710365517%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWm7Oa%3DBzQq1XF%3DR%3DoFrhTrz0qroHp6001Zax00uMTg2g%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.