Please share a couple of scanned pages for testing. You may be able to use existing traineddata files for English and Russian with -l eng+rus or for English and Hindi with -l eng+hin
For text with diacritics you can try -l script/Latin This will give you an idea of current state. You can plan training after that. On Wed, 20 Feb 2019, 10:20 Alexander Gribanov <[email protected] wrote: > Hello! > > Just found a tesseract and it seems a very great and powerful instrument, > but as we say in Russia, equipment in the hands of the fool is a > scrap-metal... > > So please, if somebody would be kind and help me to give advice > step-by-step: > 1. What to do > 2. What to read/watch > 3. Take a look on the result and give me a hint where to go next > > My subject actually is that I have a lot of scanned (and many not scanned > yet) books in mixed languages, > like English, Russian, Hindi, Bengali, sometimes kind of diacritic > symbols, etc... > Most of them, I have to idea, is there any fonts available, which were > they printed with... > > But I'm ready to select on the image for the first time some letters, > words, etc > Then tell to the program, which letter from image means as unicode char > (not sure how does it called correctly) > So this way maybe possible to create missing fonts > > So as I understood, the training neural network is kinda spiral process: > 1. We have an image > 2. We tell to the network, which part of the image is a symbol and what > that symbol is (character code). > This becomes a training materials > 3. Network based on the first small experience (let's say 1 page) tries to > recognize 2-nd page > 4. We verify and correct if needed. It becomes more training materials > > And so on, so steps 3-4 repeats until the whole book will not be > recognized. > Sometimes step 2 will be invoked for new characters or patters, etc.. > > So I think, this is should be enough to understand my level on the subject > and my goal, > so I request, please, if anybody would like to help me to establish the > process > to recognize many rare books to be able to search and navigate among > tons of scriptures, which will be lost and burried by the time... > > Thank You all very much, > best regards, Alexander > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/f4d5673a-31f4-4c2b-91f2-6cb843943a41%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/f4d5673a-31f4-4c2b-91f2-6cb843943a41%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWesTyi9j94c%2BSjEB3ssS83VMp9oFame%3DP7Kp_930s-ZA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

