See https://github.com/OCR-D/ocrd-train/issues/7
You can use the utilities listed there for creating linelevel images from page images. Make matching ground truth text files. and train. ShreeDevi ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Wed, May 30, 2018 at 4:27 PM, Ramast Magdy <[email protected]> wrote: > 1. collect utf-8 text in Coptic (DONE) > 2. Find Coptic unicode fonts, if you can find one similar to the > typewriter font used in books it will make training easier > I tried but couldn't find such font. There are not that many Coptic fonts > to being with. > Can't I just extract few samples of each letter from the old books? > > 3. train a model with these and then finetune it with line images and > matching ground truth > I think I got this one. > After extracting sample letters. arrange them randomly into separate lines > (image for each line) and provide the text in a file with similar name. > > That's a good idea but since I am trying to train for reading old books, > how can I account for things like slight page tilt during scanning for > example? > Also while at it, is there a tool I could use to split book pages into > separate lines so that I can give it as part of training (along with it's > text of course) > > > > On 05/30/2018 12:44 PM, ShreeDevi Kumar wrote: > > I am trying a test training for coptic for tess4, will let you know where > to access traineddata. > > You can train using utf-8 textand unicode optic fonts. > > 1. collect utf-8 text in Coptic > 2. Find Coptic unicode fonts, if you can find one similar to the > typewriter font used in books it will make training easier > 3. train a model with these and then finetune it with line images and > matching ground truth > > > ShreeDevi > ____________________________________________________________ > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com > > On Wed, May 30, 2018 at 4:09 PM, Ramast Magdy <[email protected]> > wrote: > >> Thank you ShreeDevi for both moheb's link and the one below. >> The current one uses Tesseract 3 and according to the author: >> "Recognition quality of Coptic texts containing old fonts will be very >> poor, depending on the trained data." >> >> I will get in contact with him to see if we can use the other link you >> provided >> https://github.com/OCR-D/ocrd-train >> To train Tesseract 4.00 >> >> Thank you very much >> >> >> On 05/30/2018 06:31 AM, ShreeDevi Kumar wrote: >> >> See http://www.moheb.de/ocr.html >> >> It provides a traineddata file for Coptic for use with tesseract version >> 3. >> >> ShreeDevi >> ____________________________________________________________ >> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >> >> On Tue, May 29, 2018 at 9:57 PM, <[email protected]> wrote: >> >>> Hi, >>> I belong to a group who study an old Egyptian writing system called >>> "Coptic". >>> It's based mostly on Greek (with some variation). >>> >>> Big majority of books written in Coptic where during the last century >>> and were mostly the same [typewriter] font. >>> Here is a sample picture: >>> https://imgur.com/a/ILRw6vm >>> And sample book: >>> https://archive.org/download/pistissophiaopu00petegoog >>> >>> We need to add Coptic to languages supported by Tesseract but not sure >>> how. >>> I tried following this document https://github.com/tesseract-o >>> cr/tesseract/wiki/TrainingTesseract-4.00 but it's very difficult to >>> understand. >>> >>> We need someone help us with the initial setup so that we can dedicate >>> our man power to training the system. >>> We are none profit group so we are hoping for free help but we would >>> also consider paid help since the alternative is hundreds of hours of man >>> labor to digitalize just few books. >>> >>> Thanks everyone for contributing to this awesome project >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit https://groups.google.com/d/ms >>> gid/tesseract-ocr/08869d08-8b3a-4390-be79-fa811c78c0ca%40goo >>> glegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/08869d08-8b3a-4390-be79-fa811c78c0ca%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To post to this group, send email to [email protected]. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit https://groups.google.com/d/ms >> gid/tesseract-ocr/CAG2NduUcTs8WTSM0Ppwuon%2B-e1RJHiS4pjsvLng >> YphW0yy4X2Q%40mail.gmail.com >> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUcTs8WTSM0Ppwuon%2B-e1RJHiS4pjsvLngYphW0yy4X2Q%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> >> >> > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXpsfKeUuPjG1Pk-UpKUt6N-793vq2WucZkSXgOvHvoTw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

