Re: [tesseract-ocr] Help for training tesseract to recognize a new (dead) language

ShreeDevi Kumar Wed, 30 May 2018 04:01:35 -0700

See https://github.com/OCR-D/ocrd-train/issues/7


You can use the utilities listed there for creating linelevel images from
page images. Make matching ground truth text files. and train.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, May 30, 2018 at 4:27 PM, Ramast Magdy <[email protected]> wrote:

> 1. collect utf-8 text in Coptic (DONE)
> 2. Find Coptic unicode fonts, if you can find one similar to the
> typewriter font used in books it will make training easier
> I tried but couldn't find such font. There are not that many Coptic fonts
> to being with.
> Can't I just extract few samples of each letter from the old books?
>
> 3. train a model with these and then finetune it with line images and
> matching ground truth
> I think I got this one.
> After extracting sample letters. arrange them randomly into separate lines
> (image for each line) and provide the text in a file with similar name.
>
> That's a good idea but since I am trying to train for reading old books,
> how can I account for things like slight page tilt during scanning for
> example?
> Also while at it, is there a tool I could use to split book pages into
> separate lines so that I can give it as part of training (along with it's
> text of course)
>
>
>
> On 05/30/2018 12:44 PM, ShreeDevi Kumar wrote:
>
> I am trying a test training for coptic for tess4, will let you know where
> to access traineddata.
>
> You can train using utf-8 textand unicode optic fonts.
>
> 1. collect utf-8 text in Coptic
> 2. Find Coptic unicode fonts, if you can find one similar to the
> typewriter font used in books it will make training easier
> 3. train a model with these and then finetune it with line images and
> matching ground truth
>
>
> ShreeDevi
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Wed, May 30, 2018 at 4:09 PM, Ramast Magdy <[email protected]>
> wrote:
>
>> Thank you ShreeDevi for both moheb's link and the one below.
>> The current one uses Tesseract 3 and according to the author:
>> "Recognition quality of Coptic texts containing old fonts will be very
>> poor, depending on the trained data."
>>
>> I will get in contact with him to see if we can use the other link you
>> provided
>> https://github.com/OCR-D/ocrd-train
>> To train Tesseract 4.00
>>
>> Thank you very much
>>
>>
>> On 05/30/2018 06:31 AM, ShreeDevi Kumar wrote:
>>
>> See http://www.moheb.de/ocr.html
>>
>> It provides a traineddata file for Coptic for use with tesseract version
>> 3.
>>
>> ShreeDevi
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Tue, May 29, 2018 at 9:57 PM, <[email protected]> wrote:
>>
>>> Hi,
>>> I belong to a group who study an old Egyptian writing system called
>>> "Coptic".
>>> It's based mostly on Greek (with some variation).
>>>
>>> Big majority of books written in Coptic where during the last century
>>> and were mostly the same [typewriter] font.
>>> Here is a sample picture:
>>> https://imgur.com/a/ILRw6vm
>>> And sample book:
>>> https://archive.org/download/pistissophiaopu00petegoog
>>>
>>> We need to add Coptic to languages supported by Tesseract but not sure
>>> how.
>>> I tried following this document https://github.com/tesseract-o
>>> cr/tesseract/wiki/TrainingTesseract-4.00 but it's very difficult to
>>> understand.
>>>
>>> We need someone help us with the initial setup so that we can dedicate
>>> our man power to training the system.
>>> We are none profit group so we are hoping for free help but we would
>>> also consider paid help since the alternative is hundreds of hours of man
>>> labor to digitalize just few books.
>>>
>>> Thanks everyone for contributing to this awesome project
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/08869d08-8b3a-4390-be79-fa811c78c0ca%40goo
>>> glegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/08869d08-8b3a-4390-be79-fa811c78c0ca%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To post to this group, send email to [email protected].
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit https://groups.google.com/d/ms
>> gid/tesseract-ocr/CAG2NduUcTs8WTSM0Ppwuon%2B-e1RJHiS4pjsvLng
>> YphW0yy4X2Q%40mail.gmail.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUcTs8WTSM0Ppwuon%2B-e1RJHiS4pjsvLngYphW0yy4X2Q%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>>
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXpsfKeUuPjG1Pk-UpKUt6N-793vq2WucZkSXgOvHvoTw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Help for training tesseract to recognize a new (dead) language

Reply via email to