Re: [tesseract-ocr] Help for training tesseract to recognize a new (dead) language

Ramast Magdy Wed, 30 May 2018 04:02:07 -0700

Perfect, That is really helpful.
Hope you are having awesome day :)


On 05/30/2018 01:00 PM, ShreeDevi Kumar wrote:

See https://github.com/OCR-D/ocrd-train/issues/7

You can use the utilities listed there for creating linelevel imagesfrom page images. Make matching ground truth text files. and train.


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, May 30, 2018 at 4:27 PM, Ramast Magdy <[email protected]<mailto:[email protected]>> wrote:


    1. collect utf-8 text in Coptic (DONE)
    2. Find Coptic unicode fonts, if you can find one similar to the
    typewriter font used in books it will make training easier
    I tried but couldn't find such font. There are not that many
    Coptic fonts to being with.
    Can't I just extract few samples of each letter from the old books?

    3. train a model with these and then finetune it with line images
    and matching ground truth
    I think I got this one.
    After extracting sample letters. arrange them randomly into
    separate lines (image for each line) and provide the text in a
    file with similar name.

    That's a good idea but since I am trying to train for reading old
    books, how can I account for things like slight page tilt during
    scanning for example?
    Also while at it, is there a tool I could use to split book pages
    into separate lines so that I can give it as part of training
    (along with it's text of course)



    On 05/30/2018 12:44 PM, ShreeDevi Kumar wrote:

    I am trying a test training for coptic for tess4, will let you
    know where to access traineddata.

    You can train using utf-8 textand unicode optic fonts.

    1. collect utf-8 text in Coptic
    2. Find Coptic unicode fonts, if you can find one similar to the
    typewriter font used in books it will make training easier
    3. train a model with these and then finetune it with line images
    and matching ground truth


    ShreeDevi
    ____________________________________________________________
    भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

    On Wed, May 30, 2018 at 4:09 PM, Ramast Magdy
    <[email protected] <mailto:[email protected]>> wrote:

        Thank you ShreeDevi for both moheb's link and the one below.
        The current one uses Tesseract 3 and according to the author:
        "Recognition quality of Coptic texts containing old fonts
        will be very poor, depending on the trained data."

        I will get in contact with him to see if we can use the other
        link you provided
        https://github.com/OCR-D/ocrd-train
        <https://github.com/OCR-D/ocrd-train>
        To train Tesseract 4.00

        Thank you very much


        On 05/30/2018 06:31 AM, ShreeDevi Kumar wrote:

        See http://www.moheb.de/ocr.html <http://www.moheb.de/ocr.html>

        It provides a traineddata file for Coptic for use with
        tesseract version 3.

        ShreeDevi
        ____________________________________________________________
        भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

        On Tue, May 29, 2018 at 9:57 PM, <[email protected]
        <mailto:[email protected]>> wrote:

            Hi,
            I belong to a group who study an old Egyptian writing
            system called "Coptic".
            It's based mostly on Greek (with some variation).

            Big majority of books written in Coptic where during the
            last century and were mostly the same [typewriter] font.
            Here is a sample picture:
            https://imgur.com/a/ILRw6vm
            And sample book:
            https://archive.org/download/pistissophiaopu00petegoog
            <https://archive.org/download/pistissophiaopu00petegoog>

            We need to add Coptic to languages supported by
            Tesseract but not sure how.
            I tried following this document
            
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
            
<https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00>
            but it's very difficult to understand.

            We need someone help us with the initial setup so that
            we can dedicate our man power to training the system.
            We are none profit group so we are hoping for free help
            but we would also consider paid help since the
            alternative is hundreds of hours of man labor to
            digitalize just few books.

            Thanks everyone for contributing to this awesome project

--You received this message because you are subscribed to

            the Google Groups "tesseract-ocr" group.
            To unsubscribe from this group and stop receiving emails
            from it, send an email to
            [email protected]
            <mailto:[email protected]>.
            To post to this group, send email to
            [email protected]
            <mailto:[email protected]>.
            Visit this group at
            https://groups.google.com/group/tesseract-ocr
            <https://groups.google.com/group/tesseract-ocr>.
            To view this discussion on the web visit
            
https://groups.google.com/d/msgid/tesseract-ocr/08869d08-8b3a-4390-be79-fa811c78c0ca%40googlegroups.com
            
<https://groups.google.com/d/msgid/tesseract-ocr/08869d08-8b3a-4390-be79-fa811c78c0ca%40googlegroups.com?utm_medium=email&utm_source=footer>.
            For more options, visit
            https://groups.google.com/d/optout
            <https://groups.google.com/d/optout>.

--You received this message because you are subscribed to the

        Google Groups "tesseract-ocr" group.
        To unsubscribe from this group and stop receiving emails
        from it, send an email to
        [email protected]
        <mailto:[email protected]>.
        To post to this group, send email to
        [email protected]
        <mailto:[email protected]>.
        Visit this group at
        https://groups.google.com/group/tesseract-ocr
        <https://groups.google.com/group/tesseract-ocr>.
        To view this discussion on the web visit
        
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUcTs8WTSM0Ppwuon%2B-e1RJHiS4pjsvLngYphW0yy4X2Q%40mail.gmail.com
        
<https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUcTs8WTSM0Ppwuon%2B-e1RJHiS4pjsvLngYphW0yy4X2Q%40mail.gmail.com?utm_medium=email&utm_source=footer>.
        For more options, visit https://groups.google.com/d/optout
        <https://groups.google.com/d/optout>.


--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/608c0bb0-a309-93bb-5145-0ee6d3b97976%40gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Help for training tesseract to recognize a new (dead) language

Reply via email to