Re: [tesseract-ocr] Help for training tesseract to recognize a new (dead) language

2018-05-30 Thread ShreeDevi Kumar
See https://github.com/OCR-D/ocrd-train/issues/7

You can use the utilities listed there for creating linelevel images from
page images. Make matching ground truth text files. and train.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, May 30, 2018 at 4:27 PM, Ramast Magdy  wrote:

> 1. collect utf-8 text in Coptic (DONE)
> 2. Find Coptic unicode fonts, if you can find one similar to the
> typewriter font used in books it will make training easier
> I tried but couldn't find such font. There are not that many Coptic fonts
> to being with.
> Can't I just extract few samples of each letter from the old books?
>
> 3. train a model with these and then finetune it with line images and
> matching ground truth
> I think I got this one.
> After extracting sample letters. arrange them randomly into separate lines
> (image for each line) and provide the text in a file with similar name.
>
> That's a good idea but since I am trying to train for reading old books,
> how can I account for things like slight page tilt during scanning for
> example?
> Also while at it, is there a tool I could use to split book pages into
> separate lines so that I can give it as part of training (along with it's
> text of course)
>
>
>
> On 05/30/2018 12:44 PM, ShreeDevi Kumar wrote:
>
> I am trying a test training for coptic for tess4, will let you know where
> to access traineddata.
>
> You can train using utf-8 textand unicode optic fonts.
>
> 1. collect utf-8 text in Coptic
> 2. Find Coptic unicode fonts, if you can find one similar to the
> typewriter font used in books it will make training easier
> 3. train a model with these and then finetune it with line images and
> matching ground truth
>
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Wed, May 30, 2018 at 4:09 PM, Ramast Magdy 
> wrote:
>
>> Thank you ShreeDevi for both moheb's link and the one below.
>> The current one uses Tesseract 3 and according to the author:
>> "Recognition quality of Coptic texts containing old fonts will be very
>> poor, depending on the trained data."
>>
>> I will get in contact with him to see if we can use the other link you
>> provided
>> https://github.com/OCR-D/ocrd-train
>> To train Tesseract 4.00
>>
>> Thank you very much
>>
>>
>> On 05/30/2018 06:31 AM, ShreeDevi Kumar wrote:
>>
>> See http://www.moheb.de/ocr.html
>>
>> It provides a traineddata file for Coptic for use with tesseract version
>> 3.
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Tue, May 29, 2018 at 9:57 PM,  wrote:
>>
>>> Hi,
>>> I belong to a group who study an old Egyptian writing system called
>>> "Coptic".
>>> It's based mostly on Greek (with some variation).
>>>
>>> Big majority of books written in Coptic where during the last century
>>> and were mostly the same [typewriter] font.
>>> Here is a sample picture:
>>> https://imgur.com/a/ILRw6vm
>>> And sample book:
>>> https://archive.org/download/pistissophiaopu00petegoog
>>>
>>> We need to add Coptic to languages supported by Tesseract but not sure
>>> how.
>>> I tried following this document https://github.com/tesseract-o
>>> cr/tesseract/wiki/TrainingTesseract-4.00 but it's very difficult to
>>> understand.
>>>
>>> We need someone help us with the initial setup so that we can dedicate
>>> our man power to training the system.
>>> We are none profit group so we are hoping for free help but we would
>>> also consider paid help since the alternative is hundreds of hours of man
>>> labor to digitalize just few books.
>>>
>>> Thanks everyone for contributing to this awesome project
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/08869d08-8b3a-4390-be79-fa811c78c0ca%40goo
>>> glegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 

Re: [tesseract-ocr] Help for training tesseract to recognize a new (dead) language

2018-05-30 Thread Ramast Magdy

1. collect utf-8 text in Coptic (DONE)
2. Find Coptic unicode fonts, if you can find one similar to the 
typewriter font used in books it will make training easier
I tried but couldn't find such font. There are not that many Coptic 
fonts to being with.

Can't I just extract few samples of each letter from the old books?

3. train a model with these and then finetune it with line images and 
matching ground truth

I think I got this one.
After extracting sample letters. arrange them randomly into separate 
lines (image for each line) and provide the text in a file with similar 
name.


That's a good idea but since I am trying to train for reading old books, 
how can I account for things like slight page tilt during scanning for 
example?
Also while at it, is there a tool I could use to split book pages into 
separate lines so that I can give it as part of training (along with 
it's text of course)



On 05/30/2018 12:44 PM, ShreeDevi Kumar wrote:
I am trying a test training for coptic for tess4, will let you know 
where to access traineddata.


You can train using utf-8 textand unicode optic fonts.

1. collect utf-8 text in Coptic
2. Find Coptic unicode fonts, if you can find one similar to the 
typewriter font used in books it will make training easier
3. train a model with these and then finetune it with line images and 
matching ground truth



ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, May 30, 2018 at 4:09 PM, Ramast Magdy > wrote:


Thank you ShreeDevi for both moheb's link and the one below.
The current one uses Tesseract 3 and according to the author:
"Recognition quality of Coptic texts containing old fonts will be
very poor, depending on the trained data."

I will get in contact with him to see if we can use the other link
you provided
https://github.com/OCR-D/ocrd-train

To train Tesseract 4.00

Thank you very much


On 05/30/2018 06:31 AM, ShreeDevi Kumar wrote:

See http://www.moheb.de/ocr.html 

It provides a traineddata file for Coptic for use with tesseract
version 3.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, May 29, 2018 at 9:57 PM, mailto:ramast@gmail.com>> wrote:

Hi,
I belong to a group who study an old Egyptian writing system
called "Coptic".
It's based mostly on Greek (with some variation).

Big majority of books written in Coptic where during the last
century and were mostly the same [typewriter] font.
Here is a sample picture:
https://imgur.com/a/ILRw6vm
And sample book:
https://archive.org/download/pistissophiaopu00petegoog


We need to add Coptic to languages supported by Tesseract but
not sure how.
I tried following this document
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

but it's very difficult to understand.

We need someone help us with the initial setup so that we can
dedicate our man power to training the system.
We are none profit group so we are hoping for free help but
we would also consider paid help since the alternative is
hundreds of hours of man labor to digitalize just few books.

Thanks everyone for contributing to this awesome project
-- 
You received this message because you are subscribed to the

Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from
it, send an email to
tesseract-ocr+unsubscr...@googlegroups.com
.
To post to this group, send email to
tesseract-ocr@googlegroups.com
.
Visit this group at
https://groups.google.com/group/tesseract-ocr
.
To view this discussion on the web visit

https://groups.google.com/d/msgid/tesseract-ocr/08869d08-8b3a-4390-be79-fa811c78c0ca%40googlegroups.com

.
For more options, visit https://groups.google.com/d/optout
.


-- 
You received this message because you are subscribed to the

Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to 

Re: [tesseract-ocr] Help for training tesseract to recognize a new (dead) language

2018-05-30 Thread ShreeDevi Kumar
> The current one uses Tesseract 3

Tesseract 3.ox has different formats for traineddata depending on the
version used 3.02 vs 3.04 etc.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, May 30, 2018 at 4:14 PM, ShreeDevi Kumar 
wrote:

> I am trying a test training for coptic for tess4, will let you know where
> to access traineddata.
>
> You can train using utf-8 textand unicode optic fonts.
>
> 1. collect utf-8 text in Coptic
> 2. Find Coptic unicode fonts, if you can find one similar to the
> typewriter font used in books it will make training easier
> 3. train a model with these and then finetune it with line images and
> matching ground truth
>
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Wed, May 30, 2018 at 4:09 PM, Ramast Magdy 
> wrote:
>
>> Thank you ShreeDevi for both moheb's link and the one below.
>> The current one uses Tesseract 3 and according to the author:
>> "Recognition quality of Coptic texts containing old fonts will be very
>> poor, depending on the trained data."
>>
>> I will get in contact with him to see if we can use the other link you
>> provided
>> https://github.com/OCR-D/ocrd-train
>> To train Tesseract 4.00
>>
>> Thank you very much
>>
>>
>> On 05/30/2018 06:31 AM, ShreeDevi Kumar wrote:
>>
>> See http://www.moheb.de/ocr.html
>>
>> It provides a traineddata file for Coptic for use with tesseract version
>> 3.
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Tue, May 29, 2018 at 9:57 PM,  wrote:
>>
>>> Hi,
>>> I belong to a group who study an old Egyptian writing system called
>>> "Coptic".
>>> It's based mostly on Greek (with some variation).
>>>
>>> Big majority of books written in Coptic where during the last century
>>> and were mostly the same [typewriter] font.
>>> Here is a sample picture:
>>> https://imgur.com/a/ILRw6vm
>>> And sample book:
>>> https://archive.org/download/pistissophiaopu00petegoog
>>>
>>> We need to add Coptic to languages supported by Tesseract but not sure
>>> how.
>>> I tried following this document https://github.com/tesseract-o
>>> cr/tesseract/wiki/TrainingTesseract-4.00 but it's very difficult to
>>> understand.
>>>
>>> We need someone help us with the initial setup so that we can dedicate
>>> our man power to training the system.
>>> We are none profit group so we are hoping for free help but we would
>>> also consider paid help since the alternative is hundreds of hours of man
>>> labor to digitalize just few books.
>>>
>>> Thanks everyone for contributing to this awesome project
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/08869d08-8b3a-4390-be79-fa811c78c0ca%40goo
>>> glegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit https://groups.google.com/d/ms
>> gid/tesseract-ocr/CAG2NduUcTs8WTSM0Ppwuon%2B-e1RJHiS4pjsvLng
>> YphW0yy4X2Q%40mail.gmail.com
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>>
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUSCM9hE%3DdpD3c92om%3DsfdZq7ou3eGK%2BQ9Vvo5RPWs%3D8g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Help for training tesseract to recognize a new (dead) language

2018-05-30 Thread ShreeDevi Kumar
I am trying a test training for coptic for tess4, will let you know where
to access traineddata.

You can train using utf-8 textand unicode optic fonts.

1. collect utf-8 text in Coptic
2. Find Coptic unicode fonts, if you can find one similar to the typewriter
font used in books it will make training easier
3. train a model with these and then finetune it with line images and
matching ground truth


ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, May 30, 2018 at 4:09 PM, Ramast Magdy  wrote:

> Thank you ShreeDevi for both moheb's link and the one below.
> The current one uses Tesseract 3 and according to the author:
> "Recognition quality of Coptic texts containing old fonts will be very
> poor, depending on the trained data."
>
> I will get in contact with him to see if we can use the other link you
> provided
> https://github.com/OCR-D/ocrd-train
> To train Tesseract 4.00
>
> Thank you very much
>
>
> On 05/30/2018 06:31 AM, ShreeDevi Kumar wrote:
>
> See http://www.moheb.de/ocr.html
>
> It provides a traineddata file for Coptic for use with tesseract version 3.
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Tue, May 29, 2018 at 9:57 PM,  wrote:
>
>> Hi,
>> I belong to a group who study an old Egyptian writing system called
>> "Coptic".
>> It's based mostly on Greek (with some variation).
>>
>> Big majority of books written in Coptic where during the last century and
>> were mostly the same [typewriter] font.
>> Here is a sample picture:
>> https://imgur.com/a/ILRw6vm
>> And sample book:
>> https://archive.org/download/pistissophiaopu00petegoog
>>
>> We need to add Coptic to languages supported by Tesseract but not sure
>> how.
>> I tried following this document https://github.com/tesseract-o
>> cr/tesseract/wiki/TrainingTesseract-4.00 but it's very difficult to
>> understand.
>>
>> We need someone help us with the initial setup so that we can dedicate
>> our man power to training the system.
>> We are none profit group so we are hoping for free help but we would also
>> consider paid help since the alternative is hundreds of hours of man labor
>> to digitalize just few books.
>>
>> Thanks everyone for contributing to this awesome project
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit https://groups.google.com/d/ms
>> gid/tesseract-ocr/08869d08-8b3a-4390-be79-fa811c78c0ca%40googlegroups.com
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/CAG2NduUcTs8WTSM0Ppwuon%2B-e1RJHiS4pjsvLngYphW0yy4X2Q%
> 40mail.gmail.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduV1OpBCrwfohb43JD0zJJM%2Bqnfh3dvC%3D3a3Fe1a5cHYCQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Help for training tesseract to recognize a new (dead) language

2018-05-30 Thread Ramast Magdy

Thank you ShreeDevi for both moheb's link and the one below.
The current one uses Tesseract 3 and according to the author:
"Recognition quality of Coptic texts containing old fonts will be very 
poor, depending on the trained data."


I will get in contact with him to see if we can use the other link you 
provided

https://github.com/OCR-D/ocrd-train
To train Tesseract 4.00

Thank you very much

On 05/30/2018 06:31 AM, ShreeDevi Kumar wrote:

See http://www.moheb.de/ocr.html

It provides a traineddata file for Coptic for use with tesseract 
version 3.


ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, May 29, 2018 at 9:57 PM, > wrote:


Hi,
I belong to a group who study an old Egyptian writing system
called "Coptic".
It's based mostly on Greek (with some variation).

Big majority of books written in Coptic where during the last
century and were mostly the same [typewriter] font.
Here is a sample picture:
https://imgur.com/a/ILRw6vm
And sample book:
https://archive.org/download/pistissophiaopu00petegoog


We need to add Coptic to languages supported by Tesseract but not
sure how.
I tried following this document
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

but it's very difficult to understand.

We need someone help us with the initial setup so that we can
dedicate our man power to training the system.
We are none profit group so we are hoping for free help but we
would also consider paid help since the alternative is hundreds of
hours of man labor to digitalize just few books.

Thanks everyone for contributing to this awesome project
-- 
You received this message because you are subscribed to the Google

Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to tesseract-ocr+unsubscr...@googlegroups.com
.
To post to this group, send email to
tesseract-ocr@googlegroups.com
.
Visit this group at https://groups.google.com/group/tesseract-ocr
.
To view this discussion on the web visit

https://groups.google.com/d/msgid/tesseract-ocr/08869d08-8b3a-4390-be79-fa811c78c0ca%40googlegroups.com

.
For more options, visit https://groups.google.com/d/optout
.


--
You received this message because you are subscribed to the Google 
Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send 
an email to tesseract-ocr+unsubscr...@googlegroups.com 
.
To post to this group, send email to tesseract-ocr@googlegroups.com 
.

Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUcTs8WTSM0Ppwuon%2B-e1RJHiS4pjsvLngYphW0yy4X2Q%40mail.gmail.com 
.

For more options, visit https://groups.google.com/d/optout.



--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/1976eeb6-dc89-4660-747d-5e23c4628faf%40gmail.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: German "Straße" is often "StraBe" (tesseract 4.0)

2018-05-30 Thread Thomas Güttler
I found the root of the problem: correct is "-l deu" but I used "-l=deu".

See https://stackoverflow.com/a/50601305/633961

And I created an issue: 
https://github.com/tesseract-ocr/tesseract/issues/1616


Am Donnerstag, 24. Mai 2018 13:35:55 UTC+2 schrieb Thomas Güttler:
>
> I use tesseract 4.0 via docker (tesseractshadow/tesseract4re)
>
> Very often tesseract detects "StraBe" instead of "Straße".
>
> Yes, I use -l=deu
>
> The word "Straße" is very common in german. It means "street".
>
> Since "StraBe" makes no sense I would like to improve this.
>
> What do you suggest?
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/616553aa-2ae2-4354-82c3-b366ce102d31%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] how make a .traineddata of combination of two language (arabic+english)

2018-05-30 Thread nick
hi 

Is possible make a new .traineddata for support two languages ? (arabic + 
english)
HOW ?

thanks

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/e6eca58e-a3b1-4e8f-86cd-28a67eb255e2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.