Re: [tesseract-ocr] Advice on training for Old Amharic texts

Menelik Berhan Sun, 14 Jan 2024 05:06:38 -0800

Yes I'm In addis.
My pc is not that powerful either. But I could find a couple of good
desktop PCs for the training.


It would be my pleasure to meet in person, I've some questions about the
training process that I'll ask when we meet.

I'm free almost all day after 10 a.m (EAT) (ketewatu arat seat local time).
Let me know the time and place of your convenience.

Thanks

On Sun, Jan 14, 2024 at 3:22 PM Dellu Bw <[email protected]> wrote:

> Hi Menilik, are you in Addis?
> I have figured out most of the workings of Tesseract. I really fall into a
> trap because of the electric blackouts and the underpowered pc. I feel that
> we can train everything of Ethiopic (Geez, Amharic, Tigringa and every
> other ) in one sweep. I have about 8gb of data to  train Amharic. But my pc
> just cannot handle it. We can meet in person and generate(collect ) more
> data to include the other Ethiopics and train it.
> (Sorry i am writing on my phone.)
>
> On Sun, Jan 14, 2024, 3:14 PM Dellu Bw <[email protected]> wrote:
>
>> Most of the guide written for version 4 actually work for version 5. The
>> changes are minimal. It is better to keep version 5 because it seems
>> perform better. Are u using linux?
>>
>> On Sat, Jan 13, 2024, 4:08 PM Menelik Berhan <[email protected]>
>> wrote:
>>
>>> Thanks for your swift reply. It would be my pleasure to collaborate with
>>> you.
>>>
>>> I've noticed that there is are extensive guides and tutorials regarding
>>> training tesseract 4.x, and I wanted to switch to 4.x version.
>>> I wanted to ask what would be the trade off if I used tesseract 4.x
>>> instead of 5.x ?
>>>
>>> Thanks for your time!!!
>>>
>>>
>>> On Saturday, January 13, 2024 at 12:49:36 PM UTC+3 [email protected]
>>> wrote:
>>>
>>>> I spend some time trying to improve the default model of Amharic. I
>>>> default model has a couple of characters missing. As i have noted in many
>>>> posts in this forum, training by removing the top layer is the best method
>>>> to introduce new characters.
>>>>
>>>> But i really struggled because the training is deteriotating the base
>>>> (default) model. I also have the shortage of processing power.
>>>> Tesseract 5.3 also has some flaws which made it hard to use in the
>>>> third countries ( electric blackouts)
>>>>
>>>> Dear Menilik, we might need to put out hands together on this.
>>>>
>>>> On Sat, Jan 13, 2024, 11:21 AM Menelik Berhan <[email protected]>
>>>> wrote:
>>>>
>>>>> *Background*
>>>>> I'm trying to use tesseract 5.3.3 on scanned old books written in
>>>>> Amharic (which uses Ethiopic script).
>>>>>
>>>>> *Major Shortcomings of amh.traineddata from tesseract*
>>>>>
>>>>> *Difference in type of Ethiopic script:* there are Ethiopic script
>>>>> characters in old Amharic texts that are not used in the unicharset of
>>>>> amh.traineddata.
>>>>>
>>>>> *Difference in punctuation styles:* the old texts use some
>>>>> punctuations not used in modern Amharic, and also for some that are used 
>>>>> in
>>>>> modern Amharic, the old texts have d/t pattern (mostly space b/n word and
>>>>> punctuation character --- while the old texts always put space b/n
>>>>> punctuation chars and both preceding and following words, in modern times
>>>>> these punctuation chars doesn't have space b/n them and the preceding 
>>>>> word).
>>>>>
>>>>> *Very narrow training_text & wordlist (based on
>>>>> tesseract/langdata_lstm)*
>>>>> The amh.training_text & amh.wordlist text files used by tesseract (the
>>>>> one from langdata_lstm) is very small. (to give you an Idea: for
>>>>> tir.traineddata (another language which uses Ethiopic script) the
>>>>> tir.training_text from langdata_lstm has more than 400,000 lines while the
>>>>> amh.training_text has only around 400 lines)
>>>>>
>>>>> *Other challenges*
>>>>>
>>>>>    - The old Amharic books use a font that's not in use (or
>>>>>    available).
>>>>>    - The old Amharic books contain many Ge'ez words (a liturgical
>>>>>    language like latin which uses Ethiopic script).
>>>>>    - The old Amharic books mostly use Ge'ez numbers, while modern
>>>>>    Amharic texts use Arabic numbers.
>>>>>
>>>>> *WHAT I'VE DONE SO FAR*
>>>>> As an experiment I've tried to fine tune amh.traineddata_best (using
>>>>> `make training`) with close to 300 line images & texts (from sample pages
>>>>> of some old Amharic books) and using files from langdata_lstm (for 10,000
>>>>> iterations).
>>>>>
>>>>> The resulting traineddata has a very satisfactory improvement in
>>>>> addressing some of the challenges mentioned above, especially those
>>>>> regarding punctuation chars.
>>>>>
>>>>> But it still fails to solve the problems I've with some characters
>>>>> (the ones not present in the unicharset of amh.traineddata) and fails for
>>>>> almost all Ge'ez numbers (eventhough the training sample pages have many
>>>>> Ge'ez nums).
>>>>>
>>>>> *WHAT I'M PLANNING TO DO*
>>>>> First I want to train tesseract with a large training_text & wordlist
>>>>> files, and also a complete unicharset file ,
>>>>> Then fine tune the resulting traineddata based on sample line images
>>>>> from the old books.
>>>>>
>>>>> *QUESTIONS (for now. I'll definitely add more questions later)*
>>>>> Is there another path I should take that would get me to where I want?
>>>>>
>>>>> *Regarding training tesseract with large training_text & wordlist
>>>>> files, and also a complete unicharset file:*
>>>>>
>>>>>    - How to prepare the training_text & wordlist file? (What the text
>>>>>    files should contain)
>>>>>    - How to prepare the unicharset file, and also how to pass it to
>>>>>    the `make training` command ?
>>>>>
>>>>>
>>>>> *Regarding generating a text, image(tif) and box file from
>>>>> training_text:*
>>>>>
>>>>> I've looked up python scripts to do this job, but have question about
>>>>> the proper values for these params in text2image:
>>>>> --font (what criteria should I use to select the list of fonts),
>>>>> --leading, --xsize, --ysize, --char_spacing, --exposure,
>>>>> --unicharset_file and --margin.
>>>>>
>>>>> I've noticed from tesstrain repo for tesseract 5 that the line images
>>>>> are tightly cropped (with minimal margin around text line). Is the same
>>>>> property (minimal margins) required/desired of the line images generated
>>>>> using text2image from the training_text?
>>>>>
>>>>> *THANKS FOR YOUR TIME !!!*
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/9bda9bc4-b07a-491b-b8fc-fbb25b54c368n%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/9bda9bc4-b07a-491b-b8fc-fbb25b54c368n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/bf4d57dc-a4ea-4157-8782-0acca178c9dan%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/bf4d57dc-a4ea-4157-8782-0acca178c9dan%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
> You received this message because you are subscribed to a topic in the
> Google Groups "tesseract-ocr" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/tesseract-ocr/qhrcsS37Kn4/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CA%2BLi4kAcJGE9Qbp9RQYz%3Dnp-Na35E-1ZukwbWdYOdVo79Fjewg%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CA%2BLi4kAcJGE9Qbp9RQYz%3Dnp-Na35E-1ZukwbWdYOdVo79Fjewg%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAEQfXZXfju4gp-BTd5dZgsEmEBEBPAOM4imPPheyEOhe1-NKbw%40mail.gmail.com.

Re: [tesseract-ocr] Advice on training for Old Amharic texts

Reply via email to