[tesseract-ocr] Re: Announcement: Python package pytesstrain (Tesseract training helpers)

Wincent Balin Mon, 03 Feb 2020 12:37:33 -0800

Hi Shree,

I am glad you find the package already useful :-) .


As to your question: I did not use the ocr-evaluation tools, only the 
language_metrics utility. So, regrettably, I cannot help you here. But 
maybe you could try the same utility too?

By the way, I added a create_ground_truth utility, which creates .gt.txt 
files as well as the associated .tif files for every specified font, to the 
package. I think it could be useful for anyone who does not have a ground 
truth collection yet.

Kind regards,

Wincent


Am Mittwoch, 29. Januar 2020 06:47:01 UTC+1 schrieb shree:
>
> Hi Wincent,
>
> Thank you for sharing these tools. I find create-dictdata to be very 
> useful.
>
> I wanted to know if you have modified any ocr-evaluation tools to handle 
> the high unicode range such as for Akkadian language.
>
> I was trying to test regarding Modi script (*Range*‎: ‎U+11600..U+1165F; 
> (96 code points)) and found that  `ocrevalutf8 accuracy` does not work 
> well for it. Any suggestions ...
>
> Shree
>
> On Sunday, January 5, 2020 at 2:22:50 AM UTC+5:30, Wincent Balin wrote:
>>
>> Hi all,
>>
>> I would like to announce pytesstrain, a collection of Tesseract training 
>> tools, as well as the underlying library. The tools were created while 
>> training Tesseract to recognise Akkadian language (stay tuned for more 
>> posts!), to solve the problems that emerged in the process.
>>
>> You can install it with pip install pytesstrain.
>>
>> The PyPI page for the package is https://pypi.org/project/pytesstrain/. 
>> The GitHub project page is https://github.com/wincentbalin/pytesstrain.
>>
>> This package contains the tools to create dictionary data (wordlist, bi- 
>> and unigram lists, etc.), rewrap lines in text files to the specified 
>> length, collect most frequent recognition errors and dump them into 
>> unicharambigs file, and to perform recognition metrics (WER and CER). It 
>> also contains the run_test() function, which creates an image file from 
>> the given string and performs OCR on it afterwards, as well as its 
>> parallelised version, run_tests(), which can be used in future tools.
>>
>> Feedback, suggestions, etc would be most welcome.
>>
>> Yours truly,
>>
>> Wincent
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/3df5801b-7119-4451-9bb5-5fabc3e66bb1%40googlegroups.com.

[tesseract-ocr] Re: Announcement: Python package pytesstrain (Tesseract training helpers)

Reply via email to