Re: [tesseract-ocr] Outreach from the Wikisource community

Jim O'Regan Tue, 12 Aug 2014 14:13:42 -0700

On 12 August 2014 17:25, Nick White <[email protected]> wrote:
> Dear Wikisourcerers,
>
> It's good to hear from you. Wikisource is awesome, as far as I am
> concerned.
>
>> One
>> of the most serious issues was raised by the Belarusian community which uses 
>> 2
>> different scripts with no commercial OCR support. This means that the
>> volunteers have to type each word manually. We wondered if it would be 
>> possible
>> to train Tesseract to recognize these old texts using the text that has been
>> already typed.
>
> Actually, Tesseract should already have support for Russian and
> Belarussian "out of the box"; see the 'rus' and 'bel' training data.
>


'bel' contains Cyrillic; there is also a Latin script ('Łacinka') for
Belarusian. (Russian is widely spoken in Belarus, but Russian texts
would be added to the Russian Wikisource).

The question I'd have for the Belarusian Wikisourcers is: can they be
treated as having an exact mapping? (It doesn't need to be 1:1, I'm
aware that, e.g., 'нь' maps to 'ń'). I ask because, as I remember it,
there's very little text in Łacinka, and adapting Cyrillic material
could be useful.

> One thing that wikisource could potentially do for us would be
> provide loads of proofread, freely reusable "ground truth" data to
> test Tesseract with. Are there programatic ways of getting at the
> data, for example downloading all page images and corresponding text
> that is marked as green, for a specific language / script?

They're all added to a category, so that part should be pretty easy.

-- 
<Sefam> Are any of the mentors around?
<jimregan> yes, they're the ones trolling you

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAHh9-xu52jnfoARHv8rY_jJAZKOPuCNXOKdxXy0%3D9x76hkfiww%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Outreach from the Wikisource community

Reply via email to