PS: I forwarded Jim's message to one of the Belarusian Wikisourcers
On Tue, Aug 12, 2014 at 11:12 PM, Jim O'Regan <[email protected]> wrote: > On 12 August 2014 17:25, Nick White <[email protected]> wrote: > > Dear Wikisourcerers, > > > > It's good to hear from you. Wikisource is awesome, as far as I am > > concerned. > > > >> One > >> of the most serious issues was raised by the Belarusian community which > uses 2 > >> different scripts with no commercial OCR support. This means that the > >> volunteers have to type each word manually. We wondered if it would be > possible > >> to train Tesseract to recognize these old texts using the text that has > been > >> already typed. > > > > Actually, Tesseract should already have support for Russian and > > Belarussian "out of the box"; see the 'rus' and 'bel' training data. > > > > 'bel' contains Cyrillic; there is also a Latin script ('Łacinka') for > Belarusian. (Russian is widely spoken in Belarus, but Russian texts > would be added to the Russian Wikisource). > > The question I'd have for the Belarusian Wikisourcers is: can they be > treated as having an exact mapping? (It doesn't need to be 1:1, I'm > aware that, e.g., 'нь' maps to 'ń'). I ask because, as I remember it, > there's very little text in Łacinka, and adapting Cyrillic material > could be useful. > > > One thing that wikisource could potentially do for us would be > > provide loads of proofread, freely reusable "ground truth" data to > > test Tesseract with. Are there programatic ways of getting at the > > data, for example downloading all page images and corresponding text > > that is marked as green, for a specific language / script? > > They're all added to a category, so that part should be pretty easy. > > -- > <Sefam> Are any of the mentors around? > <jimregan> yes, they're the ones trolling you > -- Etiamsi omnes, ego non -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJBSGSpM6bgoemM645xtT5ESkvAZfO-XbN6fN7LUs%2B7eAiExwQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

