Re: [tesseract-ocr] Outreach from the Wikisource community

Nick White Tue, 12 Aug 2014 09:26:30 -0700

Dear Wikisourcerers,

It's good to hear from you. Wikisource is awesome, as far as I am 
concerned.


> One
> of the most serious issues was raised by the Belarusian community which uses 2
> different scripts with no commercial OCR support. This means that the
> volunteers have to type each word manually. We wondered if it would be 
> possible
> to train Tesseract to recognize these old texts using the text that has been
> already typed.

Actually, Tesseract should already have support for Russian and 
Belarussian "out of the box"; see the 'rus' and 'bel' training data.

> We would like to know if you would be interested in exploring collaboration
> possibilities. I imagine that with your guidance we could prepare training 
> data

The first thing to do would be to take a look at the results you get 
from Tesseract with the rus and bel training sets already available, 
and let us know if they aren't appropriate.

> not only in different languages, but also from different time 
> periods, scripts, etc.

As to training for specific scripts, time periods, etc., in theory 
that is super cool, in practise probably one training set should be 
able to cover more-or-less everything (except very different 
scripts, like fraktur). That has been my experience with training 
Ancient Greek (for which I have been interested in recognising 
printing from a variety of time periods).

So give Tesseract a whirl, and if it isn't appropriate, or doesn't 
work for specific scripts, let us know and we can try to figure out 
a plan.

> At the moment it is not very clear how to achieve this.

My plan is to rewrite the training documentation very soon, so 
things should hopefully become clearer on that front.

One thing that wikisource could potentially do for us would be 
provide loads of proofread, freely reusable "ground truth" data to 
test Tesseract with. Are there programatic ways of getting at the 
data, for example downloading all page images and corresponding text 
that is marked as green, for a specific language / script?

Thanks for getting in touch!

Nick

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/20140812162512.GA18932%40manta.lan.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Outreach from the Wikisource community

Reply via email to