Re: [Wikitech-l] Fwd: wikicaptcha on GitHub

Nikola Smolenski Thu, 02 Feb 2012 01:08:41 -0800

On 25/01/12 10:23, Federico Leva (Nemo) wrote:

 > OCRs generally work by finding lines of text on a page, splitting the
lines
into letters, then recognizing each letter separately. So, an OCR would
 > know,
 > for each letter of the recognized text, what is its bounding box on
the page.
 >
 > However, to my knowledge there is not a single OCR that exports this
data, nor
 > is there a standard format for it. If an open source OCR could be
modified to
 > do this, then it would be easy to inject data retreieved from
captchas back
 > into OCR-ed text. And it could be used for so much more :)


I don't understand, what data are you talking about? DjVu is an open

If you know what is the bounding box of the image of the word you aresending to the captcha, how the OCR read that word, and how the usershave corrected it via the captcha, it should be easy to move thecorrected word back into the OCR output.

format and can store character mappings, which is what the wikicaptcha
proof of concept is based on. There's also
https://en.wikipedia.org/wiki/HOCR and IA uses some proprietary ABBYY
xml format which AFAIK can be somehow read and converted to hOCR.

I have to say I didn't knew about these developments :) (I knew aboutABBY's format, but it's, as you said, proprietary.)

The real problem is character training which could be used for
subsequent OCRs. I doubt we can do much here, because everyone uses
ABBYY, and even tesseract users don't seem to share such data in any way.

It is a pity, for as I said many things could be done with this data.For example, it would be possible to read the same text with multipledifferent OCRs and quickly find errors (if a text is read the same thenit's likely correct, and if it's read differently then it's certainlywrong). It would be possible to use this data to retrain the OCR or todevelop new OCRs and so on.

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Fwd: wikicaptcha on GitHub

Reply via email to