Re: [Wikitech-l] Fwd: wikicaptcha on GitHub

Tei Thu, 19 Jan 2012 03:04:03 -0800

On 19 January 2012 11:19, Cristian Consonni <[email protected]> wrote:
> 2012/1/15 Nikola Smolenski <[email protected]>:
>> Дана Wednesday 11 January 2012 18:19:14 Cristian Consonni написа:
>> However, to my knowledge there is not a single OCR that exports this data, 
>> nor
>> is there a standard format for it. If an open source OCR could be modified to
>> do this, then it would be easy to inject data retreieved from captchas back
>> into OCR-ed text. And it could be used for so much more :)
>
> I know (but I am not proficient in their use) at least two open source
> OCR softwares:
> * OCRopus[1a][1b], by the German Research Center for Artificial
> Intelligence, sponsored by Google
> * Tesseract[2a][2b], started by HP in far 1995, now Google-sponsored
> (yeah, this one too!) [note: as far as I know OCRopus used tesserect
> as an engine for OCR]
> * GOCR/JOCR
>
> I think much can be done.
>
> Cristian


More related tools, the documentcloud project.

Raw Engine  => Tools
http://documentcloud.github.com/docsplit/

Tools => Human Documents
https://github.com/documentcloud/document-viewer

Human Documents => Beatiful viewers
http://www.pbs.org/newshour/rundown/documents/mark-twain-concerning-the-interview.html
http://www.commercialappeal.com/withers-exposed/pages-from-foia-reveal-withers-as-informant/#document/p2/a2431

Using tesseract alone is "too much work". Tesseract want tiff files in
a particular format, and DPI.  Humans want stuff in a easy to use
format, perhaps click on a image and get the text directly behind the
mouse arrow as text can be copied and paste.

-- 
--
ℱin del ℳensaje.

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Fwd: wikicaptcha on GitHub

Reply via email to