On 19 January 2012 11:19, Cristian Consonni <[email protected]> wrote: > 2012/1/15 Nikola Smolenski <[email protected]>: >> Дана Wednesday 11 January 2012 18:19:14 Cristian Consonni написа: >> However, to my knowledge there is not a single OCR that exports this data, >> nor >> is there a standard format for it. If an open source OCR could be modified to >> do this, then it would be easy to inject data retreieved from captchas back >> into OCR-ed text. And it could be used for so much more :) > > I know (but I am not proficient in their use) at least two open source > OCR softwares: > * OCRopus[1a][1b], by the German Research Center for Artificial > Intelligence, sponsored by Google > * Tesseract[2a][2b], started by HP in far 1995, now Google-sponsored > (yeah, this one too!) [note: as far as I know OCRopus used tesserect > as an engine for OCR] > * GOCR/JOCR > > I think much can be done. > > Cristian
More related tools, the documentcloud project. Raw Engine => Tools http://documentcloud.github.com/docsplit/ Tools => Human Documents https://github.com/documentcloud/document-viewer Human Documents => Beatiful viewers http://www.pbs.org/newshour/rundown/documents/mark-twain-concerning-the-interview.html http://www.commercialappeal.com/withers-exposed/pages-from-foia-reveal-withers-as-informant/#document/p2/a2431 Using tesseract alone is "too much work". Tesseract want tiff files in a particular format, and DPI. Humans want stuff in a easy to use format, perhaps click on a image and get the text directly behind the mouse arrow as text can be copied and paste. -- -- ℱin del ℳensaje. _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
