Nice! I will wait for the client though, thx. Where will the source images
be stored? Labs or Commons? It would be nice if you could somehow make a
client that builds a djvu file locally with the page image and the OCR text
that you can cleanup before putting it into the djvu file. Now it just
Hello again.
So, I've set up an OpenOCR instance on Labs that's available for use as a
service. Just call it and point to an image. Example:
*curl -X POST -H Content-Type: application/json -d
'{img_url:http://bit.ly/ocrimage
http://bit.ly/ocrimage,engine:tesseract}'
I explored abbyy gx files, the full xml output from ABBYY ocr engine
running at Internet Archive, and I've been astonished by the amount of data
they contain - they are stored at XCA_Extended detaiI (as documented at
http://www.abbyy-developers.com/en:tech:features:xml ).
Something that
On Sat, Jul 11, 2015 at 8:44 AM, Nicolas VIGNERON
vigneron.nico...@gmail.com wrote:
Hi,
I'm not a techie so I'm not sure to know what is OCR-as-service but you
should ask Tpt and Phe who have OCR stuff on the tool labs (to know what is
behind tools like
On Sat, Jul 11, 2015 at 9:59 AM, Andrea Zanni zanni.andre...@gmail.com
wrote:
uh, that sounds very interesting.
Right now, we mainly use OCR from djvu from Internet Archive (that means
ABBYY Finereader, which is very nice).
Yes, the output is generally good. But as far as I can tell, the
OCR is available by a javascript. Numbers of wikisources have it enabled as
a gadget, though I cannot speak for all the wikis. I presume it relates to
the languages available in the OCR.
Script is noted at
https://wikisource.org/wiki/Wikisource:Shared_Scripts
Regards, Billinghurst
On Sun, Jul
Very, very interesting I can't help you, my skill is very limited, but
I'm very interested about and I hope that my interest will be largely
shared.
Alex
2015-07-11 12:04 GMT+02:00 Asaf Bartov abar...@wikimedia.org:
Hi.
Speaking of Wikisource software, do we already have any instance
uh, that sounds very interesting.
Right now, we mainly use OCR from djvu from Internet Archive (that means
ABBYY Finereader, which is very nice).
But ideally we could think of a customizable OCR software that gets
trained language per language: htat would be extremely useful for
Wiikisources.
(i