Hello again.
So, I've set up an OpenOCR instance on Labs that's available for use as a
service. Just call it and point to an image. Example:
*curl -X POST -H "Content-Type: application/json" -d
'{"img_url":"http://bit.ly/ocrimage
<http://bit.ly/ocrimage>","engine":"tesseract"}'
http://openocr.wmflabs.org/ocr <http://openocr.wmflabs.org/ocr>*
should yield:
"You can create local variables for the pipelines within the template by
prefixing the variable name with a “$" sign. Variable names have to be
composed of alphanumeric characters and the underscore. In the example
below I have used a few variations that work for variable names."
If we see evidence of abuse, we might have to protect it with API keys, but
for now, let's AGF. :)
I'm working on something that would be a client of this service, but don't
have a demo yet. Stay tuned! :)
A.
On Sun, Jul 12, 2015 at 3:27 PM, Alex Brollo <[email protected]> wrote:
> I explored abbyy gx files, the full xml output from ABBYY ocr engine
> running at Internet Archive, and I've been astonished by the amount of data
> they contain - they are stored at XCA_Extended detaiI (as documented at
> http://www.abbyy-developers.com/en:tech:features:xml ).
>
> Something that wikisource best developers should explore; comparing those
> data with the little bit of data into mapped text layer of djvu files is
> impressive and should be inspiring.
>
> But they are static data coming from a standard setting... nothing similar
> to a service with simple, shared, deep learning features for difficult and
> ancient texts. I tried "ancient italian" tesseract dictionary with very
> poor results.
>
> So Asaf, I can't wait for good news from you. :-)
>
> Alex
>
> 2015-07-12 12:50 GMT+02:00 Andrea Zanni <[email protected]>:
>
>>
>>
>> On Sun, Jul 12, 2015 at 11:25 AM, Asaf Bartov <[email protected]>
>> wrote:
>>
>>> On Sat, Jul 11, 2015 at 9:59 AM, Andrea Zanni <[email protected]>
>>> wrote:
>>>
>>>> uh, that sounds very interesting.
>>>> Right now, we mainly use OCR from djvu from Internet Archive (that
>>>> means ABBYY Finereader, which is very nice).
>>>>
>>>
>>> Yes, the output is generally good. But as far as I can tell, the
>>> archive's Open Library API does not offer a way to retrieve the OCR output
>>> programmatically, and certainly not for an arbitrary page rather than the
>>> whole item. What I'm working on requires the ability to OCR a single page
>>> on demand.
>>>
>>> True.
>> I've recently met Giovanni, a new (italian) guy who's now working with
>> Internet Archive and Open Library.
>> We discussed about a number of possible parnerships/projects, this is
>> definitely one to bring it up.
>>
>> But if we manage to do it directly in the Wikimedia world it's even
>> better.
>>
>> Aubrey
>>
>>
>>>
>>> _______________________________________________
>>> Wikisource-l mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>>>
>>>
>>
>> _______________________________________________
>> Wikisource-l mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>>
>>
>
> _______________________________________________
> Wikisource-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>
>
--
Asaf Bartov
Wikimedia Foundation <http://www.wikimediafoundation.org>
Imagine a world in which every single human being can freely share in the
sum of all knowledge. Help us make it a reality!
https://donate.wikimedia.org
_______________________________________________
Wikisource-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikisource-l