I explored abbyy gx files, the full xml output from ABBYY ocr engine
running at Internet Archive, and I've been astonished by the amount of data
they contain - they are stored at XCA_Extended  detaiI (as documented at
http://www.abbyy-developers.com/en:tech:features:xml ).

Something that wikisource best developers should explore; comparing those
data with the little bit of data into mapped text layer of djvu files is
impressive and should be inspiring.

But they are static data coming from a standard setting... nothing similar
to a service with simple, shared, deep learning features for difficult and
ancient texts. I tried "ancient italian" tesseract dictionary with very
poor results.

So Asaf, I can't wait for good news from you. :-)

Alex

2015-07-12 12:50 GMT+02:00 Andrea Zanni <[email protected]>:

>
>
> On Sun, Jul 12, 2015 at 11:25 AM, Asaf Bartov <[email protected]>
> wrote:
>
>> On Sat, Jul 11, 2015 at 9:59 AM, Andrea Zanni <[email protected]>
>> wrote:
>>
>>> uh, that sounds very interesting.
>>> Right now, we mainly use OCR from djvu from Internet Archive (that means
>>> ABBYY Finereader, which is very nice).
>>>
>>
>> Yes, the output is generally good.  But as far as I can tell, the
>> archive's Open Library API does not offer a way to retrieve the OCR output
>> programmatically, and certainly not for an arbitrary page rather than the
>> whole item.  What I'm working on requires the ability to OCR a single page
>> on demand.
>>
>> True.
> I've recently met Giovanni, a new (italian) guy who's now working with
> Internet Archive and Open Library.
> We discussed about a number of possible parnerships/projects, this is
> definitely one to bring it up.
>
> But if we manage to do it directly in the Wikimedia world it's even
> better.
>
> Aubrey
>
>
>>
>> _______________________________________________
>> Wikisource-l mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>>
>>
>
> _______________________________________________
> Wikisource-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>
>
_______________________________________________
Wikisource-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Reply via email to