I explored abbyy gx files, the full xml output from ABBYY ocr engine running at Internet Archive, and I've been astonished by the amount of data they contain - they are stored at XCA_Extended detaiI (as documented at http://www.abbyy-developers.com/en:tech:features:xml ).
Something that wikisource best developers should explore; comparing those data with the little bit of data into mapped text layer of djvu files is impressive and should be inspiring. But they are static data coming from a standard setting... nothing similar to a service with simple, shared, deep learning features for difficult and ancient texts. I tried "ancient italian" tesseract dictionary with very poor results. So Asaf, I can't wait for good news from you. :-) Alex 2015-07-12 12:50 GMT+02:00 Andrea Zanni <[email protected]>: > > > On Sun, Jul 12, 2015 at 11:25 AM, Asaf Bartov <[email protected]> > wrote: > >> On Sat, Jul 11, 2015 at 9:59 AM, Andrea Zanni <[email protected]> >> wrote: >> >>> uh, that sounds very interesting. >>> Right now, we mainly use OCR from djvu from Internet Archive (that means >>> ABBYY Finereader, which is very nice). >>> >> >> Yes, the output is generally good. But as far as I can tell, the >> archive's Open Library API does not offer a way to retrieve the OCR output >> programmatically, and certainly not for an arbitrary page rather than the >> whole item. What I'm working on requires the ability to OCR a single page >> on demand. >> >> True. > I've recently met Giovanni, a new (italian) guy who's now working with > Internet Archive and Open Library. > We discussed about a number of possible parnerships/projects, this is > definitely one to bring it up. > > But if we manage to do it directly in the Wikimedia world it's even > better. > > Aubrey > > >> >> _______________________________________________ >> Wikisource-l mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/wikisource-l >> >> > > _______________________________________________ > Wikisource-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikisource-l > >
_______________________________________________ Wikisource-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikisource-l
