This is a link to drag into abbyy xml: http://www.abbyy-developers.com/en:tech:features:xml
It' very exciting, and far from so exoteric as it seems at a first look. Perhaps abbyy xml could be used as the main source of usable OCR data in proofread procedure (abbyy.gz file is listed into any OCR-ed Internet Archive book, and it is possible to get OCR with python routines: take a look to http://it.wikisource.org/wiki/Indice:Fisiologia_del_matrimonio.djvu, a test book where pages 17-30 come just from abbyy.xml file). Alex 2013/6/15 Alex Brollo <[email protected]> > I got it. o_O > > No need of regex, lxml, pyquery nor XLST.... most simple python parsing > routines can understand abbyy xml and extract both text and informations > about text. > > The goal was, to get by python both plain text (the same produced by > wikisource server when creating a new page from a djvu text layer) and some > html formatting, into a format usable by VisualEditor; and if you take a > look to http://it.wikipedia.org/wiki/Utente:Alex_brollo/Sandbox, you'll > see in red only owrds, where parameter wordPenalty is more than 0 into the > source file abbyy xml. > > Alex brollo (from it.wikisource) > > > 2013/6/14 Alex Brollo <[email protected]> > >> IA gives abbyy xml files too (as .gz files); I opened one of them after a >> suggestion of Phe, and I'm dreaming about extracting anything useful to >> help proofreading. The only "small" problem is that I barely know what a >> xml is and that is similat to html in its (well-formed) structure, and that >> something called XLST exists. :-( >> >> Is any of you working about abbyy xml files with a "little bit" of more >> skill? >> >> Alex brollo >> >> >> >> >> >
_______________________________________________ Wikisource-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikisource-l
