This is a link to drag into abbyy xml:
http://www.abbyy-developers.com/en:tech:features:xml

It' very exciting, and far from so exoteric as it seems at a first look.
Perhaps abbyy xml could be used as the main source of usable OCR data in
proofread procedure (abbyy.gz file is listed into any OCR-ed Internet
Archive book, and it is possible to get OCR with python routines: take a
look to http://it.wikisource.org/wiki/Indice:Fisiologia_del_matrimonio.djvu,
a test book where pages 17-30 come just from abbyy.xml file).

Alex


2013/6/15 Alex Brollo <[email protected]>

> I got it. o_O
>
> No need of regex, lxml, pyquery nor XLST.... most simple python parsing
> routines can understand abbyy xml and extract both text and informations
> about text.
>
> The goal was, to get by python both plain text (the same produced by
> wikisource server when creating a new page from a djvu text layer) and some
> html formatting, into a format usable by VisualEditor; and if you take a
> look to http://it.wikipedia.org/wiki/Utente:Alex_brollo/Sandbox, you'll
> see in red only owrds, where parameter wordPenalty is more than 0 into the
> source file abbyy xml.
>
> Alex brollo (from it.wikisource)
>
>
> 2013/6/14 Alex Brollo <[email protected]>
>
>> IA gives abbyy xml files too (as .gz files); I opened one of them after a
>> suggestion of Phe, and I'm dreaming about extracting anything useful to
>> help proofreading. The only "small" problem is that I barely know what a
>> xml is and that is similat to html in its (well-formed) structure, and that
>> something called XLST exists. :-(
>>
>> Is any of you working about abbyy xml files with a "little bit" of more
>> skill?
>>
>> Alex brollo
>>
>>
>>
>>
>>
>
_______________________________________________
Wikisource-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Reply via email to