Just to fix our present thoughts/"discoveries".

1. ABBYY OCR procedure outputs _abbyy.xml file, containing any detail about
multi-level text structure and detailed information, character by
character, about formatting and recognition quality; _abbyy.xml file is
published by IA as _abbyy.gz file;
2. some of _abbyy.xml data are wrapped into IA djvu text layer; multi-layer
structure is saved, but details about characters are discarded;
3. MediaWiki gets the "pure text" from djvu text layer, and discards all
other data multi-layer data of djvu layer, and loads the text into new
nsPage pages;
4. finally & painfully wikisource users then add formatting  again into raw
text; in a large extent, they re-build by scratch some of data that was
present into original, source abbyy.xml file and - in part - into djvu text
layer. :-(

This seems deeply unsound IMHO; isn't it?

Alex





2013/6/17 Alex Brollo <[email protected]>

> This is a link to drag into abbyy xml:
> http://www.abbyy-developers.com/en:tech:features:xml
>
> It' very exciting, and far from so exoteric as it seems at a first look.
> Perhaps abbyy xml could be used as the main source of usable OCR data in
> proofread procedure (abbyy.gz file is listed into any OCR-ed Internet
> Archive book, and it is possible to get OCR with python routines: take a
> look to
> http://it.wikisource.org/wiki/Indice:Fisiologia_del_matrimonio.djvu, a
> test book where pages 17-30 come just from abbyy.xml file).
>
> Alex
>
>
> 2013/6/15 Alex Brollo <[email protected]>
>
>> I got it. o_O
>>
>> No need of regex, lxml, pyquery nor XLST.... most simple python parsing
>> routines can understand abbyy xml and extract both text and informations
>> about text.
>>
>> The goal was, to get by python both plain text (the same produced by
>> wikisource server when creating a new page from a djvu text layer) and some
>> html formatting, into a format usable by VisualEditor; and if you take a
>> look to http://it.wikipedia.org/wiki/Utente:Alex_brollo/Sandbox, you'll
>> see in red only owrds, where parameter wordPenalty is more than 0 into the
>> source file abbyy xml.
>>
>> Alex brollo (from it.wikisource)
>>
>>
>> 2013/6/14 Alex Brollo <[email protected]>
>>
>>> IA gives abbyy xml files too (as .gz files); I opened one of them after
>>> a suggestion of Phe, and I'm dreaming about extracting anything useful to
>>> help proofreading. The only "small" problem is that I barely know what a
>>> xml is and that is similat to html in its (well-formed) structure, and that
>>> something called XLST exists. :-(
>>>
>>> Is any of you working about abbyy xml files with a "little bit" of more
>>> skill?
>>>
>>> Alex brollo
>>>
>>>
>>>
>>>
>>>
>>
>
_______________________________________________
Wikisource-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Reply via email to