Are you following this thread?
Is it something we can share with one of the GSoCers?

Aubrey


On Mon, Jun 17, 2013 at 8:32 AM, Alex Brollo <[email protected]> wrote:

> Just to fix our present thoughts/"discoveries".
>
> 1. ABBYY OCR procedure outputs _abbyy.xml file, containing any detail
> about multi-level text structure and detailed information, character by
> character, about formatting and recognition quality; _abbyy.xml file is
> published by IA as _abbyy.gz file;
> 2. some of _abbyy.xml data are wrapped into IA djvu text layer;
> multi-layer structure is saved, but details about characters are discarded;
> 3. MediaWiki gets the "pure text" from djvu text layer, and discards all
> other data multi-layer data of djvu layer, and loads the text into new
> nsPage pages;
> 4. finally & painfully wikisource users then add formatting  again into
> raw text; in a large extent, they re-build by scratch some of data that was
> present into original, source abbyy.xml file and - in part - into djvu text
> layer. :-(
>
> This seems deeply unsound IMHO; isn't it?
>
> Alex
>
>
>
>
>
> 2013/6/17 Alex Brollo <[email protected]>
>
>> This is a link to drag into abbyy xml:
>> http://www.abbyy-developers.com/en:tech:features:xml
>>
>> It' very exciting, and far from so exoteric as it seems at a first look.
>> Perhaps abbyy xml could be used as the main source of usable OCR data in
>> proofread procedure (abbyy.gz file is listed into any OCR-ed Internet
>> Archive book, and it is possible to get OCR with python routines: take a
>> look to
>> http://it.wikisource.org/wiki/Indice:Fisiologia_del_matrimonio.djvu, a
>> test book where pages 17-30 come just from abbyy.xml file).
>>
>> Alex
>>
>>
>> 2013/6/15 Alex Brollo <[email protected]>
>>
>>> I got it. o_O
>>>
>>> No need of regex, lxml, pyquery nor XLST.... most simple python parsing
>>> routines can understand abbyy xml and extract both text and informations
>>> about text.
>>>
>>> The goal was, to get by python both plain text (the same produced by
>>> wikisource server when creating a new page from a djvu text layer) and some
>>> html formatting, into a format usable by VisualEditor; and if you take a
>>> look to http://it.wikipedia.org/wiki/Utente:Alex_brollo/Sandbox, you'll
>>> see in red only owrds, where parameter wordPenalty is more than 0 into the
>>> source file abbyy xml.
>>>
>>> Alex brollo (from it.wikisource)
>>>
>>>
>>> 2013/6/14 Alex Brollo <[email protected]>
>>>
>>>> IA gives abbyy xml files too (as .gz files); I opened one of them after
>>>> a suggestion of Phe, and I'm dreaming about extracting anything useful to
>>>> help proofreading. The only "small" problem is that I barely know what a
>>>> xml is and that is similat to html in its (well-formed) structure, and that
>>>> something called XLST exists. :-(
>>>>
>>>> Is any of you working about abbyy xml files with a "little bit" of more
>>>> skill?
>>>>
>>>> Alex brollo
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>
> _______________________________________________
> Wikisource-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>
>
_______________________________________________
Wikisource-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Reply via email to