Are you following this thread? Is it something we can share with one of the GSoCers?
Aubrey On Mon, Jun 17, 2013 at 8:32 AM, Alex Brollo <[email protected]> wrote: > Just to fix our present thoughts/"discoveries". > > 1. ABBYY OCR procedure outputs _abbyy.xml file, containing any detail > about multi-level text structure and detailed information, character by > character, about formatting and recognition quality; _abbyy.xml file is > published by IA as _abbyy.gz file; > 2. some of _abbyy.xml data are wrapped into IA djvu text layer; > multi-layer structure is saved, but details about characters are discarded; > 3. MediaWiki gets the "pure text" from djvu text layer, and discards all > other data multi-layer data of djvu layer, and loads the text into new > nsPage pages; > 4. finally & painfully wikisource users then add formatting again into > raw text; in a large extent, they re-build by scratch some of data that was > present into original, source abbyy.xml file and - in part - into djvu text > layer. :-( > > This seems deeply unsound IMHO; isn't it? > > Alex > > > > > > 2013/6/17 Alex Brollo <[email protected]> > >> This is a link to drag into abbyy xml: >> http://www.abbyy-developers.com/en:tech:features:xml >> >> It' very exciting, and far from so exoteric as it seems at a first look. >> Perhaps abbyy xml could be used as the main source of usable OCR data in >> proofread procedure (abbyy.gz file is listed into any OCR-ed Internet >> Archive book, and it is possible to get OCR with python routines: take a >> look to >> http://it.wikisource.org/wiki/Indice:Fisiologia_del_matrimonio.djvu, a >> test book where pages 17-30 come just from abbyy.xml file). >> >> Alex >> >> >> 2013/6/15 Alex Brollo <[email protected]> >> >>> I got it. o_O >>> >>> No need of regex, lxml, pyquery nor XLST.... most simple python parsing >>> routines can understand abbyy xml and extract both text and informations >>> about text. >>> >>> The goal was, to get by python both plain text (the same produced by >>> wikisource server when creating a new page from a djvu text layer) and some >>> html formatting, into a format usable by VisualEditor; and if you take a >>> look to http://it.wikipedia.org/wiki/Utente:Alex_brollo/Sandbox, you'll >>> see in red only owrds, where parameter wordPenalty is more than 0 into the >>> source file abbyy xml. >>> >>> Alex brollo (from it.wikisource) >>> >>> >>> 2013/6/14 Alex Brollo <[email protected]> >>> >>>> IA gives abbyy xml files too (as .gz files); I opened one of them after >>>> a suggestion of Phe, and I'm dreaming about extracting anything useful to >>>> help proofreading. The only "small" problem is that I barely know what a >>>> xml is and that is similat to html in its (well-formed) structure, and that >>>> something called XLST exists. :-( >>>> >>>> Is any of you working about abbyy xml files with a "little bit" of more >>>> skill? >>>> >>>> Alex brollo >>>> >>>> >>>> >>>> >>>> >>> >> > > _______________________________________________ > Wikisource-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikisource-l > >
_______________________________________________ Wikisource-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikisource-l
