Just to fix our present thoughts/"discoveries". 1. ABBYY OCR procedure outputs _abbyy.xml file, containing any detail about multi-level text structure and detailed information, character by character, about formatting and recognition quality; _abbyy.xml file is published by IA as _abbyy.gz file; 2. some of _abbyy.xml data are wrapped into IA djvu text layer; multi-layer structure is saved, but details about characters are discarded; 3. MediaWiki gets the "pure text" from djvu text layer, and discards all other data multi-layer data of djvu layer, and loads the text into new nsPage pages; 4. finally & painfully wikisource users then add formatting again into raw text; in a large extent, they re-build by scratch some of data that was present into original, source abbyy.xml file and - in part - into djvu text layer. :-(
This seems deeply unsound IMHO; isn't it? Alex 2013/6/17 Alex Brollo <[email protected]> > This is a link to drag into abbyy xml: > http://www.abbyy-developers.com/en:tech:features:xml > > It' very exciting, and far from so exoteric as it seems at a first look. > Perhaps abbyy xml could be used as the main source of usable OCR data in > proofread procedure (abbyy.gz file is listed into any OCR-ed Internet > Archive book, and it is possible to get OCR with python routines: take a > look to > http://it.wikisource.org/wiki/Indice:Fisiologia_del_matrimonio.djvu, a > test book where pages 17-30 come just from abbyy.xml file). > > Alex > > > 2013/6/15 Alex Brollo <[email protected]> > >> I got it. o_O >> >> No need of regex, lxml, pyquery nor XLST.... most simple python parsing >> routines can understand abbyy xml and extract both text and informations >> about text. >> >> The goal was, to get by python both plain text (the same produced by >> wikisource server when creating a new page from a djvu text layer) and some >> html formatting, into a format usable by VisualEditor; and if you take a >> look to http://it.wikipedia.org/wiki/Utente:Alex_brollo/Sandbox, you'll >> see in red only owrds, where parameter wordPenalty is more than 0 into the >> source file abbyy xml. >> >> Alex brollo (from it.wikisource) >> >> >> 2013/6/14 Alex Brollo <[email protected]> >> >>> IA gives abbyy xml files too (as .gz files); I opened one of them after >>> a suggestion of Phe, and I'm dreaming about extracting anything useful to >>> help proofreading. The only "small" problem is that I barely know what a >>> xml is and that is similat to html in its (well-formed) structure, and that >>> something called XLST exists. :-( >>> >>> Is any of you working about abbyy xml files with a "little bit" of more >>> skill? >>> >>> Alex brollo >>> >>> >>> >>> >>> >> >
_______________________________________________ Wikisource-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikisource-l
