[Wikisource-l] Exploring abbyy.xml: a layman trip

Alex Brollo Sun, 30 Jun 2013 23:47:29 -0700

Just to let you know what I'm doing: I'm exploring abbyy.xml (_abbyy.gz
file in Internet Archive file list).


The abbyy.xml file contains many data to go much ahead into
"self-formatting" of text - with details that can't be found into text
layer of djvu files. It contains the XCA_Extended version of xml output of
OCR: (http://www.abbyy-developers.com/en:tech:features:xml), and this is a
brief list of its useful features:

1. coordinates l,t,r,b of any element (from page to character )
2. three main "blockType": text, table, picture;
3. four level details of text areas: region, paragraph, line, character
(and a fifth one, word, can be calculated);
4. data about indenting, font size, word and character certainty of
recognition.

Using coordinates and original images, it's possible to extract images from
original page image; this could be useful both for a "wikiReCaptcha" engine
(extracting doubtful word text  and their images) and to extract (or show
without extracting) pictures (the latter can be done showing a clone of
existing thumbnail of the page as the background of a div, and setting
appropriately div and overflow coordinates, with a very low server load).

In brief: all this stuff is extremely exciting, I'm going ahead with my
bold tries, but the matter deserves IMHO the interest of best source geeks
- I'm only playing with very limited skill with a rough layman programming
style.

Alex brollo (from it.wikisource)

_______________________________________________
Wikisource-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikisource-l

[Wikisource-l] Exploring abbyy.xml: a layman trip

Reply via email to