https://bugzilla.wikimedia.org/show_bug.cgi?id=32695

--- Comment #9 from Alessandro Brollo <[email protected]> ---
I'm exploring a new and IMHO interesting path: to ignore djvu text layer, and
toparse (both to extract naked text layer and some interesting parameters) from
abbyy.xml file. This file (really heavy and discouraging at a firs glance) is
published by Internet Archive into its file download area. 

The interesting thing is, that that heavy file contains both coordinates of
words, and an interesting 'wordPenalty' parameter, something like a
"uncertainty score" for the whole word; but there's too a
character-by-character score of "certainty score". 

I'm sharing scripts  with http://www.mediawiki.org/wiki/User:Rtdwivedi, who is
MUCH skilled than me, since the idea is to upload text layer from abbyy.xml
file and to wrap uncertain words into a span tag, making them easy to be fized
by VisualEditor. A test output of extracring scripts can be seen into any page
of http://it.wikisource.org/wiki/Indice:Ricordi_di_Londra.djvu, where words
with a wordPenalty > 0 are red; unluckily VisualEditor doesn't run presently in
wikisource, but you can test the resulting code with VisualEditor in a
wikipedia sandbox.

I presume that similar scripts, using abbyy.xml files, could extract lists of
uncertain words and their images from abbyy.xml file and related scans and feed
a CAPTCHA engine. 

My suggestion is, to ask Rtdwivedi for comments; personally I feel myself
curious, bold and sometimes lucky, but very far from a "programmer".

-- 
You are receiving this mail because:
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to