https://bugzilla.wikimedia.org/show_bug.cgi?id=32695
--- Comment #9 from Alessandro Brollo <[email protected]> --- I'm exploring a new and IMHO interesting path: to ignore djvu text layer, and toparse (both to extract naked text layer and some interesting parameters) from abbyy.xml file. This file (really heavy and discouraging at a firs glance) is published by Internet Archive into its file download area. The interesting thing is, that that heavy file contains both coordinates of words, and an interesting 'wordPenalty' parameter, something like a "uncertainty score" for the whole word; but there's too a character-by-character score of "certainty score". I'm sharing scripts with http://www.mediawiki.org/wiki/User:Rtdwivedi, who is MUCH skilled than me, since the idea is to upload text layer from abbyy.xml file and to wrap uncertain words into a span tag, making them easy to be fized by VisualEditor. A test output of extracring scripts can be seen into any page of http://it.wikisource.org/wiki/Indice:Ricordi_di_Londra.djvu, where words with a wordPenalty > 0 are red; unluckily VisualEditor doesn't run presently in wikisource, but you can test the resulting code with VisualEditor in a wikipedia sandbox. I presume that similar scripts, using abbyy.xml files, could extract lists of uncertain words and their images from abbyy.xml file and related scans and feed a CAPTCHA engine. My suggestion is, to ask Rtdwivedi for comments; personally I feel myself curious, bold and sometimes lucky, but very far from a "programmer". -- You are receiving this mail because: You are on the CC list for the bug. _______________________________________________ Wikibugs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
