A quick link I just received: http://www.digitalkoot.fi (in English also)
It seems there are two Facebook games whose the aim is precisely to correct OCRs. Sébastien Sun, 20 Feb 2011 22:16:15 +0100, Seb35 <seb35wikipe...@gmail.com> wrote: > Hi Andrea, > > I saw VIGNERON and Jean-Frédéric today and we spoke about that. > Jean-Fred and I are a bit skeptical about the effective implementation > of such a system, here are some questions that I (or we) were asking: > (the questions are listed by order of importance.) > > - how much books have such coordinates? I know the Bnf-partnership-books > have such coordinates because originally in the OCR files (1057 books), > but on WS a lot of books have non-valid coordinates (word 0 0 1 1 "") > because Wikisourcians didn't know what was the meaning of these figures > (DjVu format is quite difficult to understand anyway); I don't know if > classical OCR have a function to indicate the coordinates of future > ocerized books > > - what is the confidence in the coordinates? if you serve an half-word, > it will be difficult to recognize the entire word > > - I am asking how you can validate the correctness of a given word for a > given person: a person (e.g.) creates an account on WS, a Captcha is > asked with a word, how do you know if his/her answer is correct? I > aggree this step disapears if you ask a pool of volunteers to answer to > differents captcha-word, but in this cas it resumes to the classical > check of Wikisourcians in a specialized way to treat particular cases > > - you give the example of a ^ in a word, but how do you select the > OCR-mistakes? althought this is not really an issue since you can yet > make a list of current mistakes and it will be sufficient in a first > time. I know French Wikisourcians (at least, probably others also) > already make a list of frequent mistakes ( II->Il, 1l->Il, c->e ...), > sometimes for a given book (Trévoux in French of 1771 it seems to me). > > But I know Google had a similar system for their digitization, but I > don't know exactly the details. For me there are a lot of details which > makes the global idea difficult to carry out (although I would prefer > think the contrary), but perhaps has you some answers. > > Sébastien > > PS: I had another idea in a slightly different application field > (roughtly speaking automated validation of texts) but close of this one, > I write an email next week about that (already some notes in > <http://wikisource.org/wiki/User:Seb35/Reverse_OCR>). > > Sat, 05 Feb 2011 15:14:57 +0100, Andrea Zanni <zanni.andre...@gmail.com> > wrote: >> Dear wikisourcers, >> while exploring the djvu text layer, it.source community found >> interesting >> features that is good thing to share (SPOILER ALERT: Wikisource >> reCAPTCHA). >> (I added the technicalities in the footnotes, please look at them if >> you're >> interested.) >> >> We discovered that when the text layer is extracted with djvuLibre >> djvused.exe tool [1] >> a text file is obtained, containing words and their absolute coordinates >> into the image of the page. >> >> Here a some example rows of such txt file from a running test: >> >> (line 402 2686 2424 2757 >> (word 402 2699 576 2756 "State.") >> (word 679 2698 892 2757 "Effects") >> (word 919 2698 991 2756 "of") >> (word 1007 2697 1467 2755 "Domestication") >> (word 1493 2698 1607 2755 "and") >> (word 1637 2697 1910 2757 "Climate.") >> (word 2000 2698 2132 2756 "The") >> (word 2155 2686 2424 2754 "Persians^")) >> >> As you can see, the last word has a ^ character inside, that indicates a >> doubtful, unrecognized character by OCR software. >> >> What's really interesting is that python script can select these words >> using >> the ^ character and produce automatically a file with the image of the >> word, >> since all needed parameters for a ddjvu.exe call can be obtained (please >> consider that this code comes from a rough, but *running* test script >> [2]). >> >> So, in our it.source test script, a tiff image has been automatically >> produced, exactly contaning the image of "Persians^" doubtful OCR >> output. >> Its name is built as name-of-djvu-file+page number+coordinates into the >> page, that it is all what is needed to link unambiguously the image and >> the >> specific word into a specific page of a djvu file. >> >> The image has been uploaded into Commons as >> http://commons.wikimedia.org/wiki/File:Word_image_from_wikicaptcha_project.tiff >> >> As you can easily imagine, this could be the core of a "wikicaptcha" >> project >> (as John Vandenberg called it), enabling us to produce our own >> Wikisource >> reCaptcha. >> >> A djvu file could be uploaded into a server (into an "incubator"); a >> database of doubtful word images could be built; images could be >> presented >> to wiki users (both as a voluntary task or as a formal reCAPTCHA to >> confirm >> edits by unlogged contributors); resulting human interpretation could be >> validated somehow (i.e. by n repetitions of matching, different >> interpretations) then used to edit text layer of djvu file. Finally the >> edited djvu file could be uploaded to Commons for formal source >> managing. >> >> Please contact us if you like to have a copy of running, test scripts. >> There's too a shared Dropbox folder with the complete environment where >> we >> are testing scripts. >> >> Opinions, feedbacks or thoughts are more than welcome. >> >> >> Aubrey >> Admin it.source >> WMI Board >> >> >> [1] command='djvused name-of.file.djvu -e "select page-number; >> print-txt" >text-file.txt >> os.system(command) >> >> [2] if "^" in word: >> >> coord=key.split() >> #print coord >> w=str(eval(coord[3])-eval( >> >> coord[1])) >> h=str(eval(coord[4])-eval(coord[2])) >> x=coord[1] >> y=coord[2] >> >> filetiff=fileDjvu.replace(".djvu","")+"-"+pag+"-"+"_".join(coord)+".tiff" >> segment="-segment >> WxH+X+Y".replace("W",w).replace("H",h).replace("X",x).replace("Y",y) >> command="ddjvu "+fileDjvu+" -page="+pag+" >> -format=tiff "+segment+" "+filetiff >> print command >> os.system(command) _______________________________________________ Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l