Thanks a lot for this nice answer, A technical answer to the question: > Are there programatic ways of getting at the data, for example downloading > all page images and corresponding text that is marked as green, for a > specific language / script?
Yes, you can get the list of Page: pages (the pages that contain the wikitext for a given scan image) using this API request: https://en.wikisource.org/w/api.php?action=query&generator=allpages&gapnamespace=104&prop=proofread&format=json for en.wikisource where the Page: namespace id is 104 (this id is not the same in all Wikisources) (doc: https://www.mediawiki.org/wiki/API:Allpages ) Then you can just retrieve the content of a "green" page (the ones with "quality": 4) using this API request https://en.wikisource.org/w/api.php?action=query&prop=revisions&titles=Page:%22%27Keep%20%27em%20Flying%27%20is%20Our%20Battle%20Cry^%20First%20Class%20Fighting%20Men%20Needed.%22%20-%20NARA%20-%20513526.jpg&rvprop=content (doc: https://www.mediawiki.org/wiki/API:Properties#revisions_/_rv ). To get the image of a given Page: page, just use this API request https://en.wikisource.org/w/api.php?action=query&titles=Image:Albert%20Einstein%20Head.jpg&prop=imageinfo&format=json&iiprop=url that retrieves the url of a file from his title (the Page: pages has as page title "Page:NAME_OF_THE_FILE" with sometime after a "/PAGE_NUMBER_IN_A_MULTIPAGE_FILE" so you have in NAME_OF_THE_FILE the name of the image to use. Thanks again, Thomas > From: Nick White <[email protected]> > Date: Tue, Aug 12, 2014 at 6:25 PM > Subject: Re: [tesseract-ocr] Outreach from the Wikisource community > To: [email protected] > Cc: "discussion list for Wikisource, the free library" > <[email protected]>, David Cuenca <[email protected]> > > > Dear Wikisourcerers, > > It's good to hear from you. Wikisource is awesome, as far as I am > concerned. > > > One > > of the most serious issues was raised by the Belarusian community which > > uses 2 > > different scripts with no commercial OCR support. This means that the > > volunteers have to type each word manually. We wondered if it would be > > possible > > to train Tesseract to recognize these old texts using the text that has been > > already typed. > > Actually, Tesseract should already have support for Russian and > Belarussian "out of the box"; see the 'rus' and 'bel' training data. > > > We would like to know if you would be interested in exploring collaboration > > possibilities. I imagine that with your guidance we could prepare training > > data > > The first thing to do would be to take a look at the results you get > from Tesseract with the rus and bel training sets already available, > and let us know if they aren't appropriate. > > > not only in different languages, but also from different time > > periods, scripts, etc. > > As to training for specific scripts, time periods, etc., in theory > that is super cool, in practise probably one training set should be > able to cover more-or-less everything (except very different > scripts, like fraktur). That has been my experience with training > Ancient Greek (for which I have been interested in recognising > printing from a variety of time periods). > > So give Tesseract a whirl, and if it isn't appropriate, or doesn't > work for specific scripts, let us know and we can try to figure out > a plan. > > > At the moment it is not very clear how to achieve this. > > My plan is to rewrite the training documentation very soon, so > things should hopefully become clearer on that front. > > One thing that wikisource could potentially do for us would be > provide loads of proofread, freely reusable "ground truth" data to > test Tesseract with. Are there programatic ways of getting at the > data, for example downloading all page images and corresponding text > that is marked as green, for a specific language / script? > > Thanks for getting in touch! > > Nick > > > > -- > Etiamsi omnes, ego non -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/BLU436-SMTP48F8A4D840971F0E80B4CBB2EB0%40phx.gbl. For more options, visit https://groups.google.com/d/optout.

