Fwd: [tesseract-ocr] Outreach from the Wikisource community

Thomas Tanon Wed, 13 Aug 2014 14:01:27 -0700

Thanks a lot for this nice answer,

A technical answer to the question:
> Are there programatic ways of getting at the data, for example downloading 
> all page images and corresponding text that is marked as green, for a 
> specific language / script?


Yes, you can get the list of Page: pages (the pages that contain the wikitext 
for a given scan image) using this API request: 
https://en.wikisource.org/w/api.php?action=query&generator=allpages&gapnamespace=104&prop=proofread&format=json
 for en.wikisource where the Page: namespace id is 104 (this id is not the same 
in all Wikisources) (doc: https://www.mediawiki.org/wiki/API:Allpages )

Then you can just retrieve the content of a "green" page (the ones with 
"quality": 4) using this API request 
https://en.wikisource.org/w/api.php?action=query&prop=revisions&titles=Page:%22%27Keep%20%27em%20Flying%27%20is%20Our%20Battle%20Cry^%20First%20Class%20Fighting%20Men%20Needed.%22%20-%20NARA%20-%20513526.jpg&rvprop=content
 (doc: https://www.mediawiki.org/wiki/API:Properties#revisions_/_rv ).

To get the image of a given Page: page, just use this API request 
https://en.wikisource.org/w/api.php?action=query&titles=Image:Albert%20Einstein%20Head.jpg&prop=imageinfo&format=json&iiprop=url
 that retrieves the url of a file from his title (the Page: pages has as page 
title "Page:NAME_OF_THE_FILE" with sometime after a 
"/PAGE_NUMBER_IN_A_MULTIPAGE_FILE" so you have in NAME_OF_THE_FILE the name of 
the image to use.

Thanks again,

Thomas

> From: Nick White <[email protected]>
> Date: Tue, Aug 12, 2014 at 6:25 PM
> Subject: Re: [tesseract-ocr] Outreach from the Wikisource community
> To: [email protected]
> Cc: "discussion list for Wikisource, the free library" 
> <[email protected]>, David Cuenca <[email protected]>
> 
> 
> Dear Wikisourcerers,
> 
> It's good to hear from you. Wikisource is awesome, as far as I am
> concerned.
> 
> > One
> > of the most serious issues was raised by the Belarusian community which 
> > uses 2
> > different scripts with no commercial OCR support. This means that the
> > volunteers have to type each word manually. We wondered if it would be 
> > possible
> > to train Tesseract to recognize these old texts using the text that has been
> > already typed.
> 
> Actually, Tesseract should already have support for Russian and
> Belarussian "out of the box"; see the 'rus' and 'bel' training data.
> 
> > We would like to know if you would be interested in exploring collaboration
> > possibilities. I imagine that with your guidance we could prepare training 
> > data
> 
> The first thing to do would be to take a look at the results you get
> from Tesseract with the rus and bel training sets already available,
> and let us know if they aren't appropriate.
> 
> > not only in different languages, but also from different time
> > periods, scripts, etc.
> 
> As to training for specific scripts, time periods, etc., in theory
> that is super cool, in practise probably one training set should be
> able to cover more-or-less everything (except very different
> scripts, like fraktur). That has been my experience with training
> Ancient Greek (for which I have been interested in recognising
> printing from a variety of time periods).
> 
> So give Tesseract a whirl, and if it isn't appropriate, or doesn't
> work for specific scripts, let us know and we can try to figure out
> a plan.
> 
> > At the moment it is not very clear how to achieve this.
> 
> My plan is to rewrite the training documentation very soon, so
> things should hopefully become clearer on that front.
> 
> One thing that wikisource could potentially do for us would be
> provide loads of proofread, freely reusable "ground truth" data to
> test Tesseract with. Are there programatic ways of getting at the
> data, for example downloading all page images and corresponding text
> that is marked as green, for a specific language / script?
> 
> Thanks for getting in touch!
> 
> Nick
> 
> 
> 
> -- 
> Etiamsi omnes, ego non

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/BLU436-SMTP48F8A4D840971F0E80B4CBB2EB0%40phx.gbl.
For more options, visit https://groups.google.com/d/optout.

Fwd: [tesseract-ocr] Outreach from the Wikisource community

Reply via email to