Re: [tesseract-ocr] Re: Tesseract for recognition the international phonetic transcription

Tom Morris Thu, 30 Apr 2015 19:13:33 -0700

Hi Nick,

Thanks for the heads up.  As a colonial, I'm afraid that I'm woefully
unaware of the dates of the editions of the OED and, consequently, whether
this edition is in the public domain.  I've guessing it's not due to the
reference to CDROM -- a rather new-ish technology.


Here's my fork of the OKFN project: https://github.com/tfmorris/oed

The overarching goal is OCR, then crowdsource correct, a public domain
edition of the Oxford English Dictionary.  Of course, the devil is in the
details.  The OKFN's starting point is to use Internet Archive's ABBYY
FineReader transcription, with which I'm somewhat unimpressed.
Nonetheless, I've been attempting to push that front as far as it will go
as a starting point.  I have previous experience with reading/interpreting
FineReader XML, so it's a relatively easy path to explore to start with.

The ultimate pipeline will exploit lots of lexical, typographic, layout,
and semantic information to generate starting canonical word entries to be
corrected in some type of crowd sourcing environment.  I'm not an OCR
expert, but love the multi-layered and machine + man approaches that this
will require.

I've got hundreds of pages of vol. 1 HTML generated with color-coded low
OCR accuracy highlighting, cleaned up layout, and a few other things.  I've
committed to the OKFN that I'll get them up somewhere for folks to review
(probably github.io).  Of course, I've only just scratched the surface of
the analysis which will be required.  When I look at low accuracy segments,
I find blocks of Persian and other languages, in additional to all other
manner of variability.

I'd love to have more collaborators on this.  I should write up some more
formal description of my current plan of attack and possible alternatives.
The OKFN isn't big on that kind for formality.  They're more in the "Let's
*do* this thing!" kind of vein.

Thanks for your interest.  I should tell you upfront, that this is a
sideline to my sidelines, so it may not get tons of attention.

Tom

On Thu, Apr 30, 2015 at 6:39 PM, Nick White <[email protected]> wrote:

> Hi Tom,
>
> Just on the off chance it's useful, I wrote some scripts which
> process the 2nd edition OED SGML into DICT format, called oed2dict:
> https://njw.name/oed2dict/
>
> Sounds like an interesting project you're involved in. Can I have
> more details?
>
> FWIW, I'd imagine the full IPA would be a hard script to train, as
> it's lots of little diacritical marks on similar letters. But, while
> the quality might not be as high as other scripts, it'd probably
> still be worth it, to provide a better base for crowdsourced
> proofing. The 'symbols' file oed2dict lists all the named symbols
> (basically all non-ascii characters) in the digital 2nd edition OED,
> which is probably a very good starting point for generating training
> texts.
>
> Nick
>
> On Thu, Apr 23, 2015 at 01:32:12PM -0700, Tom Morris wrote:
> > Has anyone tackled training for the IPA since this initial query?
> >
> > I'm considering using Tesseract to OCR the first edition of the Oxford
> English
> > Dictionary (as input to a crowdsourced proofing process) and trying to
> decide
> > whether it's worth training it to recognized the pronunciations.  I'm
> also not
> > sure how close the OED version is since the original IPA was developed
> in 1886
> > and the first volume of the first edition of the OED was published in
> 1888, so
> > perhaps they used a homegrown variant.
> >
> > Here's the alphabet: https://archive.org/stream/
> > ANewEnglishDictionaryOnHistoricalPrinciples.10VolumesWithSupplement/
> > 01.NEDHP.AB.Oxford.Murray.1888.#page/n22/mode/1up
> > here's an example page with it in use: https://archive.org/stream/
> > ANewEnglishDictionaryOnHistoricalPrinciples.10VolumesWithSupplement/
> > 01.NEDHP.AB.Oxford.Murray.1888.#page/n326/mode/1up
> >
> > Any opinions on whether it's worth training for the phonetic alphabet or
> is it
> > going to just be too difficult to recognize even with specific training?
> >
> > Tom
> >
> > On Wednesday, January 22, 2014 at 11:55:28 AM UTC-5, Nick White wrote:
> >
> >     Hi Epin,
> >
> >     On Sat, Jan 18, 2014 at 01:32:11AM -0800, Epin Dorsal wrote:
> >     > I've been looking for a soft means  for recognition the
> >     > international phonetic transcription, may be applied into C++
> Builder.
> >     Would
> >     > you help me to find it, please!
> >
> >     Tesseract could certainly be used to recognise the International
> >     Phonetic Alphabet, though to my knowledge nobody has trained it for
> >     that yet. As there are quite a lot of different diacritics the
> >     training set would be quite large, but that's no problem,
> >     particularly if you use a tool like VietOCR[0] or my tools[1] to
> >     generate the training images. Detailed instructions for training
> >     Tesseract can be found on the Tesseract wiki[2].
> >
> >     Tesseract has a C++ API, so you can certainly integrate it into a C++
> >     project.
> >
> >     Hope this helps. If you need more advice, please try to be specific
> >     in your question.
> >
> >     Nick
> >
> >     0. http://vietocr.sourceforge.net/training.html
> >     1. https://gitorious.org/ancient-greek-training-for-tesseract/
> >     tesstrainingtools
> >     2. https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3
> >
> > --
> > You received this message because you are subscribed to the Google Groups
> > "tesseract-ocr" group.
> > To unsubscribe from this group and stop receiving emails from it, send
> an email
> > to [email protected].
> > To post to this group, send email to [email protected].
> > Visit this group at http://groups.google.com/group/tesseract-ocr.
> > To view this discussion on the web visit
> https://groups.google.com/d/msgid/
> > tesseract-ocr/4f2be0bc-71e5-4483-b4a4-bf0064ddcdf1%40googlegroups.com.
> > For more options, visit https://groups.google.com/d/optout.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAE9vqEEGk-5-uuuBB8BPkEqR9aPWsJQ82ECnTB5FhzCeWwfBMw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: Tesseract for recognition the international phonetic transcription

Reply via email to