Hi Tom,

Just on the off chance it's useful, I wrote some scripts which 
process the 2nd edition OED SGML into DICT format, called oed2dict: 
https://njw.name/oed2dict/

Sounds like an interesting project you're involved in. Can I have 
more details?

FWIW, I'd imagine the full IPA would be a hard script to train, as 
it's lots of little diacritical marks on similar letters. But, while 
the quality might not be as high as other scripts, it'd probably 
still be worth it, to provide a better base for crowdsourced 
proofing. The 'symbols' file oed2dict lists all the named symbols 
(basically all non-ascii characters) in the digital 2nd edition OED, 
which is probably a very good starting point for generating training 
texts.

Nick

On Thu, Apr 23, 2015 at 01:32:12PM -0700, Tom Morris wrote:
> Has anyone tackled training for the IPA since this initial query?
> 
> I'm considering using Tesseract to OCR the first edition of the Oxford English
> Dictionary (as input to a crowdsourced proofing process) and trying to decide
> whether it's worth training it to recognized the pronunciations.  I'm also not
> sure how close the OED version is since the original IPA was developed in 1886
> and the first volume of the first edition of the OED was published in 1888, so
> perhaps they used a homegrown variant.
> 
> Here's the alphabet: https://archive.org/stream/
> ANewEnglishDictionaryOnHistoricalPrinciples.10VolumesWithSupplement/
> 01.NEDHP.AB.Oxford.Murray.1888.#page/n22/mode/1up
> here's an example page with it in use: https://archive.org/stream/
> ANewEnglishDictionaryOnHistoricalPrinciples.10VolumesWithSupplement/
> 01.NEDHP.AB.Oxford.Murray.1888.#page/n326/mode/1up
> 
> Any opinions on whether it's worth training for the phonetic alphabet or is it
> going to just be too difficult to recognize even with specific training?
> 
> Tom
> 
> On Wednesday, January 22, 2014 at 11:55:28 AM UTC-5, Nick White wrote:
> 
>     Hi Epin,
> 
>     On Sat, Jan 18, 2014 at 01:32:11AM -0800, Epin Dorsal wrote:
>     > I've been looking for a soft means  for recognition the
>     > international phonetic transcription, may be applied into C++ Builder.
>     Would
>     > you help me to find it, please!
> 
>     Tesseract could certainly be used to recognise the International
>     Phonetic Alphabet, though to my knowledge nobody has trained it for
>     that yet. As there are quite a lot of different diacritics the
>     training set would be quite large, but that's no problem,
>     particularly if you use a tool like VietOCR[0] or my tools[1] to
>     generate the training images. Detailed instructions for training
>     Tesseract can be found on the Tesseract wiki[2].
> 
>     Tesseract has a C++ API, so you can certainly integrate it into a C++
>     project.
> 
>     Hope this helps. If you need more advice, please try to be specific
>     in your question.
> 
>     Nick
> 
>     0. http://vietocr.sourceforge.net/training.html
>     1. https://gitorious.org/ancient-greek-training-for-tesseract/
>     tesstrainingtools
>     2. https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3
> 
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email
> to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/msgid/
> tesseract-ocr/4f2be0bc-71e5-4483-b4a4-bf0064ddcdf1%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/20150430223913.GA22333%40manta.lan.
For more options, visit https://groups.google.com/d/optout.

Reply via email to