Has anyone tackled training for the IPA since this initial query?

I'm considering using Tesseract to OCR the first edition of the Oxford 
English Dictionary (as input to a crowdsourced proofing process) and trying 
to decide whether it's worth training it to recognized the pronunciations. 
 I'm also not sure how close the OED version is since the original IPA was 
developed in 1886 and the first volume of the first edition of the OED was 
published in 1888, so perhaps they used a homegrown variant.

Here's the alphabet: 
https://archive.org/stream/ANewEnglishDictionaryOnHistoricalPrinciples.10VolumesWithSupplement/01.NEDHP.AB.Oxford.Murray.1888.#page/n22/mode/1up
here's an example page with it in 
use: 
https://archive.org/stream/ANewEnglishDictionaryOnHistoricalPrinciples.10VolumesWithSupplement/01.NEDHP.AB.Oxford.Murray.1888.#page/n326/mode/1up

Any opinions on whether it's worth training for the phonetic alphabet or is 
it going to just be too difficult to recognize even with specific training?

Tom

On Wednesday, January 22, 2014 at 11:55:28 AM UTC-5, Nick White wrote:
>
> Hi Epin, 
>
> On Sat, Jan 18, 2014 at 01:32:11AM -0800, Epin Dorsal wrote: 
> > I've been looking for a soft means  for recognition the 
> > international phonetic transcription, may be applied into C++ Builder. 
> Would 
> > you help me to find it, please! 
>
> Tesseract could certainly be used to recognise the International 
> Phonetic Alphabet, though to my knowledge nobody has trained it for 
> that yet. As there are quite a lot of different diacritics the 
> training set would be quite large, but that's no problem, 
> particularly if you use a tool like VietOCR[0] or my tools[1] to 
> generate the training images. Detailed instructions for training 
> Tesseract can be found on the Tesseract wiki[2]. 
>
> Tesseract has a C++ API, so you can certainly integrate it into a C++ 
> project. 
>
> Hope this helps. If you need more advice, please try to be specific 
> in your question. 
>
> Nick 
>
> 0. http://vietocr.sourceforge.net/training.html 
> 1. 
> https://gitorious.org/ancient-greek-training-for-tesseract/tesstrainingtools 
> 2. https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/4f2be0bc-71e5-4483-b4a4-bf0064ddcdf1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to