Has anyone tackled training for the IPA since this initial query? I'm considering using Tesseract to OCR the first edition of the Oxford English Dictionary (as input to a crowdsourced proofing process) and trying to decide whether it's worth training it to recognized the pronunciations. I'm also not sure how close the OED version is since the original IPA was developed in 1886 and the first volume of the first edition of the OED was published in 1888, so perhaps they used a homegrown variant.
Here's the alphabet: https://archive.org/stream/ANewEnglishDictionaryOnHistoricalPrinciples.10VolumesWithSupplement/01.NEDHP.AB.Oxford.Murray.1888.#page/n22/mode/1up here's an example page with it in use: https://archive.org/stream/ANewEnglishDictionaryOnHistoricalPrinciples.10VolumesWithSupplement/01.NEDHP.AB.Oxford.Murray.1888.#page/n326/mode/1up Any opinions on whether it's worth training for the phonetic alphabet or is it going to just be too difficult to recognize even with specific training? Tom On Wednesday, January 22, 2014 at 11:55:28 AM UTC-5, Nick White wrote: > > Hi Epin, > > On Sat, Jan 18, 2014 at 01:32:11AM -0800, Epin Dorsal wrote: > > I've been looking for a soft means for recognition the > > international phonetic transcription, may be applied into C++ Builder. > Would > > you help me to find it, please! > > Tesseract could certainly be used to recognise the International > Phonetic Alphabet, though to my knowledge nobody has trained it for > that yet. As there are quite a lot of different diacritics the > training set would be quite large, but that's no problem, > particularly if you use a tool like VietOCR[0] or my tools[1] to > generate the training images. Detailed instructions for training > Tesseract can be found on the Tesseract wiki[2]. > > Tesseract has a C++ API, so you can certainly integrate it into a C++ > project. > > Hope this helps. If you need more advice, please try to be specific > in your question. > > Nick > > 0. http://vietocr.sourceforge.net/training.html > 1. > https://gitorious.org/ancient-greek-training-for-tesseract/tesstrainingtools > 2. https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/4f2be0bc-71e5-4483-b4a4-bf0064ddcdf1%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

