In addition to Art's training data, you might also want to test the IKU language data for Tesseract 3.04 that Google released a few hours ago:
https://github.com/tesseract-ocr/tessdata/blob/master/iku.traineddata It was generated from the source language data here: https://github.com/tesseract-ocr/langdata/tree/master/iku and I think this is the script data: https://github.com/tesseract-ocr/langdata/blob/master/Canadian_Aboriginal.unicharset https://github.com/tesseract-ocr/langdata/blob/master/Canadian_Aboriginal.xheights The fact that this is in the standard Google implementation now may also mean that you can (or soon will be able to) get IKU OCR search results for books in Google Books. That might be worth testing at some point. Tom On Tuesday, June 23, 2015 at 1:51:26 PM UTC-4, Art Rhyno wrote: > > Hi Riel, > > > > I did some volunteer work on Inuktitut OCR for an ongoing project > collaboration between OurDigitalWorld.org and the Multicultural History > Society of Ontario (MHSO), there is a presentation on that project here > [1], but I was focused only on the OCR of the scanned titles in the MHSO > collection. One of these is "Inuit Today", an Inuktitut/English publication > from the 1970s. > > > > The training files I created are on GitHub [2], I have attached the result > of using the trained data set to this message but I was relying on the > English dataset for numbers so none of the numeric characters are in the > sample. Sad to say, I have no facility in the Inuktitut language and I was > dealing with one publication and one font, so I was out of my depth for > much of this but it might give you a starting point. I would be happy to > walk you through the process I went through for the dataset. The ability to > add your own fonts is an area where tesseract shines, though it’s sad that > the companies you approached didn’t step forward to add it to the > commercial options since it is a major language in Canada. > > > > art > > --- > > 1. http://www.accessola2.com/superconference2014/sessions/329.pdf > > 2. https://github.com/OurDigitalWorld/odw-font-training > > > > *From:* [email protected] <javascript:> [mailto: > [email protected] <javascript:>] *On Behalf Of *Riel Gallant > *Sent:* Tuesday, June 23, 2015 11:52 AM > *To:* [email protected] <javascript:> > *Subject:* [tesseract-ocr] Inuktitut OCR problems - ᐃᓄᑦᑎᑐᑦ (Euphemia > typeface) > > > > Hello everyone. Greetings from Nunavut, Canada. > > I'm fairly new to the technical side of OCR and Tesseract in general, so > my apologies in advance. > > I've been OCRing quite a bit using Adobe Acrobat. It works quite well for > English, but offers no support at all for the written language of > Inuktitut <https://en.wikipedia.org/wiki/Inuktitut>. The Inuktitut > language is native to the north eastern part of Canada and uses a non-Roman > orthography script named "syllabic > <https://en.wikipedia.org/wiki/Canadian_Aboriginal_syllabics>", which was > introduced by missionaries in the 1800s and is still used today. Some Cree > dialects also use syllabary. Here's a link to the Unified Canadian > Aboriginal Syllabics Official Unicode Consortium code chart > <http://www.unicode.org/charts/PDF/U1400.pdf> (PDF) - Wikipedia link > <https://en.wikipedia.org/wiki/Unified_Canadian_Aboriginal_Syllabics_%28Unicode_block%29> > . > > Since Windows Vista, every Windows OS comes prepackaged with a font named > Euphemia <https://en.wikipedia.org/wiki/Euphemia_%28typeface%29>, which > is a unicode font that supports syllabics. When you activate the Inuktitut > keyboard and hit the caps lock, you can type syllabics. Apple also supports > Euphemia--a recent app came out with gives users an Inuktitut keyboard > <https://itunes.apple.com/ca/app/inuktut-naqittautit/id993521673?mt=8/>. > Android does not support it yet. There's also many of pre-Unicode > typefaces > <http://www.pirurvik.ca/en/productions/iu-computing/font-download> that > look slightly different than Euphemia syllabics, which I realize may be an > issue. > > I've been able to manually fix OCR errors in Adobe Acrobat under Text > Recognition -> Find All Suspects -> changing the font to Euphemia -> > manually typing the correct text in the red box (see attached image for > instructions). Though this was a step forward, we're looking for a batch > production OCR solution. OCRing Inuktitut using Acrobat gives us results > like this: > > [image: Image removed by sender.] > > Both Adobe and ABBYY haven't responded to our requests to have Inuktitut > added as a language in their text recognition feature. > > Is there something we can try with Tesseract? I downloaded it but haven't > made much progress. We'd love to be able to search our older scanned PDFs > using syllabics and eventually put our historic documents on our website, > which would then come up in Google search results. Any help would be > greatly appreciated. I've attached a jpg of sample text from the Nunavut > Land Claims Agreement <http://nlca.tunngavik.com> (table of contents for > Article 26) if anyone needs some content for testing. > > ᓇᑯᕐᒦᒃ / Thank you! > > > > > > > https://en.wikipedia.org/wiki/Unified_Canadian_Aboriginal_Syllabics_%28Unicode_block%29 > > *Unified Canadian Aboriginal Syllabics*[1] > <https://en.wikipedia.org/wiki/Unified_Canadian_Aboriginal_Syllabics_%28Unicode_block%29#endnote_U1400_as_of_Unicode_version> > Official Unicode Consortium code chart > <http://www.unicode.org/charts/PDF/U1400.pdf> (PDF) > > > > 0 > > 1 > > 2 > > 3 > > 4 > > 5 > > 6 > > 7 > > 8 > > 9 > > A > > B > > C > > D > > E > > F > > U+140x > > ᐀ > > ᐁ > > ᐂ > > ᐃ > > ᐄ > > ᐅ > > ᐆ > > ᐇ > > ᐈ > > ᐉ > > ᐊ > > ᐋ > > ᐌ > > ᐍ > > ᐎ > > ᐏ > > U+141x > > ᐐ > > ᐑ > > ᐒ > > ᐓ > > ᐔ > > ᐕ > > ᐖ > > ᐗ > > ᐘ > > ᐙ > > ᐚ > > ᐛ > > ᐜ > > ᐝ > > ᐞ > > ᐟ > > U+142x > > ᐠ > > ᐡ > > ᐢ > > ᐣ > > ᐤ > > ᐥ > > ᐦ > > ᐧ > > ᐨ > > ᐩ > > ᐪ > > ᐫ > > ᐬ > > ᐭ > > ᐮ > > ᐯ > > U+143x > > ᐰ > > ᐱ > > ᐲ > > ᐳ > > ᐴ > > ᐵ > > ᐶ > > ᐷ > > ᐸ > > ᐹ > > ᐺ > > ᐻ > > ᐼ > > ᐽ > > ᐾ > > ᐿ > > U+144x > > ᑀ > > ᑁ > > ᑂ > > ᑃ > > ᑄ > > ᑅ > > ᑆ > > ᑇ > > ᑈ > > ᑉ > > ᑊ > > ᑋ > > ᑌ > > ᑍ > > ᑎ > > ᑏ > > U+145x > > ᑐ > > ᑑ > > ᑒ > > ᑓ > > ᑔ > > ᑕ > > ᑖ > > ᑗ > > ᑘ > > ᑙ > > ᑚ > > ᑛ > > ᑜ > > ᑝ > > ᑞ > > ᑟ > > U+146x > > ᑠ > > ᑡ > > ᑢ > > ᑣ > > ᑤ > > ᑥ > > ᑦ > > ᑧ > > ᑨ > > ᑩ > > ᑪ > > ᑫ > > ᑬ > > ᑭ > > ᑮ > > ᑯ > > U+147x > > ᑰ > > ᑱ > > ᑲ > > ᑳ > > ᑴ > > ᑵ > > ᑶ > > ᑷ > > ᑸ > > ᑹ > > ᑺ > > ᑻ > > ᑼ > > ᑽ > > ᑾ > > ᑿ > > U+148x > > ᒀ > > ᒁ > > ᒂ > > ᒃ > > ᒄ > > ᒅ > > ᒆ > > ᒇ > > ᒈ > > ᒉ > > ᒊ > > ᒋ > > ᒌ > > ᒍ > > ᒎ > > ᒏ > > U+149x > > ᒐ > > ᒑ > > ᒒ > > ᒓ > > ᒔ > > ᒕ > > ᒖ > > ᒗ > > ᒘ > > ᒙ > > ᒚ > > ᒛ > > ᒜ > > ᒝ > > ᒞ > > ᒟ > > U+14Ax > > ᒠ > > ᒡ > > ᒢ > > ᒣ > > ᒤ > > ᒥ > > ᒦ > > ᒧ > > ᒨ > > ᒩ > > ... -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/88d2f44a-4d36-488d-8d29-5f1615ade5d8%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

