Thanks to both of you. I've made some progress on this front and will update you all shortly. Art helped us with picking a box editor so we're currently correcting some non-Unicode fonts, like ProSyl, OldSyl, etc.
On Thursday, June 25, 2015 at 12:22:32 AM UTC-4, Tom Morris wrote: > > In addition to Art's training data, you might also want to test the IKU > language data for Tesseract 3.04 that Google released a few hours ago: > > https://github.com/tesseract-ocr/tessdata/blob/master/iku.traineddata > > It was generated from the source language data here: > > https://github.com/tesseract-ocr/langdata/tree/master/iku > > and I think this is the script data: > > > https://github.com/tesseract-ocr/langdata/blob/master/Canadian_Aboriginal.unicharset > > https://github.com/tesseract-ocr/langdata/blob/master/Canadian_Aboriginal.xheights > > The fact that this is in the standard Google implementation now may also > mean that you can (or soon will be able to) get IKU OCR search results for > books in Google Books. That might be worth testing at some point. > > Tom > > > On Tuesday, June 23, 2015 at 1:51:26 PM UTC-4, Art Rhyno wrote: >> >> Hi Riel, >> >> >> >> I did some volunteer work on Inuktitut OCR for an ongoing project >> collaboration between OurDigitalWorld.org and the Multicultural History >> Society of Ontario (MHSO), there is a presentation on that project here >> [1], but I was focused only on the OCR of the scanned titles in the MHSO >> collection. One of these is "Inuit Today", an Inuktitut/English publication >> from the 1970s. >> >> >> >> The training files I created are on GitHub [2], I have attached the >> result of using the trained data set to this message but I was relying on >> the English dataset for numbers so none of the numeric characters are in >> the sample. Sad to say, I have no facility in the Inuktitut language and I >> was dealing with one publication and one font, so I was out of my depth for >> much of this but it might give you a starting point. I would be happy to >> walk you through the process I went through for the dataset. The ability to >> add your own fonts is an area where tesseract shines, though it’s sad that >> the companies you approached didn’t step forward to add it to the >> commercial options since it is a major language in Canada. >> >> >> >> art >> >> --- >> >> 1. http://www.accessola2.com/superconference2014/sessions/329.pdf >> >> 2. https://github.com/OurDigitalWorld/odw-font-training >> >> >> >> *From:* [email protected] [mailto:[email protected]] *On >> Behalf Of *Riel Gallant >> *Sent:* Tuesday, June 23, 2015 11:52 AM >> *To:* [email protected] >> *Subject:* [tesseract-ocr] Inuktitut OCR problems - ᐃᓄᑦᑎᑐᑦ (Euphemia >> typeface) >> >> >> >> Hello everyone. Greetings from Nunavut, Canada. >> >> I'm fairly new to the technical side of OCR and Tesseract in general, so >> my apologies in advance. >> >> I've been OCRing quite a bit using Adobe Acrobat. It works quite well for >> English, but offers no support at all for the written language of >> Inuktitut <https://en.wikipedia.org/wiki/Inuktitut>. The Inuktitut >> language is native to the north eastern part of Canada and uses a non-Roman >> orthography script named "syllabic >> <https://en.wikipedia.org/wiki/Canadian_Aboriginal_syllabics>", which >> was introduced by missionaries in the 1800s and is still used today. Some >> Cree dialects also use syllabary. Here's a link to the Unified Canadian >> Aboriginal Syllabics Official Unicode Consortium code chart >> <http://www.unicode.org/charts/PDF/U1400.pdf> (PDF) - Wikipedia link >> <https://en.wikipedia.org/wiki/Unified_Canadian_Aboriginal_Syllabics_%28Unicode_block%29> >> . >> >> Since Windows Vista, every Windows OS comes prepackaged with a font named >> Euphemia <https://en.wikipedia.org/wiki/Euphemia_%28typeface%29>, which >> is a unicode font that supports syllabics. When you activate the Inuktitut >> keyboard and hit the caps lock, you can type syllabics. Apple also supports >> Euphemia--a recent app came out with gives users an Inuktitut keyboard >> <https://itunes.apple.com/ca/app/inuktut-naqittautit/id993521673?mt=8/>. >> Android does not support it yet. There's also many of pre-Unicode >> typefaces >> <http://www.pirurvik.ca/en/productions/iu-computing/font-download> that >> look slightly different than Euphemia syllabics, which I realize may be an >> issue. >> >> I've been able to manually fix OCR errors in Adobe Acrobat under Text >> Recognition -> Find All Suspects -> changing the font to Euphemia -> >> manually typing the correct text in the red box (see attached image for >> instructions). Though this was a step forward, we're looking for a batch >> production OCR solution. OCRing Inuktitut using Acrobat gives us results >> like this: >> >> [image: Image removed by sender.] >> >> Both Adobe and ABBYY haven't responded to our requests to have Inuktitut >> added as a language in their text recognition feature. >> >> Is there something we can try with Tesseract? I downloaded it but haven't >> made much progress. We'd love to be able to search our older scanned PDFs >> using syllabics and eventually put our historic documents on our website, >> which would then come up in Google search results. Any help would be >> greatly appreciated. I've attached a jpg of sample text from the Nunavut >> Land Claims Agreement <http://nlca.tunngavik.com> (table of contents for >> Article 26) if anyone needs some content for testing. >> >> ᓇᑯᕐᒦᒃ / Thank you! >> >> >> >> >> >> >> https://en.wikipedia.org/wiki/Unified_Canadian_Aboriginal_Syllabics_%28Unicode_block%29 >> >> *Unified Canadian Aboriginal Syllabics*[1] >> <https://en.wikipedia.org/wiki/Unified_Canadian_Aboriginal_Syllabics_%28Unicode_block%29#endnote_U1400_as_of_Unicode_version> >> Official Unicode Consortium code chart >> <http://www.unicode.org/charts/PDF/U1400.pdf> (PDF) >> >> >> >> 0 >> >> 1 >> >> 2 >> >> 3 >> >> 4 >> >> 5 >> >> 6 >> >> 7 >> >> 8 >> >> 9 >> >> A >> >> B >> >> C >> >> D >> >> E >> >> F >> >> U+140x >> >> ᐀ >> >> ᐁ >> >> ᐂ >> >> ᐃ >> >> ᐄ >> >> ᐅ >> >> ᐆ >> >> ᐇ >> >> ᐈ >> >> ᐉ >> >> ᐊ >> >> ᐋ >> >> ᐌ >> >> ᐍ >> >> ᐎ >> >> ᐏ >> >> U+141x >> >> ᐐ >> >> ᐑ >> >> ᐒ >> >> ᐓ >> >> ᐔ >> >> ᐕ >> >> ᐖ >> >> ᐗ >> >> ᐘ >> >> ᐙ >> >> ᐚ >> >> ᐛ >> >> ᐜ >> >> ᐝ >> >> ᐞ >> >> ᐟ >> >> U+142x >> >> ᐠ >> >> ᐡ >> >> ᐢ >> >> ᐣ >> >> ᐤ >> >> ᐥ >> >> ᐦ >> >> ᐧ >> >> ᐨ >> >> ᐩ >> >> ᐪ >> >> ᐫ >> >> ᐬ >> >> ᐭ >> >> ᐮ >> >> ᐯ >> >> U+143x >> >> ᐰ >> >> ᐱ >> >> ᐲ >> >> ᐳ >> >> ᐴ >> >> ᐵ >> >> ᐶ >> >> ᐷ >> >> ᐸ >> >> ᐹ >> >> ᐺ >> >> ᐻ >> >> ᐼ >> >> ᐽ >> >> ᐾ >> >> ᐿ >> >> U+144x >> >> ᑀ >> >> ᑁ >> >> ᑂ >> >> ᑃ >> >> ᑄ >> >> ᑅ >> >> ᑆ >> >> ᑇ >> >> ᑈ >> >> ᑉ >> >> ᑊ >> >> ᑋ >> >> ᑌ >> >> ᑍ >> >> ᑎ >> >> ᑏ >> >> U+145x >> >> ᑐ >> >> ᑑ >> >> ᑒ >> >> ᑓ >> >> ᑔ >> >> ᑕ >> >> ᑖ >> >> ᑗ >> >> ᑘ >> >> ᑙ >> >> ᑚ >> >> ᑛ >> >> ᑜ >> >> ᑝ >> >> ᑞ >> >> ᑟ >> >> U+146x >> >> ᑠ >> >> ᑡ >> >> ᑢ >> >> ᑣ >> >> ᑤ >> >> ᑥ >> >> ᑦ >> >> ᑧ >> >> ᑨ >> >> ᑩ >> >> ᑪ >> >> ᑫ >> >> ᑬ >> >> ᑭ >> >> ᑮ >> >> ᑯ >> >> U+147x >> >> ᑰ >> >> ᑱ >> >> ᑲ >> >> ᑳ >> >> ᑴ >> >> ᑵ >> >> ᑶ >> >> ᑷ >> >> ᑸ >> >> ᑹ >> >> ᑺ >> >> ᑻ >> >> ᑼ >> >> ᑽ >> >> ᑾ >> >> ᑿ >> >> U+148x >> >> ᒀ >> >> ᒁ >> >> ᒂ >> >> ᒃ >> >> ᒄ >> >> ᒅ >> >> ᒆ >> >> ᒇ >> >> ᒈ >> >> ᒉ >> >> ᒊ >> >> ᒋ >> >> ᒌ >> >> ᒍ >> >> ᒎ >> >> ᒏ >> >> U+149x >> >> ᒐ >> >> ᒑ >> >> ᒒ >> >> ᒓ >> >> ᒔ >> >> ᒕ >> >> ᒖ >> >> ᒗ >> >> ᒘ >> >> ᒙ >> >> ᒚ >> >> ᒛ >> >> ᒜ >> >> ᒝ >> >> ᒞ >> >> ᒟ >> >> U+14Ax >> >> ᒠ >> >> ᒡ >> >> ᒢ >> >> ᒣ >> >> ᒤ >> >> ᒥ >> >> ᒦ >> >> ᒧ >> >> ᒨ >> >> ᒩ >> >> >> ... > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6318cf88-736e-486f-83ed-e63a83e2682e%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

