Thanks to Art Rhyno, we've successfully OCRed documents from our collection. He created a few training files for us and they work great. Will post updates in the future. Message us id you have questions.
On Tuesday, June 23, 2015 at 12:40:49 PM UTC-4, Riel G wrote: > > Hello everyone. Greetings from Nunavut, Canada. > > I'm fairly new to the technical side of OCR and Tesseract in general, so > my apologies in advance. > > I've been OCRing quite a bit using Adobe Acrobat. It works quite well for > English, but offers no support at all for the written language of > Inuktitut <https://en.wikipedia.org/wiki/Inuktitut>. The Inuktitut > language is native to the north eastern part of Canada and uses a non-Roman > orthography script named "syllabic > <https://en.wikipedia.org/wiki/Canadian_Aboriginal_syllabics>", which was > introduced by missionaries in the 1800s and is still used today. Some Cree > dialects also use syllabary. Here's a link to the Unified Canadian > Aboriginal Syllabics Official Unicode Consortium code chart > <http://www.unicode.org/charts/PDF/U1400.pdf> (PDF) - Wikipedia link > <https://en.wikipedia.org/wiki/Unified_Canadian_Aboriginal_Syllabics_%28Unicode_block%29> > . > > Since Windows Vista, every Windows OS comes prepackaged with a font named > Euphemia <https://en.wikipedia.org/wiki/Euphemia_%28typeface%29>, which > is a unicode font that supports syllabics. When you activate the Inuktitut > keyboard and hit the caps lock, you can type syllabics. Apple also supports > Euphemia--a recent app came out with gives users an Inuktitut keyboard > <https://itunes.apple.com/ca/app/inuktut-naqittautit/id993521673?mt=8/>. > Android does not support it yet. There's also many of pre-Unicode > typefaces > <http://www.pirurvik.ca/en/productions/iu-computing/font-download> that > look slightly different than Euphemia syllabics, which I realize may be an > issue. > > I've been able to manually fix OCR errors in Adobe Acrobat under Text > Recognition -> Find All Suspects -> changing the font to Euphemia -> > manually typing the correct text in the red box (see attached image for > instructions). Though this was a step forward, we're looking for a batch > production OCR solution. OCRing Inuktitut using Acrobat gives us results > like this: > > > > Both Adobe and ABBYY haven't responded to our requests to have Inuktitut > added as a language in their text recognition feature. > > Is there something we can try with Tesseract? I downloaded it but haven't > made much progress. We'd love to be able to search our older scanned PDFs > using syllabics and eventually put our historic documents on our website, > which would then come up in Google search results. Any help would be > greatly appreciated. I've attached a jpg of sample text from the Nunavut > Land Claims Agreement <http://nlca.tunngavik.com> (table of contents for > Article 26) if anyone needs some content for testing. > > ᓇᑯᕐᒦᒃ / Thank you! > > > > > https://en.wikipedia.org/wiki/Unified_Canadian_Aboriginal_Syllabics_%28Unicode_block%29 > > *Unified Canadian Aboriginal Syllabics*[1] > <https://en.wikipedia.org/wiki/Unified_Canadian_Aboriginal_Syllabics_%28Unicode_block%29#endnote_U1400_as_of_Unicode_version> > Official Unicode Consortium code chart > <http://www.unicode.org/charts/PDF/U1400.pdf> (PDF) 0 1 2 3 4 5 6 7 8 9 > A B C D E F U+140x ᐀ ᐁ ᐂ ᐃ ᐄ ᐅ ᐆ ᐇ ᐈ ᐉ ᐊ ᐋ ᐌ ᐍ ᐎ ᐏ U+141x ᐐ ᐑ ᐒ ᐓ ᐔ ᐕ ᐖ > ᐗ ᐘ ᐙ ᐚ ᐛ ᐜ ᐝ ᐞ ᐟ U+142x ᐠ ᐡ ᐢ ᐣ ᐤ ᐥ ᐦ ᐧ ᐨ ᐩ ᐪ ᐫ ᐬ ᐭ ᐮ ᐯ U+143x ᐰ ᐱ ᐲ ᐳ > ᐴ ᐵ ᐶ ᐷ ᐸ ᐹ ᐺ ᐻ ᐼ ᐽ ᐾ ᐿ U+144x ᑀ ᑁ ᑂ ᑃ ᑄ ᑅ ᑆ ᑇ ᑈ ᑉ ᑊ ᑋ ᑌ ᑍ ᑎ ᑏ U+145x ᑐ > ᑑ ᑒ ᑓ ᑔ ᑕ ᑖ ᑗ ᑘ ᑙ ᑚ ᑛ ᑜ ᑝ ᑞ ᑟ U+146x ᑠ ᑡ ᑢ ᑣ ᑤ ᑥ ᑦ ᑧ ᑨ ᑩ ᑪ ᑫ ᑬ ᑭ ᑮ ᑯ > U+147x ᑰ ᑱ ᑲ ᑳ ᑴ ᑵ ᑶ ᑷ ᑸ ᑹ ᑺ ᑻ ᑼ ᑽ ᑾ ᑿ U+148x ᒀ ᒁ ᒂ ᒃ ᒄ ᒅ ᒆ ᒇ ᒈ ᒉ ᒊ ᒋ ᒌ ᒍ > ᒎ ᒏ U+149x ᒐ ᒑ ᒒ ᒓ ᒔ ᒕ ᒖ ᒗ ᒘ ᒙ ᒚ ᒛ ᒜ ᒝ ᒞ ᒟ U+14Ax ᒠ ᒡ ᒢ ᒣ ᒤ ᒥ ᒦ ᒧ ᒨ ᒩ ᒪ > ᒫ ᒬ ᒭ ᒮ ᒯ U+14Bx ᒰ ᒱ ᒲ ᒳ ᒴ ᒵ ᒶ ᒷ ᒸ ᒹ ᒺ ᒻ ᒼ ᒽ ᒾ ᒿ U+14Cx ᓀ ᓁ ᓂ ᓃ ᓄ ᓅ ᓆ ᓇ > ᓈ ᓉ ᓊ ᓋ ᓌ ᓍ ᓎ ᓏ U+14Dx ᓐ ᓑ ᓒ ᓓ ᓔ ᓕ ᓖ ᓗ ᓘ ᓙ ᓚ ᓛ ᓜ ᓝ ᓞ ᓟ U+14Ex ᓠ ᓡ ᓢ ᓣ ᓤ > ᓥ ᓦ ᓧ ᓨ ᓩ ᓪ ᓫ ᓬ ᓭ ᓮ ᓯ U+14Fx ᓰ ᓱ ᓲ ᓳ ᓴ ᓵ ᓶ ᓷ ᓸ ᓹ ᓺ ᓻ ᓼ ᓽ ᓾ ᓿ U+150x ᔀ ᔁ > ᔂ ᔃ ᔄ ᔅ ᔆ ᔇ ᔈ ᔉ ᔊ ᔋ ᔌ ᔍ ᔎ ᔏ U+151x ᔐ ᔑ ᔒ ᔓ ᔔ ᔕ ᔖ ᔗ ᔘ ᔙ ᔚ ᔛ ᔜ ᔝ ᔞ ᔟ > U+152x ᔠ ᔡ ᔢ ᔣ ᔤ ᔥ ᔦ ᔧ ᔨ ᔩ ᔪ ᔫ ᔬ ᔭ ᔮ ᔯ U+153x ᔰ ᔱ ᔲ ᔳ ᔴ ᔵ ᔶ ᔷ ᔸ ᔹ ᔺ ᔻ ᔼ ᔽ > ᔾ ᔿ U+154x ᕀ ᕁ ᕂ ᕃ ᕄ ᕅ ᕆ ᕇ ᕈ ᕉ ᕊ ᕋ ᕌ ᕍ ᕎ ᕏ U+155x ᕐ ᕑ ᕒ ᕓ ᕔ ᕕ ᕖ ᕗ ᕘ ᕙ ᕚ > ᕛ ᕜ ᕝ ᕞ ᕟ U+156x ᕠ ᕡ ᕢ ᕣ ᕤ ᕥ ᕦ ᕧ ᕨ ᕩ ᕪ ᕫ ᕬ ᕭ ᕮ ᕯ U+157x ᕰ ᕱ ᕲ ᕳ ᕴ ᕵ ᕶ ᕷ > ᕸ ᕹ ᕺ ᕻ ᕼ ᕽ ᕾ ᕿ U+158x ᖀ ᖁ ᖂ ᖃ ᖄ ᖅ ᖆ ᖇ ᖈ ᖉ ᖊ ᖋ ᖌ ᖍ ᖎ ᖏ U+159x ᖐ > ... -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/eae0d7b5-6a12-4589-a50e-28b183fa9e1d%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

