Thanks to Art Rhyno, we've successfully OCRed documents from our 
collection. He created a few training files for us and they work great. 
Will post updates in the future. Message us id you have questions.

On Tuesday, June 23, 2015 at 12:40:49 PM UTC-4, Riel G wrote:
>
> Hello everyone. Greetings from Nunavut, Canada. 
>
> I'm fairly new to the technical side of OCR and Tesseract in general, so 
> my apologies in advance.
>
> I've been OCRing quite a bit using Adobe Acrobat. It works quite well for 
> English, but offers no support at all for the written language of 
> Inuktitut <https://en.wikipedia.org/wiki/Inuktitut>. The Inuktitut 
> language is native to the north eastern part of Canada and uses a non-Roman 
> orthography script named "syllabic 
> <https://en.wikipedia.org/wiki/Canadian_Aboriginal_syllabics>", which was 
> introduced by missionaries in the 1800s and is still used today. Some Cree 
> dialects also use syllabary. Here's a link to the Unified Canadian 
> Aboriginal Syllabics Official Unicode Consortium code chart 
> <http://www.unicode.org/charts/PDF/U1400.pdf> (PDF) - Wikipedia link 
> <https://en.wikipedia.org/wiki/Unified_Canadian_Aboriginal_Syllabics_%28Unicode_block%29>
> .
>
> Since Windows Vista, every Windows OS comes prepackaged with a font named 
> Euphemia <https://en.wikipedia.org/wiki/Euphemia_%28typeface%29>, which 
> is a unicode font that supports syllabics. When you activate the Inuktitut 
> keyboard and hit the caps lock, you can type syllabics. Apple also supports 
> Euphemia--a recent app came out with gives users an Inuktitut keyboard 
> <https://itunes.apple.com/ca/app/inuktut-naqittautit/id993521673?mt=8/>. 
> Android does not support it yet. There's also many of pre-Unicode 
> typefaces 
> <http://www.pirurvik.ca/en/productions/iu-computing/font-download> that 
> look slightly different than Euphemia syllabics, which I realize may be an 
> issue.
>
> I've been able to manually fix OCR errors in Adobe Acrobat under Text 
> Recognition -> Find All Suspects -> changing the font to Euphemia -> 
> manually typing the correct text in the red box (see attached image for 
> instructions). Though this was a step forward, we're looking for a batch 
> production OCR solution. OCRing Inuktitut using Acrobat gives us results 
> like this:
>
>
>
> Both Adobe and ABBYY haven't responded to our requests to have Inuktitut 
> added as a language in their text recognition feature.
>
> Is there something we can try with Tesseract? I downloaded it but haven't 
> made much progress. We'd love to be able to search our older scanned PDFs 
> using syllabics and eventually put our historic documents on our website, 
> which would then come up in Google search results. Any help would be 
> greatly appreciated. I've attached a jpg of sample text from the Nunavut 
> Land Claims Agreement <http://nlca.tunngavik.com> (table of contents for 
> Article 26) if anyone needs some content for testing.
>
> ᓇᑯᕐᒦᒃ / Thank you!
>
>
>
>
> https://en.wikipedia.org/wiki/Unified_Canadian_Aboriginal_Syllabics_%28Unicode_block%29
>
> *Unified Canadian Aboriginal Syllabics*[1] 
> <https://en.wikipedia.org/wiki/Unified_Canadian_Aboriginal_Syllabics_%28Unicode_block%29#endnote_U1400_as_of_Unicode_version>
> Official Unicode Consortium code chart 
> <http://www.unicode.org/charts/PDF/U1400.pdf> (PDF)    0 1 2 3 4 5 6 7 8 9 
> A B C D E F  U+140x ᐀ ᐁ ᐂ ᐃ ᐄ ᐅ ᐆ ᐇ ᐈ ᐉ ᐊ ᐋ ᐌ ᐍ ᐎ ᐏ  U+141x ᐐ ᐑ ᐒ ᐓ ᐔ ᐕ ᐖ 
> ᐗ ᐘ ᐙ ᐚ ᐛ ᐜ ᐝ ᐞ ᐟ  U+142x ᐠ ᐡ ᐢ ᐣ ᐤ ᐥ ᐦ ᐧ ᐨ ᐩ ᐪ ᐫ ᐬ ᐭ ᐮ ᐯ  U+143x ᐰ ᐱ ᐲ ᐳ 
> ᐴ ᐵ ᐶ ᐷ ᐸ ᐹ ᐺ ᐻ ᐼ ᐽ ᐾ ᐿ  U+144x ᑀ ᑁ ᑂ ᑃ ᑄ ᑅ ᑆ ᑇ ᑈ ᑉ ᑊ ᑋ ᑌ ᑍ ᑎ ᑏ  U+145x ᑐ 
> ᑑ ᑒ ᑓ ᑔ ᑕ ᑖ ᑗ ᑘ ᑙ ᑚ ᑛ ᑜ ᑝ ᑞ ᑟ  U+146x ᑠ ᑡ ᑢ ᑣ ᑤ ᑥ ᑦ ᑧ ᑨ ᑩ ᑪ ᑫ ᑬ ᑭ ᑮ ᑯ  
> U+147x ᑰ ᑱ ᑲ ᑳ ᑴ ᑵ ᑶ ᑷ ᑸ ᑹ ᑺ ᑻ ᑼ ᑽ ᑾ ᑿ  U+148x ᒀ ᒁ ᒂ ᒃ ᒄ ᒅ ᒆ ᒇ ᒈ ᒉ ᒊ ᒋ ᒌ ᒍ 
> ᒎ ᒏ  U+149x ᒐ ᒑ ᒒ ᒓ ᒔ ᒕ ᒖ ᒗ ᒘ ᒙ ᒚ ᒛ ᒜ ᒝ ᒞ ᒟ  U+14Ax ᒠ ᒡ ᒢ ᒣ ᒤ ᒥ ᒦ ᒧ ᒨ ᒩ ᒪ 
> ᒫ ᒬ ᒭ ᒮ ᒯ  U+14Bx ᒰ ᒱ ᒲ ᒳ ᒴ ᒵ ᒶ ᒷ ᒸ ᒹ ᒺ ᒻ ᒼ ᒽ ᒾ ᒿ  U+14Cx ᓀ ᓁ ᓂ ᓃ ᓄ ᓅ ᓆ ᓇ 
> ᓈ ᓉ ᓊ ᓋ ᓌ ᓍ ᓎ ᓏ  U+14Dx ᓐ ᓑ ᓒ ᓓ ᓔ ᓕ ᓖ ᓗ ᓘ ᓙ ᓚ ᓛ ᓜ ᓝ ᓞ ᓟ  U+14Ex ᓠ ᓡ ᓢ ᓣ ᓤ 
> ᓥ ᓦ ᓧ ᓨ ᓩ ᓪ ᓫ ᓬ ᓭ ᓮ ᓯ  U+14Fx ᓰ ᓱ ᓲ ᓳ ᓴ ᓵ ᓶ ᓷ ᓸ ᓹ ᓺ ᓻ ᓼ ᓽ ᓾ ᓿ  U+150x ᔀ ᔁ 
> ᔂ ᔃ ᔄ ᔅ ᔆ ᔇ ᔈ ᔉ ᔊ ᔋ ᔌ ᔍ ᔎ ᔏ  U+151x ᔐ ᔑ ᔒ ᔓ ᔔ ᔕ ᔖ ᔗ ᔘ ᔙ ᔚ ᔛ ᔜ ᔝ ᔞ ᔟ  
> U+152x ᔠ ᔡ ᔢ ᔣ ᔤ ᔥ ᔦ ᔧ ᔨ ᔩ ᔪ ᔫ ᔬ ᔭ ᔮ ᔯ  U+153x ᔰ ᔱ ᔲ ᔳ ᔴ ᔵ ᔶ ᔷ ᔸ ᔹ ᔺ ᔻ ᔼ ᔽ 
> ᔾ ᔿ  U+154x ᕀ ᕁ ᕂ ᕃ ᕄ ᕅ ᕆ ᕇ ᕈ ᕉ ᕊ ᕋ ᕌ ᕍ ᕎ ᕏ  U+155x ᕐ ᕑ ᕒ ᕓ ᕔ ᕕ ᕖ ᕗ ᕘ ᕙ ᕚ 
> ᕛ ᕜ ᕝ ᕞ ᕟ  U+156x ᕠ ᕡ ᕢ ᕣ ᕤ ᕥ ᕦ ᕧ ᕨ ᕩ ᕪ ᕫ ᕬ ᕭ ᕮ ᕯ  U+157x ᕰ ᕱ ᕲ ᕳ ᕴ ᕵ ᕶ ᕷ 
> ᕸ ᕹ ᕺ ᕻ ᕼ ᕽ ᕾ ᕿ  U+158x ᖀ ᖁ ᖂ ᖃ ᖄ ᖅ ᖆ ᖇ ᖈ ᖉ ᖊ ᖋ ᖌ ᖍ ᖎ ᖏ  U+159x ᖐ 
> ...

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/eae0d7b5-6a12-4589-a50e-28b183fa9e1d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to