Re: [tesseract-ocr] Inuktitut OCR problems - ᐃᓄᑦᑎᑐᑦ (Euphemia typeface)

Riel G Thu, 25 Jun 2015 07:55:42 -0700

Thanks to both of you. 

I've made some progress on this front and will update you all shortly. Art 
helped us with picking a box editor so we're currently correcting some 
non-Unicode fonts, like ProSyl, OldSyl, etc.


On Thursday, June 25, 2015 at 12:22:32 AM UTC-4, Tom Morris wrote:
>
> In addition to Art's training data, you might also want to test the IKU 
> language data for Tesseract 3.04 that Google released a few hours ago:
>
>    https://github.com/tesseract-ocr/tessdata/blob/master/iku.traineddata
>
> It was generated from the source language data here:
>
>     https://github.com/tesseract-ocr/langdata/tree/master/iku
>
> and I think this is the script data:
>
>     
> https://github.com/tesseract-ocr/langdata/blob/master/Canadian_Aboriginal.unicharset
>     
> https://github.com/tesseract-ocr/langdata/blob/master/Canadian_Aboriginal.xheights
>
> The fact that this is in the standard Google implementation now may also 
> mean that you can (or soon will be able to) get IKU OCR search results for 
> books in Google Books.  That might be worth testing at some point.
>
> Tom
>
>
> On Tuesday, June 23, 2015 at 1:51:26 PM UTC-4, Art Rhyno wrote:
>>
>>  Hi Riel,
>>
>>  
>>
>> I did some volunteer work on Inuktitut OCR for an ongoing project 
>> collaboration between OurDigitalWorld.org and the Multicultural History 
>> Society of Ontario (MHSO), there is a presentation on that project here 
>> [1], but I was focused only on the OCR of the scanned titles in the MHSO 
>> collection. One of these is "Inuit Today", an Inuktitut/English publication 
>> from the 1970s. 
>>
>>  
>>
>> The training files I created are on GitHub [2], I have attached the 
>> result of using the trained data set to this message but I was relying on 
>> the English dataset for numbers so none of the numeric characters are in 
>> the sample. Sad to say, I have no facility in the Inuktitut language and I 
>> was dealing with one publication and one font, so I was out of my depth for 
>> much of this but it might give you a starting point. I would be happy to 
>> walk you through the process I went through for the dataset. The ability to 
>> add your own fonts is an area where tesseract shines, though it’s sad that 
>> the companies you approached didn’t step forward to add it to the 
>> commercial options since it is a major language in Canada.
>>
>>  
>>
>> art
>>
>> ---
>>
>> 1. http://www.accessola2.com/superconference2014/sessions/329.pdf
>>
>> 2. https://github.com/OurDigitalWorld/odw-font-training
>>
>>  
>>
>> *From:* [email protected] [mailto:[email protected]] *On 
>> Behalf Of *Riel Gallant
>> *Sent:* Tuesday, June 23, 2015 11:52 AM
>> *To:* [email protected]
>> *Subject:* [tesseract-ocr] Inuktitut OCR problems - ᐃᓄᑦᑎᑐᑦ (Euphemia 
>> typeface)
>>
>>  
>>  
>> Hello everyone. Greetings from Nunavut, Canada. 
>>
>> I'm fairly new to the technical side of OCR and Tesseract in general, so 
>> my apologies in advance.
>>
>> I've been OCRing quite a bit using Adobe Acrobat. It works quite well for 
>> English, but offers no support at all for the written language of 
>> Inuktitut <https://en.wikipedia.org/wiki/Inuktitut>. The Inuktitut 
>> language is native to the north eastern part of Canada and uses a non-Roman 
>> orthography script named "syllabic 
>> <https://en.wikipedia.org/wiki/Canadian_Aboriginal_syllabics>", which 
>> was introduced by missionaries in the 1800s and is still used today. Some 
>> Cree dialects also use syllabary. Here's a link to the Unified Canadian 
>> Aboriginal Syllabics Official Unicode Consortium code chart 
>> <http://www.unicode.org/charts/PDF/U1400.pdf> (PDF) - Wikipedia link 
>> <https://en.wikipedia.org/wiki/Unified_Canadian_Aboriginal_Syllabics_%28Unicode_block%29>
>> .
>>
>> Since Windows Vista, every Windows OS comes prepackaged with a font named 
>> Euphemia <https://en.wikipedia.org/wiki/Euphemia_%28typeface%29>, which 
>> is a unicode font that supports syllabics. When you activate the Inuktitut 
>> keyboard and hit the caps lock, you can type syllabics. Apple also supports 
>> Euphemia--a recent app came out with gives users an Inuktitut keyboard 
>> <https://itunes.apple.com/ca/app/inuktut-naqittautit/id993521673?mt=8/>. 
>> Android does not support it yet. There's also many of pre-Unicode 
>> typefaces 
>> <http://www.pirurvik.ca/en/productions/iu-computing/font-download> that 
>> look slightly different than Euphemia syllabics, which I realize may be an 
>> issue.
>>
>> I've been able to manually fix OCR errors in Adobe Acrobat under Text 
>> Recognition -> Find All Suspects -> changing the font to Euphemia -> 
>> manually typing the correct text in the red box (see attached image for 
>> instructions). Though this was a step forward, we're looking for a batch 
>> production OCR solution. OCRing Inuktitut using Acrobat gives us results 
>> like this:
>>
>> [image: Image removed by sender.]
>>
>> Both Adobe and ABBYY haven't responded to our requests to have Inuktitut 
>> added as a language in their text recognition feature.
>>
>> Is there something we can try with Tesseract? I downloaded it but haven't 
>> made much progress. We'd love to be able to search our older scanned PDFs 
>> using syllabics and eventually put our historic documents on our website, 
>> which would then come up in Google search results. Any help would be 
>> greatly appreciated. I've attached a jpg of sample text from the Nunavut 
>> Land Claims Agreement <http://nlca.tunngavik.com> (table of contents for 
>> Article 26) if anyone needs some content for testing.
>>
>> ᓇᑯᕐᒦᒃ / Thank you!
>>
>>  
>>
>>  
>>
>>
>> https://en.wikipedia.org/wiki/Unified_Canadian_Aboriginal_Syllabics_%28Unicode_block%29
>>   
>> *Unified Canadian Aboriginal Syllabics*[1] 
>> <https://en.wikipedia.org/wiki/Unified_Canadian_Aboriginal_Syllabics_%28Unicode_block%29#endnote_U1400_as_of_Unicode_version>
>> Official Unicode Consortium code chart 
>> <http://www.unicode.org/charts/PDF/U1400.pdf> (PDF)
>>   
>>  
>>  
>> 0
>>  
>> 1
>>  
>> 2
>>  
>> 3
>>  
>> 4
>>  
>> 5
>>  
>> 6
>>  
>> 7
>>  
>> 8
>>  
>> 9
>>  
>> A
>>  
>> B
>>  
>> C
>>  
>> D
>>  
>> E
>>  
>> F
>>   
>> U+140x
>>  
>> ᐀
>>  
>> ᐁ
>>  
>> ᐂ
>>  
>> ᐃ
>>  
>> ᐄ
>>  
>> ᐅ
>>  
>> ᐆ
>>  
>> ᐇ
>>  
>> ᐈ
>>  
>> ᐉ
>>  
>> ᐊ
>>  
>> ᐋ
>>  
>> ᐌ
>>  
>> ᐍ
>>  
>> ᐎ
>>  
>> ᐏ
>>   
>> U+141x
>>  
>> ᐐ
>>  
>> ᐑ
>>  
>> ᐒ
>>  
>> ᐓ
>>  
>> ᐔ
>>  
>> ᐕ
>>  
>> ᐖ
>>  
>> ᐗ
>>  
>> ᐘ
>>  
>> ᐙ
>>  
>> ᐚ
>>  
>> ᐛ
>>  
>> ᐜ
>>  
>> ᐝ
>>  
>> ᐞ
>>  
>> ᐟ
>>   
>> U+142x
>>  
>> ᐠ
>>  
>> ᐡ
>>  
>> ᐢ
>>  
>> ᐣ
>>  
>> ᐤ
>>  
>> ᐥ
>>  
>> ᐦ
>>  
>> ᐧ
>>  
>> ᐨ
>>  
>> ᐩ
>>  
>> ᐪ
>>  
>> ᐫ
>>  
>> ᐬ
>>  
>> ᐭ
>>  
>> ᐮ
>>  
>> ᐯ
>>   
>> U+143x
>>  
>> ᐰ
>>  
>> ᐱ
>>  
>> ᐲ
>>  
>> ᐳ
>>  
>> ᐴ
>>  
>> ᐵ
>>  
>> ᐶ
>>  
>> ᐷ
>>  
>> ᐸ
>>  
>> ᐹ
>>  
>> ᐺ
>>  
>> ᐻ
>>  
>> ᐼ
>>  
>> ᐽ
>>  
>> ᐾ
>>  
>> ᐿ
>>   
>> U+144x
>>  
>> ᑀ
>>  
>> ᑁ
>>  
>> ᑂ
>>  
>> ᑃ
>>  
>> ᑄ
>>  
>> ᑅ
>>  
>> ᑆ
>>  
>> ᑇ
>>  
>> ᑈ
>>  
>> ᑉ
>>  
>> ᑊ
>>  
>> ᑋ
>>  
>> ᑌ
>>  
>> ᑍ
>>  
>> ᑎ
>>  
>> ᑏ
>>   
>> U+145x
>>  
>> ᑐ
>>  
>> ᑑ
>>  
>> ᑒ
>>  
>> ᑓ
>>  
>> ᑔ
>>  
>> ᑕ
>>  
>> ᑖ
>>  
>> ᑗ
>>  
>> ᑘ
>>  
>> ᑙ
>>  
>> ᑚ
>>  
>> ᑛ
>>  
>> ᑜ
>>  
>> ᑝ
>>  
>> ᑞ
>>  
>> ᑟ
>>   
>> U+146x
>>  
>> ᑠ
>>  
>> ᑡ
>>  
>> ᑢ
>>  
>> ᑣ
>>  
>> ᑤ
>>  
>> ᑥ
>>  
>> ᑦ
>>  
>> ᑧ
>>  
>> ᑨ
>>  
>> ᑩ
>>  
>> ᑪ
>>  
>> ᑫ
>>  
>> ᑬ
>>  
>> ᑭ
>>  
>> ᑮ
>>  
>> ᑯ
>>   
>> U+147x
>>  
>> ᑰ
>>  
>> ᑱ
>>  
>> ᑲ
>>  
>> ᑳ
>>  
>> ᑴ
>>  
>> ᑵ
>>  
>> ᑶ
>>  
>> ᑷ
>>  
>> ᑸ
>>  
>> ᑹ
>>  
>> ᑺ
>>  
>> ᑻ
>>  
>> ᑼ
>>  
>> ᑽ
>>  
>> ᑾ
>>  
>> ᑿ
>>   
>> U+148x
>>  
>> ᒀ
>>  
>> ᒁ
>>  
>> ᒂ
>>  
>> ᒃ
>>  
>> ᒄ
>>  
>> ᒅ
>>  
>> ᒆ
>>  
>> ᒇ
>>  
>> ᒈ
>>  
>> ᒉ
>>  
>> ᒊ
>>  
>> ᒋ
>>  
>> ᒌ
>>  
>> ᒍ
>>  
>> ᒎ
>>  
>> ᒏ
>>   
>> U+149x
>>  
>> ᒐ
>>  
>> ᒑ
>>  
>> ᒒ
>>  
>> ᒓ
>>  
>> ᒔ
>>  
>> ᒕ
>>  
>> ᒖ
>>  
>> ᒗ
>>  
>> ᒘ
>>  
>> ᒙ
>>  
>> ᒚ
>>  
>> ᒛ
>>  
>> ᒜ
>>  
>> ᒝ
>>  
>> ᒞ
>>  
>> ᒟ
>>   
>> U+14Ax
>>  
>> ᒠ
>>  
>> ᒡ
>>  
>> ᒢ
>>  
>> ᒣ
>>  
>> ᒤ
>>  
>> ᒥ
>>  
>> ᒦ
>>  
>> ᒧ
>>  
>> ᒨ
>>  
>> ᒩ
>>  
>>
>> ...
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6318cf88-736e-486f-83ed-e63a83e2682e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Inuktitut OCR problems - ᐃᓄᑦᑎᑐᑦ (Euphemia typeface)

Reply via email to