[tesseract-ocr] Re: Tesseract v3.03 and norwegian language

2017-01-02 Thread Ludvig F Aarstad
> > Hm, in Norwegian it isn't that rare. Or at least shouldn't be ;). Æ is >> the uppercase version of æ, and it would never occur in the middle of a >> word. >> > > I find it strange that it has been left out alltogether. What must I do to >> get it in there? >> > tirsdag 3. januar 2017

[tesseract-ocr] Re: Preprocessing ideas besides cropping/resizing/thresholding and identifying individual letters.

2017-01-02 Thread Tom Morris
p.s. If you post some example images, I'm happy to knock together a quick example for you. It looks like the native file format is AVI and AVI files have the ability to incorporate streams of not only video and audio, but also closed captioning info and other metadata. Is it safe to assume

[tesseract-ocr] Re: I can't get accurate ocr of this can anyone help with settings?

2017-01-02 Thread Tom Morris
On Monday, January 2, 2017 at 12:49:38 AM UTC-5, jean-charles compagnon wrote: > > I have attached the captcha that I cannot decode. > It says YNXAJB. Or do you mean your computer program can't decode it? In that case, it sounds like it is working as intended. -- You received this message

[tesseract-ocr] Re: Tesseract v3.03 and norwegian language

2017-01-02 Thread Tom Morris
First, the latest version is 3.04 (although there's also a tag for 3.05). Second, there will soon (hopefully) be a release for 4.00 which will make 3.x obsolete. Having said that, it looks like the root cause of your problem is that Tesseract doesn't know Æ is a possible letter for Norwegian.

[tesseract-ocr] Re: Comparing GetComponentImages to iterate_level

2017-01-02 Thread T G
I'm still hoping to learn how to use GetComponentImages / SetRectangle better, but I found a workaround to get what I need out of GetIterator / iterate_level... BoundingBoxInternal is not something I can find documentation for, but I saw a reference to it

[tesseract-ocr] Re: Comparing GetComponentImages to iterate_level

2017-01-02 Thread T G
I'm still hoping to learn how to use GetComponentImages / SetRectangle better, but I found a workaround to get what I need out of GetIterator / iterate_level... BoundingBoxInternal is not something I can find documentation for, but I saw a reference to it

[tesseract-ocr] Tesseract v3.03 and norwegian language

2017-01-02 Thread Ludvig F Aarstad
Greetings and salutations fellow OCR'ers ;). I have been playing around with various modules in PowerShell for reading text from an image with PowerShell but I have landed on using tesseract directly. It all works fine, and it reads like a dream :). However, it seems it is having problems with

[tesseract-ocr] Re: Comparing GetComponentImages to iterate_level

2017-01-02 Thread T G
I've continued to spend a little time each day working on my problem. I've found something that fuels my desire to understand what GetComponentImages does differently from iterate_level. from PIL import Image Image.MAX_IMAGE_PIXELS=10 from tesserocr import PyTessBaseAPI, RIL image =

Re: [tesseract-ocr] I can't get accurate ocr of this can anyone help with settings?

2017-01-02 Thread Allistair C
The whole point of a captcha is to evade automated reading. That's why letters are very close together and letters are heavily rotated off a consistent baseline. OCR is designed for normal text input so you need to do clever preprocessing here first. Sent from my iPhone > On 2 Jan 2017, at