[tesseract-ocr] Tesseract config for simple single words text and questions about learning

2018-04-28 Thread Lorenzo Blz

Hi, I'm using tesseract to recognize small fragments of text like this 
(actual images I'm using):





Numers are fixed lenght (7 digits) and letters are always 2 chars 
uppercase. I'm using a whitelist (a different one depeding if the fragment 
is text or digits, I know this in advance). And it works reasonable well. 
The size of these fragments is fixed, I rescale them to the same height (54 
pixels, I could change it or add some borders). These are extracted from 
smartphone pictures so the original resolution varies a lot.

I'm using lang "eng+ita" because in this way I get better results. I'm also 
using user-patterns but they are not helping much. I'm using the api 
through tesserocr  python bindings.

I think there are may parameters I can fine tune but I tried a few 
(load_system_dawg, load_freq_dawg, textord_min_linesize) but none of these 
improved the results (a very small textord_min_linesize=0.2 made them 
worse, so they are being used). I've read the FAQ and the docs but there 
are really too many parameters to understand what to change and how. 

In particular my current problem is adaptive learning: when I process a 
large batch of pictures the result varies depending on other fragments. 
Fragments that are perfectly readable and correctly classified when 
processed individually, give different, wrong, results when processed in a 
batch (I mean reusing the same api instance for multiple images).

I tried to disable it but it looks like 
 it cannot be 
disabled when using multiple languages(?).

If I use only "ita" (and no whitelist, no learning) the first image in this 
post is recognized as (text [confidence]):

('5748788\n\n', [81])
('5748788\n\n', [81])
('5748788\n\n', [81])
('5748788\n\n', [81])

With learning (multiple calls, no whitelist, lang: ita):

('5748788\n\n', [81])
('5748788\n\n', [81])
('5748788\n\n', [90])
('5748788\n\n', [90])

so it improves to a higher confidence (I do not know how much the 
confidence value matters in real life). It looks like learning is doing 
something good even with no whitelist (I could use the whitelist anyway, 
just to be sure, but the starting point looks better).

I'm wondering if I can do some kind of "warmup" with learning enabled and 
later turn it off (I'll try this today). But how many samples do I need? 
And it seems a little hacky.

Or maybe there is some way to print debug informations from the learning 
part to see what parameters are changed and set them manually later (I 
tried a few debug params but got no output).

Or maybe it is quite easy to manually find good parameters for this kind of 
regular text to get close to 90 confidence.

On the "AT" fragment I get 89 confidence and I think it may be quite low 
for this kind of simple clean text.

What I need are (good) consistent results in all situations for the same 
image. What do you think?


Thanks, bye

Lorenzo

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/563f2458-d63f-4198-8e73-abc448112423%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] tesseract 4 beta: openCL useage

2018-04-28 Thread Janpieter Sollie
Oops, I forgot the attachment.  Here it is :-)
I believe it will help you further to decide, but what it CAN do:
- find whitelines
- map a zone to a certainn character probability
- train itself.
it does NOT decide whether it is a certain character or not, this needs to
be decided on the host, not the gpu.

2018-04-28 8:09 GMT+00:00 ShreeDevi Kumar :

> @zdenko This discussion maybe better suited for tesseract-dev forum or do
> you want to track it as a issue on github?
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Sat, Apr 28, 2018 at 1:19 PM, Janpieter Sollie <
> janpietersol...@gmail.com> wrote:
>
>> Would it be a problem for you if I rewrite the opencl engine completely,
>> and you people provide me help to link the tesseract kernel -> opencl
>> engine parts?
>> in attachment, I already have a list of features I'd like to port to
>> openCL.  As this uses the GPU in a heavy way, I will implement multi-card
>> support on the host.
>> Is it a problem for you guys to think of tesseract 5.0 as a milestone?
>>
>>
>> 2018-04-27 15:53 GMT+00:00 Janpieter Sollie :
>>
>>> if I'm right, a neural net is about the engine parts, not the image
>>> characterisation rendering method, am I right? because I see many
>>> presentations, and most of them talk about the history of tesseract, but
>>> that's not what I need
>>>
>>> 2018-04-27 14:27 GMT+00:00 ShreeDevi Kumar :
>>>
 Please see

 https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM

 For info about neural nets used by tesseract

 On Fri 27 Apr, 2018, 7:48 PM Janpieter Sollie, <
 janpietersol...@gmail.com> wrote:

> I had a quick thought about what you could offload to opencl.  I will
> need some help from you people (I am a C programmer, not C++, at least not
> experienced) to do the host code, but this algorithm is perfectly
> optimizeable in openCL.
> the way I'd do it:
>
> prerequirements:
> - you can define 65k offsets (x,y) in whose you want the openCL engine
> to look for dots (x,y), the optimal position and closest neighbour can be
> reported in the first part.
> - you can make a RAW image of both the image and the characters. size
> of the letters doesn't matter, but they must be trimmed properly
>
> 1. you give me a matrix of 256*256 offsets(short, short) to analyze,
> with a max of 64 dots (char, char) (I assume these are neurons) to analyze
> in each offset.
> so, this gives you a start memory usage of   2⁸ * 2⁸ *4 + 64*2 = 256k
> + 128 bytes
> each dot MUST contain a black pixel.
> then we add the image, this is a charimage of max (to be discussed
> with you guys), I assume a 4096*4096 pixel image would be fine, especially
> when a character can contain a 4x4 matrix defining a 0/1 (black/white)
> value.
> 2. Then I follow these steps in the openCL engine:
> - we analyze the neurons
> - draw a cirle around them of x black points. (this circle can be
> 0, in which case the  neuron is white), for which the circle is completely
> black
> - when we encounter one or more white points, a direction of the
> points is calculated. if there's no whitespace at the other side, the
> neuron offset is moved for x/2 in the opposite direction and analyze 
> neuron
> is restarted for x/2.  else, quit the 'analyze neuron' part.  This can be
> done in local memory, in which case it will cost you 256*2=512 bytes of
> local ram to determine the optimal neuron position. Most graphic cards 
> have
> a limit of 32k ram, so this is no problem :-)
> - determine the closest dot next to this one:
> for each dot != this one, draw a line of black points, if no line
> can be found, jump to next dot.
> watch distance.  If it's smaller than the previous neuron && this
> dot id hasn't a link pointing from the destination to this one, save dot 
> id.
> so, at the end:
> - each neuron of each offset is optimally centered in a return
> matrix of 256*256*64*2 = 2²³ = 8M of memory
> - each neuron has a unique id to its closest neighbour, to which
> it's guaranteed to be attached. an id of -1 means no id could be found.
> 256*256*64 = 4M of memory
>
> 3. we focus on neuron list -> character mapping. this is a separate
> kernel. A "probability" factor is involved here, but I will think about it
> further.  I suggest to use a list of 64 character images at once, 
> otherwise
> you need lots of memory :-)
> - define the top, left and right neuron. create a zoom factor for the
> image. calculate the aspect ratio.  The probability is
> 1-diff(aspect_ratio1, aspect_ratio2)
> - analyze each link in the font character. total probability *=

Re: [tesseract-ocr] tesseract 4 beta: openCL useage

2018-04-28 Thread ShreeDevi Kumar
@zdenko This discussion maybe better suited for tesseract-dev forum or do
you want to track it as a issue on github?

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sat, Apr 28, 2018 at 1:19 PM, Janpieter Sollie  wrote:

> Would it be a problem for you if I rewrite the opencl engine completely,
> and you people provide me help to link the tesseract kernel -> opencl
> engine parts?
> in attachment, I already have a list of features I'd like to port to
> openCL.  As this uses the GPU in a heavy way, I will implement multi-card
> support on the host.
> Is it a problem for you guys to think of tesseract 5.0 as a milestone?
>
>
> 2018-04-27 15:53 GMT+00:00 Janpieter Sollie :
>
>> if I'm right, a neural net is about the engine parts, not the image
>> characterisation rendering method, am I right? because I see many
>> presentations, and most of them talk about the history of tesseract, but
>> that's not what I need
>>
>> 2018-04-27 14:27 GMT+00:00 ShreeDevi Kumar :
>>
>>> Please see
>>>
>>> https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM
>>>
>>> For info about neural nets used by tesseract
>>>
>>> On Fri 27 Apr, 2018, 7:48 PM Janpieter Sollie, <
>>> janpietersol...@gmail.com> wrote:
>>>
 I had a quick thought about what you could offload to opencl.  I will
 need some help from you people (I am a C programmer, not C++, at least not
 experienced) to do the host code, but this algorithm is perfectly
 optimizeable in openCL.
 the way I'd do it:

 prerequirements:
 - you can define 65k offsets (x,y) in whose you want the openCL engine
 to look for dots (x,y), the optimal position and closest neighbour can be
 reported in the first part.
 - you can make a RAW image of both the image and the characters. size
 of the letters doesn't matter, but they must be trimmed properly

 1. you give me a matrix of 256*256 offsets(short, short) to analyze,
 with a max of 64 dots (char, char) (I assume these are neurons) to analyze
 in each offset.
 so, this gives you a start memory usage of   2⁸ * 2⁸ *4 + 64*2 = 256k +
 128 bytes
 each dot MUST contain a black pixel.
 then we add the image, this is a charimage of max (to be discussed with
 you guys), I assume a 4096*4096 pixel image would be fine, especially when
 a character can contain a 4x4 matrix defining a 0/1 (black/white) value.
 2. Then I follow these steps in the openCL engine:
 - we analyze the neurons
 - draw a cirle around them of x black points. (this circle can be
 0, in which case the  neuron is white), for which the circle is completely
 black
 - when we encounter one or more white points, a direction of the
 points is calculated. if there's no whitespace at the other side, the
 neuron offset is moved for x/2 in the opposite direction and analyze neuron
 is restarted for x/2.  else, quit the 'analyze neuron' part.  This can be
 done in local memory, in which case it will cost you 256*2=512 bytes of
 local ram to determine the optimal neuron position. Most graphic cards have
 a limit of 32k ram, so this is no problem :-)
 - determine the closest dot next to this one:
 for each dot != this one, draw a line of black points, if no line
 can be found, jump to next dot.
 watch distance.  If it's smaller than the previous neuron && this
 dot id hasn't a link pointing from the destination to this one, save dot 
 id.
 so, at the end:
 - each neuron of each offset is optimally centered in a return
 matrix of 256*256*64*2 = 2²³ = 8M of memory
 - each neuron has a unique id to its closest neighbour, to which
 it's guaranteed to be attached. an id of -1 means no id could be found.
 256*256*64 = 4M of memory

 3. we focus on neuron list -> character mapping. this is a separate
 kernel. A "probability" factor is involved here, but I will think about it
 further.  I suggest to use a list of 64 character images at once, otherwise
 you need lots of memory :-)
 - define the top, left and right neuron. create a zoom factor for the
 image. calculate the aspect ratio.  The probability is
 1-diff(aspect_ratio1, aspect_ratio2)
 - analyze each link in the font character. total probability *=
 (found_link_length / total_link_length)
 - report the probability.
 On the PC: the character with the highest probability is the character
 you 're looking for.  Be aware that you need to compare the possibilities
 of the different offsets if they overlap.

 if the tesseract project can use this, please let me know

 2018-04-27 9:36 GMT+00:00 Zdenko Podobny :

> Only documentation we have is code itself ;-) But you can start with

Re: [tesseract-ocr] tesseract 4 beta: openCL useage

2018-04-28 Thread Janpieter Sollie
Would it be a problem for you if I rewrite the opencl engine completely,
and you people provide me help to link the tesseract kernel -> opencl
engine parts?
in attachment, I already have a list of features I'd like to port to
openCL.  As this uses the GPU in a heavy way, I will implement multi-card
support on the host.
Is it a problem for you guys to think of tesseract 5.0 as a milestone?


2018-04-27 15:53 GMT+00:00 Janpieter Sollie :

> if I'm right, a neural net is about the engine parts, not the image
> characterisation rendering method, am I right? because I see many
> presentations, and most of them talk about the history of tesseract, but
> that's not what I need
>
> 2018-04-27 14:27 GMT+00:00 ShreeDevi Kumar :
>
>> Please see
>>
>> https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM
>>
>> For info about neural nets used by tesseract
>>
>> On Fri 27 Apr, 2018, 7:48 PM Janpieter Sollie, 
>> wrote:
>>
>>> I had a quick thought about what you could offload to opencl.  I will
>>> need some help from you people (I am a C programmer, not C++, at least not
>>> experienced) to do the host code, but this algorithm is perfectly
>>> optimizeable in openCL.
>>> the way I'd do it:
>>>
>>> prerequirements:
>>> - you can define 65k offsets (x,y) in whose you want the openCL engine
>>> to look for dots (x,y), the optimal position and closest neighbour can be
>>> reported in the first part.
>>> - you can make a RAW image of both the image and the characters. size of
>>> the letters doesn't matter, but they must be trimmed properly
>>>
>>> 1. you give me a matrix of 256*256 offsets(short, short) to analyze,
>>> with a max of 64 dots (char, char) (I assume these are neurons) to analyze
>>> in each offset.
>>> so, this gives you a start memory usage of   2⁸ * 2⁸ *4 + 64*2 = 256k +
>>> 128 bytes
>>> each dot MUST contain a black pixel.
>>> then we add the image, this is a charimage of max (to be discussed with
>>> you guys), I assume a 4096*4096 pixel image would be fine, especially when
>>> a character can contain a 4x4 matrix defining a 0/1 (black/white) value.
>>> 2. Then I follow these steps in the openCL engine:
>>> - we analyze the neurons
>>> - draw a cirle around them of x black points. (this circle can be 0,
>>> in which case the  neuron is white), for which the circle is completely
>>> black
>>> - when we encounter one or more white points, a direction of the
>>> points is calculated. if there's no whitespace at the other side, the
>>> neuron offset is moved for x/2 in the opposite direction and analyze neuron
>>> is restarted for x/2.  else, quit the 'analyze neuron' part.  This can be
>>> done in local memory, in which case it will cost you 256*2=512 bytes of
>>> local ram to determine the optimal neuron position. Most graphic cards have
>>> a limit of 32k ram, so this is no problem :-)
>>> - determine the closest dot next to this one:
>>> for each dot != this one, draw a line of black points, if no line
>>> can be found, jump to next dot.
>>> watch distance.  If it's smaller than the previous neuron && this
>>> dot id hasn't a link pointing from the destination to this one, save dot id.
>>> so, at the end:
>>> - each neuron of each offset is optimally centered in a return
>>> matrix of 256*256*64*2 = 2²³ = 8M of memory
>>> - each neuron has a unique id to its closest neighbour, to which
>>> it's guaranteed to be attached. an id of -1 means no id could be found.
>>> 256*256*64 = 4M of memory
>>>
>>> 3. we focus on neuron list -> character mapping. this is a separate
>>> kernel. A "probability" factor is involved here, but I will think about it
>>> further.  I suggest to use a list of 64 character images at once, otherwise
>>> you need lots of memory :-)
>>> - define the top, left and right neuron. create a zoom factor for the
>>> image. calculate the aspect ratio.  The probability is
>>> 1-diff(aspect_ratio1, aspect_ratio2)
>>> - analyze each link in the font character. total probability *=
>>> (found_link_length / total_link_length)
>>> - report the probability.
>>> On the PC: the character with the highest probability is the character
>>> you 're looking for.  Be aware that you need to compare the possibilities
>>> of the different offsets if they overlap.
>>>
>>> if the tesseract project can use this, please let me know
>>>
>>> 2018-04-27 9:36 GMT+00:00 Zdenko Podobny :
>>>
 Only documentation we have is code itself ;-) But you can start with
 searching for opencl issue in tesseract issue tracker on github...

 Zdenko


 pi 27. 4. 2018 o 10:56 Janpieter Sollie 
 napísal(a):

> I'd be glad to help.  using tesseract 4, I am able to perform a 90%
> accuracy on OpenCL.  I do not have any experience with neural networks 
> (i'm
> just a high-school (no college educated IT-support guy with some knowledge