Re: [tesseract-ocr] does it make sense to train existing languages? how to fix repeatedly wrong letters?

2018-04-02 Thread ShreeDevi Kumar
My suggestion would be to do post processing of the OCR output.

On Mon 2 Apr, 2018, 6:09 PM JP T,  wrote:

> Hi
>
> I don't really got an understanding of the consequences of training.
>
> My problem:
> I've got tons of pages with a special format. ("one place study" about the
> historic inhabitants of a town)
>
> tesseract repeatedly fails on a few special words:
> oo (oh-oh) at start of line for "wedding" is often interpreted as 00 (zero
> zero)
> roman numbers 2 and 3 in Arial font are taken for lowercase LL or
> uppercase I plus lowercase LL
> */~ (birth at about) is percent %
> ~ is -
>
> my scans are of almost perfect quality (used Fred's scripts). so there is
> nothing I can do on that side any more.
> adding oo to user words did not help.
>
> Can I use training to solve these or should I instead write a script that
> fixes the mistakes after OCR?
> The problem is, that OCR needs to know some semantics. The Arial letters
> itself do hardly provide a hint which one is correct.
>
> thanks
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/5cd68a84-a7d2-4185-91c9-115c9e62d1d4%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXnu95%3DKnW5qK1-%2Brmxpt1BZ5pH6z0qi4CtYVzMiSGGVQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] in the script data directory , script data of English is Latin.traineddata ?

2018-04-02 Thread notoriousterran
Hi 
in the script data directory(tess_best/script) , script data of English is 
Latin.traineddata ? 
waiting for answer.

Thank you

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/8799bd72-73d1-4888-9bca-7a7fc5db9499%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Extracting pristine rasterized text

2018-04-02 Thread ShreeDevi Kumar
Thank you for the detailed info.

My suggestion is to try recognition with eng.traineddata from the
tessdata_fast repository with --oem 1.


On Tue 3 Apr, 2018, 3:13 AM Patrick Ramsey, 
wrote:

> Answers below inline. And thank you very much for your help :)
>
> |PTR
>
> On Friday, March 30, 2018 at 2:00:18 AM UTC-7, shree wrote:
>>
>> Please check GitHub/issues for similar reports and suggestions.
>>
>> Also specify,
>>
> Which version/commit of tesseract 4
>>
>
> commit hash: 40f43111e05b3dd2f2f8aeae3aba33016523c881
> tag: 4.0.0-beta.1
>
> Which traineddata file, from which repo
>>
>
> eng.traineddata from https://github.com/tesseract-ocr/tessdata at commit
> 9b2e3f6642285b3e9a7a5852e5b10259e42d5510
>
>
>> Which o/s
>>
>
> Ubuntu 17.10 on amd64
>
>>
>> tesseract -v
>>
>
> tesseract 4.0.0-beta.1
>  leptonica-1.74.4
>   libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.1) : libpng 1.6.34 :
> libtiff 4.0.8 : zlib 1.2.11 : libwebp 0.6.0 : libopenjp2 2.2.0
>
>  Found AVX2
>  Found AVX
>  Found SSE
>
>
>
>>
>>
>
>>
>>
>> On Fri 30 Mar, 2018, 2:19 PM Patrick Ramsey, 
>> wrote:
>>
>>> Hi!
>>>
>>> So, I am running tesseract4 on clean, 1-bit images of rasterized text
>>> (not printed and scanned).  I'm getting very accurate output, as expected,
>>> but tesseract is taking about 1 second to process a single page on a core
>>> i7 cpu, and that seems a lot longer than I'd have expected.
>>>
>>> I've been trying to enable debug output so that I can see what's taking
>>> the most time, to see if there is anything that I could get away with
>>> turning off to speed it up (since I don't need to account for e.g. dirt on
>>> the lens), but thus far I'm feeling pretty stupid.  So:
>>>
>>> A) is there any straightforward way to get more information on what
>>> tesseract is actually doing? (I've built with --enable-debug and it doesn't
>>> seem to have changed the output on the command line)
>>> B) are there any control parameters you folks would suggest setting to
>>> speed up image processing/turn off unnecessary work, given the inputs I've
>>> described?
>>>
>>> Many thanks,
>>>
>>> PTR
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/893cf5f7-8f64-428e-b1fe-5e6214215059%40googlegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/c709dd21-02d4-4d23-a52a-60501916c37a%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVLbi6wbRyWnNqTwAdZovBm-W%3DmZx4gTOjoCfTdrXcucA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Extracting pristine rasterized text

2018-04-02 Thread Patrick Ramsey
Answers below inline. And thank you very much for your help :) 

|PTR

On Friday, March 30, 2018 at 2:00:18 AM UTC-7, shree wrote:
>
> Please check GitHub/issues for similar reports and suggestions.
>
> Also specify, 
>
Which version/commit of tesseract 4
>

commit hash: 40f43111e05b3dd2f2f8aeae3aba33016523c881
tag: 4.0.0-beta.1

Which traineddata file, from which repo
>

eng.traineddata from https://github.com/tesseract-ocr/tessdata at commit 
9b2e3f6642285b3e9a7a5852e5b10259e42d5510
 

> Which o/s
>

Ubuntu 17.10 on amd64 

>
> tesseract -v
>
 
tesseract 4.0.0-beta.1
 leptonica-1.74.4
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.1) : libpng 1.6.34 : libtiff 
4.0.8 : zlib 1.2.11 : libwebp 0.6.0 : libopenjp2 2.2.0

 Found AVX2
 Found AVX
 Found SSE

 

>  
>

>
>
> On Fri 30 Mar, 2018, 2:19 PM Patrick Ramsey,  > wrote:
>
>> Hi!
>>
>> So, I am running tesseract4 on clean, 1-bit images of rasterized text 
>> (not printed and scanned).  I'm getting very accurate output, as expected, 
>> but tesseract is taking about 1 second to process a single page on a core 
>> i7 cpu, and that seems a lot longer than I'd have expected.  
>>
>> I've been trying to enable debug output so that I can see what's taking 
>> the most time, to see if there is anything that I could get away with 
>> turning off to speed it up (since I don't need to account for e.g. dirt on 
>> the lens), but thus far I'm feeling pretty stupid.  So:
>>
>> A) is there any straightforward way to get more information on what 
>> tesseract is actually doing? (I've built with --enable-debug and it doesn't 
>> seem to have changed the output on the command line)
>> B) are there any control parameters you folks would suggest setting to 
>> speed up image processing/turn off unnecessary work, given the inputs I've 
>> described?
>>
>> Many thanks,
>>
>> PTR
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/893cf5f7-8f64-428e-b1fe-5e6214215059%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/c709dd21-02d4-4d23-a52a-60501916c37a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] [4.0.0-beta.1] read_params_file: parameter not found: PNG

2018-04-02 Thread Zdenko Podobny
aim is to have tool that is easy portable with minimum dependencies.
IMO it is standard on linux/unix like system to use --help option for
explanation of usage.

Zdenko

2018-04-02 14:38 GMT+02:00 JP T :

> Well, the problem is error handling.
> If tesseract would have given a meaningful error message...
> This is about basic parameter handling, nothing sophisticated.
>
> Am Montag, 2. April 2018 09:02:02 UTC+2 schrieb zdenop:
>>
>> ... and it was exactly the same in tesseract 3.0x as in 4.0
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/03acd1d7-28c6-473b-a631-a999b76645ac%
> 40googlegroups.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wrMXQYYFvAmiemo7Fr0%3DXyBb%2B-aU7LrkyxGMM80kfm-A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] does it make sense to train existing languages? how to fix repeatedly wrong letters?

2018-04-02 Thread JP T
Hi

I don't really got an understanding of the consequences of training.

My problem:
I've got tons of pages with a special format. ("one place study" about the 
historic inhabitants of a town)

tesseract repeatedly fails on a few special words:
oo (oh-oh) at start of line for "wedding" is often interpreted as 00 (zero 
zero)
roman numbers 2 and 3 in Arial font are taken for lowercase LL or uppercase 
I plus lowercase LL
*/~ (birth at about) is percent %
~ is -

my scans are of almost perfect quality (used Fred's scripts). so there is 
nothing I can do on that side any more.
adding oo to user words did not help.

Can I use training to solve these or should I instead write a script that 
fixes the mistakes after OCR?
The problem is, that OCR needs to know some semantics. The Arial letters 
itself do hardly provide a hint which one is correct.

thanks


-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/5cd68a84-a7d2-4185-91c9-115c9e62d1d4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] [4.0.0-beta.1] read_params_file: parameter not found: PNG

2018-04-02 Thread JP T
Well, the problem is error handling. 
If tesseract would have given a meaningful error message...
This is about basic parameter handling, nothing sophisticated.

Am Montag, 2. April 2018 09:02:02 UTC+2 schrieb zdenop:
>
> ... and it was exactly the same in tesseract 3.0x as in 4.0
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/03acd1d7-28c6-473b-a631-a999b76645ac%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Where is /path/to/eng.user-words?

2018-04-02 Thread 이경준
Hi ..


I incited this page .

I cannot find (lang).user-words .

How can I find? 


Tesseract config files consist of lines with variable-value pairs (space 
separated). The variables are documented as flags in the source code like 
the following one in tesseractclass.h:

STRING_VAR_H(tessedit_char_blacklist, "", "Blacklist of chars not to 
recognize");

These variables may enable or disable various features of the engine, and 
may cause it to load (or not load) various data. For instance, let’s 
suppose you want to OCR in English, but suppress the normal dictionary and 
load an alternative word list and an alternative list of patterns — these 
two files are the most commonly used extra data files.

If your language pack is in /path/to/eng.traineddata and the hocr config is 
in /path/to/configs/hocr then create three new files:

/path/to/eng.user-words:

the
quick
brown
fox
jumped

/path/to/eng.user-patterns:

1-\d\d\d-GOOG-411
www.\n\\\*.com

/path/to/configs/bazaar:

load_system_dawg F
load_freq_dawg   F
user_words_suffixuser-words
user_patterns_suffix user-patterns

Now, if you pass the word *bazaar* as a trailing command line parameter to 
Tesseract, Tesseract will not bother loading the system dictionary nor the 
dictionary of frequent words and will load and use the eng.user-words and 
eng.user-patterns files you provided. The former is a simple word list, one 
per line. The format of the latter is documented in dict/trie.h on 
read_pattern_list().

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/5d06132a-a726-42ea-825b-4d1f6ac5083c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: When tesseract(3.04) makes a box, is there a way to control it if it is made more than the number of letters?

2018-04-02 Thread notoriousterran
The original image contains eight characters, but tesseract(3.04) has nine 
boxes. 

=  The original image contains eight characters, but tesseract(3.04) makes 
nine boxes. ($ tesseract (lang).(fontname).exp(num).tif tesseract 
(lang).(fontname).exp(num) -l lang batch.nochop makebox)

2018년 4월 2일 월요일 오후 4시 33분 27초 UTC+9, notorio...@gmail.com 님의 말:
>
> Hi
>
> When tesseract(3.04) makes a box, is there a way to control it if it is 
> made more than the number of letters?
>
> The original image contains eight characters, but tesseract(3.04) has nine 
> boxes.
>
>
> So I only put 8 boxes of file information into the box file, but A showed 
> 9 characters in the execution result.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/696f3d54-56d1-4290-a8ca-4f1957c7fd67%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] When tesseract(3.04) makes a box, is there a way to control it if it is made more than the number of letters?

2018-04-02 Thread notoriousterran
Hi

When tesseract(3.04) makes a box, is there a way to control it if it is 
made more than the number of letters?

The original image contains eight characters, but tesseract(3.04) has nine 
boxes.


So I only put 8 boxes of file information into the box file, but A showed 9 
characters in the execution result.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/1f6aa902-db65-4bc8-a6f9-9447c77359b9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] [4.0.0-beta.1] read_params_file: parameter not found: PNG

2018-04-02 Thread Zdenko Podobny
... and it was exactly the same in tesseract 3.0x as in 4.0

Zdenko

2018-04-02 0:14 GMT+02:00 JP T :

> Solved:
> must be* tesseract infile outfile options* instead of standard unix *program
> options infile outfile*.
> On Sun 1 Apr, 2018, 7:25 PM JP T,  wrote:
>
>> Hi
>>>
>>> I just updated from version 3.04.01 but now tesseract fails with above
>>> message if I give the -psm option.
>>> input files are PNG.
>>>
>>> any idea?
>>>
>>> thanks
>>>
>>>
>>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/58f73b75-5734-49dd-b06b-0cc1f15b1338%
> 40googlegroups.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wnhmBF4q-G5j2kuvq1FtN6VSQcsXJwZMZGcvuoGecT8g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.