[tesseract-ocr] Re: Training error "Couldn't find a matching blob"

2018-05-31 Thread shree
This has been an issue for long. Thanks for finding the problem.

Please submit a PR on github.

On Friday, June 1, 2018 at 1:55:25 AM UTC+5:30, Paul Kitchen wrote:
>
> After a lot of stepping through tesseract code, I found the problem. 
>
> 1)  In file coutln.cpp, function C_OUTLINE::IsLegallyNested(), we 
> assign outer_area() to an inT32, parent_area. Then lower in the function, 
> we multiple child->outer_area() by parent_area. This caused an integer 
> overflow which resulted in a bad sign for the multiplication. The fix was 
> to make parent_area an inT64 so that integer overflow cannot happen.
>
>
> The two 32-bit integers being multiplied were -51874 and 60218. The true 
> result should be -3123748532 but the maximum result cannot be greater than 
> 2^31 or you will have sign/overflow problems, which is the case here. The 
> computer result was 1171218764, causing the if-statement to go down the 
> wrong path.
>
> dfs
>
>
>
>
>
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/1ef0e822-9518-4cbb-af39-5a8ec6370d00%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Preprocess Image

2018-05-31 Thread Hongguo An



Hi:
When trying to OCR the above image, the date 09/02/2017 is always wrong, 
(0G/02/2017).


This is tesseract 4 running on linux, the cmd line is: 

*tesseract stdin stdout -l eng --psm 11 --oem 1 -c textonly_pdf=1 -c 
tessedit_create_pdf=1 | pdftotext -layout - - *


Is there any way to pre-process the image to make it work? (preferably 
using convert)


Thanks

Hongguo An

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/f77bbd69-a1d2-473c-98ed-fc0ecd9d4034%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Not able install tesseract ocr on ubuntu 17.04

2018-05-31 Thread Александр Поздняков
You need to replace in */etc/apt/sources.list* repository 
*http://us.archive.ubuntu.com/ubuntu* at 
*http://old-releases.ubuntu.com/ubuntu/*

sudo apt-get update
sudo apt install tesseract-ocr 

As for the beta version, I'll think ...

четверг, 31 мая 2018 г., 10:04:19 UTC+3 пользователь RT-Rakesh написал:
>
> user@computer:~$ sudo apt install tesseract-ocr
> Reading package lists... Done
> Building dependency tree   
> Reading state information... Done
> The following packages were automatically installed and are no longer 
> required:
>   libgnutls-openssl27 postfix-sqlite
> Use 'sudo apt autoremove' to remove them.
> The following additional packages will be installed:
>   libgif7 liblept5 libtesseract-data libtesseract3 tesseract-ocr-eng
>   tesseract-ocr-equ tesseract-ocr-osd
> The following NEW packages will be installed:
>   libgif7 liblept5 libtesseract-data libtesseract3 tesseract-ocr
>   tesseract-ocr-eng tesseract-ocr-equ tesseract-ocr-osd
> 0 upgraded, 8 newly installed, 0 to remove and 180 not upgraded.
> Need to get 945 kB/14.6 MB of archives.
> After this operation, 57.5 MB of additional disk space will be used.
> Do you want to continue? [Y/n] y
> Err:1 http://us.archive.ubuntu.com/ubuntu zesty/main amd64 libgif7 amd64 
> 5.1.4-0.4
>   404  Not Found [IP: 91.189.91.23 80]
> Err:2 http://us.archive.ubuntu.com/ubuntu zesty/universe amd64 liblept5 
> amd64 1.74.1-1
>   404  Not Found [IP: 91.189.91.23 80]
> E: Failed to fetch 
> http://us.archive.ubuntu.com/ubuntu/pool/main/g/giflib/libgif7_5.1.4-0.4_amd64.deb
>   
> 404  Not Found [IP: 91.189.91.23 80]
> E: Failed to fetch 
> http://us.archive.ubuntu.com/ubuntu/pool/universe/l/leptonlib/liblept5_1.74.1-1_amd64.deb
>   
> 404  Not Found [IP: 91.189.91.23 80]
> E: Unable to fetch some archives, maybe run apt-get update or try with 
> --fix-missing?
>
>
> *This is the error being thrown, can some one help me with how to solve 
> this issue. *
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/5d45f3de-00a4-44c6-9674-de39c7ea54e4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] lstmeval gives a perfect result but tesseract fails

2018-05-31 Thread ShreeDevi Kumar
 >I've trained a LSTM model for a custom language from scratch as explained
here
.

>The language only has about 100 words and 17 characters, so it's pretty
simple.

For such a small model, try to build the legacy version rather than LSTM.

$tesstrain_dir/tesstrain.sh \
   --lang $Lang \
   --exposures "0" \
   --fonts_dir $fonts_dir \
   --fontlist $fonts_for_training \
   --langdata_dir $langdata_dir \
   --tessdata_dir  $tessdata_dir \
   --training_text $langdata_dir/$Lang/$Lang.training_text \
   --output_dir $train_output_dir



ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, May 31, 2018 at 3:43 PM, Julien Jemine 
wrote:

> Hi,
>
> I've trained a LSTM model for a custom language from scratch as explained
> here
> .
>
> The language only has about 100 words and 17 characters, so it's pretty
> simple.
>
> When I run lstmeval on my model, I get a perfect match:
> [icm@u16-offcao-07] train1$ lstmeval --model 
> /home/icm/share/tessdata/iqi.traineddata
> --eval_listfile iqitrain2/iqi.training_files.txt --verbosity 2
> Loaded 2/2 pages (1-2) of document /home/icm/train1/iqitrain2/
> iqi.Arial.exp0.lstmf
> Loaded 2/2 pages (1-2) of document /home/icm/train1/iqitrain2/
> iqi.Calibri.exp0.lstmf
> Warning: LSTMTrainer deserialized an LSTMRecognizer!
> Truth:ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16
> OCR  :ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16
> Truth:6CUEN 6 CU EN
> OCR  :6CUEN 6 CU EN
> Loaded 2/2 pages (1-2) of document /home/icm/train1/iqitrain2/
> iqi.Lucida_Sans_Typewriter_Semi-Condensed.exp0.lstmf
> Truth:ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16
> OCR  :ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16
> Truth:6CUEN 6 CU EN
> OCR  :6CUEN 6 CU EN
> Loaded 2/2 pages (1-2) of document /home/icm/train1/iqitrain2/
> iqi.Verdana.exp0.lstmf
> Truth:ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16
> OCR  :ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16
> Truth:6CUEN 6 CU EN
> OCR  :6CUEN 6 CU EN
> Truth:6CUEN 6 CU EN
> OCR  :6CUEN 6 CU EN
> Truth:ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16
> OCR  :ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16
> At iteration 0, stage 0, Eval Char error rate=0, Word error rate=0
>
> However, when I put my iqi.traineddata file in my tessdata folder and try
> to run tesseract on the same tif file, I get errors:
> [icm@u16-offcao-07] train1$ tesseract iqitrain2/iqi.training_img.txt
> stdout -l iqi
> Page 0 : /home/icm/train1/iqitrain2/iqi.Arial.exp0.tif
> 6CFN
> 6CUEN 1 CU EN
> Page 1 : /home/icm/train1/iqitrain2/iqi.Calibri.exp0.tif
>
> 6CM 10FEEN 0 6 FEE 13CUEN 11 6 FE EEN 1116
> 6UEN 16 FE
> Page 2 : /home/icm/train1/iqitrain2/iqi.Lucida_Sans_Typewriter_
> Semi-Condensed.exp0.tif
>
> 6TM 13CUEN 13 1 EN 11CUE 11 CU EN 12B 11 16
> 6 6 CU EN
> Page 3 : /home/icm/train1/iqitrain2/iqi.Verdana.exp0.tif
>
> ASTM 103UEEN 13 1CU EN 13CUEN 13 6 FE EEN 11 16
> 6CUEN 6 CU EN
>
>
> Now the really frustrating part: I have the opposite phenomenon with the
> "eng" language! (with eng.traineddata taken from tessdata_best)
> lstmeval gives me a few errors (Eval Char error rate=2.4665552, Word error
> rate=16.67)
> tesseract gives me the right answer! (But the images are generated with
> tesstrain.sh and very common fonts, it's probably to be expected).
>
> Am I doing something wrong?
> What's going on here?
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/67286720-c624-4239-a812-3c76d7603cf1%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWCsauX6u4MT4Uzutb0fXAiyg75iwy7x_vf9beAfrhZqw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Help for training tesseract to recognize a new (dead) language

2018-05-31 Thread Ramast Magdy
Impressive! I thought we would need to do a lot of work in order to 
reach that stage??.


The "??" in the text correspond to an unknown character to me, I also 
can't find it among the available unicode characters.

It's certainly 100% not part of the text. Probably indicator of new chapter.
Maybe we could use paragraph sign § symbol for it?

This is very exciting, how do we access the work you have done so far 
and add to it?

Thanks a lot


ⲁⲩⲱ ⲟⲛ ⲁⲓ̈ⲧⲣⲉⲩ ⲣ̄ ⲥⲟⲟⲩ ⲛ̄ ⲉⲃⲟⲧ ⲉⲩⲕⲏⲧ ⲉ ϩⲃⲟⲩⲣ
ⲉⲩⲉⲓⲣⲉ ⲛ̄ ⲛⲉ ϩⲃⲏⲩⲉ ⲛ̄ ⲛⲉⲩⲁⲡⲟⲧⲉⲗⲉⲥⲙⲁ ⲙⲛ̄ ⲛⲉⲩ–
ⲥⲭⲏⲙⲁ ⲧⲏⲣⲟⲩ· ϫⲉ ⲕⲁⲥ ϩⲛ̄ ⲟⲩ ϩⲃⲁ ⲉⲩⲉⲣ̄ ϩⲃⲁ·
ⲁⲩⲱ ϩⲛ̄ ⲟⲩ ⲡⲗⲁⲛⲏ ⲉⲩⲉⲡⲗⲁⲛⲁ ⲛ̄ϭⲓ ⲛ ⲁⲣⲭⲱ̄ ⲉⲧ
ϣⲟⲟⲡ ϩⲛ̄ ⲛ ⲁⲓⲱ̄ ⲁⲩⲱ ϩⲛ̄ ⲛⲉⲩⲥⲫⲁⲓⲣⲁ ⲁⲩⲱ ϩⲛ̄  5
ⲛⲉⲩⲙ̄ⲡⲏⲩⲉ· ⲁⲩⲱ ϩⲛ̄ ⲛⲉⲩⲧⲟⲡⲟⲥ ⲧⲏⲣⲟⲩ· ϫⲉ ⲕⲁⲥ ⲛ̄
ⲛⲉⲩⲛⲟⲓ̈ ⲛ̄ ⲧⲉⲩϭⲓⲛⲙⲟⲟϣⲉ ⲙ̄ⲙⲓⲛ ⲙ̄ⲙⲟ–
?? ⲟⲩ: ⲁⲥϣⲱⲡⲉ ϭⲉ ⲛ̄ⲧⲉⲣⲉ ⲓ̄ⲥ̄ ⲟⲩⲱ ⲉϥϫⲱ ⲛ̄
ⲡⲉⲓ̈ ϣⲁϫⲉ ⲉⲣⲉ ⲫⲓⲗⲓⲡⲡⲟⲥ ϩⲙⲟⲟⲥ ⲉϥⲥϩⲁⲓ̈ ⲛ̄ ϣⲁϫⲉ
ⲗ̄ⲁ̄ ⲁ. ⲛⲓⲙ ⲉⲧ ⲉⲣⲉ ⲓ̄ⲥ̄ ϫⲱ ⲙ̄ⲙⲟⲟⲩ; ⲁⲥϣⲱⲡⲉ ϭⲉ ⲙⲛ̄ⲛ̄ⲥⲁ 10



On 05/31/2018 06:42 AM, ShreeDevi Kumar wrote:
I am attaching the recognition result of the one page image you gave 
from the test model for Coptic I have built. If you can send me the 
correct unicode transcription for that page, I can further fine tune 
it. You can then further modify as per your needs.


--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/d97ed96e-e701-32c3-aa26-1997a86157b8%40gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Where to find tessdata folder?

2018-05-31 Thread Zdenko Podobny
Did you follow instruction for installation of that package?
Did you try internet search before posting on forum?
Did you try to search for help in project  tesserocr???

I just put it to google and I got:
https://pypi.org/project/tesserocr/
https://github.com/sirfz/tesserocr
https://oded.blog/2017/01/08/ocr-made-easy-using-tesserocr/

Zdenko


št 31. 5. 2018 o 8:40 Abel Tan  napísal(a):

> Hi,
>
> I installed pytesseract by using Anaconda
>
> conda install -c simonflueckiger tesserocr
>
>
> The path to Anaconda is :C:\Users\Tan\Anaconda3\
>
> The path to tesseract package is :
> C:\Users\Tan\Anaconda3\Lib\site-packages\tesserocr
>
> However when I start up Jupyter notebook and run :
>
> import tesserocr
>
>
> print(tesserocr.image_to_text(image))
>
> I get error:
>
>
> RuntimeError  Traceback (most recent call 
> last) in ()> 1 
> print(tesserocr.image_to_text(imagee))
> tesserocr.pyx in tesserocr._tesserocr.image_to_text()
> RuntimeError: Failed to init API, possibly an invalid tessdata path: 
> C:\Users\Tan\Anaconda3\
>
>
>
>
> How do I solve this problem?
>
>
> Thanks
>
> Abel
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/11c1f35c-943a-4d9e-a045-71b558dab269%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wbVBTVTm-zzqqNdHu8sLvMop5MRAZ_n9qPC%2BiOnP3isg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Not able install tesseract ocr on ubuntu 17.04

2018-05-31 Thread RT-Rakesh
user@computer:~$ sudo apt install tesseract-ocr
Reading package lists... Done
Building dependency tree   
Reading state information... Done
The following packages were automatically installed and are no longer 
required:
  libgnutls-openssl27 postfix-sqlite
Use 'sudo apt autoremove' to remove them.
The following additional packages will be installed:
  libgif7 liblept5 libtesseract-data libtesseract3 tesseract-ocr-eng
  tesseract-ocr-equ tesseract-ocr-osd
The following NEW packages will be installed:
  libgif7 liblept5 libtesseract-data libtesseract3 tesseract-ocr
  tesseract-ocr-eng tesseract-ocr-equ tesseract-ocr-osd
0 upgraded, 8 newly installed, 0 to remove and 180 not upgraded.
Need to get 945 kB/14.6 MB of archives.
After this operation, 57.5 MB of additional disk space will be used.
Do you want to continue? [Y/n] y
Err:1 http://us.archive.ubuntu.com/ubuntu zesty/main amd64 libgif7 amd64 
5.1.4-0.4
  404  Not Found [IP: 91.189.91.23 80]
Err:2 http://us.archive.ubuntu.com/ubuntu zesty/universe amd64 liblept5 
amd64 1.74.1-1
  404  Not Found [IP: 91.189.91.23 80]
E: Failed to fetch 
http://us.archive.ubuntu.com/ubuntu/pool/main/g/giflib/libgif7_5.1.4-0.4_amd64.deb
  
404  Not Found [IP: 91.189.91.23 80]
E: Failed to fetch 
http://us.archive.ubuntu.com/ubuntu/pool/universe/l/leptonlib/liblept5_1.74.1-1_amd64.deb
  
404  Not Found [IP: 91.189.91.23 80]
E: Unable to fetch some archives, maybe run apt-get update or try with 
--fix-missing?


*This is the error being thrown, can some one help me with how to solve 
this issue. *

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/06faa78f-7a57-4749-9cf2-e9bdce5721c1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Where to find tessdata folder?

2018-05-31 Thread Abel Tan
Hi,

I installed pytesseract by using Anaconda 

conda install -c simonflueckiger tesserocr


The path to Anaconda is :C:\Users\Tan\Anaconda3\

The path to tesseract package is :
C:\Users\Tan\Anaconda3\Lib\site-packages\tesserocr

However when I start up Jupyter notebook and run :

import tesserocr


print(tesserocr.image_to_text(image))

I get error:


RuntimeError  Traceback (most recent call 
last) in ()> 1 
print(tesserocr.image_to_text(imagee))
tesserocr.pyx in tesserocr._tesserocr.image_to_text()
RuntimeError: Failed to init API, possibly an invalid tessdata path: 
C:\Users\Tan\Anaconda3\




How do I solve this problem?


Thanks

Abel

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/11c1f35c-943a-4d9e-a045-71b558dab269%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Training of tesseract

2018-05-31 Thread AKS

Hi,

I want to use tesseract OCR on images with varying font types and font 
sizes. Also there is a lot of background variations in each images such as 
multi-colored background, background with some designs, illumination 
variance, white background.
If I simply apply  a tesseract with some configuration accroding to my 
requirement, I am not getting very good accuracy.  Also , due to lot of 
variation in background and text writing, I cannot apply some fixed image 
processing on the images before using tesseract.

Can I solve this problem by finetuning tesseract with my set of images?? 

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/09d3f38a-1eb9-4714-ba6b-01ffbcecb719%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.