Re: [tesseract-ocr] Re: problems with upper-case character

2019-09-21 Thread Lorenzo Bolzani
If you are not sure if you have a single line or a single block use psm 6.

See tesseract --help-extra

Psm 6 generally works fine for single lines too.


If you have full pages and single lines mixed you need a pre processing
step (threshold, morphology, etc.) to understand what psm is the correct
one.



Il giorno ven 20 set 2019 alle ore 10:54 'Sandra M.' via tesseract-ocr <
tesseract-ocr@googlegroups.com> ha scritto:

> I realized that it also occurs for strings without the symbol. The image
> given below for example returns an empty string as well. But in this case,
> it is recognized correctly with config='--psm 7' But unfortunately I
> cannot presume generally for this case that it is only one line text. Maybe
> the problem is because it is no word given in the dictionary? I found out
> that it is possible to enable the dictionary and to get back the single
> letters with the highest accuracy, but I did not get how to do this. I
> tried it with this config:
>
> text = pytesseract.image_to_string(gray, config='load_system_dawg=0')
>
> but it didn't imporove anything and I'm even not sure if I applied it
> correctly...
>
> [image: googleforum.png]
>
>
>
> Am Donnerstag, 19. September 2019 19:36:32 UTC+2 schrieb zdenop:
>>
>>
>> please provide image for testing.
>>
>> Zdenko
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/ebbdd84b-0928-43b1-a0d8-d7c9308f7616%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwrPVC4JBOWK4d6UsGQSicOJ8FQsm1XPp0Fe2YsPk74hw%40mail.gmail.com.


Re: [tesseract-ocr] Re: problems with upper-case character

2019-09-20 Thread 'Sandra M.' via tesseract-ocr


I realized that it also occurs for strings without the symbol. The image 
given below for example returns an empty string as well. But in this case, 
it is recognized correctly with config='--psm 7' But unfortunately I cannot 
presume generally for this case that it is only one line text. Maybe the 
problem is because it is no word given in the dictionary? I found out that 
it is possible to enable the dictionary and to get back the single letters 
with the highest accuracy, but I did not get how to do this. I tried it 
with this config:

text = pytesseract.image_to_string(gray, config='load_system_dawg=0')

but it didn't imporove anything and I'm even not sure if I applied it 
correctly...

[image: googleforum.png] 



Am Donnerstag, 19. September 2019 19:36:32 UTC+2 schrieb zdenop:
>
>
> please provide image for testing.
>
> Zdenko
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/ebbdd84b-0928-43b1-a0d8-d7c9308f7616%40googlegroups.com.


Re: [tesseract-ocr] Re: problems with upper-case character

2019-09-19 Thread Zdenko Podobny
please provide image for testing.

Zdenko


št 19. 9. 2019 o 18:06 'Sandra M.' via tesseract-ocr <
tesseract-ocr@googlegroups.com> napísal(a):

> But therefore I get empty strings now, because it occurs a symbol that
> tesseract does not know. I had this problem before as well, but could fix
> it for whatever reason with config='--psm 7'. This doesn't work now
> anymore... Do you have an idea for this as well? I don't need to detect the
> symbol, I just want that the rest of the string is not "thrown away"...
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/29f63b14-e2f4-481b-89f6-bd8149e71138%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8w7FacieXfvrTdzFFG55KcpYChbj8TMj5FzcAabx1f_jA%40mail.gmail.com.


Re: [tesseract-ocr] Re: problems with upper-case character

2019-09-19 Thread 'Sandra M.' via tesseract-ocr
But therefore I get empty strings now, because it occurs a symbol that 
tesseract does not know. I had this problem before as well, but could fix 
it for whatever reason with config='--psm 7'. This doesn't work now 
anymore... Do you have an idea for this as well? I don't need to detect the 
symbol, I just want that the rest of the string is not "thrown away"...

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/29f63b14-e2f4-481b-89f6-bd8149e71138%40googlegroups.com.


Re: [tesseract-ocr] Re: problems with upper-case character

2019-09-19 Thread 'Sandra M.' via tesseract-ocr
You were both right - updating to version 5 fixed the problem more or less! 
Only in one case there is still a problem with lower and upper case 
letters, but for the other cases it's working now!

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/e8563427-d531-4589-a178-e8bdee4a8e7b%40googlegroups.com.


Re: [tesseract-ocr] Re: problems with upper-case character

2019-09-19 Thread 'Sandra M.' via tesseract-ocr
You were both right - updating to version 5 fixed the problem more or less! 
Only in one case there is still a problem with lower and upper case 
letters, but for the other cases it's working now!

Am Donnerstag, 19. September 2019 12:49:43 UTC+2 schrieb zdenop:
>
> your tesseract version is old. Current version is 4.1 (or dev version is 
> 5.0).
> For 4.x and above you can you different tessdata: best, fast or with 3.x 
> module.
>
> Zdenko
>
>
> št 19. 9. 2019 o 11:55 'Sandra M.' via tesseract-ocr <
> tesser...@googlegroups.com > napísal(a):
>
>> I use Tesseract 3.02 leptonica-1.68. What do you mean with tessdata_best? 
>> I'm new in this field and just know how to call tesseract with the given 
>> code line How can the resolution be 0 dpi?
>>
>> I'm using this Python code:
>>
>> import pytesseractimport argparseimport cv2import os
>> # construct the argument parse and parse the arguments
>> ap = argparse.ArgumentParser()
>> ap.add_argument("-i", "--image", required=True,
>> help="path to input image to be OCR'd")
>> args = vars(ap.parse_args())
>> # load the example image and convert it to grayscale
>> image = cv2.imread(args["image"])
>> gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
>> # write the grayscale image to disk as a temporary file so we can# apply OCR 
>> to it
>> filename = "{}.png".format(os.getpid())
>> cv2.imwrite(filename, gray)
>> # load the image as a PIL/Pillow image, apply OCR, and then delete# the 
>> temporary file
>> text = pytesseract.image_to_string(gray)print("Output: " + text)
>>
>>
>> Am Donnerstag, 19. September 2019 11:23:50 UTC+2 schrieb zdenop:
>>>
>>> Please provide more information (versions info, how you do OCR - seem 
>>> like you use some coding).
>>> I just tried tesseract (tesseract 5.0.0-alpha-416-g408d6) command line 
>>> with tessdata_best and if work for me:
>>> tesseract unnamed.png -
>>> Warning: Invalid resolution 0 dpi. Using 70 instead.
>>> Estimating resolution as 497
>>> Calibrations
>>>
>>> Zdenko
>>>
>>>
>>> št 19. 9. 2019 o 10:43 'Sandra M.' via tesseract-ocr <
>>> tesser...@googlegroups.com> napísal(a):
>>>
 [image: currentImage.png]
 @Lorenzo Blz: This is an example image. The output of my code is 
 "calibrations". The height of the letters is not the same. Of course it 
 cannot be recognized if there is only a "c", but in the context to the 
 other letters tesseract should be able to detect if it is a small or 
 capital letter, I think. This image has no noise or anything else, I don't 
 unterstand the problem. But nevertheless, your comment to change the size 
 helped! If I resize it with 150% or 75% for example, it works. I just 
 don't 
 know how to solve it if I don't have a reference value later on. How to 
 decide which is the right spelling, 100% image size or 150%. Or is it 
 possible to say that it's always a more reliable result if I resize the 
 image in preprocessing?

 Am Mittwoch, 18. September 2019 17:19:22 UTC+2 schrieb Sandra M.:
>
> I'm using Tesseract with Python. I have an image with 1-6 words in it 
> and need to read the text. Sometimes the character "C", which look the 
> same 
> in upper and lower case, is detected as lower case c instead of upper 
> case 
> C. I see the problem, but in context to the following letters it should 
> be 
> possible to detect the right notation. Is there any configuration or 
> something to improve this?
>
> I had a look at the configuration options of config='-psm x' with 
> different values for x, but nothing fits to my problem
>
 -- 
 You received this message because you are subscribed to the Google 
 Groups "tesseract-ocr" group.
 To unsubscribe from this group and stop receiving emails from it, send 
 an email to tesser...@googlegroups.com.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/tesseract-ocr/e4ed704a-cee0-4bb2-80ae-9fc9b82ab55d%40googlegroups.com
  
 
 .

>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesser...@googlegroups.com .
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/9faf77f7-c862-47f6-b01d-629773025a7f%40googlegroups.com
>>  
>> 
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 

Re: [tesseract-ocr] Re: problems with upper-case character

2019-09-19 Thread Lorenzo Bolzani
I tried to upscale, downscale, with and without the white border and I
always get Calibrations. I even tried a few psm modes.

I'm using:

tesseract 4.0.0
 leptonica-1.76.0
  libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib
1.2.11

What I would do is this:
- prepare a test set with some data so that you can check what gives you an
improvement and what not on average
- remove the white border (see here
)
- now rescale the text so that it is about 35/55px, try a few values and
see what works best. I would also try a few completely different values
(75, 100) while I'm there (just make sure you always start from the
original images when you rescale not to mess the images too much, I would
use find+imagemagick).

If this doesn't work, you could look at the character boxes size. If the
text height is fixed you should be able to tell immediately what is what.

If this doesn't work and if you have some data, you could consider doing
some fine tuning (for example with ocrd-train
) but if your text is so clear
and standard you should not need it.


I just saw that you are using version 3.x, this is the old version and does
not use neural networks. Current stable version is 4.1.


Lorenzo

Il giorno gio 19 set 2019 alle ore 10:43 'Sandra M.' via tesseract-ocr <
tesseract-ocr@googlegroups.com> ha scritto:

> [image: currentImage.png]
> @Lorenzo Blz: This is an example image. The output of my code is
> "calibrations". The height of the letters is not the same. Of course it
> cannot be recognized if there is only a "c", but in the context to the
> other letters tesseract should be able to detect if it is a small or
> capital letter, I think. This image has no noise or anything else, I don't
> unterstand the problem. But nevertheless, your comment to change the size
> helped! If I resize it with 150% or 75% for example, it works. I just don't
> know how to solve it if I don't have a reference value later on. How to
> decide which is the right spelling, 100% image size or 150%. Or is it
> possible to say that it's always a more reliable result if I resize the
> image in preprocessing?
>
> Am Mittwoch, 18. September 2019 17:19:22 UTC+2 schrieb Sandra M.:
>>
>> I'm using Tesseract with Python. I have an image with 1-6 words in it and
>> need to read the text. Sometimes the character "C", which look the same in
>> upper and lower case, is detected as lower case c instead of upper case C.
>> I see the problem, but in context to the following letters it should be
>> possible to detect the right notation. Is there any configuration or
>> something to improve this?
>>
>> I had a look at the configuration options of config='-psm x' with
>> different values for x, but nothing fits to my problem
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/e4ed704a-cee0-4bb2-80ae-9fc9b82ab55d%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwouZhkZkME31jW-KVchbeHViByEqsqchy3pe4c0gtBRg%40mail.gmail.com.


Re: [tesseract-ocr] Re: problems with upper-case character

2019-09-19 Thread Zdenko Podobny
Please provide more information (versions info, how you do OCR - seem like
you use some coding).
I just tried tesseract (tesseract 5.0.0-alpha-416-g408d6) command line with
tessdata_best and if work for me:
tesseract unnamed.png -
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 497
Calibrations

Zdenko


št 19. 9. 2019 o 10:43 'Sandra M.' via tesseract-ocr <
tesseract-ocr@googlegroups.com> napísal(a):

> [image: currentImage.png]
> @Lorenzo Blz: This is an example image. The output of my code is
> "calibrations". The height of the letters is not the same. Of course it
> cannot be recognized if there is only a "c", but in the context to the
> other letters tesseract should be able to detect if it is a small or
> capital letter, I think. This image has no noise or anything else, I don't
> unterstand the problem. But nevertheless, your comment to change the size
> helped! If I resize it with 150% or 75% for example, it works. I just don't
> know how to solve it if I don't have a reference value later on. How to
> decide which is the right spelling, 100% image size or 150%. Or is it
> possible to say that it's always a more reliable result if I resize the
> image in preprocessing?
>
> Am Mittwoch, 18. September 2019 17:19:22 UTC+2 schrieb Sandra M.:
>>
>> I'm using Tesseract with Python. I have an image with 1-6 words in it and
>> need to read the text. Sometimes the character "C", which look the same in
>> upper and lower case, is detected as lower case c instead of upper case C.
>> I see the problem, but in context to the following letters it should be
>> possible to detect the right notation. Is there any configuration or
>> something to improve this?
>>
>> I had a look at the configuration options of config='-psm x' with
>> different values for x, but nothing fits to my problem
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/e4ed704a-cee0-4bb2-80ae-9fc9b82ab55d%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xRV59aFAdRrj-erFNodY0OHPgisoWrOtKXoLdZkL-Pcg%40mail.gmail.com.