[tesseract-ocr] image_to_string returns nothing sometimes

2018-07-06 Thread lolongeryan
Hi all, I'm new to pytesseract. I tried to use image_to_string method with two identical image. Image A is saved by screen grab and then Image.save. I applied image_to_string to A, it returns nothing. Then I use photoshop to load the same image and save as a copy, imgae B. I applied

Re: [tesseract-ocr] Re: Explanation for training_text and wordlist files

2018-07-06 Thread Shree Devi Kumar
Also see a community contributed perl script for generating langdata in https://github.com/tesseract-ocr/tesseract/tree/master/contrib On Fri 6 Jul, 2018, 10:52 PM Shree Devi Kumar, wrote: > See the following link to comment by Ray regarding building of Training > data > > >

Re: [tesseract-ocr] Re: Explanation for training_text and wordlist files

2018-07-06 Thread Shree Devi Kumar
See the following link to comment by Ray regarding building of Training data https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951 On Fri 6 Jul, 2018, 10:38 PM James Q, wrote: > No tool I can think of. What I would do is edit the file in a large text > file editor (such

[tesseract-ocr] Re: Explanation for training_text and wordlist files

2018-07-06 Thread James Q
No tool I can think of. What I would do is edit the file in a large text file editor (such as EmEditor) to remove duplicate words. You could do this by replacing all spaces for newlines then sorting and removing duplicates. After that you can randomize the unique list of words, add an

Re: [tesseract-ocr] Really poor performance with decimal numbers

2018-07-06 Thread Lorenzo Bolzani
Hi, upscale and enhance contrast, but upscale is what really matters: each letter is 20px, a dot is about three pixel, it's probably "seen" as noise. Bye Lorenzo 2018-07-06 5:51 GMT+02:00 Alberto Andreotti : > Hello, > > I'm having problems with the simplest image possible. > It's a screenshot

Re: [tesseract-ocr] Really poor performance with decimal numbers

2018-07-06 Thread James Q
Have you tried removing all surrounding whitespace from the image except for a thin border (say 8px thick)? On Friday, July 6, 2018 at 4:52:08 PM UTC+1, Alberto Andreotti wrote: > > Hi, > > tried it with same results, also, all other cases work well. > > 23.78 > 15 > 1.6 > 1.7 > 1.2 > 1.3 > 1.4

Re: [tesseract-ocr] Really poor performance with decimal numbers

2018-07-06 Thread Alberto Andreotti
Hi, tried it with same results, also, all other cases work well. 23.78 15 1.6 1.7 1.2 1.3 1.4 1.8 1.9 The only that won't come out well is "1.5". That's pretty crazy. Any config I may provide or something? thanks, Alberto. On Friday, July 6, 2018 at 11:38:45 AM UTC-3, shree wrote: > > try

Re: [tesseract-ocr] Really poor performance with decimal numbers

2018-07-06 Thread Shree Devi Kumar
try --psm 6 On Fri, Jul 6, 2018 at 2:23 PM Alberto Andreotti wrote: > Hello, > > I'm having problems with the simplest image possible. > It's a screenshot from GEdit(Ubuntu's text editor), with numbers and > points. This is what I get, > > 23.78 > 15 > 1.6 > 17.6 > 25 > 225 > 2235 > 0.5 > >

Re: [tesseract-ocr] how to improve dot-matrix digits recognize accuracy

2018-07-06 Thread Shree Devi Kumar
You could try finetuning for the dotmatrix font. On Fri, Jul 6, 2018 at 3:43 PM Wenjie Chen wrote: > Hi folks, > > Below is the dot-matrix digits picture, *tesseract *recognize it > uncorrect without any pre-processing. > >

[tesseract-ocr] Paragraph/Block Reading Order of Text

2018-07-06 Thread Mohit Jain
I'd like to know what algorithm/heuristics Tesseract follows to determine the order in which blocks of text are read? Analysing the output of Tesseract on complex layout documents, I see that its not a simple row-order/column-order, rather some sort of hybrid fusion of the two. Can someone

Re: [tesseract-ocr] Not Able to get Text

2018-07-06 Thread Zdenko Podobny
AFAIK google does not use tesseract in ocr API Dňa pi 6. 7. 2018, 11:43 Pranay Saxena napísal(a): > Hi > > Thanks for ur reply .. But google api for ocr also uses same tessract and > from google api we are able to read text. > > And i had tried to remove noise from image but still not worked ..

[tesseract-ocr] how to improve dot-matrix digits recognize accuracy

2018-07-06 Thread Wenjie Chen
Hi folks, Below is the dot-matrix digits picture, *tesseract *recognize it uncorrect without any pre-processing. I did erode processing via opencv, the digit 1

Re: [tesseract-ocr] Not Able to get Text

2018-07-06 Thread Pranay Saxena
Hi Thanks for ur reply .. But google api for ocr also uses same tessract and from google api we are able to read text. And i had tried to remove noise from image but still not worked .. And can u suggest any other way to get it done. Regards Pranay On Fri, Jul 6, 2018, 14:48 Zdenko Podobny

Re: [tesseract-ocr] Not Able to get Text

2018-07-06 Thread Zdenko Podobny
images you provided are noisy. tesseract is not designed to work with such images (e.g. to break captcha). Zdenko pi 6. 7. 2018 o 11:14 Pranay Saxena napísal(a): > Hi > > I read and done all the changes to increase the quality for better result > .. > > I tried with google api also and google

Re: [tesseract-ocr] Not Able to get Text

2018-07-06 Thread Pranay Saxena
Hi I read and done all the changes to increase the quality for better result .. I tried with google api also and google api is working properly but of my own im not able to read text .. I request you plz if u can help me out in this. Regards, Pranay On Fri, Jul 6, 2018, 14:22 wrote: > Hi, >

Re: [tesseract-ocr] Not Able to get Text

2018-07-06 Thread Zdenko Podobny
Please read wiki regarding improving tesseract result. Zdenko pi 6. 7. 2018 o 10:52 napísal(a): > Hi, > > I was using tesseract from long time and its working fine, we got some > new images but these images are not been parsed by tesseract > > I removed extra noise, changed to greyscale,

[tesseract-ocr] Really poor performance with decimal numbers

2018-07-06 Thread Alberto Andreotti
Hello, I'm having problems with the simplest image possible. It's a screenshot from GEdit(Ubuntu's text editor), with numbers and points. This is what I get, 23.78 15 1.6 17.6 25 225 2235 0.5 Alberto version: tesseract 4.0.0-beta.1-285-g8d3f run from command line like this, tesseract