Re: [tesseract-ocr] Many 'question mark' chars in recognized text

2014-10-17 Thread Salvo Piazza
Hi Zdenko, thanks for your response. I know tesseract at very beginning level, so can you tell me how can I check it? (I use a Linux version of tesseract...) Thanks, Salvo. Il giorno giovedì 16 ottobre 2014 21:46:31 UTC+2, zdenop ha scritto: fl is recognizes as ligature in English, so there

[tesseract-ocr] Re: Difference between sudo apt-get install tesseract and installing from source

2014-10-17 Thread Rick Leir
You probably got the source for a different version of Tesseract. This might not matter, depending on what you are doing. Find out the version by running it: you will see 'Tesseract Open Source OCR Engine v3.04.00 with Leptonica' or similar. How to train:

Re: [tesseract-ocr] Many 'question mark' chars in recognized text

2014-10-17 Thread Rick Leir
On Linux try YAGF, it is a GUI front end for Tesseract. As zdenop said, you have a unicode problem. You need to use UTF8 for strings. On Friday, October 17, 2014 6:07:26 AM UTC-4, Salvo Piazza wrote: Hi Zdenko, thanks for your response. I know tesseract at very beginning level, so can you

Re: [tesseract-ocr] Many 'question mark' chars in recognized text

2014-10-17 Thread zdenko podobny
OCR a test image with you app, store result to text file. Than OCR the same image with tesseract executable (output should be in text file by default) and compare results. If output from tesseract executable is OK, but from your app is wrong (e.g. there are only ascii letters) = you have problem

[tesseract-ocr] how can I get better results for this

2014-10-17 Thread Rick Leir
I have been getting great results from Tesseract when the images are clear. However, many of my images are crummy. How would you get the best results for this? Maybe improved training, maybe image pre-processing? The original is like this:

[tesseract-ocr] Re: produce delimited output using hOCR or by preserving original document spacing

2014-10-17 Thread Rick Leir
If you like Perl you can parse values from the hOCR. You will need to change this to suit: sub saveStats { my ( $outHcr, $outStats) = @_; open( STFILE, $outStats); # get just the x_wconf values from the hocr file: # write to a stats file with a wconf per line my $confsum

Re: [tesseract-ocr] how can I get better results for this

2014-10-17 Thread ShreeDevi Kumar
https://code.google.com/p/tesseract-ocr/wiki/ImproveQuality ​try with image at 300dpi or higher. resize 300%​ ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Fri, Oct 17, 2014 at 8:35 PM, Rick Leir rich...@c7a.ca

Re: [tesseract-ocr] how can I get better results for this

2014-10-17 Thread Rick Leir
Thanks, ShreeDevi I opened the jpg in Gimp, and you can see that it is about 100 pixels per text line: https://lh5.googleusercontent.com/-jAAkrAFL_wE/VEE3pA5LMbI/ADs/1kExQh_pdiA/s1600/gimpOriginal.png On Friday, October 17, 2014 11:23:37 AM UTC-4, shree wrote:

Re: [tesseract-ocr] how can I get better results for this

2014-10-17 Thread ShreeDevi Kumar
You have to experiment .. I got better results after some image processing and vietocr .. that it has bcln dooi transfer of a portzon which has been leased an. M- nan-ant.‘ 0n Mu [image: Inline image 1] ShreeDevi भजन - कीर्तन -

Re: [tesseract-ocr] how can I get better results for this

2014-10-17 Thread Robert Komar
On Fri, 17 Oct 2014, Rick Leir wrote: I opened the jpg in Gimp, and you can see that it is about 100 pixels per text line: [gimpOriginal.png] That image looks to be scanned at about 150 dpi. With such faint characters, scanning at 300 or 600 dpi would have been better. Anyway, try scaling