Re: [tesseract-ocr] Reading Device labels to get model number

2014-11-13 Thread ShreeDevi Kumar
Straighten the image before sending to tesseract. You can use scantailor or unpaper. Imagemagick may also have an option, you'll have to look. See attached images - output from scantailor - and then OCRed using Vietocr (gui frontend to Tesseract) MODEL NAME 7 MOORE RF28HMEDBSR ml.“ | mt

[tesseract-ocr] Re: Help with lottery tickets

2014-11-13 Thread PorridgeBear
You will need to perform some kind of pre-processing before sending it to Tesseract. For instance, if you always knew the ticket was a certain size and the image was always straight, you could first crop out the rectangular areas for each row (I'm assuming you are looking for row numbers here

[tesseract-ocr] Re: Reading Device labels to get model number

2014-11-13 Thread Allistair C
I think the table lines are not helping. I up-sized your image to 1000px wide, then ran into Tesseract with PSM=6 and got mostly rubbish. Then I removed the table lines manually in Photoshop, then up-sized your image to 1000px wide, then ran into Tesseract with PSM=6: RFZBHMEDBSR R 134a/

[tesseract-ocr] What are the possible output file extensions?

2014-11-13 Thread Traun Leyden
I'm trying to fix a bug https://github.com/tleyden/open-ocr/issues/18#issuecomment-62277655 related to looking for the output in filename.txt, but it should be looking in filename.hocr instead. Are there any other file extensions that tesseract can write to? Or just .txt / .hocr? -- You

[tesseract-ocr] Re: Reading Device labels to get model number

2014-11-13 Thread Allistair C
Do you have higher resolution images to work with - that's one issue going on here as the edges of your text are very fuzzy and at that resolution it's pretty hard for Tesseract. You can also play with Thresholding and Opening (Erosion/Dilation) to thicken some of your lines up (using e.g.

Re: [tesseract-ocr] What are the possible output file extensions?

2014-11-13 Thread ShreeDevi Kumar
.txt .pdf .hocr pdf and hocr can be passed as CONFIG file options when using tesseract from commandline and txt output is created automatically (in both cases, I think) This is with the latest version of tesseract from git. ShreeDevi

[tesseract-ocr] Re: Reading Device labels to get model number

2014-11-13 Thread shree
also take a look at the pre-processing method mentioned at https://github.com/tleyden/open-ocr/wiki/Stroke-Width-Transform-In-Action On Thursday, November 13, 2014 3:30:03 AM UTC+5:30, Bill Garrison wrote: So if someone sends in labels like the attached ones, I need to grab the model number.

Re: [tesseract-ocr] Covering ASCII Extended range.

2014-11-13 Thread Ryan Dev
Wow! Awesome. That file definitely helps. It fixed a few issues, but introduced a few of its own, so currently I am running eng+asc and that is giving great output, and is running faster then eng+deu. Attached is an example image and output using asc. Note that asc is getting the 'ü' as a

Re: [tesseract-ocr] Covering ASCII Extended range.

2014-11-13 Thread ShreeDevi Kumar
asc traineddata does not have a wordlist or dictionary, so using eng will help with that. Also, I just trained using a few fonts that support the whole range. If you train with the font you are using, you will get better results. You can use 'combine_tessdata' command with the -u (unpack) option