All of the transformations were applied with gimp 2.10.30. I don't think the tools are going to be much different for any recent version.
Blur is Filters -> Blur -> Gaussian Blur. Set SizeX and Size Y to 2.50. Colors -> Threshold... Set the left number to 217. The right number should be 255 already. You might be able to get better 5's and 3's by playing with these numbers a little bit. We now have a binary image which is generally best for OCR performance. Next is Image -> Canvas size... Lock the Width:Height ratio with the rectangular chain thingy, set Height to 500, click the Center button and Resize. Image -> Scale Image... Lock the Width:Height ratio by clicking the square chain thingy. Change the Height to 1000 pixels. The default interpolation of Cubic is fine. Hit "Scale". Now File -> Export as... and save it as "fixedup.png". Don't use jpeg for OCR if you can possibly avoid it. On Sun, Nov 5, 2023 at 5:21 AM Des Bw <[email protected]> wrote: > Dear piggy, can you elaborate what you did with the images please? > The tools you used; and the modifications you did. > I was trying to replicate what you did. But, I am not getting what you > get. > Is scaling up the image the same thing as increasing the DPI of the image? > > On Friday, November 3, 2023 at 4:28:49 PM UTC+3 piggy wrote: > >> I think the biggest improvement came from the blur followed by the right >> thresholding. That improves the division of the page into separate letters. >> >> The added border allowed tesseract to pick up the right-hand side of the >> numbers better. I was hoping to pick up the 1's down the left-hand side, >> but that didn't work. >> >> Scaling up is a heuristic trick I've used in the past, and it helped here. >> >> On Thu, Nov 2, 2023 at 6:36 PM Slartybartfast < >> [email protected]> wrote: >> >>> Thank you! The original has much more border around it. I just cropped >>> it for easier viewing here. I already did a little bit of pre-processing >>> but looks like I need to do more. Interesting that scaling up improved >>> things. According to one analysis done, accuracy depends on character >>> height. According to that - I had the optimum character height, but maybe >>> things have changed. The original scan was done at 300 dpi. I'll try 600. >>> >>> Incidentally ... I got so frustrated I wrote my own OCR program today. >>> Only took me a few hours. Much more accurate than Tesseract, though working >>> with fixed-width fonts makes life a lot easier!! Just divide the image up >>> into a grid, and pattern match each "cell". As I was only interested in the >>> numbers, I only had 16 (hex digits) to match against. >>> >>> Cheers >>> >>> >>> On Thursday, November 2, 2023 at 12:43:12 PM UTC piggy wrote: >>> >>>> I added more white space around the target text by scaling the canvas >>>> to 500 pixels wide, and then scaled up the whole image by a factor of 2. >>>> >>>> -230 6 5O >>>> >>>> 90 6 50 >>>> >>>> 90 6 -100 >>>> 130 6 -100 >>>> 130 6 -150 >>>> >>>> On Thu, Nov 2, 2023 at 8:35 AM La Monte H. P. Yarroll < >>>> [email protected]> wrote: >>>> >>>>> I had a little success applying 2.5 pixels of blur and then >>>>> thresholding at 217-255. FWIW, I used gimp for the preprocesing. Here's >>>>> what I got after just a few minutes: >>>>> a i @) >>>>> >>>>> -230 & 50 >>>>> 90 6 50 >>>>> 90 6 -100 >>>>> >>>>> 130 6 >>>>> 130 6 >>>>> >>>>> ~100 >>>>> -130 >>>>> >>>>> I don't know what happened to the first column or why the last 2 lines >>>>> got split the way they did. >>>>> >>>>> >>>>> On Wed, Nov 1, 2023 at 4:30 PM Slartybartfast < >>>>> [email protected]> wrote: >>>>> >>>>>> Doesn't anybody have any ideas? :-( >>>>>> >>>>>> On Tuesday, October 24, 2023 at 5:40:20 PM UTC+1 Slartybartfast wrote: >>>>>> >>>>>>> Hi >>>>>>> I am a new tesseract user, and I'm really struggling to get it to >>>>>>> produce any kind of sensible results, especially with numerical text. I >>>>>>> have some text that looks like this: >>>>>>> [image: example_input.jpg] >>>>>>> I've read the documentation, and looked through the parameter list, >>>>>>> and I added the following to the command line: >>>>>>> --psm 6 >>>>>>> -c preserve_interword_spaces=1 >>>>>>> -c textord_dotmatrix_gap=6 >>>>>>> -c classify_bln_numeric_mode=1 >>>>>>> -c rej_alphas_in_number_perm=1 >>>>>>> >>>>>>> But I just get garbage out: >>>>>>> >>>>>>> Oo -250 6 3a >>>>>>> 190 & So >>>>>>> 190 6 -100 >>>>>>> 1 $1290 6 ~140 >>>>>>> 1 $130 6 ~150 >>>>>>> >>>>>>> I've tried all sorts of additional image processing to try and >>>>>>> improve the look of the text, but none of it works. In fact, this is the >>>>>>> best output of seen. It's usually worse. I'm really hoping someone who >>>>>>> has >>>>>>> worked with dot-matrix input can offer some magic incantation to make >>>>>>> tesseract come to its senses. Thanks. >>>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to [email protected]. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/15797f86-58c9-4e71-b316-54f663d04cbfn%40googlegroups.com >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/15797f86-58c9-4e71-b316-54f663d04cbfn%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> >>>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> >> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/5c364cf1-076a-43e4-86f2-61b925b9d6c3n%40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/5c364cf1-076a-43e4-86f2-61b925b9d6c3n%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/6d8cefec-0521-4b83-8a0f-74ba85a10bc7n%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/6d8cefec-0521-4b83-8a0f-74ba85a10bc7n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAL7mBq7HLt1GagN_Te3sf6EYWOBTfZxErOtrvuf1tzqYsncAuw%40mail.gmail.com.

