Unfortunately, gimp is an interactive application, so it is difficult to make it part of a cleanup pipeline. It can be done, and there is a tutorial on doing exactly that: https://www.gimp.org/tutorials/Automate_Editing_in_GIMP/
Once I work out the steps to clean my images, I usually code something up using the imagemagick suite. If it is exotic enough that I need to write C code for it, I generally use the Leptonica library. On Mon, Nov 6, 2023 at 5:06 PM La Monte H. P. Yarroll < [email protected]> wrote: > All of the transformations were applied with gimp 2.10.30. I don't think > the tools are going to be much different for any recent version. > > Blur is Filters -> Blur -> Gaussian Blur. Set SizeX and Size Y to 2.50. > > Colors -> Threshold... Set the left number to 217. The right number should > be 255 already. You might be able to get better 5's and 3's by playing with > these numbers a little bit. > > We now have a binary image which is generally best for OCR performance. > > Next is Image -> Canvas size... Lock the Width:Height ratio with the > rectangular chain thingy, set Height to 500, click the Center button and > Resize. > > Image -> Scale Image... Lock the Width:Height ratio by clicking the square > chain thingy. Change the Height to 1000 pixels. The default interpolation > of Cubic is fine. Hit "Scale". > > Now File -> Export as... and save it as "fixedup.png". Don't use jpeg for > OCR if you can possibly avoid it. > > > > > > > > > On Sun, Nov 5, 2023 at 5:21 AM Des Bw <[email protected]> wrote: > >> Dear piggy, can you elaborate what you did with the images please? >> The tools you used; and the modifications you did. >> I was trying to replicate what you did. But, I am not getting what you >> get. >> Is scaling up the image the same thing as increasing the DPI of the image? >> >> On Friday, November 3, 2023 at 4:28:49 PM UTC+3 piggy wrote: >> >>> I think the biggest improvement came from the blur followed by the right >>> thresholding. That improves the division of the page into separate letters. >>> >>> The added border allowed tesseract to pick up the right-hand side of the >>> numbers better. I was hoping to pick up the 1's down the left-hand side, >>> but that didn't work. >>> >>> Scaling up is a heuristic trick I've used in the past, and it helped >>> here. >>> >>> On Thu, Nov 2, 2023 at 6:36 PM Slartybartfast < >>> [email protected]> wrote: >>> >>>> Thank you! The original has much more border around it. I just cropped >>>> it for easier viewing here. I already did a little bit of pre-processing >>>> but looks like I need to do more. Interesting that scaling up improved >>>> things. According to one analysis done, accuracy depends on character >>>> height. According to that - I had the optimum character height, but maybe >>>> things have changed. The original scan was done at 300 dpi. I'll try 600. >>>> >>>> Incidentally ... I got so frustrated I wrote my own OCR program today. >>>> Only took me a few hours. Much more accurate than Tesseract, though working >>>> with fixed-width fonts makes life a lot easier!! Just divide the image up >>>> into a grid, and pattern match each "cell". As I was only interested in the >>>> numbers, I only had 16 (hex digits) to match against. >>>> >>>> Cheers >>>> >>>> >>>> On Thursday, November 2, 2023 at 12:43:12 PM UTC piggy wrote: >>>> >>>>> I added more white space around the target text by scaling the canvas >>>>> to 500 pixels wide, and then scaled up the whole image by a factor of 2. >>>>> >>>>> -230 6 5O >>>>> >>>>> 90 6 50 >>>>> >>>>> 90 6 -100 >>>>> 130 6 -100 >>>>> 130 6 -150 >>>>> >>>>> On Thu, Nov 2, 2023 at 8:35 AM La Monte H. P. Yarroll < >>>>> [email protected]> wrote: >>>>> >>>>>> I had a little success applying 2.5 pixels of blur and then >>>>>> thresholding at 217-255. FWIW, I used gimp for the preprocesing. Here's >>>>>> what I got after just a few minutes: >>>>>> a i @) >>>>>> >>>>>> -230 & 50 >>>>>> 90 6 50 >>>>>> 90 6 -100 >>>>>> >>>>>> 130 6 >>>>>> 130 6 >>>>>> >>>>>> ~100 >>>>>> -130 >>>>>> >>>>>> I don't know what happened to the first column or why the last 2 >>>>>> lines got split the way they did. >>>>>> >>>>>> >>>>>> On Wed, Nov 1, 2023 at 4:30 PM Slartybartfast < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Doesn't anybody have any ideas? :-( >>>>>>> >>>>>>> On Tuesday, October 24, 2023 at 5:40:20 PM UTC+1 Slartybartfast >>>>>>> wrote: >>>>>>> >>>>>>>> Hi >>>>>>>> I am a new tesseract user, and I'm really struggling to get it to >>>>>>>> produce any kind of sensible results, especially with numerical text. I >>>>>>>> have some text that looks like this: >>>>>>>> [image: example_input.jpg] >>>>>>>> I've read the documentation, and looked through the parameter list, >>>>>>>> and I added the following to the command line: >>>>>>>> --psm 6 >>>>>>>> -c preserve_interword_spaces=1 >>>>>>>> -c textord_dotmatrix_gap=6 >>>>>>>> -c classify_bln_numeric_mode=1 >>>>>>>> -c rej_alphas_in_number_perm=1 >>>>>>>> >>>>>>>> But I just get garbage out: >>>>>>>> >>>>>>>> Oo -250 6 3a >>>>>>>> 190 & So >>>>>>>> 190 6 -100 >>>>>>>> 1 $1290 6 ~140 >>>>>>>> 1 $130 6 ~150 >>>>>>>> >>>>>>>> I've tried all sorts of additional image processing to try and >>>>>>>> improve the look of the text, but none of it works. In fact, this is >>>>>>>> the >>>>>>>> best output of seen. It's usually worse. I'm really hoping someone who >>>>>>>> has >>>>>>>> worked with dot-matrix input can offer some magic incantation to make >>>>>>>> tesseract come to its senses. Thanks. >>>>>>>> >>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "tesseract-ocr" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to [email protected]. >>>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/15797f86-58c9-4e71-b316-54f663d04cbfn%40googlegroups.com >>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/15797f86-58c9-4e71-b316-54f663d04cbfn%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>> . >>>>>>> >>>>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> >>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/5c364cf1-076a-43e4-86f2-61b925b9d6c3n%40googlegroups.com >>>> <https://groups.google.com/d/msgid/tesseract-ocr/5c364cf1-076a-43e4-86f2-61b925b9d6c3n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/6d8cefec-0521-4b83-8a0f-74ba85a10bc7n%40googlegroups.com >> <https://groups.google.com/d/msgid/tesseract-ocr/6d8cefec-0521-4b83-8a0f-74ba85a10bc7n%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAL7mBq7kXezErNJcEa0FnUEUJ7b3vRaL5x12ptmTLM-cbuKz5w%40mail.gmail.com.

