Not about training, but you should use the latest version of tess4j that corresponds to the latest Tesseract releases.
https://mvnrepository.com/artifact/net.sourceforge.tess4j/tess4j https://github.com/nguyenq/tess4j Hope it will produce better results for you. On Saturday, January 29, 2022 at 1:52:48 AM UTC-6 Bernd Angelo wrote: > Hello, I am having trouble getting numbers recognized. > I am using Tess4J from http://tess4j.sourceforge.net/ > which, if I am not wrong, is using Tesseract 3.05 in the background. > > I followed the instructions outlined here: > http://tess4j.sourceforge.net/tutorial/ > (using the command line version, no eclipse, maven or other sh't) > > I can modify the TesseractExample.java file without an issue and doing the > 2 command line commands mentioned in the site above, can do an tesseract > ocr scan on any png or jpg I want. > > Now you see what I in the end want to do is use ocr to make my program > "read" the balance of an online casino and with that balance now given as a > string variable, I will do all kinds of actions based on it. > so reading the numbers properly is important. > > Now for test purposes I took 2 screenshots that together include all the > different digits that can appear, so 0-9. > > when I do the normal ocr as instructed in the page above, (from my > knowledge, it then uses the pre-trained standard eng.traineddata file) > sadly both the digits 4 and 6 in the image are read as 5. > the euro sign € is also as the pound sign isntead but that is of minor > importance to me. > the ocr not being able to distinguish between 4 and 6 really sucks. > > The pictures used are these ones: > https://ibb.co/ZTRFqVg > https://ibb.co/p23w7nj > > As said, they are basically screenshots of the casino site and so I cant > influence the font or size or anything. > > as said, the ocr reads the "4,6" part as "5,5". > > which is bad. > So I thought, why not use the 2 images to train tesseract, as obviously > tesseract having seen all the possible digits should give it 100% accuracy, > right? > well, I got myself jtessboxeditor, got myself serrak tesseract trainer, > did a ton of stuff and created the traineddata from the image. > and made the ocr file use it to try to ocr the image again. > well, I wrote a line in my code to System.out.print the string and also > write down its length. > I dont know what ocr does. but the stuff written as a result in the > command line window is an empty line (where the result string should stand) > and string length is claimed to be 6 (it should be 11 with all the digits. > and , involved). > so I dont know watf ocr is doing, is sucks way harder than with the > standard eng language. > > so I did some bit of googling, apparently the font "Alte DIN 1451 > Mittelschrift" is VERY similar to my number, the casino (for the balance > display at least) uses this font or a very similar one. > so while I know about a font worth training with (I also already > downloaded it's ttf file) I havent the slightest idea how to train with the > font. > > Can someone please help me, explain to me why the ocr result can be that > bad after training with the actual image to ocr? > (was a pain to perfectly fit the rectangles to the digits!) > or how to train tess4j with the given font? > google even tells me about such a one click service but sadly it is > apparently gone by now. > > can someone help me please? :-) > > > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1bd230a7-b761-401f-80ce-acec7dd67a4an%40googlegroups.com.