To clarify, Shree's script is useful in case your images are not single line. If they are all single line, that script won't do much for you. On Wednesday, November 1, 2023 at 4:20:09 PM UTC+3 Des Bw wrote:
> > *1. using sythetic data: * > What can you do if you do not have a data that is confirmed to be accurate? > The only way around that I know is to use sythetic data. That is: you > generate the images from the texts using text2image script. You then train > from that one. The accuracy of the result model is not going to be > perfect because the actual data is messier than the syntactic data. But, > you can try different methods to get better accuracy: > (a) by training from a network: that is you can cut the top layer of a > working model, and train from that one. > (b) configure text2image script to add noise to the sythetic data so that > it will be similar to the actual images. > (c) using larger dataset > etc > > *2) the hocr hack: * > - I havn't tried this method myself. But, I read in GitHub that Shree has > some kind of hack (script) that uses horc script inside tesseract. > https://github.com/tesseract-ocr/tesstrain/issues/7 > a. First, ocr the images using the standard model to an hocr format. > b) he then breaks down the hocr format to box, tif, text files > c) he then compares the text files with the images, and manually corrects > faulty ones. > This one also requires a lot of manual work because the standard model > will miss a lot of characters. > > 3) Alternatively, you can try other ocr engines such as *EasyOCr*. Some > people say EasOCR is better to ocr those kinds of images: while tesseract > is better for scanned docs. > > On Wednesday, November 1, 2023 at 3:57:48 PM UTC+3 [email protected] > wrote: > >> Thank you for your responses. Regarding my question and referring to the >> official documentation at >> https://tesseract-ocr.github.io/tessdoc/tess4/Make-Box-Files.html , the >> generated .box files for LSTM-based training have the *same coordinates* for >> every character because they use line-level boxes instead of >> character-level boxes. >> Also, I have a couple of concerns: >> 1) I'm working on license plate recognition and have 80K car plate images >> with noise. Most of the .box files generated by lstmbox are incorrect >> compared with ground truth text. Manually editing all these box files will >> be very time-consuming. Do you have any suggestions to shorten the time? >> 2) Do I need to manually check all 80K box files to ensure the accuracy >> of my training data? >> >> On Wednesday, November 1, 2023 at 9:21:36 PM UTC+9 [email protected] >> wrote: >> >>> "Please note that box files generated using makebox config file are OK >>> for training legacy models but not for LSTM training.". Makebox is the >>> tool included inside tesseract to generate box files. It looks like that >>> was used for the legacy model. For the current model, text2image is the way >>> to do it. >>> >>> On Wednesday, November 1, 2023 at 3:02:28 PM UTC+3 Des Bw wrote: >>> >>>> >>>> I don't know what you are trying to do. I am not familiar with this >>>> method of box generation. But, I think the command you are running is >>>> supposed to generate them with the same coordinates. Look at the example >>>> here: >>>> https://tesseract-ocr.github.io/tessdoc/tess4/Make-Box-Files.html >>>> >>>> >>>> On Wednesday, November 1, 2023 at 2:57:46 PM UTC+3 [email protected] >>>> wrote: >>>> >>>>> On 1 Nov 2023 at 11:51:27 AM, TRAN TRONG KHANH[학생](대학원 컴퓨터공학과) < >>>>> [email protected]> wrote: >>>>> >>>>>> >>>>> Are you trying to generate box files from the images (tif files)? >>>>> >>>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d19af743-d9f6-4a0c-8f13-9a78ba922fd7n%40googlegroups.com.

