To clarify, Shree's script is useful in case your images are not single 
line. If they are all single line, that script won't do much for you. 
On Wednesday, November 1, 2023 at 4:20:09 PM UTC+3 Des Bw wrote:

>  
> *1. using sythetic data: *
> What can you do if you do not have a data that is confirmed to be accurate?
> The only way around that I know  is to use sythetic data.  That is: you 
> generate the images from the texts using text2image script. You then train 
> from that one. The accuracy of the result model is not going to be 
> perfect because the actual data is messier than the syntactic data. But, 
> you can try  different methods to get better accuracy: 
> (a) by training from a network: that is you can cut the top layer of a 
> working model, and train from that one. 
> (b) configure text2image script to add noise to the sythetic data so that 
> it will be similar to the actual images. 
> (c) using larger dataset
> etc
>
> *2) the hocr hack: *
> - I havn't tried this method myself. But, I read in GitHub that Shree has 
> some kind of hack (script) that uses horc script inside tesseract.
> https://github.com/tesseract-ocr/tesstrain/issues/7
> a. First, ocr the images using the standard model  to an hocr format. 
> b) he then breaks down the hocr format to box, tif, text files
> c) he then compares the text files with the images, and manually corrects 
> faulty ones. 
> This one also requires a lot of manual work because the standard model 
> will miss a lot of characters. 
>
> 3) Alternatively, you can try other ocr engines such as *EasyOCr*. Some 
> people say EasOCR is better to ocr those kinds of images: while tesseract 
> is better for scanned docs. 
>
> On Wednesday, November 1, 2023 at 3:57:48 PM UTC+3 [email protected] 
> wrote:
>
>> Thank you for your responses. Regarding my question and referring to the 
>> official documentation at  
>> https://tesseract-ocr.github.io/tessdoc/tess4/Make-Box-Files.html , the 
>> generated .box files for LSTM-based training have the *same coordinates* for 
>> every character because they use line-level boxes instead of 
>> character-level boxes.
>> Also, I have a couple of concerns:
>> 1) I'm working on license plate recognition and have 80K car plate images 
>> with noise. Most of the .box files generated by lstmbox are incorrect 
>> compared with ground truth text. Manually editing all these box files will 
>> be very time-consuming. Do you have any suggestions to shorten the time?
>> 2) Do I need to manually check all 80K box files to ensure the accuracy 
>> of my training data?
>>
>> On Wednesday, November 1, 2023 at 9:21:36 PM UTC+9 [email protected] 
>> wrote:
>>
>>> "Please note that box files generated using makebox config file are OK 
>>> for training legacy models but not for LSTM training.". Makebox is the 
>>> tool included inside tesseract to generate box files. It looks like that 
>>> was used for the legacy model. For the current model, text2image is the way 
>>> to do it.  
>>>
>>> On Wednesday, November 1, 2023 at 3:02:28 PM UTC+3 Des Bw wrote:
>>>
>>>>
>>>> I don't know what you are trying to do. I am not familiar with this 
>>>> method of box generation. But, I think the command you are running is 
>>>> supposed to generate them with the same coordinates. Look at the example 
>>>> here:  
>>>> https://tesseract-ocr.github.io/tessdoc/tess4/Make-Box-Files.html
>>>>
>>>>
>>>> On Wednesday, November 1, 2023 at 2:57:46 PM UTC+3 [email protected] 
>>>> wrote:
>>>>
>>>>> On 1 Nov 2023 at 11:51:27 AM, TRAN TRONG KHANH[학생](대학원 컴퓨터공학과) ‍ <
>>>>> [email protected]> wrote:
>>>>>
>>>>>>
>>>>> Are you trying to generate box files from the images (tif files)?
>>>>>
>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/d19af743-d9f6-4a0c-8f13-9a78ba922fd7n%40googlegroups.com.

Reply via email to