Re: [tesseract-ocr] LSTM-based training produces .box files with the same coordinates

Des Bw Wed, 01 Nov 2023 06:22:01 -0700

To clarify, Shree's script is useful in case your images are not single 
line. If they are all single line, that script won't do much for you. 
On Wednesday, November 1, 2023 at 4:20:09 PM UTC+3 Des Bw wrote:


>  
> *1. using sythetic data: *
> What can you do if you do not have a data that is confirmed to be accurate?
> The only way around that I know  is to use sythetic data.  That is: you 
> generate the images from the texts using text2image script. You then train 
> from that one. The accuracy of the result model is not going to be 
> perfect because the actual data is messier than the syntactic data. But, 
> you can try  different methods to get better accuracy: 
> (a) by training from a network: that is you can cut the top layer of a 
> working model, and train from that one. 
> (b) configure text2image script to add noise to the sythetic data so that 
> it will be similar to the actual images. 
> (c) using larger dataset
> etc
>
> *2) the hocr hack: *
> - I havn't tried this method myself. But, I read in GitHub that Shree has 
> some kind of hack (script) that uses horc script inside tesseract.
> https://github.com/tesseract-ocr/tesstrain/issues/7
> a. First, ocr the images using the standard model  to an hocr format. 
> b) he then breaks down the hocr format to box, tif, text files
> c) he then compares the text files with the images, and manually corrects 
> faulty ones. 
> This one also requires a lot of manual work because the standard model 
> will miss a lot of characters. 
>
> 3) Alternatively, you can try other ocr engines such as *EasyOCr*. Some 
> people say EasOCR is better to ocr those kinds of images: while tesseract 
> is better for scanned docs. 
>
> On Wednesday, November 1, 2023 at 3:57:48 PM UTC+3 [email protected] 
> wrote:
>
>> Thank you for your responses. Regarding my question and referring to the 
>> official documentation at  
>> https://tesseract-ocr.github.io/tessdoc/tess4/Make-Box-Files.html , the 
>> generated .box files for LSTM-based training have the *same coordinates* for 
>> every character because they use line-level boxes instead of 
>> character-level boxes.
>> Also, I have a couple of concerns:
>> 1) I'm working on license plate recognition and have 80K car plate images 
>> with noise. Most of the .box files generated by lstmbox are incorrect 
>> compared with ground truth text. Manually editing all these box files will 
>> be very time-consuming. Do you have any suggestions to shorten the time?
>> 2) Do I need to manually check all 80K box files to ensure the accuracy 
>> of my training data?
>>
>> On Wednesday, November 1, 2023 at 9:21:36 PM UTC+9 [email protected] 
>> wrote:
>>
>>> "Please note that box files generated using makebox config file are OK 
>>> for training legacy models but not for LSTM training.". Makebox is the 
>>> tool included inside tesseract to generate box files. It looks like that 
>>> was used for the legacy model. For the current model, text2image is the way 
>>> to do it.  
>>>
>>> On Wednesday, November 1, 2023 at 3:02:28 PM UTC+3 Des Bw wrote:
>>>
>>>>
>>>> I don't know what you are trying to do. I am not familiar with this 
>>>> method of box generation. But, I think the command you are running is 
>>>> supposed to generate them with the same coordinates. Look at the example 
>>>> here:  
>>>> https://tesseract-ocr.github.io/tessdoc/tess4/Make-Box-Files.html
>>>>
>>>>
>>>> On Wednesday, November 1, 2023 at 2:57:46 PM UTC+3 [email protected] 
>>>> wrote:
>>>>
>>>>> On 1 Nov 2023 at 11:51:27 AM, TRAN TRONG KHANH[학생](대학원 컴퓨터공학과) ‍ <
>>>>> [email protected]> wrote:
>>>>>
>>>>>>
>>>>> Are you trying to generate box files from the images (tif files)?
>>>>>
>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/d19af743-d9f6-4a0c-8f13-9a78ba922fd7n%40googlegroups.com.

Re: [tesseract-ocr] LSTM-based training produces .box files with the same coordinates

Reply via email to