Re: [tesseract-ocr] Re: lstmeval shows good result but visualized result looks bad

Shree Devi Kumar Mon, 17 Jun 2019 09:33:58 -0700

I don't think you need training to improve results.

You need to pre-process the image, straighten it. Use a separate tool to
identify each cell of data and then OCR that. You will get best results
like that.


On Mon, Jun 17, 2019 at 6:07 PM [email protected] <[email protected]>
wrote:

> Thanks shree for your reply. I see that you are very busy to answer a lot
> of questions here. Thanks again for taking some time for me
>>
>> Your files have prefix of jpn, so I assume you are training for Japanese,
>> but the image in question has only numbers in it.
>>
> Well I forgot to mention, my model only need to recognize digits, not all
> of Japanese Character. I just put the prefix of jpn because I am working
> with Japanese Document
> Anw, as your answer I understand that high chance that I am dealing with
> overfitting problem, not some problem of how to convert check point file to
> .traineddata file, am I right? If so, I guess the first thing I should try
> is to finetune your digits model (I found you shared on github
> https://github.com/Shreeshrii/tessdata_shreetest). Correct me if I am
> wrong
>
> Btw, I have 2 more questions:
> 1. About how I generate the training data. Since I could not find the
> right font for my document, I cropped the digit image from the data I have
> and randomly pick cropped digit to generate training image. Do you think
> this is the right way to do the data augmentation?
> 2. I generated 2000 samples for the training, is it enough or not?
>
> On Mon, Jun 17, 2019 at 5:19 PM shree <[email protected]> wrote:
>
>> Your files have prefix of jpn, so I assume you are training for Japanese,
>> but the image in question has only numbers in it.
>>
>> Getting good results on eval data but bad results on OCR could be the
>> result of overfitting the model, if you have used a small sample and
>> trained for large number of iterations.
>>
>>
>> On Friday, June 14, 2019 at 8:35:40 AM UTC+5:30, Phuc wrote:
>>>
>>> Hi
>>> I am training a model using Tesseract's lstmtraining and get confuse
>>> about the result I get. I wonder if I do anything wrong among these steps
>>> below:
>>>
>>>    - I create training data .box and .tif following
>>>    https://github.com/tesseract-ocr/tesseract/issues/2357. Note that an
>>>    (.box, .tif) pair include multiple text lines
>>>    - Run the training process using https://github.com/OCR-D/ocrd-train.
>>>    Since I already have .box file, I simply comment out the line of
>>>    `generate_line_box.py` inside the Makefile
>>>    - After training, I use lstmeval to evaluate the model on some
>>>    evaluation dataset and get the error which is not so bad
>>>
>>> [image: 図1.png]
>>>
>>>
>>>    - But when I use the exact same image on evaluation dataset, and run
>>>    the prediction using .traineddata and then the result seems to be totally
>>>    different
>>>
>>> I also attach some files of my training data and the visualized result
>>> in case anyone wants to take a look
>>>
>>> I will be appreciate if someone can tell me what wrong did I do
>>>
>>> Thanks
>>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To post to this group, send email to [email protected].
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/a6090eb0-6803-4242-b2e9-9cf27ca65126%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/a6090eb0-6803-4242-b2e9-9cf27ca65126%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CACgP0BYPrmMgp6HLKBf4P8oQ7naACaZO0914%3DUQJKi4CzTKn0A%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CACgP0BYPrmMgp6HLKBf4P8oQ7naACaZO0914%3DUQJKi4CzTKn0A%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWKGfVyTDcawio63iCHcwe4TcVYd6vDjj9upKdOdoRoMA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: lstmeval shows good result but visualized result looks bad

Reply via email to