Re: [tesseract-ocr] Re: OCR-D training process - High error rate [Tess 4]

Joe Sat, 07 Jul 2018 07:36:01 -0700


Hi, Lorenzo!


Thank you for your tips!


When I run those check commands I get this:


<https://lh3.googleusercontent.com/-Soz3vfA1HVc/W0DO5ya_HPI/AAAAAAAAAH8/3sJ-_tf0eWslqt9BxHXmRIFqIZYagMr1ACLcBGAs/s1600/tess4eval.JPG>

I'm gathering more data and as soon as I get any result I will share it 
here.

Have a nice weekend!
Joe.

quarta-feira, 4 de Julho de 2018 às 13:39:41 UTC-3, Lorenzo Blz escreveu:
>
>
> I suspect 1800 lines may not be enough data for training from scratch and 
> you are simply overfitting. I think 5% refers to the evaluation set, with a 
> default split 80/20 I think.
>
> Try this to check the accuracy on the training set and the eval set:
>
> lstmeval --model your-model.traineddata --eval_listfile data/list.train
> lstmeval --model your-model.traineddata --eval_listfile data/list.eval
>
> If the train accuracy is much lower, like 0.1% or even 2%, you are 
> overfitting: too little data and/or a model too large.
>
> If so, you may add more different data (I guess at least 10 times or 
> more), also try some augmentation even if I think you already do.
>
>
>
> Lorenzo
>
>
> 2018-07-04 18:13 GMT+02:00 Joe <[email protected] <javascript:>>:
>
>> Thank you for your answer, Lorenzo!
>>
>> I was following the sample data provided by ocr-d and I realized every 
>> tiff in ocrd-testset.zip has no left or right white border. That's why my 
>> tiffs are the same way.
>> Anyway I'll give it a try with some space and with no-binarized data.
>>
>> I'm training from scratch and I used the 10000 iterations given by 
>> default by ocr-d (then I tried with 20K/30K but only with slightly better 
>> results).  The training process takes about 2-3 hours to complete (4-5h 
>> with 20K iterations).
>>
>> This is the best result a got:
>>
>>
>> <https://lh3.googleusercontent.com/-LpN72wYMGOo/WzzxEMcwkjI/AAAAAAAAAHY/GQ7kUm3ekV8PptNwyNh6ObNQe_SsiKqNgCLcBGAs/s1600/tess4lstmEx.JPG>
>>
>> After that with more iterations the char train value remains almost the 
>> same and sometimes it ends up bigger.
>>
>> The thread you commented about only refers to fine tuning, so I'd 
>> probably use it later. Thank you once again!
>>
>>
>> quarta-feira, 4 de Julho de 2018 às 12:33:41 UTC-3, Lorenzo Blz escreveu:
>>>
>>>
>>> I had no problems training with the ocr-d boxes. Looking at the tiffs 
>>> the first thing I'd try to do is adding some white border on left and right.
>>>
>>> For my training I used no-binarized (grayscale) data and I think it 
>>> could be better (more information is available).
>>>
>>> Are you training from scratch of fine tuning a model? How many epochs 
>>> did you do? How long did it run? Maybe you just need to wait more. 
>>>
>>> Please, have a look at this thread too:
>>>
>>> https://groups.google.com/forum/#!topic/tesseract-ocr/be4-rjvY2tQ
>>>
>>>
>>> Bye
>>>
>>> Lorenzo
>>>
>>>
>>> 2018-07-04 17:03 GMT+02:00 Joe <[email protected]>:
>>>
>>>> I forgot to mention:
>>>> The *.box files created by OCR-D are not in the same format as 
>>>> described in 
>>>> https://github.com/tesseract-ocr/tesseract/wiki/Making-Box-Files---4.0
>>>> I know Tesseract 4 boxes only need to cover a text line instead of 
>>>> individual chars, but in the example given in that link every character 
>>>> box 
>>>> value is different while in *.box files created by OCR-D the all have the 
>>>> same values.
>>>>
>>>> Is that a problem?
>>>>
>>>>
>>>> quarta-feira, 4 de Julho de 2018 às 11:50:54 UTC-3, Joe escreveu:
>>>>>
>>>>> Hi everybody!
>>>>>
>>>>> I'm trying this tool https://github.com/OCR-D/ocrd-train/ but without 
>>>>> success so far. Tesseract and Leptonica are installed by the scripts.
>>>>> Inspired by the test set provided in that repo, I created pairs of 
>>>>> [*.tif, *.gt.txt] with binarized chars and TTF's from two fonts (1869 
>>>>> text 
>>>>> lines in total).
>>>>> You can see an example of my set in attachment that also contains 
>>>>> files created by the training process.
>>>>>
>>>>> My guess is that something is wrong with my data.
>>>>> Sometimes I can see the char train value increasing instead of 
>>>>> decreasing and the final error rate still too high (about 60%).
>>>>>
>>>>> That new training process with LSTM is driving me crazy!
>>>>> I would appreciate if anyone with experience could take a look to my 
>>>>> data set.
>>>>>
>>>>>
>>>>> Joe.
>>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/601364b4-3ebd-4a04-9f6a-3d418ab728ab%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/601364b4-3ebd-4a04-9f6a-3d418ab728ab%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/55041513-f089-4a18-b712-7daed030da01%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/55041513-f089-4a18-b712-7daed030da01%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/e6a29281-0322-40b3-a6ab-7459055a994e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: OCR-D training process - High error rate [Tess 4]

Reply via email to