Re: [tesseract-ocr] Re: OCR-D training process - High error rate [Tess 4]

Lorenzo Bolzani Wed, 04 Jul 2018 09:40:05 -0700

I suspect 1800 lines may not be enough data for training from scratch and
you are simply overfitting. I think 5% refers to the evaluation set, with a
default split 80/20 I think.


Try this to check the accuracy on the training set and the eval set:

lstmeval --model your-model.traineddata --eval_listfile data/list.train
lstmeval --model your-model.traineddata --eval_listfile data/list.eval

If the train accuracy is much lower, like 0.1% or even 2%, you are
overfitting: too little data and/or a model too large.

If so, you may add more different data (I guess at least 10 times or more),
also try some augmentation even if I think you already do.



Lorenzo


2018-07-04 18:13 GMT+02:00 Joe <[email protected]>:

> Thank you for your answer, Lorenzo!
>
> I was following the sample data provided by ocr-d and I realized every
> tiff in ocrd-testset.zip has no left or right white border. That's why my
> tiffs are the same way.
> Anyway I'll give it a try with some space and with no-binarized data.
>
> I'm training from scratch and I used the 10000 iterations given by default
> by ocr-d (then I tried with 20K/30K but only with slightly better
> results).  The training process takes about 2-3 hours to complete (4-5h
> with 20K iterations).
>
> This is the best result a got:
>
>
> <https://lh3.googleusercontent.com/-LpN72wYMGOo/WzzxEMcwkjI/AAAAAAAAAHY/GQ7kUm3ekV8PptNwyNh6ObNQe_SsiKqNgCLcBGAs/s1600/tess4lstmEx.JPG>
>
> After that with more iterations the char train value remains almost the
> same and sometimes it ends up bigger.
>
> The thread you commented about only refers to fine tuning, so I'd probably
> use it later. Thank you once again!
>
>
> quarta-feira, 4 de Julho de 2018 às 12:33:41 UTC-3, Lorenzo Blz escreveu:
>>
>>
>> I had no problems training with the ocr-d boxes. Looking at the tiffs the
>> first thing I'd try to do is adding some white border on left and right.
>>
>> For my training I used no-binarized (grayscale) data and I think it could
>> be better (more information is available).
>>
>> Are you training from scratch of fine tuning a model? How many epochs did
>> you do? How long did it run? Maybe you just need to wait more.
>>
>> Please, have a look at this thread too:
>>
>> https://groups.google.com/forum/#!topic/tesseract-ocr/be4-rjvY2tQ
>>
>>
>> Bye
>>
>> Lorenzo
>>
>>
>> 2018-07-04 17:03 GMT+02:00 Joe <[email protected]>:
>>
>>> I forgot to mention:
>>> The *.box files created by OCR-D are not in the same format as described
>>> in https://github.com/tesseract-ocr/tesseract/wiki/Making-
>>> Box-Files---4.0
>>> I know Tesseract 4 boxes only need to cover a text line instead of
>>> individual chars, but in the example given in that link every character box
>>> value is different while in *.box files created by OCR-D the all have the
>>> same values.
>>>
>>> Is that a problem?
>>>
>>>
>>> quarta-feira, 4 de Julho de 2018 às 11:50:54 UTC-3, Joe escreveu:
>>>>
>>>> Hi everybody!
>>>>
>>>> I'm trying this tool https://github.com/OCR-D/ocrd-train/ but without
>>>> success so far. Tesseract and Leptonica are installed by the scripts.
>>>> Inspired by the test set provided in that repo, I created pairs of
>>>> [*.tif, *.gt.txt] with binarized chars and TTF's from two fonts (1869 text
>>>> lines in total).
>>>> You can see an example of my set in attachment that also contains files
>>>> created by the training process.
>>>>
>>>> My guess is that something is wrong with my data.
>>>> Sometimes I can see the char train value increasing instead of
>>>> decreasing and the final error rate still too high (about 60%).
>>>>
>>>> That new training process with LSTM is driving me crazy!
>>>> I would appreciate if anyone with experience could take a look to my
>>>> data set.
>>>>
>>>>
>>>> Joe.
>>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/601364b4-3ebd-4a04-9f6a-3d418ab728ab%40goo
>>> glegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/601364b4-3ebd-4a04-9f6a-3d418ab728ab%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/55041513-f089-4a18-b712-7daed030da01%
> 40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/55041513-f089-4a18-b712-7daed030da01%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLxr9ZUJwfYW7SUGmAfuULXioQRkq28bG6XzUnAiCqRumg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: OCR-D training process - High error rate [Tess 4]

Reply via email to