Re: [tesseract-ocr] Not getting results with numbers and currency simbols in tables

Lorenzo Bolzani Thu, 26 Jul 2018 03:46:48 -0700

First, read this: "Fine Tuning for ± a few characters"
<https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters>



Then check the data/unicharset file to see if everything is ok, if there
are all the characters you want.


Then, 15000 iterations are way too many and 300 samples are really too few.
If you train too much you'll get worse results.

I usually get the best fine tuning results from 400 to 2000 iterations. I
can do more, up to 20k iterations, only when I have many sample images: a
few thousand with multiple words.


I do it like this (this is not a complete guide, just to give you the
general idea):

-
 clean the data and data/checkpoints folders (do NOT add -rf, you do not
want to wipe out the training data)

rm data/*

rm data/checkpoints/*


(do this only once, when you start a new training session, not after each
training step)

-
go into the Makefile and fix this (in the "data/list.eval" block, remove
the + before $$no):


     tail -n "$$no" $(ALL_LSTMF) > "$@"


then add somewhere at the top:

ITERATIONS=100

and change the max_iterations line to this (do not change the tabs/spaces
at the beginning, just replace the number):

--max_iterations $(ITERATIONS)

- now run the training as normal like this:

make training ITERATIONS=100

- when it finishes run this:

lstmeval --model data/YOUR_MODEL.traineddata --eval_listfile data/list.eval

In the last line you'll get something like this:

At iteration 0, stage 0, Eval Char error rate=0.96153846, Word error
rate=3.8461538

These are the only values that matter. Take note of these values and the
iteration numbers.

Make a backup of the model:

cp data/YOUR_MODEL.traineddata data/YOUR_MODEL.traineddata_100

- Now start the training again with ITERATIONS=200, it will resume from the
previous iteration up to 200:

make training ITERATIONS=200

- Run lstmeval again, take note, backup and so on, 300, 400, 500....

You should see that the error rate will go down for a while then it will
slow down and then will start to get worse. Use the model where you got the
best score.

You can try this, but 300 samples are likely way too few for this to be
meaningful.

I'm attaching my training scripts, they should work but double check
everything.


About thresholding, probably you do not need it, just increase the contrast
a little, do not go binary. Probably you do not need that either. And do
the same processing to the training data that you will do on your real data.

Two important things, for training and recognition. Use PSM=13
(PSM.RAW_LINE). Trim all the white borders, upscale the image so that the
text is 30-50 pixels tall.

Again, train with the same processing you'll use for recognition.


Bye

Lorenzo


2018-07-25 16:49 GMT+02:00 Emiliano Isaza Villamizar <[email protected]>:

> Hello,
>
> I'm trying to train tesseract to accurately extract information from a
> table. Initialy when running with pytesseract I get these results:
>
> *pytesseract.image_to_string(img, lang='eng', config='--psm 11 --oem 1 -c
> tessedit_char_whitelist=0123456789')*
>
> I get these results:
>
> ground truth                            Tesseract
>
> CN¥6.94 CN#6.94
>
> ¥31660.90 ¥31660.90
>
> Ltd Lid
>
> I retrained tesseract with OCR-D, I extracted each cell and wrote the
> ground truth for 3 tables that add up to 300 cells (300 labeled images). I
> ran it for 15000 iterations and got an error of 0.5%. But now I get worse
> results. Tesseract doesn't seem to read numbers and basic acronyms.attached
> you may find an example of an image used for training.
>
> ground truth                              New tesseract
>
> 000426.China                            ooo426.cin
>
> How can I improve tesseract to read these weird characters? I already
> tried to improve the image quality by transforming the image using CV2 this
> is an example:
>
>
> th3 = cv2.adaptiveThreshold(img_grey,255,cv2.ADAPTIVE_THRESH_
> GAUSSIAN_C,cv2.THRESH_BINARY,11,2) img_grey = cv2.cvtColor(atable,
> cv2.COLOR_BGR2GRAY)
>
>
> Thanks!!
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/1b05ace0-4ca6-4caf-94a8-d53f7c0bec35%
> 40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/1b05ace0-4ca6-4caf-94a8-d53f7c0bec35%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLyRK3JJ2NJTjU2zMZj0vh4cT8QCkFZgv7Y8HUnu8pBiGA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

train-multi.sh
Description: application/shellscript

monitor-eval.sh
Description: application/shellscript

train.sh
Description: application/shellscript

Re: [tesseract-ocr] Not getting results with numbers and currency simbols in tables

Reply via email to