Re: [tesseract-ocr] Not getting results with numbers and currency simbols in tables

Emiliano Isaza Villamizar Mon, 30 Jul 2018 08:19:36 -0700

Lorenzo, Thank you so much for your help. I did everything step by step and 
got a very good result I think what helped me most was up scaling the 
images. the code I did is in python and is the following if anyone is 
following the thread:


*import PIL*
*from PIL import Image*

*im = Image.open(imagepath)*
*hpercent = (baseheight / float(img.size[1]))*
*wsize = int((float(img.size[0]) * float(hpercent)))*
*img = img.resize((wsize, baseheight), PIL.Image.ANTIALIAS)*

I'm a real newbie in bash so I didn't use your scripts I kept getting a 
permission error.  Thank you again Lorenzo! 






On Thursday, July 26, 2018 at 5:46:44 AM UTC-5, Lorenzo Blz wrote:
>
> First, read this: "Fine Tuning for ± a few characters" 
> <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters>
>
>
> Then check the data/unicharset file to see if everything is ok, if there 
> are all the characters you want.
>
>
> Then, 15000 iterations are way too many and 300 samples are really too 
> few. If you train too much you'll get worse results. 
>
> I usually get the best fine tuning results from 400 to 2000 iterations. I 
> can do more, up to 20k iterations, only when I have many sample images: a 
> few thousand with multiple words.
>
>
> I do it like this (this is not a complete guide, just to give you the 
> general idea):
>
> -
>  clean the data and data/checkpoints folders (do NOT add -rf, you do not 
> want to wipe out the training data)
>
> rm data/*
>
> rm data/checkpoints/*
>
>
> (do this only once, when you start a new training session, not after each 
> training step)
>
> -
> go into the Makefile and fix this (in the "data/list.eval" block, remove 
> the + before $$no):
>
>
>      tail -n "$$no" $(ALL_LSTMF) > "$@"
>
>
> then add somewhere at the top:
>
> ITERATIONS=100
>
> and change the max_iterations line to this (do not change the tabs/spaces 
> at the beginning, just replace the number):
>
> --max_iterations $(ITERATIONS)
>
> - now run the training as normal like this:
>
> make training ITERATIONS=100
>
> - when it finishes run this:
>
> lstmeval --model data/YOUR_MODEL.traineddata --eval_listfile data/list.eval
>
> In the last line you'll get something like this:
>
> At iteration 0, stage 0, Eval Char error rate=0.96153846, Word error 
> rate=3.8461538
>
> These are the only values that matter. Take note of these values and the 
> iteration numbers.
>
> Make a backup of the model:
>
> cp data/YOUR_MODEL.traineddata data/YOUR_MODEL.traineddata_100
>
> - Now start the training again with ITERATIONS=200, it will resume from 
> the previous iteration up to 200:
>
> make training ITERATIONS=200
>
> - Run lstmeval again, take note, backup and so on, 300, 400, 500....
>
> You should see that the error rate will go down for a while then it will 
> slow down and then will start to get worse. Use the model where you got the 
> best score. 
>
> You can try this, but 300 samples are likely way too few for this to be 
> meaningful.
>
> I'm attaching my training scripts, they should work but double check 
> everything.
>
>
> About thresholding, probably you do not need it, just increase the 
> contrast a little, do not go binary. Probably you do not need that either. 
> And do the same processing to the training data that you will do on your 
> real data.
>
> Two important things, for training and recognition. Use PSM=13 
> (PSM.RAW_LINE). Trim all the white borders, upscale the image so that the 
> text is 30-50 pixels tall.
>
> Again, train with the same processing you'll use for recognition.
>
>
> Bye
>
> Lorenzo
>
>
> 2018-07-25 16:49 GMT+02:00 Emiliano Isaza Villamizar <[email protected] 
> <javascript:>>:
>
>> Hello,
>>
>> I'm trying to train tesseract to accurately extract information from a 
>> table. Initialy when running with pytesseract I get these results:
>>
>> *pytesseract.image_to_string(img, lang='eng', config='--psm 11 --oem 1 -c 
>> tessedit_char_whitelist=0123456789')*
>>
>> I get these results:
>>
>> ground truth                            Tesseract  
>>
>> CN¥6.94 CN#6.94
>>
>> ¥31660.90 ¥31660.90
>>
>> Ltd Lid
>>
>> I retrained tesseract with OCR-D, I extracted each cell and wrote the 
>> ground truth for 3 tables that add up to 300 cells (300 labeled images). I 
>> ran it for 15000 iterations and got an error of 0.5%. But now I get worse 
>> results. Tesseract doesn't seem to read numbers and basic acronyms.attached 
>> you may find an example of an image used for training.
>>
>> ground truth                              New tesseract
>>
>> 000426.China                            ooo426.cin
>>
>> How can I improve tesseract to read these weird characters? I already 
>> tried to improve the image quality by transforming the image using CV2 this 
>> is an example:
>>
>>
>> th3 = 
>> cv2.adaptiveThreshold(img_grey,255,cv2.ADAPTIVE_THRESH_GAUSSIAN_C,cv2.THRESH_BINARY,11,2)
>>  
>> img_grey = cv2.cvtColor(atable, cv2.COLOR_BGR2GRAY)
>>
>>
>> Thanks!!
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/1b05ace0-4ca6-4caf-94a8-d53f7c0bec35%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/1b05ace0-4ca6-4caf-94a8-d53f7c0bec35%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/0ae0ec83-b0d3-43b3-bfd5-5f612b297d3b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Not getting results with numbers and currency simbols in tables

Reply via email to