Re: [tesseract-ocr] Not getting results with numbers and currency simbols in tables

Zdenko Podobny Sun, 24 Mar 2019 00:29:01 -0700

Tesseract is OCR library e.g.  user is responsible for image preprocessing.


Zdenko


ne 24. 3. 2019 o 4:12 <[email protected]> napísal(a):

> Hi, i feel confused why upscaling works.Actually,  in the tesseract, it
> also has the process to prescale the image to height 36pix.
>
> 在 2018年7月30日星期一 UTC+8下午11:19:23，Emiliano Isaza Villamizar写道：
>>
>> Lorenzo, Thank you so much for your help. I did everything step by step
>> and got a very good result I think what helped me most was up scaling the
>> images. the code I did is in python and is the following if anyone is
>> following the thread:
>>
>> *import PIL*
>> *from PIL import Image*
>>
>> *im = Image.open(imagepath)*
>> *hpercent = (baseheight / float(img.size[1]))*
>> *wsize = int((float(img.size[0]) * float(hpercent)))*
>> *img = img.resize((wsize, baseheight), PIL.Image.ANTIALIAS)*
>>
>> I'm a real newbie in bash so I didn't use your scripts I kept getting a
>> permission error.  Thank you again Lorenzo!
>>
>>
>>
>>
>>
>>
>> On Thursday, July 26, 2018 at 5:46:44 AM UTC-5, Lorenzo Blz wrote:
>>>
>>> First, read this: "Fine Tuning for ± a few characters"
>>> <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters>
>>>
>>>
>>> Then check the data/unicharset file to see if everything is ok, if there
>>> are all the characters you want.
>>>
>>>
>>> Then, 15000 iterations are way too many and 300 samples are really too
>>> few. If you train too much you'll get worse results.
>>>
>>> I usually get the best fine tuning results from 400 to 2000 iterations.
>>> I can do more, up to 20k iterations, only when I have many sample images: a
>>> few thousand with multiple words.
>>>
>>>
>>> I do it like this (this is not a complete guide, just to give you the
>>> general idea):
>>>
>>> -
>>>  clean the data and data/checkpoints folders (do NOT add -rf, you do not
>>> want to wipe out the training data)
>>>
>>> rm data/*
>>>
>>> rm data/checkpoints/*
>>>
>>>
>>> (do this only once, when you start a new training session, not after
>>> each training step)
>>>
>>> -
>>> go into the Makefile and fix this (in the "data/list.eval" block, remove
>>> the + before $$no):
>>>
>>>
>>>      tail -n "$$no" $(ALL_LSTMF) > "$@"
>>>
>>>
>>> then add somewhere at the top:
>>>
>>> ITERATIONS=100
>>>
>>> and change the max_iterations line to this (do not change the
>>> tabs/spaces at the beginning, just replace the number):
>>>
>>> --max_iterations $(ITERATIONS)
>>>
>>> - now run the training as normal like this:
>>>
>>> make training ITERATIONS=100
>>>
>>> - when it finishes run this:
>>>
>>> lstmeval --model data/YOUR_MODEL.traineddata --eval_listfile
>>> data/list.eval
>>>
>>> In the last line you'll get something like this:
>>>
>>> At iteration 0, stage 0, Eval Char error rate=0.96153846, Word error
>>> rate=3.8461538
>>>
>>> These are the only values that matter. Take note of these values and the
>>> iteration numbers.
>>>
>>> Make a backup of the model:
>>>
>>> cp data/YOUR_MODEL.traineddata data/YOUR_MODEL.traineddata_100
>>>
>>> - Now start the training again with ITERATIONS=200, it will resume from
>>> the previous iteration up to 200:
>>>
>>> make training ITERATIONS=200
>>>
>>> - Run lstmeval again, take note, backup and so on, 300, 400, 500....
>>>
>>> You should see that the error rate will go down for a while then it will
>>> slow down and then will start to get worse. Use the model where you got the
>>> best score.
>>>
>>> You can try this, but 300 samples are likely way too few for this to be
>>> meaningful.
>>>
>>> I'm attaching my training scripts, they should work but double check
>>> everything.
>>>
>>>
>>> About thresholding, probably you do not need it, just increase the
>>> contrast a little, do not go binary. Probably you do not need that either.
>>> And do the same processing to the training data that you will do on your
>>> real data.
>>>
>>> Two important things, for training and recognition. Use PSM=13
>>> (PSM.RAW_LINE). Trim all the white borders, upscale the image so that the
>>> text is 30-50 pixels tall.
>>>
>>> Again, train with the same processing you'll use for recognition.
>>>
>>>
>>> Bye
>>>
>>> Lorenzo
>>>
>>>
>>> 2018-07-25 16:49 GMT+02:00 Emiliano Isaza Villamizar <[email protected]>:
>>>
>>>> Hello,
>>>>
>>>> I'm trying to train tesseract to accurately extract information from a
>>>> table. Initialy when running with pytesseract I get these results:
>>>>
>>>> *pytesseract.image_to_string(img, lang='eng', config='--psm 11 --oem 1
>>>> -c tessedit_char_whitelist=0123456789')*
>>>>
>>>> I get these results:
>>>>
>>>> ground truth                            Tesseract
>>>>
>>>> CN¥6.94 CN#6.94
>>>>
>>>> ¥31660.90 ¥31660.90
>>>>
>>>> Ltd Lid
>>>>
>>>> I retrained tesseract with OCR-D, I extracted each cell and wrote the
>>>> ground truth for 3 tables that add up to 300 cells (300 labeled images). I
>>>> ran it for 15000 iterations and got an error of 0.5%. But now I get worse
>>>> results. Tesseract doesn't seem to read numbers and basic acronyms.attached
>>>> you may find an example of an image used for training.
>>>>
>>>> ground truth                              New tesseract
>>>>
>>>> 000426.China                            ooo426.cin
>>>>
>>>> How can I improve tesseract to read these weird characters? I already
>>>> tried to improve the image quality by transforming the image using CV2 this
>>>> is an example:
>>>>
>>>>
>>>> th3 =
>>>> cv2.adaptiveThreshold(img_grey,255,cv2.ADAPTIVE_THRESH_GAUSSIAN_C,cv2.THRESH_BINARY,11,2)
>>>> img_grey = cv2.cvtColor(atable, cv2.COLOR_BGR2GRAY)
>>>>
>>>>
>>>> Thanks!!
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/tesseract-ocr/1b05ace0-4ca6-4caf-94a8-d53f7c0bec35%40googlegroups.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/1b05ace0-4ca6-4caf-94a8-d53f7c0bec35%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/09043939-0f2b-45e7-9f54-130eb8d03299%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/09043939-0f2b-45e7-9f54-130eb8d03299%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zpRdp86E5DfgpiL%2BdX-2vFDk9t67naWzegFTABeTkofg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Not getting results with numbers and currency simbols in tables

Reply via email to