Re: File Uzn using tesseract 3

zdenko podobny Wed, 19 Jun 2013 23:34:25 -0700

On Wed, Jun 19, 2013 at 8:45 PM, llozano <[email protected]> wrote:

> This is awesome. Thanks for your reply. So, one more for you.. just to
> clarify..
> In your command example: 8309_016.2B_psm4 should be with the prefix
> _psm4? Is it true or just mistype?
>


As a second argument (output basename - in this case 8309_016.2B_psm4) you
can use any free text.
I prefer to use image name (to easily identify image source) + something
for identification how I run tesseract (_psm4). This could be useful if you
are planning to test different page segmentation modes on the same image.
Than you can use some tools like kdiff3 (or Winmerge, or Compare by content
in Total Commander if you are on Windows) to see differences coming from
different psm...


> Do I need to pass the tiff file through some filter to remove colors or
> something like that? The examples you shared in your tar.gz file, which are
> awesome, there are in gray scales and not sure about the resolution. Is
> there some preparation of the image in order to improve output?
>

The images you saw are part of UNLV tests (see [1]). There are much more
files with different DPI.
Tesseract binarize input image by itself (see e.g. [2] for parameter how to
get binarized image from tesseract). If you are not satisfied with it, you
can binarize images by yourself in advance (e.g. to use
different algorithm). Search tesseract forum if you need, for more details
about used binarization algorithm.

[1] https://code.google.com/p/tesseract-ocr/wiki/TestingTesseract
[2] http://www.sk-spell.sk.cx/through-tesseract-ocr-eye


> Thanks again!
>
>
>
> On Wednesday, June 19, 2013 2:18:40 PM UTC-4, zdenop wrote:
>>
>>
>> On Wed, Jun 19, 2013 at 3:20 PM, llozano <[email protected]> wrote:
>>
>>> Francesco,
>>>
>>> Do you mind to post how this uzn file may look like
>>>
>>
>> Have a look at  (e.g.) https://isri-ocr-**evaluation-tools.googlecode.**
>> com/files/zset.2B.tar.gz<https://isri-ocr-evaluation-tools.googlecode.com/files/zset.2B.tar.gz>
>>
>> and how should be the entire command?
>>>
>>
>> As far as I remember if you use psm > 3 tesseract will look for uzn file
>> (based on image name). If you are on linux you can check it with strace
>> easily.
>>
>> So you can try something like this:
>> tesseract 8309_016.2B.tif 8309_016.2B_psm4 -psm 4
>>
>>
>>> I'm starting to research this area for one project and I a bit puzzled.
>>> All I know is I need to specify areas to extract text from a document.
>>> Document is layout in tables. Do I need to remove the lines if I specify
>>> areas?
>>>
>>
>> The best way is to make your test and share your findings.
>>
>>>
>>> Thanks
>>>
>>>
>>> On Thursday, July 5, 2012 11:00:10 AM UTC-4, Di Perna Francesco wrote:
>>>>
>>>> Ok. No one can help me.
>>>> I have found the solution anyway....:-)
>>>> Calling tesseract with parameter "-psm 4" and renaming the uzn file
>>>> with the same name of the image seem works.
>>>> Bye
>>>>
>>>> On 4 Lug, 13:16, Di Perna Francesco <[email protected]>
>>>> wrote:
>>>> > Hi, we use tesseract in a web application to recognize some numer in
>>>> > document aquired with scanner.
>>>> > With tesseract2 we have used the "uzn" file to indicate in wich area
>>>> > of the tiff file are the numers to be recognize (the uzn file shoud
>>>> > have the same name of the tiff file witch "uzn" extension).
>>>> > We have now intalled tesseract 3, my error was to suppose that the
>>>> uzn
>>>> > file work as the previous version, but doesn't.
>>>> > Can anyone explain me how recognize some area of the file in
>>>> tesseract
>>>> > 3?
>>>> > Regards
>>>
>>>  --
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To post to this group, send email to [email protected]
>>>
>>> To unsubscribe from this group, send email to
>>> tesseract-oc...@**googlegroups.com
>>>
>>> For more options, visit this group at
>>> http://groups.google.com/**group/tesseract-ocr?hl=en<http://groups.google.com/group/tesseract-ocr?hl=en>
>>>
>>> ---
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@**googlegroups.com.
>>>
>>> For more options, visit 
>>> https://groups.google.com/**groups/opt_out<https://groups.google.com/groups/opt_out>
>>> .
>>>
>>>
>>>
>>
>>  --
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>
> ---
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: File Uzn using tesseract 3

Reply via email to