Re: Chinese Simplified on this image not working

zdenko podobny Tue, 18 Dec 2012 11:11:44 -0800

I do apologize, but I am not familiar with Chinese (or other Asian
languages ;-) ). So I tried


tesseract original.jpg original -l chi_sim

and the message was:

Too many unichars in ambiguity on line 0
Too many unichars in ambiguity on line 0
Tesseract Open Source OCR Engine v3.02.02 with Leptonica

It created output. Than I tried:

tesseract preprocessed.tiff preprocessed -l chi_sim

and I got:

Too many unichars in ambiguity on line 0
Too many unichars in ambiguity on line 0
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
Garbage result of merge? Left Ragged (414,645)->(258,2108) w=6 s=0, sort
key=14617152, boxes=53, partners=0
Garbage result of merge? Right Ragged (425,619)->(291,1877) w=5 s=0, sort
key=14872288, boxes=19, partners=0
Garbage result of merge? Right Ragged (686,619)->(552,1881) w=6 s=0, sort
key=22788162, boxes=41, partners=0
Garbage result of merge? Right Ragged (1967,899)->(1840,2098) w=5 s=0, sort
key=62484498, boxes=38, partners=0
Garbage result of merge? Right Ragged (2166,687)->(2055,1729) w=6 s=0, sort
key=67827472, boxes=30, partners=0
Garbage result of merge? Left Ragged (3336,571)->(3173,2096) w=5 s=0, sort
key=102864440, boxes=45, partners=0
Garbage result of merge? Right Ragged (3403,646)->(3248,2100) w=5 s=0, sort
key=105121744, boxes=58, partners=0
Garbage result of merge? Left Ragged (3510,565)->(3401,1596) w=5 s=0, sort
key=108127380, boxes=33, partners=0
Garbage result of merge? Right Ragged (3632,565)->(3469,2096) w=6 s=0, sort
key=111800270, boxes=60, partners=0

and no output was created.
Messages before "Tesseract Open Source..." are from init phase. So it looks
like there could be some problem in "chi_sim" language file. Messages after
"Tesseract Open Source..." are from OCR phase, so it looks like your
preprocessing is causing tesseract problems that cause there is not
output...

I am not sure if I can help you but here are some general suggestion:

   - if you modified source/use API and you have problem with some image,
   try to use tesseract executable build by "test" compiler (gcc on linux,
   VS2008 on Windows). This can help to find out if there is problem in your
   code, problem with compiler or in tesseract library.
   - try to use other (similar) language file.There are reported some
   problems with Kannada (issue 801), Esperanto(issue 791) and
   Spanish(issue 758). If symbol was not trained (e.g there is missing
   accented symbols) no preprocessing will help you.

For your case I would suggest also to create test case where you can test
if size matters (or maybe you will  find optimum size).

Zdenko



On Tue, Dec 18, 2012 at 3:23 PM, occorled <[email protected]> wrote:

> Thank you for the reply.  I appreciate the help.
>
> We are compiling on Windows7 (64-bit machine), but src compiled in
> 32-bit.  We are using tesseract 3.02 compiled from scratch with VS2010.  I
> link to the built static libs from a wrapper.  Also using leptonica 1.68
> static lib, not built from scratch.
>
> Note: this file is in Chinese Simplified ("chi_sim"), not Traditional
> which has different hansi set.
>
> We are evaluating tesseract-ocr as an alternative to some of the more
> pricey apis we currently use like Accusoft (which I believe just wraps
> OmniPage under the hood - which it seems Ray Smith is very familiar with),
> and Aspose.OCR (which is in its infancy).  We don't use the .exe, but
> rather the api, specifically (pixRead, Init, SetPageSegMode,
> "preprocessing", ProcessPage).
>
> Given the above, I need an algorithm for performing/attempting OCR on any
> image thrown at it, and I have posted here before with a 96 .dpi image, and
> the suggestion was to upscale to 300 .dpi.  This is also suggested on the
> forum link you provided.  This Chinese Simplified image reports that it is
> 96 .dpi after pixRead.  So my silly algorithm is to take any image, read
> pix, upscale if less than 300 .dpi to appropriate amount, check depth and
> grayscale if necessary, TRC, and finally to 1bpp - similar to what is done
> here: http://tpgit.github.com/**UnOfficialLeptDocs/leptonica/**
> document-image-analysis.html<http://tpgit.github.com/UnOfficialLeptDocs/leptonica/document-image-analysis.html>
>
> Perhaps you can help me come up with a better generic algorithm that
> preprocesses/prepares any file for OCRing.  We cannot have customers
> setting specific values for each image to OCR them separately.  We need a
> generic algorithm to prepare the image.  Our Accusoft solution takes just
> about any image (including vector images) and produces pretty good OCR
> results.  Then again, it's crazy expensive as well... so we are looking to
> find an alternative with leptonica/tesseract.   With the forum's help of
> course :)  I admit that I am not very experienced with all this and would
> gladly accept your help in finding a better algorithm.  Perhaps I can
> decipher one in the tesseract src from the command argument "original" that
> you passed?
>
>
>
> On Tuesday, December 18, 2012 8:54:04 AM UTC-5, zdenop wrote:
>
>> What kind of OS you use, what version of tesseract etc...????
>> I tried
>>     tesseract original.jpg original -l chi_tra
>> and
>>     tesseract preprocessed.tiff preprocessed -l chi_tra
>> and I did not get any error message (on openSUSE linux 64bit 12.2 with
>> tesseract 3.02.02)...
>>
>> Why did you upscale image? It is not omnipotent solution - see some
>> experience from past[1].
>> [1] https://groups.google.com/d/**msg/tesseract-ocr/KVHsGxfDdy0/**
>> hh6r4AFUvRMJ<https://groups.google.com/d/msg/tesseract-ocr/KVHsGxfDdy0/hh6r4AFUvRMJ>
>>
>> Zdenko
>>
>>
>>
>> On Tue, Dec 18, 2012 at 2:55 AM, occorled <[email protected]> wrote:
>>
>>>  I have an image here that was initially 32 bit depth, but I scaled it
>>> larger, grayscaled, TRCed, then threshold to binary to produce the image I
>>> send to tesseract.  However it still cannot capture anything.  Is there
>>> something wrong with this image because the text is so big?
>>>
>>> Tesseract prints out the following errors:
>>>
>>>> Too many unichars in ambiguity on line 102163292
>>>> Too many unichars in ambiguity on line 102163292
>>>> Too many unichars in ambiguity on line 102163292
>>>> Garbage result of merge? Left Ragged (414,645)->(258,2108) w=6 s=0,
>>>> sort key=146
>>>> 17152, boxes=53, partners=0
>>>> Garbage result of merge? Right Ragged (425,619)->(291,1877) w=5 s=0,
>>>> sort key=14
>>>> 872288, boxes=19, partners=0
>>>> Garbage result of merge? Right Ragged (686,619)->(552,1881) w=6 s=0,
>>>> sort key=22
>>>> 788162, boxes=41, partners=0
>>>> Garbage result of merge? Right Ragged (1967,899)->(1840,2098) w=5 s=0,
>>>> sort key=
>>>> 62484498, boxes=38, partners=0
>>>> Garbage result of merge? Right Ragged (2166,687)->(2055,1729) w=6 s=0,
>>>> sort key=
>>>> 67827472, boxes=30, partners=0
>>>> Garbage result of merge? Left Ragged (3336,571)->(3173,2096) w=5 s=0,
>>>> sort key=1
>>>> 02864440, boxes=45, partners=0
>>>> Garbage result of merge? Right Ragged (3403,646)->(3248,2100) w=5 s=0,
>>>> sort key=
>>>> 105121744, boxes=58, partners=0
>>>> Garbage result of merge? Left Ragged (3510,565)->(3401,1596) w=5 s=0,
>>>> sort key=1
>>>> 08127380, boxes=33, partners=0
>>>> Garbage result of merge? Right Ragged (3632,565)->(3469,2096) w=6 s=0,
>>>> sort key=
>>>> 111800270, boxes=60, partners=0
>>>>
>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To post to this group, send email to [email protected]
>>>
>>> To unsubscribe from this group, send email to
>>> tesseract-oc...@**googlegroups.com
>>>
>>> For more options, visit this group at
>>> http://groups.google.com/**group/tesseract-ocr?hl=en<http://groups.google.com/group/tesseract-ocr?hl=en>
>>>
>>
>>  --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Chinese Simplified on this image not working

Reply via email to