Thank you for the reply.  I appreciate the help.

We are compiling on Windows7 (64-bit machine), but src compiled in 32-bit.  
We are using tesseract 3.02 compiled from scratch with VS2010.  I link to 
the built static libs from a wrapper.  Also using leptonica 1.68 static 
lib, not built from scratch.

Note: this file is in Chinese Simplified ("chi_sim"), not Traditional which 
has different hansi set.

We are evaluating tesseract-ocr as an alternative to some of the more 
pricey apis we currently use like Accusoft (which I believe just wraps 
OmniPage under the hood - which it seems Ray Smith is very familiar with), 
and Aspose.OCR (which is in its infancy).  We don't use the .exe, but 
rather the api, specifically (pixRead, Init, SetPageSegMode, 
"preprocessing", ProcessPage).

Given the above, I need an algorithm for performing/attempting OCR on any 
image thrown at it, and I have posted here before with a 96 .dpi image, and 
the suggestion was to upscale to 300 .dpi.  This is also suggested on the 
forum link you provided.  This Chinese Simplified image reports that it is 
96 .dpi after pixRead.  So my silly algorithm is to take any image, read 
pix, upscale if less than 300 .dpi to appropriate amount, check depth and 
grayscale if necessary, TRC, and finally to 1bpp - similar to what is done 
here: 
http://tpgit.github.com/UnOfficialLeptDocs/leptonica/document-image-analysis.html

Perhaps you can help me come up with a better generic algorithm that 
preprocesses/prepares any file for OCRing.  We cannot have customers 
setting specific values for each image to OCR them separately.  We need a 
generic algorithm to prepare the image.  Our Accusoft solution takes just 
about any image (including vector images) and produces pretty good OCR 
results.  Then again, it's crazy expensive as well... so we are looking to 
find an alternative with leptonica/tesseract.   With the forum's help of 
course :)  I admit that I am not very experienced with all this and would 
gladly accept your help in finding a better algorithm.  Perhaps I can 
decipher one in the tesseract src from the command argument "original" that 
you passed?


On Tuesday, December 18, 2012 8:54:04 AM UTC-5, zdenop wrote:
>
> What kind of OS you use, what version of tesseract etc...????
> I tried
>     tesseract original.jpg original -l chi_tra
> and
>     tesseract preprocessed.tiff preprocessed -l chi_tra
> and I did not get any error message (on openSUSE linux 64bit 12.2 with 
> tesseract 3.02.02)...
>
> Why did you upscale image? It is not omnipotent solution - see some 
> experience from past[1].
> [1] https://groups.google.com/d/msg/tesseract-ocr/KVHsGxfDdy0/hh6r4AFUvRMJ
>
> Zdenko
>
>
>
> On Tue, Dec 18, 2012 at 2:55 AM, occorled <[email protected] <javascript:>
> > wrote:
>
>> I have an image here that was initially 32 bit depth, but I scaled it 
>> larger, grayscaled, TRCed, then threshold to binary to produce the image I 
>> send to tesseract.  However it still cannot capture anything.  Is there 
>> something wrong with this image because the text is so big?
>>
>> Tesseract prints out the following errors:
>>
>>> Too many unichars in ambiguity on line 102163292
>>> Too many unichars in ambiguity on line 102163292
>>> Too many unichars in ambiguity on line 102163292
>>> Garbage result of merge? Left Ragged (414,645)->(258,2108) w=6 s=0, sort 
>>> key=146
>>> 17152, boxes=53, partners=0
>>> Garbage result of merge? Right Ragged (425,619)->(291,1877) w=5 s=0, 
>>> sort key=14
>>> 872288, boxes=19, partners=0
>>> Garbage result of merge? Right Ragged (686,619)->(552,1881) w=6 s=0, 
>>> sort key=22
>>> 788162, boxes=41, partners=0
>>> Garbage result of merge? Right Ragged (1967,899)->(1840,2098) w=5 s=0, 
>>> sort key=
>>> 62484498, boxes=38, partners=0
>>> Garbage result of merge? Right Ragged (2166,687)->(2055,1729) w=6 s=0, 
>>> sort key=
>>> 67827472, boxes=30, partners=0
>>> Garbage result of merge? Left Ragged (3336,571)->(3173,2096) w=5 s=0, 
>>> sort key=1
>>> 02864440, boxes=45, partners=0
>>> Garbage result of merge? Right Ragged (3403,646)->(3248,2100) w=5 s=0, 
>>> sort key=
>>> 105121744, boxes=58, partners=0
>>> Garbage result of merge? Left Ragged (3510,565)->(3401,1596) w=5 s=0, 
>>> sort key=1
>>> 08127380, boxes=33, partners=0
>>> Garbage result of merge? Right Ragged (3632,565)->(3469,2096) w=6 s=0, 
>>> sort key=
>>> 111800270, boxes=60, partners=0 
>>>
>>  -- 
>> You received this message because you are subscribed to the Google
>> Groups "tesseract-ocr" group.
>> To post to this group, send email to [email protected]<javascript:>
>> To unsubscribe from this group, send email to
>> [email protected] <javascript:>
>> For more options, visit this group at
>> http://groups.google.com/group/tesseract-ocr?hl=en
>>
>
>

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to