Thank you for the reply. I appreciate the help.
We are compiling on Windows7 (64-bit machine), but src compiled in 32-bit.
We are using tesseract 3.02 compiled from scratch with VS2010. I link to
the built static libs from a wrapper. Also using leptonica 1.68 static
lib, not built from scratch.
Note: this file is in Chinese Simplified ("chi_sim"), not Traditional which
has different hansi set.
We are evaluating tesseract-ocr as an alternative to some of the more
pricey apis we currently use like Accusoft (which I believe just wraps
OmniPage under the hood - which it seems Ray Smith is very familiar with),
and Aspose.OCR (which is in its infancy). We don't use the .exe, but
rather the api, specifically (pixRead, Init, SetPageSegMode,
"preprocessing", ProcessPage).
Given the above, I need an algorithm for performing/attempting OCR on any
image thrown at it, and I have posted here before with a 96 .dpi image, and
the suggestion was to upscale to 300 .dpi. This is also suggested on the
forum link you provided. This Chinese Simplified image reports that it is
96 .dpi after pixRead. So my silly algorithm is to take any image, read
pix, upscale if less than 300 .dpi to appropriate amount, check depth and
grayscale if necessary, TRC, and finally to 1bpp - similar to what is done
here:
http://tpgit.github.com/UnOfficialLeptDocs/leptonica/document-image-analysis.html
Perhaps you can help me come up with a better generic algorithm that
preprocesses/prepares any file for OCRing. We cannot have customers
setting specific values for each image to OCR them separately. We need a
generic algorithm to prepare the image. Our Accusoft solution takes just
about any image (including vector images) and produces pretty good OCR
results. Then again, it's crazy expensive as well... so we are looking to
find an alternative with leptonica/tesseract. With the forum's help of
course :) I admit that I am not very experienced with all this and would
gladly accept your help in finding a better algorithm. Perhaps I can
decipher one in the tesseract src from the command argument "original" that
you passed?
On Tuesday, December 18, 2012 8:54:04 AM UTC-5, zdenop wrote:
>
> What kind of OS you use, what version of tesseract etc...????
> I tried
> tesseract original.jpg original -l chi_tra
> and
> tesseract preprocessed.tiff preprocessed -l chi_tra
> and I did not get any error message (on openSUSE linux 64bit 12.2 with
> tesseract 3.02.02)...
>
> Why did you upscale image? It is not omnipotent solution - see some
> experience from past[1].
> [1] https://groups.google.com/d/msg/tesseract-ocr/KVHsGxfDdy0/hh6r4AFUvRMJ
>
> Zdenko
>
>
>
> On Tue, Dec 18, 2012 at 2:55 AM, occorled <[email protected] <javascript:>
> > wrote:
>
>> I have an image here that was initially 32 bit depth, but I scaled it
>> larger, grayscaled, TRCed, then threshold to binary to produce the image I
>> send to tesseract. However it still cannot capture anything. Is there
>> something wrong with this image because the text is so big?
>>
>> Tesseract prints out the following errors:
>>
>>> Too many unichars in ambiguity on line 102163292
>>> Too many unichars in ambiguity on line 102163292
>>> Too many unichars in ambiguity on line 102163292
>>> Garbage result of merge? Left Ragged (414,645)->(258,2108) w=6 s=0, sort
>>> key=146
>>> 17152, boxes=53, partners=0
>>> Garbage result of merge? Right Ragged (425,619)->(291,1877) w=5 s=0,
>>> sort key=14
>>> 872288, boxes=19, partners=0
>>> Garbage result of merge? Right Ragged (686,619)->(552,1881) w=6 s=0,
>>> sort key=22
>>> 788162, boxes=41, partners=0
>>> Garbage result of merge? Right Ragged (1967,899)->(1840,2098) w=5 s=0,
>>> sort key=
>>> 62484498, boxes=38, partners=0
>>> Garbage result of merge? Right Ragged (2166,687)->(2055,1729) w=6 s=0,
>>> sort key=
>>> 67827472, boxes=30, partners=0
>>> Garbage result of merge? Left Ragged (3336,571)->(3173,2096) w=5 s=0,
>>> sort key=1
>>> 02864440, boxes=45, partners=0
>>> Garbage result of merge? Right Ragged (3403,646)->(3248,2100) w=5 s=0,
>>> sort key=
>>> 105121744, boxes=58, partners=0
>>> Garbage result of merge? Left Ragged (3510,565)->(3401,1596) w=5 s=0,
>>> sort key=1
>>> 08127380, boxes=33, partners=0
>>> Garbage result of merge? Right Ragged (3632,565)->(3469,2096) w=6 s=0,
>>> sort key=
>>> 111800270, boxes=60, partners=0
>>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "tesseract-ocr" group.
>> To post to this group, send email to [email protected]<javascript:>
>> To unsubscribe from this group, send email to
>> [email protected] <javascript:>
>> For more options, visit this group at
>> http://groups.google.com/group/tesseract-ocr?hl=en
>>
>
>
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en