Re: Quality of OCR

Dmitri Silaev Thu, 08 Sep 2011 04:05:53 -0700

Your results correspond to what I had written earlier. However I don't
know the reasons for the error in your third experiment. Probably this
was because of a browser glitch or smth. Anyways you should be able to
send the image for processing just after you opened the demo's
webpage. You can try again.


As for Kannada, I think, at this moment, the latest traineddata file
is sufficient for the demo. Kindly send it to me please.

Warm regards,
Dmitri Silaev
www.CustomOCR.com





On Thu, Sep 8, 2011 at 12:01 PM, Sriranga(78yrsold)
<[email protected]> wrote:
>
>
> On Thu, Sep 8, 2011 at 1:24 PM, Sriranga(78yrsold) <[email protected]>
> wrote:
>>
>> Hi Dmitri,
>> Thanks for the encouragement to pursue the OCR.I am really extremely
>> grateful to you for all valuable guidance rendered to me from time to time-
>> which I cannot forget your great noble help.
>>
>> As suggested I download again OCR.tim from the Tim website and saved
>> downloaded and uploaded in your cloud demo. The result is reproduced below:
>> 1)for ocrbook-1.tif (unedited - original)=
>> s On the Insert tab, the galleries include items that are designed to
>> coordinate with the overall look of
>> your document. You can use these galleries to insert tables, headers,
>> footers, lists, cover pages, and
>> other document building blocks. When you create pictures, charts, or
>> diagrams, they also coordinate
>> = g with your current document look.
>> two times experimented but output is same.
>> 2)ocrbook-2.tif(edited in paintbrush - removed speckles with help of
>> magnifier in the paintbrush itself. and tested in your demo) output was
>> correct. reproduced below =
>>
>> On the Insert tab, the galleries include items that are designed to
>> coordinate with the overall look of
>> your document. You can use these galleries to insert tables, headers,
>> footers, lists, cover pages, and
>> other document building blocks. When you create pictures, charts, or
>> diagrams, they also coordinate
>> with your current document look.
>>
>> 3) testing.tif(this is renamed file for unittled.tif forwarded to you
>> earlier)
>> when uploaded in your Demo : error displayed as
>> ERROR: "Illegal parameter values: 'name' cannot be blank."
>> Where I made mistake?
>> For your experiment purpose, whether trainedata file is sufficient or all
>> generated  data files like unicharset,etc are required for your testing?
>> I tested using kannada.tif tile output was in English - this proves your
>> demo is supported for all langs - depends of<Lang>traineddata files are
>> installed in the cloud - I think.
>> With Warmest Regards,
>> -sriranga(78yrs)
>>
>>
>>
>> On Thu, Sep 8, 2011 at 2:56 AM, Dmitri Silaev <[email protected]>
>> wrote:
>>>
>>> Hi Sriranga!
>>>
>>> Glad you are now OK. I must express my respect and admiration on your
>>> efforts in the OCR field while having all these troubles with your
>>> health.
>>>
>>> You are right, the result for *your image* with CustomOCR Tesseract
>>> demo is exactly like you've attached. But *your image* is not the same
>>> as the image *Tim had sent*: the Tim's is much smaller, having as much
>>> background as needed around the text, while yours is having huge
>>> whitespace to the bottom right to the text. Shame to Tesseract but
>>> this degrades recognition accuracy much.
>>>
>>> A hint on how to obtain Tim's image the right way. Click Tim's link,
>>> then in the menu choose File\Donwload Original. Then save the file
>>> onto your local hard drive. After that indicate that file in the Image
>>> file field of the CustomOCR Standard Tesseract OCR demo and then run
>>> processing.
>>>
>>> Once you've tested the demo with Tim's image you will get the perfect,
>>> crisp and clear result, check this yourself.
>>>
>>> And the last. Absolutely no objections on making Kannada recognition
>>> in the form of CustomOCR demo. Is I see now, this should be a separate
>>> demo. I'll be glad to make this for the community and waiting for you
>>> kindly send me your last traineddata components as well as the
>>> compiled traineddata file.
>>>
>>> Warm regards,
>>> Dmitri Silaev
>>> www.CustomOCR.com
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Sep 7, 2011 at 6:58 AM, Sriranga(78yrsold)
>>> <[email protected]> wrote:
>>> > Hi Dmitri,
>>> > I got laser treatment for my blurred vision. Now OK. I tested in your
>>> > demo
>>> > attached output below
>>> > On ﬁle Insert tab, the gallzries xnclude items that are dcslgled to
>>> > onnrdinab: with the ova:-all look of
>>> > your dncumem. You can use than galleries w insert mum, heudms, footers,
>>> > um,
>>> > cover pig»-5, and
>>> > other ducumcnt budding blucls. wh=- you mm piclures, mm, at diagrams,
>>> > they
>>> > also ccordinlle
>>> > with yuur wmm document lnnk.
>>> > I am using r-527 winxp
>>> > commandline used as follow:
>>> > M:\>tesseract untitled.TIF testtif
>>> > Tesseract Open Source OCR Engine with Leptonica
>>> > Number of found pages: 1.
>>> > M:\>
>>> > M:\>tesseract untitled.TIF 2testtif -l eng
>>> > Tesseract Open Source OCR Engine with Leptonica
>>> > Number of found pages: 1.
>>> >
>>> > M:\>
>>> > submitted for your persual. I find no difference between demo and cmd
>>> > output. Where i made a mistake.
>>> > I may kindly be informed whether your demo cannot be tested for Kannada
>>> > ?
>>> > With regards,
>>> > -sriranga(78yrs)
>>> >
>>> >
>>> >
>>> > On Fri, Sep 2, 2011 at 8:19 AM, Sriranga(78yrsold)
>>> > <[email protected]>
>>> > wrote:
>>> >>
>>> >> HI dmitri,
>>> >> I am still using r-527 and winxp. I am suffering from blurred vision.
>>> >> With warm regards,
>>> >> -sriranga(78)
>>> >>
>>> >> On Thu, Sep 1, 2011 at 8:22 PM, Dmitri Silaev <[email protected]>
>>> >> wrote:
>>> >>>
>>> >>> I don't know your Tesseract's version but here you can witness that
>>> >>> with rev. 580 the result is perfect:
>>> >>>
>>> >>> http://www.customocr.com/index.php?r=site/page&view=demos.tesseract_ocr
>>> >>> The extra chars in the first and last lines are due to some speckle
>>> >>> noise to the left of these lines.
>>> >>>
>>> >>> Warm regards,
>>> >>> Dmitri Silaev
>>> >>> www.CustomOCR.com
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>> On Thu, Sep 1, 2011 at 2:36 PM, Tim Alexander
>>> >>> <[email protected]>
>>> >>> wrote:
>>> >>> > Apologies.  Have google docced a portion of the tif file I ran
>>> >>> > tesseract on:
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > https://docs.google.com/viewer?a=v&pid=explorer&chrome=true&srcid=0B-BfHrAa9J5kZDEzNWRmODItZGFiZi00Y2NkLWI2N2MtZjA5MDg1OTEzYjky&hl=en_US
>>> >>> >
>>> >>> > Regards
>>> >>> >
>>> >>> > Tim
>>> >>> >
>>> >>> > On Aug 31, 8:08 pm, Dmitri Silaev <[email protected]> wrote:
>>> >>> >> No chance to answer your questions without a sample image. Please
>>> >>> >> provide.
>>> >>> >>
>>> >>> >> Warm regards,
>>> >>> >> Dmitri Silaevwww.CustomOCR.com
>>> >>> >>
>>> >>> >> On Wed, Aug 31, 2011 at 3:43 PM, Tim Alexander
>>> >>> >>
>>> >>> >>
>>> >>> >>
>>> >>> >> <[email protected]> wrote:
>>> >>> >> > Seem to have tesseract setup and scripted ok running on Ubuntu
>>> >>> >> > 11.04.
>>> >>> >> > However I am finding my accuracy for OCR to be fairly low.  At
>>> >>> >> > first
>>> >>> >> > I
>>> >>> >> > thought it was the scanned documents I was using but I recently
>>> >>> >> > ran
>>> >>> >> > my
>>> >>> >> > script against a printed and scanned Word document using Times
>>> >>> >> > New
>>> >>> >> > Roman with the output from MS Words random paragraph function.
>>> >>> >>
>>> >>> >> > I was undere the impression that the english training data that
>>> >>> >> > is
>>> >>> >> > downloadable from the site included times new roman as one of
>>> >>> >> > the
>>> >>> >> > pre
>>> >>> >> > trained fonts?  Either way my results look like this:
>>> >>> >>
>>> >>> >> > "On the Insertt ab, the galleriesi nclude itemst hat are
>>> >>> >> > designedto
>>> >>> >> > coordinatew ith the overall look of
>>> >>> >> > yourd ocumenYt. ou canu set heseg alleriesto insertt ablesh,
>>> >>> >> > eadersfo,
>>> >>> >> > otersl,i sts,c overp agesa, nd
>>> >>> >> > other document building blocks. When you create pictures,
>>> >>> >> > charts, or
>>> >>> >> > diagrams, they also coordinate
>>> >>> >> > with your current document look."
>>> >>> >>
>>> >>> >> > As you can see there are several words where the delineation
>>> >>> >> > between
>>> >>> >> > two words is somewhat jumbled.  Is this a case of having to
>>> >>> >> > train
>>> >>> >> > tesseract or is it more down to the scan quality or
>>> >>> >> > preprocessing
>>> >>> >> > (or
>>> >>> >> > lack of)?
>>> >>> >>
>>> >>> >> > Any help or input greatly appreciated.
>>> >>> >>
>>> >>> >> > Regards
>>> >>> >>
>>> >>> >> > Tim
>>> >>> >>
>>> >>> >> > --
>>> >>> >> > You received this message because you are subscribed to the
>>> >>> >> > Google
>>> >>> >> > Groups "tesseract-ocr" group.
>>> >>> >> > To post to this group, send email to
>>> >>> >> > [email protected]
>>> >>> >> > To unsubscribe from this group, send email to
>>> >>> >> > [email protected]
>>> >>> >> > For more options, visit this group at
>>> >>> >> >http://groups.google.com/group/tesseract-ocr?hl=en
>>> >>> >
>>> >>> > --
>>> >>> > You received this message because you are subscribed to the Google
>>> >>> > Groups "tesseract-ocr" group.
>>> >>> > To post to this group, send email to [email protected]
>>> >>> > To unsubscribe from this group, send email to
>>> >>> > [email protected]
>>> >>> > For more options, visit this group at
>>> >>> > http://groups.google.com/group/tesseract-ocr?hl=en
>>> >>> >
>>> >>>
>>> >>> --
>>> >>> You received this message because you are subscribed to the Google
>>> >>> Groups "tesseract-ocr" group.
>>> >>> To post to this group, send email to [email protected]
>>> >>> To unsubscribe from this group, send email to
>>> >>> [email protected]
>>> >>> For more options, visit this group at
>>> >>> http://groups.google.com/group/tesseract-ocr?hl=en
>>> >>
>>> >
>>> > --
>>> > You received this message because you are subscribed to the Google
>>> > Groups "tesseract-ocr" group.
>>> > To post to this group, send email to [email protected]
>>> > To unsubscribe from this group, send email to
>>> > [email protected]
>>> > For more options, visit this group at
>>> > http://groups.google.com/group/tesseract-ocr?hl=en
>>> >
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To post to this group, send email to [email protected]
>>> To unsubscribe from this group, send email to
>>> [email protected]
>>> For more options, visit this group at
>>> http://groups.google.com/group/tesseract-ocr?hl=en
>>
>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Quality of OCR

Reply via email to