Re: Quality of OCR

M.N.S.Rao Mon, 12 Sep 2011 03:09:53 -0700

Can I have the link to cloud OCR web--site to test kannada OCR?
MNS Rao

----- Original Message -----From: "Dmitri Silaev" <[email protected]>

To: <[email protected]>
Sent: Monday, September 12, 2011 2:54 PM
Subject: Re: Quality of OCR



Hi Sriranga,

Thanks for the files. I'll try to build a demo around them and place
it at our website as soon as possible. Can you please provide more
pairs of source images and result .txt files for me to do some tests?

Warm regards,
Dmitri Silaev
www.CustomOCR.com





On Fri, Sep 9, 2011 at 5:20 PM, Sriranga(78yrsold)
<[email protected]> wrote:

Dear Dmitry,

Trust everything is in order and OK for uploading kannada datafiles forDemo

cloud OCR. anymore information is required by you.
i am ready to conduct beta testing if required by you at any time.
I hope if successful, it will be first of such OCR in the world.I am
ready to generate other Indic lang. viz. Tamil, telugu,and hindi datafiles
to the extent possible.for your demo version.

One small request. Is it possible to develop small program for
post-processing of image files to remove any noises, if any, (In
other words, to generate clean image files from the existing files)
Awaiting your outcome of cloud demo OCR project.
With Warmest Regards,
-sriranga(78yrs)

2011/9/8 Sriranga(78yrsold) <[email protected]>


unable to post in tesseract forum. as such forwarded to you directly.
-sriranga

---------- Forwarded message ----------
From: Sriranga(78yrsold) <[email protected]>
Date: 2011/9/8
Subject: Re: Quality of OCR
To: [email protected]


Dmitri,
based on tif file attached,just now I generated dmi.traineddata file and
also also attached output was made in FreeOCR as well as CMD - it appears
both boutputs were not identical. tesseract version r-527 used.
Any help is required. i am ready perform beta testing and feedback to you
whenever you want.
Again I tried to test the 3rd experiment failed. It is noticed connecting
takes longer time. whereas other tif works quickly.
With regards,
-sriranga

On Thu, Sep 8, 2011 at 4:19 PM, Dmitri Silaev <[email protected]>
wrote:


Your results correspond to what I had written earlier. However I don't
know the reasons for the error in your third experiment. Probably this
was because of a browser glitch or smth. Anyways you should be able to
send the image for processing just after you opened the demo's
webpage. You can try again.

As for Kannada, I think, at this moment, the latest traineddata file
is sufficient for the demo. Kindly send it to me please.

Warm regards,
Dmitri Silaev
www.CustomOCR.com





On Thu, Sep 8, 2011 at 12:01 PM, Sriranga(78yrsold)
<[email protected]> wrote:
>
>
> On Thu, Sep 8, 2011 at 1:24 PM, Sriranga(78yrsold)
> <[email protected]>
> wrote:
>>
>> Hi Dmitri,
>> Thanks for the encouragement to pursue the OCR.I am really extremely
>> grateful to you for all valuable guidance rendered to me from time to
>> time-
>> which I cannot forget your great noble help.
>>
>> As suggested I download again OCR.tim from the Tim website and saved
>> downloaded and uploaded in your cloud demo. The result is reproduced
>> below:
>> 1)for ocrbook-1.tif (unedited - original)=
>> s On the Insert tab, the galleries include items that are designed to
>> coordinate with the overall look of
>> your document. You can use these galleries to insert tables, headers,
>> footers, lists, cover pages, and
>> other document building blocks. When you create pictures, charts, or
>> diagrams, they also coordinate
>> = g with your current document look.
>> two times experimented but output is same.
>> 2)ocrbook-2.tif(edited in paintbrush - removed speckles with help of
>> magnifier in the paintbrush itself. and tested in your demo) output
>> was
>> correct. reproduced below =
>>
>> On the Insert tab, the galleries include items that are designed to
>> coordinate with the overall look of
>> your document. You can use these galleries to insert tables, headers,
>> footers, lists, cover pages, and
>> other document building blocks. When you create pictures, charts, or
>> diagrams, they also coordinate
>> with your current document look.
>>
>> 3) testing.tif(this is renamed file for unittled.tif forwarded to you
>> earlier)
>> when uploaded in your Demo : error displayed as
>> ERROR: "Illegal parameter values: 'name' cannot be blank."
>> Where I made mistake?
>> For your experiment purpose, whether trainedata file is sufficient or
>> all
>> generated data files like unicharset,etc are required for your
>> testing?
>> I tested using kannada.tif tile output was in English - this proves
>> your
>> demo is supported for all langs - depends of<Lang>traineddata files
>> are
>> installed in the cloud - I think.
>> With Warmest Regards,
>> -sriranga(78yrs)
>>
>>
>>
>> On Thu, Sep 8, 2011 at 2:56 AM, Dmitri Silaev <[email protected]>
>> wrote:
>>>
>>> Hi Sriranga!
>>>

>>> Glad you are now OK. I must express my respect and admiration on>>> your

>>> efforts in the OCR field while having all these troubles with your
>>> health.
>>>
>>> You are right, the result for *your image* with CustomOCR Tesseract
>>> demo is exactly like you've attached. But *your image* is not the
>>> same
>>> as the image *Tim had sent*: the Tim's is much smaller, having as
>>> much
>>> background as needed around the text, while yours is having huge
>>> whitespace to the bottom right to the text. Shame to Tesseract but
>>> this degrades recognition accuracy much.
>>>
>>> A hint on how to obtain Tim's image the right way. Click Tim's link,
>>> then in the menu choose File\Donwload Original. Then save the file
>>> onto your local hard drive. After that indicate that file in the
>>> Image
>>> file field of the CustomOCR Standard Tesseract OCR demo and then run
>>> processing.
>>>
>>> Once you've tested the demo with Tim's image you will get the
>>> perfect,
>>> crisp and clear result, check this yourself.
>>>
>>> And the last. Absolutely no objections on making Kannada recognition
>>> in the form of CustomOCR demo. Is I see now, this should be a
>>> separate

>>> demo. I'll be glad to make this for the community and waiting for>>> you

>>> kindly send me your last traineddata components as well as the
>>> compiled traineddata file.
>>>
>>> Warm regards,
>>> Dmitri Silaev
>>> www.CustomOCR.com
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Sep 7, 2011 at 6:58 AM, Sriranga(78yrsold)
>>> <[email protected]> wrote:
>>> > Hi Dmitri,
>>> > I got laser treatment for my blurred vision. Now OK. I tested in
>>> > your
>>> > demo
>>> > attached output below

>>> > On ﬁle Insert tab, the gallzries xnclude items that are dcslgled>>> > to

>>> > onnrdinab: with the ova:-all look of
>>> > your dncumem. You can use than galleries w insert mum, heudms,
>>> > footers,
>>> > um,
>>> > cover pig»-5, and
>>> > other ducumcnt budding blucls. wh=- you mm piclures, mm, at
>>> > diagrams,
>>> > they
>>> > also ccordinlle
>>> > with yuur wmm document lnnk.
>>> > I am using r-527 winxp
>>> > commandline used as follow:
>>> > M:\>tesseract untitled.TIF testtif
>>> > Tesseract Open Source OCR Engine with Leptonica
>>> > Number of found pages: 1.
>>> > M:\>
>>> > M:\>tesseract untitled.TIF 2testtif -l eng
>>> > Tesseract Open Source OCR Engine with Leptonica
>>> > Number of found pages: 1.
>>> >
>>> > M:\>
>>> > submitted for your persual. I find no difference between demo and
>>> > cmd
>>> > output. Where i made a mistake.
>>> > I may kindly be informed whether your demo cannot be tested for
>>> > Kannada
>>> > ?
>>> > With regards,
>>> > -sriranga(78yrs)
>>> >
>>> >
>>> >
>>> > On Fri, Sep 2, 2011 at 8:19 AM, Sriranga(78yrsold)
>>> > <[email protected]>
>>> > wrote:
>>> >>
>>> >> HI dmitri,
>>> >> I am still using r-527 and winxp. I am suffering from blurred
>>> >> vision.
>>> >> With warm regards,
>>> >> -sriranga(78)
>>> >>
>>> >> On Thu, Sep 1, 2011 at 8:22 PM, Dmitri Silaev
>>> >> <[email protected]>
>>> >> wrote:
>>> >>>
>>> >>> I don't know your Tesseract's version but here you can witness
>>> >>> that
>>> >>> with rev. 580 the result is perfect:
>>> >>>
>>> >>>
>>> >>> http://www.customocr.com/index.php?r=site/page&view=demos.tesseract_ocr
>>> >>> The extra chars in the first and last lines are due to some
>>> >>> speckle
>>> >>> noise to the left of these lines.
>>> >>>
>>> >>> Warm regards,
>>> >>> Dmitri Silaev
>>> >>> www.CustomOCR.com
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>> On Thu, Sep 1, 2011 at 2:36 PM, Tim Alexander
>>> >>> <[email protected]>
>>> >>> wrote:
>>> >>> > Apologies. Have google docced a portion of the tif file I ran
>>> >>> > tesseract on:
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > 
https://docs.google.com/viewer?a=v&pid=explorer&chrome=true&srcid=0B-BfHrAa9J5kZDEzNWRmODItZGFiZi00Y2NkLWI2N2MtZjA5MDg1OTEzYjky&hl=en_US
>>> >>> >
>>> >>> > Regards
>>> >>> >
>>> >>> > Tim
>>> >>> >
>>> >>> > On Aug 31, 8:08 pm, Dmitri Silaev <[email protected]>
>>> >>> > wrote:
>>> >>> >> No chance to answer your questions without a sample image.
>>> >>> >> Please
>>> >>> >> provide.
>>> >>> >>
>>> >>> >> Warm regards,
>>> >>> >> Dmitri Silaevwww.CustomOCR.com
>>> >>> >>
>>> >>> >> On Wed, Aug 31, 2011 at 3:43 PM, Tim Alexander
>>> >>> >>
>>> >>> >>
>>> >>> >>
>>> >>> >> <[email protected]> wrote:
>>> >>> >> > Seem to have tesseract setup and scripted ok running on
>>> >>> >> > Ubuntu
>>> >>> >> > 11.04.
>>> >>> >> > However I am finding my accuracy for OCR to be fairly low.
>>> >>> >> > At
>>> >>> >> > first
>>> >>> >> > I
>>> >>> >> > thought it was the scanned documents I was using but I
>>> >>> >> > recently
>>> >>> >> > ran
>>> >>> >> > my
>>> >>> >> > script against a printed and scanned Word document using
>>> >>> >> > Times
>>> >>> >> > New
>>> >>> >> > Roman with the output from MS Words random paragraph
>>> >>> >> > function.
>>> >>> >>
>>> >>> >> > I was undere the impression that the english training data
>>> >>> >> > that
>>> >>> >> > is
>>> >>> >> > downloadable from the site included times new roman as one
>>> >>> >> > of
>>> >>> >> > the
>>> >>> >> > pre
>>> >>> >> > trained fonts? Either way my results look like this:
>>> >>> >>
>>> >>> >> > "On the Insertt ab, the galleriesi nclude itemst hat are
>>> >>> >> > designedto
>>> >>> >> > coordinatew ith the overall look of

>>> >>> >> > yourd ocumenYt. ou canu set heseg alleriesto insertt>>> >>> >> > ablesh,

>>> >>> >> > eadersfo,
>>> >>> >> > otersl,i sts,c overp agesa, nd
>>> >>> >> > other document building blocks. When you create pictures,
>>> >>> >> > charts, or
>>> >>> >> > diagrams, they also coordinate
>>> >>> >> > with your current document look."
>>> >>> >>

>>> >>> >> > As you can see there are several words where the>>> >>> >> > delineation

>>> >>> >> > between
>>> >>> >> > two words is somewhat jumbled. Is this a case of having to
>>> >>> >> > train
>>> >>> >> > tesseract or is it more down to the scan quality or
>>> >>> >> > preprocessing
>>> >>> >> > (or
>>> >>> >> > lack of)?
>>> >>> >>
>>> >>> >> > Any help or input greatly appreciated.
>>> >>> >>
>>> >>> >> > Regards
>>> >>> >>
>>> >>> >> > Tim
>>> >>> >>
>>> >>> >> > --
>>> >>> >> > You received this message because you are subscribed to the
>>> >>> >> > Google
>>> >>> >> > Groups "tesseract-ocr" group.
>>> >>> >> > To post to this group, send email to
>>> >>> >> > [email protected]
>>> >>> >> > To unsubscribe from this group, send email to
>>> >>> >> > [email protected]
>>> >>> >> > For more options, visit this group at
>>> >>> >> >http://groups.google.com/group/tesseract-ocr?hl=en
>>> >>> >
>>> >>> > --
>>> >>> > You received this message because you are subscribed to the
>>> >>> > Google
>>> >>> > Groups "tesseract-ocr" group.
>>> >>> > To post to this group, send email to
>>> >>> > [email protected]
>>> >>> > To unsubscribe from this group, send email to
>>> >>> > [email protected]
>>> >>> > For more options, visit this group at
>>> >>> > http://groups.google.com/group/tesseract-ocr?hl=en
>>> >>> >
>>> >>>
>>> >>> --
>>> >>> You received this message because you are subscribed to the
>>> >>> Google
>>> >>> Groups "tesseract-ocr" group.
>>> >>> To post to this group, send email to
>>> >>> [email protected]
>>> >>> To unsubscribe from this group, send email to
>>> >>> [email protected]
>>> >>> For more options, visit this group at
>>> >>> http://groups.google.com/group/tesseract-ocr?hl=en
>>> >>
>>> >
>>> > --
>>> > You received this message because you are subscribed to the Google
>>> > Groups "tesseract-ocr" group.

>>> > To post to this group, send email to>>> > [email protected]

>>> > To unsubscribe from this group, send email to
>>> > [email protected]
>>> > For more options, visit this group at
>>> > http://groups.google.com/group/tesseract-ocr?hl=en
>>> >
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To post to this group, send email to [email protected]
>>> To unsubscribe from this group, send email to
>>> [email protected]
>>> For more options, visit this group at
>>> http://groups.google.com/group/tesseract-ocr?hl=en
>>
>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>

--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en


--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at

http://groups.google.com/group/tesseract-ocr?hl=en

--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Quality of OCR

Reply via email to