I don't know your Tesseract's version but here you can witness that
with rev. 580 the result is perfect:
http://www.customocr.com/index.php?r=site/page&view=demos.tesseract_ocr
The extra chars in the first and last lines are due to some speckle
noise to the left of these lines.

Warm regards,
Dmitri Silaev
www.CustomOCR.com





On Thu, Sep 1, 2011 at 2:36 PM, Tim Alexander <[email protected]> wrote:
> Apologies.  Have google docced a portion of the tif file I ran
> tesseract on:
>
> https://docs.google.com/viewer?a=v&pid=explorer&chrome=true&srcid=0B-BfHrAa9J5kZDEzNWRmODItZGFiZi00Y2NkLWI2N2MtZjA5MDg1OTEzYjky&hl=en_US
>
> Regards
>
> Tim
>
> On Aug 31, 8:08 pm, Dmitri Silaev <[email protected]> wrote:
>> No chance to answer your questions without a sample image. Please provide.
>>
>> Warm regards,
>> Dmitri Silaevwww.CustomOCR.com
>>
>> On Wed, Aug 31, 2011 at 3:43 PM, Tim Alexander
>>
>>
>>
>> <[email protected]> wrote:
>> > Seem to have tesseract setup and scripted ok running on Ubuntu 11.04.
>> > However I am finding my accuracy for OCR to be fairly low.  At first I
>> > thought it was the scanned documents I was using but I recently ran my
>> > script against a printed and scanned Word document using Times New
>> > Roman with the output from MS Words random paragraph function.
>>
>> > I was undere the impression that the english training data that is
>> > downloadable from the site included times new roman as one of the pre
>> > trained fonts?  Either way my results look like this:
>>
>> > "On the Insertt ab, the galleriesi nclude itemst hat are designedto
>> > coordinatew ith the overall look of
>> > yourd ocumenYt. ou canu set heseg alleriesto insertt ablesh, eadersfo,
>> > otersl,i sts,c overp agesa, nd
>> > other document building blocks. When you create pictures, charts, or
>> > diagrams, they also coordinate
>> > with your current document look."
>>
>> > As you can see there are several words where the delineation between
>> > two words is somewhat jumbled.  Is this a case of having to train
>> > tesseract or is it more down to the scan quality or preprocessing (or
>> > lack of)?
>>
>> > Any help or input greatly appreciated.
>>
>> > Regards
>>
>> > Tim
>>
>> > --
>> > You received this message because you are subscribed to the Google
>> > Groups "tesseract-ocr" group.
>> > To post to this group, send email to [email protected]
>> > To unsubscribe from this group, send email to
>> > [email protected]
>> > For more options, visit this group at
>> >http://groups.google.com/group/tesseract-ocr?hl=en
>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to