Tesseract makes an estimation of the size of text in an image. Text which
is significantly larger or smaller than that estimate gets ignored. See if
you can isolate the larger text in an image by itself, you may find more
success.



On Wed, Jul 30, 2014 at 3:24 AM, <[email protected]> wrote:

> Hi All,
>
> I have been playing around with Tesseract as I need to write some software
> that uses OCR (preferably open source). So far I have found that Tesseract
> reads part of my document quite accurately, but ignores some of it
> completely. Unfortunately the part it ignores is the part I need to read.
> The following is the image I am trying to read (I have blanked out certain
> parts to protect identities):
>
>
> <https://lh3.googleusercontent.com/-471DmisZhzY/U9iZNx0vR8I/AAAAAAAAAAM/NHaxpDnPWFQ/s1600/summary.bmp>
>
> What I need to read is the contract and schedule number at the top of the
> page just right of centre. These numbers are amongst the biggest font on
> the page, so Tesseract should have no difficulty reading it given it's
> accuracy on the smaller fonts. However these numbers seem to get ignored. I
> am assuming it has something to do with the page layout, or more
> specifically, assumptions that Tesseract makes about page layout. So my
> question is, can I force Tesseract not to make these assumptions about page
> layout? I am not concerned if the text comes out in a strange order as I
> can procedurally filter out the noise, as long as it reads everything on
> the page. For reference, this is the output I get when I use Tesseract on
> this document (note that I have inserted asterisks again to protect
> identities):
>
>  -—  ~ Contract Schedule Type Page Of Total Tonnes
> On Site 24/6/14 Load 24/6/14 Produce 24/6/14
> Customer ******************* SERVICES %EW01(HT ************* 24/6/14
> Project RYDE Phone (02) ******** ‘ Fax (02) *********
> Description    V-R/2-9 Contact   427 210
> At RYDE Site Phone Fax
> **************** Cust Ref TBA
> RYDE NSW 2112 Map Ref Fixing CUSTOMER
> Drawing Numbers Rev Supply Site 206
> NO DRAWING AVAILABLE 1.0 Narrative URGENT DELIVERY-RESUPPLY
> Colour Bar Mark Product No.Off Length Shapes (Dimensions Overall)  A
> Remarks / Item Wt.
> BEIGE SEC.50-2 N12 48 2990 0.127
> Min Pin 60 mm
> I
> SUMMARY N12 0.127 TONNES  BENT 0.127
> SERVICES FIXBAR 1 EACH OTHER 0.000
>
>   5 I < ‘“‘
> 0..//‘
> Printed: 24/06/14 at10:32 KARIMCONS-LSUM-0456 0.127
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/d596be43-4fa3-4cba-ad6c-9c31fe1f068d%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/d596be43-4fa3-4cba-ad6c-9c31fe1f068d%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAL7mBq5a_AOOiAp6WuPSP2fAgEXRu4sDz2Q%3DNX9jNryJkW7r-g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to