In my opinion, given you have a very fixed layout/template this gives you more control over how you perform the OCR. Rather than give Tesseract the entire spreadsheet here why not program a preprocessing stage where you extract the text you want out cleanly into a new image (given you know all (X, Y, WIDTH, HEIGHT) rectangle locations for such an input image?
On 11 July 2016 at 22:00, Raphael Budd <[email protected]> wrote: > Hey everyone, > > I've got this pdf document which is a schedule. I'm trying to extract the > text from it via tesseract but I'm not having that good results. > > I've tried a lot of different things, in my inexperienced opinion the > image seems very high quality as I can zoom in a lot without seeing pixels. > I've also tried to convert the pdf->tiff and add grayscale filter (all via > java). > > I've attached both the end result and the original pdf here along with a > sample of the output, any help making the output better would be > appreciated. > > The tiff file is too big for the attachement; see this link: > http://wltd.org/Daily%20schedule-14.tiff > > ---Begin text--- > 008 KIERA MCG 3:00 PM 11:00 PM TRWN 8.00 — > 718 KYLE s 11:00 PM 7:00 AM MT 8.00 < — > 686 JOSEPH e 11:00 PM 5:00 AM MT 6.00 — > > 718 KYLE s 11:00 PM 7:00 AM MT 8.00 — > > 656 CHANDLER A 1:00 PM 4:00 PM MB 3.00 — > 720 TYLER D 11:00 PM 7:00 AM T|_ F 8.00 < — > 720 TYLER D 11:00 PM 7:00 AM T|_ F 8.00 — > > 052 SH ELLY L 5:30 AM 2:00 PM FLRIFFIMGR F 8.50 _:I > Riley M 372 8:00 AM 4:00 PM FLR F 8.00 — > ‘ Raphael B602 4:00 PM 12:00 AM FLRIMGR F 8.00 ‘ —:| I > ‘ Kevin G 652 11:00 AM 7:00 PM g$Y$IWNIMNY$I F 8.00 ‘ I:-:| I > Joseph C 191 8:00 AM 4:00 PM ADMIBKIMB F 8.00 -:— > 2014 ROXANA T 11:00 AM 7:00 PM ADM F 8.00 _ > > --END TEXT--- > > As you can see tesseract becomes quite creative with its attempt at > parsing this, earlier in the document it even parsed the letter "N" as > "|\|", creative but useless for parsing! > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/f77f8dd8-f6d2-4f6b-b5fe-5510fac4f878%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/f77f8dd8-f6d2-4f6b-b5fe-5510fac4f878%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAORW5vi%3DDQhSupy-u4QE1zyQaE%2BK71d3sN_XQw%2B9AkDa6yb_Cw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

