Great stuff. My parting advice is don't think it will always be 100% perfect. I hope it will but you could get a weird person name that brings 2 letters together just close enough to make Tesseract get it wrong. I would maybe do further testing against lots of test images - of course it depends on your onward promises and system dependencies - is this just a helper tool or does it need to be 100% accurate 100% of the time :) Great you got somewhere with it.
Cheers On 15 July 2016 at 16:49, Raphael Budd <[email protected]> wrote: > Okay so after all of that I did something with maven - not really sure > what but now it works. I have 100% accuracy and everything is amazing. > Thanks for all the help! Just as an aside for anyone reading this also > trying to do this; cutting out the individual rows and removing the borders > makes Tesseract a lot happier than just throwing the entire document. I > went from maybe around 60% accuracy to 100% with some pre processing. I > also had to scale the image up a lot, but it works great now. > > On Tuesday, July 12, 2016 at 2:10:27 AM UTC-4, Raphael Budd wrote: > >> Hey everyone, >> >> I've got this pdf document which is a schedule. I'm trying to extract the >> text from it via tesseract but I'm not having that good results. >> >> I've tried a lot of different things, in my inexperienced opinion the >> image seems very high quality as I can zoom in a lot without seeing pixels. >> I've also tried to convert the pdf->tiff and add grayscale filter (all via >> java). >> >> I've attached both the end result and the original pdf here along with a >> sample of the output, any help making the output better would be >> appreciated. >> >> The tiff file is too big for the attachement; see this link: >> http://wltd.org/Daily%20schedule-14.tiff >> >> ---Begin text--- >> 008 KIERA MCG 3:00 PM 11:00 PM TRWN 8.00 — >> 718 KYLE s 11:00 PM 7:00 AM MT 8.00 < — >> 686 JOSEPH e 11:00 PM 5:00 AM MT 6.00 — > >> 718 KYLE s 11:00 PM 7:00 AM MT 8.00 — > >> 656 CHANDLER A 1:00 PM 4:00 PM MB 3.00 — >> 720 TYLER D 11:00 PM 7:00 AM T|_ F 8.00 < — >> 720 TYLER D 11:00 PM 7:00 AM T|_ F 8.00 — > >> 052 SH ELLY L 5:30 AM 2:00 PM FLRIFFIMGR F 8.50 _:I >> Riley M 372 8:00 AM 4:00 PM FLR F 8.00 — >> ‘ Raphael B602 4:00 PM 12:00 AM FLRIMGR F 8.00 ‘ —:| I >> ‘ Kevin G 652 11:00 AM 7:00 PM g$Y$IWNIMNY$I F 8.00 ‘ I:-:| I >> Joseph C 191 8:00 AM 4:00 PM ADMIBKIMB F 8.00 -:— >> 2014 ROXANA T 11:00 AM 7:00 PM ADM F 8.00 _ >> >> --END TEXT--- >> >> As you can see tesseract becomes quite creative with its attempt at >> parsing this, earlier in the document it even parsed the letter "N" as >> "|\|", creative but useless for parsing! >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/f7688cd4-63a7-4ade-b150-0133c49364d7%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/f7688cd4-63a7-4ade-b150-0133c49364d7%40googlegroups.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAORW5vgLANb1uDW96BsV7ABo5S790AguAf_zByS%2BcEqiQDQnLw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

