I could, the only issue is that based on the number of people scheduled the 
box can grow, which would change all the x,y coords...

What can be easily done is to narrow down the scope of the ocr by only 
getting the horizontal table part and omitting the rest, I'm guessing that 
might also help?


Thanks for the help by the way!

On Tuesday, July 12, 2016 at 5:14:01 AM UTC-4, Allistair C wrote:
>
> In my opinion, given you have a very fixed layout/template this gives you 
> more control over how you perform the OCR. Rather than give Tesseract the 
> entire spreadsheet here why not program a preprocessing stage where you 
> extract the text you want out cleanly into a new image (given you know all 
> (X, Y, WIDTH, HEIGHT) rectangle locations for such an input image?
>
> On 11 July 2016 at 22:00, Raphael Budd <[email protected] <javascript:>> 
> wrote:
>
>> Hey everyone,
>>
>> I've got this pdf document which is a schedule. I'm trying to extract the 
>> text from it via tesseract but I'm not having that good results.
>>
>> I've tried a lot of different things, in my inexperienced opinion the 
>> image seems very high quality as I can zoom in a lot without seeing pixels. 
>> I've also tried to convert the pdf->tiff and add grayscale filter (all via 
>> java).
>>
>> I've attached both the end result and the original pdf here along with a 
>> sample of the output, any help making the output better would be 
>> appreciated. 
>>
>> The tiff file is too big for the attachement; see this link: 
>> http://wltd.org/Daily%20schedule-14.tiff
>>
>> ---Begin text---
>> 008 KIERA MCG 3:00 PM 11:00 PM TRWN 8.00 —
>> 718 KYLE s 11:00 PM 7:00 AM MT 8.00 < —
>> 686 JOSEPH e 11:00 PM 5:00 AM MT 6.00 — >
>> 718 KYLE s 11:00 PM 7:00 AM MT 8.00 — >
>> 656 CHANDLER A 1:00 PM 4:00 PM MB 3.00 —
>> 720 TYLER D 11:00 PM 7:00 AM T|_ F 8.00 < —
>> 720 TYLER D 11:00 PM 7:00 AM T|_ F 8.00 — >
>> 052 SH ELLY L 5:30 AM 2:00 PM FLRIFFIMGR F 8.50 _:I
>> Riley M 372 8:00 AM 4:00 PM FLR F 8.00 —
>> ‘ Raphael B602 4:00 PM 12:00 AM FLRIMGR F 8.00 ‘ —:| I
>> ‘ Kevin G 652 11:00 AM 7:00 PM g$Y$IWNIMNY$I F 8.00 ‘ I:-:| I
>> Joseph C 191 8:00 AM 4:00 PM ADMIBKIMB F 8.00 -:—
>> 2014 ROXANA T 11:00 AM 7:00 PM ADM F 8.00 _
>>
>> --END TEXT---
>>
>> As you can see tesseract becomes quite creative with its attempt at 
>> parsing this, earlier in the document it even parsed the letter "N" as 
>> "|\|", creative but useless for parsing!
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/f77f8dd8-f6d2-4f6b-b5fe-5510fac4f878%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/f77f8dd8-f6d2-4f6b-b5fe-5510fac4f878%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/d3270fa9-7706-4260-9f90-c8b8d0f350d6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to