It just seems to happen to this one, its super weird because all the other 
ones work perfectly fine!

I'l try resizing to 2x though.

On Thursday, July 14, 2016 at 3:46:38 AM UTC-4, Allistair C wrote:
>
> Have you tried resizing your image to be larger, try x2 larger - can 
> sometimes help. Is this happening to all Ms or just one?
>
> Sent from my iPhone
>
> On 14 Jul 2016, at 03:44, Raphael Budd <[email protected] <javascript:>> 
> wrote:
>
> So I added really strong pre processing that chops up the schedule, 
> however it is being weird.
>
> Output of the attached image is: 
>
> 721 BENJI B 7:00 AM 3:00 PIVI DT 8.00
>
> Once again, almost perfect but the M becoming IVI is just a deal breaker 
> and having to do post processing on that is going to be hell because there 
> is no promise that IVI might never appear.
>
>
> On Tuesday, July 12, 2016 at 2:10:27 AM UTC-4, Raphael Budd wrote:
>>
>> Hey everyone,
>>
>> I've got this pdf document which is a schedule. I'm trying to extract the 
>> text from it via tesseract but I'm not having that good results.
>>
>> I've tried a lot of different things, in my inexperienced opinion the 
>> image seems very high quality as I can zoom in a lot without seeing pixels. 
>> I've also tried to convert the pdf->tiff and add grayscale filter (all via 
>> java).
>>
>> I've attached both the end result and the original pdf here along with a 
>> sample of the output, any help making the output better would be 
>> appreciated. 
>>
>> The tiff file is too big for the attachement; see this link: 
>> http://wltd.org/Daily%20schedule-14.tiff
>>
>> ---Begin text---
>> 008 KIERA MCG 3:00 PM 11:00 PM TRWN 8.00 —
>> 718 KYLE s 11:00 PM 7:00 AM MT 8.00 < —
>> 686 JOSEPH e 11:00 PM 5:00 AM MT 6.00 — >
>> 718 KYLE s 11:00 PM 7:00 AM MT 8.00 — >
>> 656 CHANDLER A 1:00 PM 4:00 PM MB 3.00 —
>> 720 TYLER D 11:00 PM 7:00 AM T|_ F 8.00 < —
>> 720 TYLER D 11:00 PM 7:00 AM T|_ F 8.00 — >
>> 052 SH ELLY L 5:30 AM 2:00 PM FLRIFFIMGR F 8.50 _:I
>> Riley M 372 8:00 AM 4:00 PM FLR F 8.00 —
>> ‘ Raphael B602 4:00 PM 12:00 AM FLRIMGR F 8.00 ‘ —:| I
>> ‘ Kevin G 652 11:00 AM 7:00 PM g$Y$IWNIMNY$I F 8.00 ‘ I:-:| I
>> Joseph C 191 8:00 AM 4:00 PM ADMIBKIMB F 8.00 -:—
>> 2014 ROXANA T 11:00 AM 7:00 PM ADM F 8.00 _
>>
>> --END TEXT---
>>
>> As you can see tesseract becomes quite creative with its attempt at 
>> parsing this, earlier in the document it even parsed the letter "N" as 
>> "|\|", creative but useless for parsing!
>>
> -- 
> You received this message because you are subscribed to the Google Groups 
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected] <javascript:>.
> To post to this group, send email to [email protected] 
> <javascript:>.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/tesseract-ocr/32721b73-7333-468c-8232-d6f5f68487a1%40googlegroups.com
>  
> <https://groups.google.com/d/msgid/tesseract-ocr/32721b73-7333-468c-8232-d6f5f68487a1%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>
> <Daily schedule-11348.tiff>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/caf6d0db-eac0-4634-9169-1e33e9d85b93%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to