In case anybody was curious what I was doing, I figured it out.
You need to use the -psm 6 option (treat it all as a block of text)
I needed to convert a pdf document to text, but it had a weird encoding (so
you couldn't do things like copy/paste text),
anyways long story short this is what you want to do for PDF documents if
you need to use tesseract:
pdftoppm -tiff whatever.pdf file-prefix
for i in file-prefix*.tif
do
tesseract -psm 6 "$i" `basename $i .tif`
done
On Thursday, May 16, 2013 12:22:07 AM UTC-4, Jonathan Frias wrote:
>
> I have this document that I want tesseract to process, but I'm running
> into the same issue as the guy from this thread:
> https://groups.google.com/forum/?fromgroups=#!searchin/tesseract-ocr/don$27t$20split$20column/tesseract-ocr/FiC4kKbR00s/gsdSQto6wVkJ
>
> Is there any way to set it to ignore columns and read line-by-line?
>
> P.S. The last time I used tesseract was with version 2.0.4. which didn't
> have a problem with this type of file. I'm not sure if that helps.
>
--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
---
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.