In case anybody was curious what I was doing, I figured it out. 
You need to use the -psm 6 option (treat it all as a block of text) 

I needed to convert a pdf document to text, but it had a weird encoding (so 
you couldn't do things like copy/paste text), 

anyways long story short this is what you want to do for PDF documents if 
you need to use tesseract:

pdftoppm -tiff whatever.pdf file-prefix
for i in file-prefix*.tif 
  do
       tesseract -psm 6 "$i" `basename $i .tif`
  done




On Thursday, May 16, 2013 12:22:07 AM UTC-4, Jonathan Frias wrote:
>
> I have this document that I want tesseract to process, but I'm running 
> into the same issue as the guy from this thread: 
> https://groups.google.com/forum/?fromgroups=#!searchin/tesseract-ocr/don$27t$20split$20column/tesseract-ocr/FiC4kKbR00s/gsdSQto6wVkJ
>
> Is there any way to set it to ignore columns and read line-by-line? 
>
> P.S. The last time I used tesseract was with version 2.0.4. which didn't 
> have a problem with this type of file. I'm not sure if that helps. 
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.


Reply via email to