OK, we're getting somewhere! I figured out that the Ubuntu repo just doesn't work properly with tiffs, and recompiled and installed tesseract and leptonica.
So now when I run tesseract -v, I get: ↪ tesseract -v tesseract 3.02.02 leptonica-1.69 libjpeg 8b : libpng 1.2.46 : libtiff 3.9.5 : zlib 1.2.3.4 Whereas previously, I didn't get anything mentioning libtiff. >From there, I ran the convert command on the stackoverflow post: convert -depth 4 -density 300 -background white -flatten +matte united_states_v._ups_customhouse_brokerage_inc..pdf united_states_v._ups_customhouse_brokerage_inc2.tiff The resulting file worked well with tesseract, but it only had the last page of the PDF...so it's close -- very close -- but not quite there yet. On Sun, Feb 3, 2013 at 2:08 PM, zdenko podobny <[email protected]> wrote: > BTW: spp means Samples-per-pixel[1]. Are you able to instruct imagick to > use 1,3 or 4? > And I found report on stackoverflow[2] - there mentioned that imagick use > to set spp to 2, which should be invalid for tiff... > > [1] http://tpgit.github.com/Leptonica/tiffio_8c_source.html > [2] > http://stackoverflow.com/questions/5083492/problem-with-tesseract-and-tiff-format > > Zdenko > > > On Sun, Feb 3, 2013 at 11:00 PM, zdenko podobny <[email protected]> wrote: > >> Are you able to generate just one page or small example? Or can you >> provide step how you create it (so I can create it)? >> Tiff could be tricky. E.g. libtiff-4 do not work for me... >> >> Zdenko >> >> >> On Sun, Feb 3, 2013 at 10:29 PM, Mike Lissner < >> [email protected]> wrote: >> >>> It's about 300MB, unfortunately, but I generate it programmatically >>> using imagemagick in a way that's worked in the past, so I don't think the >>> tiff file itself is the issue. >>> >>> If you're willing to download this monster, I'll post it to dropbox. I'd >>> love the help, but I don't think it's the right problem. >>> >>> >>> On Sun, Feb 3, 2013 at 1:16 PM, zdenko podobny <[email protected]> wrote: >>> >>>> Can you send and example of you tif file? >>>> >>>> Zdenko >>>> >>>> >>>> On Sun, Feb 3, 2013 at 10:08 PM, Michael Lissner < >>>> [email protected]> wrote: >>>> >>>>> I have Ubuntu 12.04, which has tesseract 3.02 and leptonica version >>>>> 1.69. >>>>> >>>>> I've installed these, and also installed libtiff4 using apt-get. >>>>> >>>>> When I try to process a document, I get: >>>>> >>>>> ↪ sudo tesseract united_states_v._ups_customhouse_brokerage_inc.tif >>>>> united_states_v._ups_customhouse_brokerage_inc -l eng >>>>> Tesseract Open Source OCR Engine v3.02 with Leptonica >>>>> Error in pixReadFromTiffStream: spp not in set {1,3,4} >>>>> Error in pixReadStreamTiff: pix not read >>>>> Error in pixReadStream: tiff: no pix returned >>>>> Error in pixRead: pix not read >>>>> Unsupported image type. >>>>> >>>>> >>>>> Which seems baffling to me. I've tried reinstalling leptonica, >>>>> reininstalling the tiff libraries, and reinstalling tesseract in the hope >>>>> that they'd support tiffs once reinstalled. So far, nothing is helping. >>>>> >>>>> I was hoping that Ubuntu 12.04 would support everything i needed it to >>>>> without having to compile from source, but so far I've had bad luck. Is >>>>> there a way to make this work? >>>>> >>>>> Thanks, >>>>> >>>>> Mike >>>>> >>>>> -- >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To post to this group, send email to [email protected] >>>>> To unsubscribe from this group, send email to >>>>> [email protected] >>>>> For more options, visit this group at >>>>> http://groups.google.com/group/tesseract-ocr?hl=en >>>>> >>>>> --- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> For more options, visit https://groups.google.com/groups/opt_out. >>>>> >>>>> >>>>> >>>> >>>> -- >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To post to this group, send email to [email protected] >>>> To unsubscribe from this group, send email to >>>> [email protected] >>>> For more options, visit this group at >>>> http://groups.google.com/group/tesseract-ocr?hl=en >>>> >>>> --- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> For more options, visit https://groups.google.com/groups/opt_out. >>>> >>>> >>>> >>> >>> -- >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To post to this group, send email to [email protected] >>> To unsubscribe from this group, send email to >>> [email protected] >>> For more options, visit this group at >>> http://groups.google.com/group/tesseract-ocr?hl=en >>> >>> --- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> For more options, visit https://groups.google.com/groups/opt_out. >>> >>> >>> >> >> > -- > -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en > > --- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/groups/opt_out. > > > -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

