Sure, I will write that up. Thanks for helping, zdenop. Would you happen to know which is the most recent version that does not exhibit this issue so I can switch to that?
On Tuesday, June 7, 2022 at 12:27:08 AM UTC-5 zdenop wrote: > Can you please create an issue at > https://github.com/tesseract-ocr/tesseract/issues? > > I confirm a problem with recent tesseract and leptonica, so it should be > fixed for the next release... > > Zdenko > > > po 6. 6. 2022 o 22:47 Lucas L. <[email protected]> napísal(a): > >> OK, I have a sample document to share now. I've pulled out one page from >> a document exhibiting this error that does not have any identifying >> information on it. >> I noticed in the process of doing this, that the same original document >> (they usually come in as PDFs) split into TIFFs by other applications >> (i.e., FoxIt) don't seem to run into issues. The TIFFs are not invalid when >> I look at them on my personal PC. However when the document goes through >> our pipeline and is split into TIFFs in preparation for being OCR'd, >> Tesseract throws the "defaultPdfEncoding" error mentioned above. >> Unfortunately unless I know exactly what about this document is causing >> this, I won't be able to address it in our pipeline. >> >> On Monday, June 6, 2022 at 12:00:45 PM UTC-5 Lucas L. wrote: >> >>> No luck sadly, when I edited the image in Irfanview to block out the >>> sensitive parts and tried to OCR it again, the error didn't occur. I'm not >>> sure what changed in the .tiff image file. Any ideas on what kind of image >>> metadata can possibly cause this "selectDefaultPdfEncoding" error? >>> >>> Only differences I can notice between the two files is that the original >>> has 16 BPP color depth. They both have LZW compression. >>> >>> On Monday, June 6, 2022 at 11:47:31 AM UTC-5 Lucas L. wrote: >>> >>>> Oh yeah, here's the output of tessdata -v: >>>> >>>> tesseract 5.1.0 >>>> leptonica-1.79.0 >>>> libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : >>>> libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1 >>>> Found AVX2 >>>> Found AVX >>>> Found FMA >>>> Found SSE4.1 >>>> Found OpenMP 201511 >>>> Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 >>>> liblz4/1.9.2 libzstd/1.4.4 >>>> >>>> On Monday, June 6, 2022 at 11:46:30 AM UTC-5 Lucas L. wrote: >>>> >>>>> It seems to be specific to the document in question. However I'm >>>>> afraid I can't post the document because it has sensitive information on >>>>> it. I guess I can try to scrub the info using an image editing tool and >>>>> see >>>>> if the error still occurs. >>>>> >>>>> On Monday, June 6, 2022 at 11:21:25 AM UTC-5 zdenop wrote: >>>>> >>>>>> Can you please share ocrIn_1.tif + info which tessdata version you >>>>>> use? >>>>>> + output of 'tesseract -v' >>>>>> >>>>>> Zdenko >>>>>> >>>>>> >>>>>> po 6. 6. 2022 o 17:53 Lucas L. <[email protected]> napísal(a): >>>>>> >>>>>>> Hi, I'm trying to upgrade Tesseract in our Ubuntu 20.04 VMs used to >>>>>>> OCR documents to Tesseract 5.1 from 4.1.1, both versions were built >>>>>>> from >>>>>>> source on that VM. 4.1.1 worked, but 5.1 throws an error that I can't >>>>>>> seem >>>>>>> to find anywhere else online: >>>>>>> >>>>>>> sudo -u userx tesseract --loglevel ALL --oem 1 -l eng >>>>>>> /opt/.../pdfprocessor/test/ocr-working/1/ocrIn_1.tif >>>>>>> /opt/.../pdfprocessor/test/test pdf >>>>>>> Error in selectDefaultPdfEncoding: type selection failure >>>>>>> Error during processing. >>>>>>> >>>>>>> I have tried the training data from both "tessdata" and >>>>>>> "tessdata_best" and got the same error. Any help would be appreciated. >>>>>>> >>>>>>> Thanks, >>>>>>> Lucas LeBlanc >>>>>>> >>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "tesseract-ocr" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to [email protected]. >>>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/6a8a3c7c-5c09-478e-a897-dca4314646e6n%40googlegroups.com >>>>>>> >>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/6a8a3c7c-5c09-478e-a897-dca4314646e6n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>> . >>>>>>> >>>>>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> > To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/ef11b8a4-df31-4b16-b398-f38a8bbac0f7n%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/ef11b8a4-df31-4b16-b398-f38a8bbac0f7n%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ff6a5fb8-6ece-49e7-9e55-4656c77dd1f0n%40googlegroups.com.

