Can you please create an issue at https://github.com/tesseract-ocr/tesseract/issues?
I confirm a problem with recent tesseract and leptonica, so it should be fixed for the next release... Zdenko po 6. 6. 2022 o 22:47 Lucas L. <infinitepant...@gmail.com> napísal(a): > OK, I have a sample document to share now. I've pulled out one page from a > document exhibiting this error that does not have any identifying > information on it. > I noticed in the process of doing this, that the same original document > (they usually come in as PDFs) split into TIFFs by other applications > (i.e., FoxIt) don't seem to run into issues. The TIFFs are not invalid when > I look at them on my personal PC. However when the document goes through > our pipeline and is split into TIFFs in preparation for being OCR'd, > Tesseract throws the "defaultPdfEncoding" error mentioned above. > Unfortunately unless I know exactly what about this document is causing > this, I won't be able to address it in our pipeline. > > On Monday, June 6, 2022 at 12:00:45 PM UTC-5 Lucas L. wrote: > >> No luck sadly, when I edited the image in Irfanview to block out the >> sensitive parts and tried to OCR it again, the error didn't occur. I'm not >> sure what changed in the .tiff image file. Any ideas on what kind of image >> metadata can possibly cause this "selectDefaultPdfEncoding" error? >> >> Only differences I can notice between the two files is that the original >> has 16 BPP color depth. They both have LZW compression. >> >> On Monday, June 6, 2022 at 11:47:31 AM UTC-5 Lucas L. wrote: >> >>> Oh yeah, here's the output of tessdata -v: >>> >>> tesseract 5.1.0 >>> leptonica-1.79.0 >>> libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : >>> libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1 >>> Found AVX2 >>> Found AVX >>> Found FMA >>> Found SSE4.1 >>> Found OpenMP 201511 >>> Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 >>> liblz4/1.9.2 libzstd/1.4.4 >>> >>> On Monday, June 6, 2022 at 11:46:30 AM UTC-5 Lucas L. wrote: >>> >>>> It seems to be specific to the document in question. However I'm afraid >>>> I can't post the document because it has sensitive information on it. I >>>> guess I can try to scrub the info using an image editing tool and see if >>>> the error still occurs. >>>> >>>> On Monday, June 6, 2022 at 11:21:25 AM UTC-5 zdenop wrote: >>>> >>>>> Can you please share ocrIn_1.tif + info which tessdata version you >>>>> use? >>>>> + output of 'tesseract -v' >>>>> >>>>> Zdenko >>>>> >>>>> >>>>> po 6. 6. 2022 o 17:53 Lucas L. <infinit...@gmail.com> napísal(a): >>>>> >>>>>> Hi, I'm trying to upgrade Tesseract in our Ubuntu 20.04 VMs used to >>>>>> OCR documents to Tesseract 5.1 from 4.1.1, both versions were built from >>>>>> source on that VM. 4.1.1 worked, but 5.1 throws an error that I can't >>>>>> seem >>>>>> to find anywhere else online: >>>>>> >>>>>> sudo -u userx tesseract --loglevel ALL --oem 1 -l eng >>>>>> /opt/.../pdfprocessor/test/ocr-working/1/ocrIn_1.tif >>>>>> /opt/.../pdfprocessor/test/test pdf >>>>>> Error in selectDefaultPdfEncoding: type selection failure >>>>>> Error during processing. >>>>>> >>>>>> I have tried the training data from both "tessdata" and >>>>>> "tessdata_best" and got the same error. Any help would be appreciated. >>>>>> >>>>>> Thanks, >>>>>> Lucas LeBlanc >>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/6a8a3c7c-5c09-478e-a897-dca4314646e6n%40googlegroups.com >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/6a8a3c7c-5c09-478e-a897-dca4314646e6n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> >>>>> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/ef11b8a4-df31-4b16-b398-f38a8bbac0f7n%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/ef11b8a4-df31-4b16-b398-f38a8bbac0f7n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wPcngWC4s-gw-McyrVjW2pAuPo0aPq8_%2BQD-qJ%3DE0X0g%40mail.gmail.com.