Sure, I will write that up. Thanks for helping, zdenop. Would you happen to 
know which is the most recent version that does not exhibit this issue so I 
can switch to that?

On Tuesday, June 7, 2022 at 12:27:08 AM UTC-5 zdenop wrote:

> Can you please create an issue at 
> https://github.com/tesseract-ocr/tesseract/issues?
>
> I confirm a problem with recent tesseract and leptonica, so it should be 
> fixed for the next release...
>
> Zdenko
>
>
> po 6. 6. 2022 o 22:47 Lucas L. <[email protected]> napísal(a):
>
>> OK, I have a sample document to share now. I've pulled out one page from 
>> a document exhibiting this error that does not have any identifying 
>> information on it.
>> I noticed in the process of doing this, that the same original document 
>> (they usually come in as PDFs) split into TIFFs by other applications 
>> (i.e., FoxIt) don't seem to run into issues. The TIFFs are not invalid when 
>> I look at them on my personal PC. However when the document goes through 
>> our pipeline and is split into TIFFs in preparation for being OCR'd, 
>> Tesseract throws the "defaultPdfEncoding" error mentioned above. 
>> Unfortunately unless I know exactly what about this document is causing 
>> this, I won't be able to address it in our pipeline.
>>
>> On Monday, June 6, 2022 at 12:00:45 PM UTC-5 Lucas L. wrote:
>>
>>> No luck sadly, when I edited the image in Irfanview to block out the 
>>> sensitive parts and tried to OCR it again, the error didn't occur. I'm not 
>>> sure what changed in the .tiff image file. Any ideas on what kind of image 
>>> metadata can possibly cause this "selectDefaultPdfEncoding" error? 
>>>
>>> Only differences I can notice between the two files is that the original 
>>> has 16 BPP color depth. They both have LZW compression.
>>>
>>> On Monday, June 6, 2022 at 11:47:31 AM UTC-5 Lucas L. wrote:
>>>
>>>> Oh yeah, here's the output of tessdata -v:
>>>>
>>>> tesseract 5.1.0
>>>>  leptonica-1.79.0
>>>>   libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : 
>>>> libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1
>>>>  Found AVX2
>>>>  Found AVX
>>>>  Found FMA
>>>>  Found SSE4.1
>>>>  Found OpenMP 201511
>>>>  Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 
>>>> liblz4/1.9.2 libzstd/1.4.4
>>>>
>>>> On Monday, June 6, 2022 at 11:46:30 AM UTC-5 Lucas L. wrote:
>>>>
>>>>> It seems to be specific to the document in question. However I'm 
>>>>> afraid I can't post the document because it has sensitive information on 
>>>>> it. I guess I can try to scrub the info using an image editing tool and 
>>>>> see 
>>>>> if the error still occurs.
>>>>>
>>>>> On Monday, June 6, 2022 at 11:21:25 AM UTC-5 zdenop wrote:
>>>>>
>>>>>> Can you please share  ocrIn_1.tif + info which tessdata version you 
>>>>>> use?
>>>>>> + output of 'tesseract -v'
>>>>>>
>>>>>> Zdenko
>>>>>>
>>>>>>
>>>>>> po 6. 6. 2022 o 17:53 Lucas L. <[email protected]> napísal(a):
>>>>>>
>>>>>>> Hi, I'm trying to upgrade Tesseract in our Ubuntu 20.04 VMs used to 
>>>>>>> OCR documents to Tesseract 5.1 from 4.1.1, both versions were built 
>>>>>>> from 
>>>>>>> source on that VM. 4.1.1 worked, but 5.1 throws an error that I can't 
>>>>>>> seem 
>>>>>>> to find anywhere else online:
>>>>>>>
>>>>>>> sudo -u userx tesseract --loglevel ALL --oem 1 -l eng 
>>>>>>> /opt/.../pdfprocessor/test/ocr-working/1/ocrIn_1.tif 
>>>>>>> /opt/.../pdfprocessor/test/test pdf
>>>>>>> Error in selectDefaultPdfEncoding: type selection failure
>>>>>>> Error during processing.
>>>>>>>
>>>>>>> I have tried the training data from both "tessdata" and 
>>>>>>> "tessdata_best" and got the same error. Any help would be appreciated.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Lucas LeBlanc
>>>>>>>
>>>>>>> -- 
>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>> Groups "tesseract-ocr" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>> send an email to [email protected].
>>>>>>> To view this discussion on the web visit 
>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/6a8a3c7c-5c09-478e-a897-dca4314646e6n%40googlegroups.com
>>>>>>>  
>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/6a8a3c7c-5c09-478e-a897-dca4314646e6n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>>
>>>>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>>
> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/ef11b8a4-df31-4b16-b398-f38a8bbac0f7n%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/ef11b8a4-df31-4b16-b398-f38a8bbac0f7n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/ff6a5fb8-6ece-49e7-9e55-4656c77dd1f0n%40googlegroups.com.

Reply via email to