Can you please create an issue at
https://github.com/tesseract-ocr/tesseract/issues?

I confirm a problem with recent tesseract and leptonica, so it should be
fixed for the next release...

Zdenko


po 6. 6. 2022 o 22:47 Lucas L. <infinitepant...@gmail.com> napísal(a):

> OK, I have a sample document to share now. I've pulled out one page from a
> document exhibiting this error that does not have any identifying
> information on it.
> I noticed in the process of doing this, that the same original document
> (they usually come in as PDFs) split into TIFFs by other applications
> (i.e., FoxIt) don't seem to run into issues. The TIFFs are not invalid when
> I look at them on my personal PC. However when the document goes through
> our pipeline and is split into TIFFs in preparation for being OCR'd,
> Tesseract throws the "defaultPdfEncoding" error mentioned above.
> Unfortunately unless I know exactly what about this document is causing
> this, I won't be able to address it in our pipeline.
>
> On Monday, June 6, 2022 at 12:00:45 PM UTC-5 Lucas L. wrote:
>
>> No luck sadly, when I edited the image in Irfanview to block out the
>> sensitive parts and tried to OCR it again, the error didn't occur. I'm not
>> sure what changed in the .tiff image file. Any ideas on what kind of image
>> metadata can possibly cause this "selectDefaultPdfEncoding" error?
>>
>> Only differences I can notice between the two files is that the original
>> has 16 BPP color depth. They both have LZW compression.
>>
>> On Monday, June 6, 2022 at 11:47:31 AM UTC-5 Lucas L. wrote:
>>
>>> Oh yeah, here's the output of tessdata -v:
>>>
>>> tesseract 5.1.0
>>>  leptonica-1.79.0
>>>   libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 :
>>> libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1
>>>  Found AVX2
>>>  Found AVX
>>>  Found FMA
>>>  Found SSE4.1
>>>  Found OpenMP 201511
>>>  Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8
>>> liblz4/1.9.2 libzstd/1.4.4
>>>
>>> On Monday, June 6, 2022 at 11:46:30 AM UTC-5 Lucas L. wrote:
>>>
>>>> It seems to be specific to the document in question. However I'm afraid
>>>> I can't post the document because it has sensitive information on it. I
>>>> guess I can try to scrub the info using an image editing tool and see if
>>>> the error still occurs.
>>>>
>>>> On Monday, June 6, 2022 at 11:21:25 AM UTC-5 zdenop wrote:
>>>>
>>>>> Can you please share  ocrIn_1.tif + info which tessdata version you
>>>>> use?
>>>>> + output of 'tesseract -v'
>>>>>
>>>>> Zdenko
>>>>>
>>>>>
>>>>> po 6. 6. 2022 o 17:53 Lucas L. <infinit...@gmail.com> napísal(a):
>>>>>
>>>>>> Hi, I'm trying to upgrade Tesseract in our Ubuntu 20.04 VMs used to
>>>>>> OCR documents to Tesseract 5.1 from 4.1.1, both versions were built from
>>>>>> source on that VM. 4.1.1 worked, but 5.1 throws an error that I can't 
>>>>>> seem
>>>>>> to find anywhere else online:
>>>>>>
>>>>>> sudo -u userx tesseract --loglevel ALL --oem 1 -l eng
>>>>>> /opt/.../pdfprocessor/test/ocr-working/1/ocrIn_1.tif
>>>>>> /opt/.../pdfprocessor/test/test pdf
>>>>>> Error in selectDefaultPdfEncoding: type selection failure
>>>>>> Error during processing.
>>>>>>
>>>>>> I have tried the training data from both "tessdata" and
>>>>>> "tessdata_best" and got the same error. Any help would be appreciated.
>>>>>>
>>>>>> Thanks,
>>>>>> Lucas LeBlanc
>>>>>>
>>>>>> --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>> To view this discussion on the web visit
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/6a8a3c7c-5c09-478e-a897-dca4314646e6n%40googlegroups.com
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/6a8a3c7c-5c09-478e-a897-dca4314646e6n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/ef11b8a4-df31-4b16-b398-f38a8bbac0f7n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/ef11b8a4-df31-4b16-b398-f38a8bbac0f7n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wPcngWC4s-gw-McyrVjW2pAuPo0aPq8_%2BQD-qJ%3DE0X0g%40mail.gmail.com.

Reply via email to