Re: [tesseract-ocr] TessPDFRenderer outputs invalid PDF file (+gosseract)

'blaumedia' via tesseract-ocr Mon, 22 Nov 2021 10:54:40 -0800

Hey zdenop,

turns out I can't rely on 5.0.0, because OpenCV seems to only is compatible 
with 4.x yet. (OpenCV is another requirement of my project).
Does your script from above works on tesseract 4.x for you?


blaumedia schrieb am Montag, 22. November 2021 um 18:51:38 UTC+1:

> It works!
>
> I tried tesseract-ocr 5.0.0 RC2 + leptonica 1.8.2 and with it, my and your 
> code worked flawlessly. It seems like the 4.1.3 has a bug in it, that has 
> been fixed in 4.1.3. I didn't tested 5.0, because I thought It would be 
> more unstable.
> I extra tested 4.1.3 + leptonica 1.8.2 (was on 1.7.x somewhat before) and 
> the problem with corrupt pdf still exists. But that's not a problem, I will 
> use 5.0.0 instead.
>
> Thank you zdenop!
>
> zdenop schrieb am Montag, 22. November 2021 um 14:33:15 UTC+1:
>
>> this is my old snippet, so part of the code is useless for pdf rendering 
>> (opening the input image as PIX).
>>
>> Zdenko
>>
>>
>> po 22. 11. 2021 o 14:28 Zdenko Podobny <[email protected]> napísal(a):
>>
>>> Here is a simple code, that works for me (with tesseract 5 and leptonica 
>>> 1.82)
>>>
>>> #include <leptonica/allheaders.h>
>>> #include <tesseract/baseapi.h>
>>> #include <tesseract/renderer.h>
>>> #include <string>
>>>
>>> int main() {
>>>     const char* datapath = "f:/Project-Personal/tessdata_best/tessdata";
>>>     std::string language_ = "eng";
>>>     std::string inputFile_ = "input.png";
>>>     const char* outputbase = "output";
>>>
>>>     tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
>>>     if (api->Init(datapath, language_.c_str(), 
>>> tesseract::OEM_LSTM_ONLY)) {
>>>         fprintf(stderr, "Could not initialize tesseract.\n");
>>>         exit(1);
>>>     }
>>>
>>>     PIX *sourceImg = pixRead(inputFile_.c_str());
>>>     if (!sourceImg) {
>>>         fprintf(stderr, "Leptonica can't process input file: %s\n",
>>>                 inputFile_.c_str());
>>>         return EXIT_FAILURE;
>>>     }
>>>     api->SetImage(sourceImg);
>>>     api->SetInputName(inputFile_.c_str());
>>>     api->SetOutputName(outputbase);
>>>
>>>     tesseract::TessPDFRenderer* renderer =
>>>         new tesseract::TessPDFRenderer(outputbase, api->GetDatapath());
>>>     if (!renderer->happy()) {
>>>         printf("Error, could not create PDF output file: %s\n",
>>>                strerror(errno));
>>>         delete renderer;
>>>     }
>>>
>>>     bool succeed = api->ProcessPages(inputFile_.c_str(), nullptr, 0, 
>>> renderer);
>>>     if (!succeed) {
>>>         fprintf(stderr, "Error during processing.\n");
>>>         return EXIT_FAILURE;
>>>     }
>>>
>>>     api->End();
>>>     pixDestroy(&sourceImg);
>>>     return 0;
>>> }
>>>
>>>
>>> Zdenko
>>>
>>>
>>> ne 21. 11. 2021 o 23:16 'blaumedia' via tesseract-ocr <
>>> [email protected]> napísal(a):
>>>
>>>> Hi zdenop,
>>>>
>>>> thanks for your tip, but I'm using the ProcessPage*s* function, so it 
>>>> should write the head and footer part of the file itself.
>>>> BUT I've played a bit with ProcessPage() + BeginDocument() before and 
>>>> EndDocument() after and the resulting file has big differences. Sadly, the 
>>>> file is still corrupt.
>>>>
>>>> So it seems the problem is based on the failing begin/enddocument 
>>>> function. But even there I'm experiencing mysterious bugs.
>>>> Using only EndDocument(), I have something like a footer at the end of 
>>>> the file:
>>>> [image: r3mxpijfjkxk073pmzquqm343_testjpg.pdf_ff — gosseract [SSH: 
>>>> root.debdocker.home.blaumedia.com]_2021-11-21 23-07-40_MacPro.png]
>>>>
>>>> But it suddenly stops at "Produce". But when I'm using BeginDocument(), 
>>>> ProcessPage() and then EndDocument() the file is ending with bytes and 
>>>> there is no "endstream" or "endobj".
>>>> I've updated to latest 4.1.3 version but problem still exists.
>>>>
>>>> I updated the bug branch in 
>>>> https://github.com/dnnspaul/gosseract/tree/tesseract/bug%2F3652 so the 
>>>> problem is reproducible.
>>>> To disable the BeginDocument, one have to comment out 
>>>> https://github.com/dnnspaul/gosseract/blob/tesseract/bug/3652/tessbridge.cpp#L187
>>>> .
>>>>
>>>> I tried to use 1:1 the code from the tesseract cli but it still does 
>>>> not work...
>>>>
>>>> zdenop schrieb am Sonntag, 21. November 2021 um 13:18:52 UTC+1:
>>>>
>>>>> seems like the same problem as 
>>>>> https://github.com/sirfz/tesserocr/issues/271#issuecomment-919334885
>>>>>
>>>>> Did you use  BeginDocument EndDocument ?
>>>>>
>>>>> Zdenko
>>>>>
>>>>>
>>>>> ne 21. 11. 2021 o 9:27 'blaumedia' via tesseract-ocr <
>>>>> [email protected]> napísal(a):
>>>>>
>>>>>> Described already in issue: 
>>>>>> https://github.com/tesseract-ocr/tesseract/issues/3652
>>>>>>
>>>>>> I'm trying to generate a searchable PDF outgoing from a jpg image, 
>>>>>> but the file that gets output is an invalid pdf file that can't be read 
>>>>>> by 
>>>>>> any pdf reader.
>>>>>>
>>>>>> I have added an docker image for reproduction of the problem in the 
>>>>>> issue, but here is the bash snippet for it:
>>>>>>
>>>>>> *git clone [email protected]:dnnspaul/gosseract.git*
>>>>>> *git checkout tesseract/bug/3652*
>>>>>>
>>>>>> *docker build -t tessbug .*
>>>>>> *docker run -it -v $PWD/tmp:/tmp tessbug go run main.go*
>>>>>>
>>>>>> When I'm inputting the file in the tesseract cli, the outcoming pdf 
>>>>>> is readable, but I can't find any difference between the cli and my 
>>>>>> snippet.
>>>>>>
>>>>>> Thanks in advance for any help! I'm very sorry, I'm more a GoLang 
>>>>>> developer, than a C ++ developer so I have kind of problems with the 
>>>>>> simplest syntax, but tried my best.
>>>>>>
>>>>>> -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to [email protected].
>>>>>> To view this discussion on the web visit 
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/f34562d3-d11e-4385-9c78-b24092413dean%40googlegroups.com
>>>>>>  
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/f34562d3-d11e-4385-9c78-b24092413dean%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/ad68ab2c-2d45-47c3-9194-5d1cd8ea8400n%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/ad68ab2c-2d45-47c3-9194-5d1cd8ea8400n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/064a7ded-ba41-4273-a690-c520634ab375n%40googlegroups.com.

Re: [tesseract-ocr] TessPDFRenderer outputs invalid PDF file (+gosseract)

Reply via email to