I am using the tesseract capi from Python using ctypes. Everything seems to 
work well except multi-page TIFFs. I only get text from the last page 
instead of all the text in a multi-page TIFF. 

This is what I'm doing:

    path = "multipage.tiff"
    self.tesseract.TessBaseAPIProcessPages.argtypes = 
[POINTER(TessBaseAPI), c_char_p, c_char_p, c_int, 
POINTER(TessResultRenderer)]
    self.tesseract.TessBaseAPIProcessPages.restype = c_bool
    success = self.tesseract.TessBaseAPIProcessPages(self.api, 
create_string_buffer(path), None , 0, None)
    ocr_r = self.tesseract.TessBaseAPIGetUTF8Text(self.api)
    result = string_at(ocr_r) #contains text only from last page

Has anyone come across this before or have knowledge of how to resolve 
this? 

I had [opened this as an issue][1] in tesseract but apparently this isn't 
an issue in tesseract command line or API since the command line works fine 
and gives text for all pages. 

Perhaps something else should be called instead of 
`self.tesseract.TessBaseAPIGetUTF8Text(api)` to get all the text?


  [1]: https://github.com/tesseract-ocr/tesseract/issues/1138

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/23a13db4-bc65-43ff-a4f2-dd74304e28d0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to