On Thu, 21 Oct 2021, nskarthik wrote:
Question : Need to extract Text / images at page level using java. Did not find any example on www or Tika website.

For PDF, you should fetch the contents as XHTML rather than plain text. You can then split on the page divs. This isn't available for formats which aren't page-based, but luckily PDF is

Depending on what you want to do, it might make sense to write a custom ContentHandler which works a lot like the ToTextContentHandler in Tika, but which starts writing to a new text buffer each time it hits the event for a new page

Nick

Reply via email to