Re: Tika 2.1.0 pdf parser

Nick Burch Thu, 21 Oct 2021 11:27:10 -0700

On Thu, 21 Oct 2021, nskarthik wrote:

Question : Need to extract Text / images at page level using java.Did not find any example on www or Tika website.

For PDF, you should fetch the contents as XHTML rather than plain text.You can then split on the page divs. This isn't available for formatswhich aren't page-based, but luckily PDF is

Depending on what you want to do, it might make sense to write a customContentHandler which works a lot like the ToTextContentHandler in Tika,but which starts writing to a new text buffer each time it hits the eventfor a new page


Nick

Re: Tika 2.1.0 pdf parser

Reply via email to