Re: Tika 2.1.0 pdf parser

nskarthik Fri, 22 Oct 2021 07:23:50 -0700

Hi

Thx for the Suggestion...


Do we have a simple example for the same.

please share


with regards
Karthik

On 2021/10/21 18:26:58, Nick Burch <[email protected]> wrote: 
> On Thu, 21 Oct 2021, nskarthik wrote:
> > Question :  Need to extract Text / images at page level using java. 
> > Did not find any example on www or Tika website.
> 
> For PDF, you should fetch the contents as XHTML rather than plain text. 
> You can then split on the page divs. This isn't available for formats 
> which aren't page-based, but luckily PDF is
> 
> Depending on what you want to do, it might make sense to write a custom 
> ContentHandler which works a lot like the ToTextContentHandler in Tika, 
> but which starts writing to a new text buffer each time it hits the event 
> for a new page
> 
> Nick
>

Re: Tika 2.1.0 pdf parser

Reply via email to