Re: Tika 2.1.0 pdf parser

Tim Allison Fri, 22 Oct 2021 13:43:38 -0700

Hi Karthik,

  Tika hasn't been set up well to extract images and text per page.
As Nick pointed out, we do mark page breaks in the xhtml, and we do
put links for image locations within the text for file types that
support that.

   Part of the challenge is that not all document types are paged
(doc/docx), but also images get tricky quickly
(https://issues.apache.org/jira/browse/TIKA-3416).

  There was a request to do something like this here:
https://issues.apache.org/jira/browse/TIKA-3348, and I feel like we've
been getting more requests to do this.  We might want to improve our
/unpack endpoint or create a new one.  I don't think I'll be able to
work on this for a bit.  Let us know what you find and how you solve
this.

      Best,

            Tim

On Fri, Oct 22, 2021 at 11:15 AM nskarthik <[email protected]> wrote:
>
> Hi
>
> I plan to get Text/images out of pdf/docx/xlsx./html/csv/mht......so on
>
> Instead of using POI / PDFBox /... thought Tika would be single source of 
> Data extraction...
>
> Hence wanted to use the same.
>
>
> with regards
> Karthik
>
> On 2021/10/22 14:41:38, AJ Weber <[email protected]> wrote:
> >
> > >>> Question :  Need to extract Text / images at page level using java.
> > >>> Did not find any example on www or Tika website.
> >
> > Why not use a library specifically suited to the job like Apache PDFBox 
> > (directly)?
> >
> >
> >

Re: Tika 2.1.0 pdf parser

Reply via email to