The other complication is how to handle embedded files. Perhaps punt on
them to start?

On Fri, Oct 22, 2021 at 4:43 PM Tim Allison <[email protected]> wrote:

> Hi Karthik,
>
>   Tika hasn't been set up well to extract images and text per page.
> As Nick pointed out, we do mark page breaks in the xhtml, and we do
> put links for image locations within the text for file types that
> support that.
>
>    Part of the challenge is that not all document types are paged
> (doc/docx), but also images get tricky quickly
> (https://issues.apache.org/jira/browse/TIKA-3416).
>
>   There was a request to do something like this here:
> https://issues.apache.org/jira/browse/TIKA-3348, and I feel like we've
> been getting more requests to do this.  We might want to improve our
> /unpack endpoint or create a new one.  I don't think I'll be able to
> work on this for a bit.  Let us know what you find and how you solve
> this.
>
>
>       Best,
>
>             Tim
>
> On Fri, Oct 22, 2021 at 11:15 AM nskarthik <[email protected]> wrote:
> >
> > Hi
> >
> > I plan to get Text/images out of pdf/docx/xlsx./html/csv/mht......so on
> >
> > Instead of using POI / PDFBox /... thought Tika would be single source
> of Data extraction...
> >
> > Hence wanted to use the same.
> >
> >
> > with regards
> > Karthik
> >
> > On 2021/10/22 14:41:38, AJ Weber <[email protected]> wrote:
> > >
> > > >>> Question :  Need to extract Text / images at page level using java.
> > > >>> Did not find any example on www or Tika website.
> > >
> > > Why not use a library specifically suited to the job like Apache
> PDFBox (directly)?
> > >
> > >
> > >
>

Reply via email to