Hi

Ok so u say  POI is currently only Text extractor  for doc/docx...

I will do some HMWRK..and get back on the same.

This thread can be closed.

Thx for help appriciated

On 2021/10/22 20:57:00, Tim Allison <[email protected]> wrote: 
> The other complication is how to handle embedded files. Perhaps punt on
> them to start?
> 
> On Fri, Oct 22, 2021 at 4:43 PM Tim Allison <[email protected]> wrote:
> 
> > Hi Karthik,
> >
> >   Tika hasn't been set up well to extract images and text per page.
> > As Nick pointed out, we do mark page breaks in the xhtml, and we do
> > put links for image locations within the text for file types that
> > support that.
> >
> >    Part of the challenge is that not all document types are paged
> > (doc/docx), but also images get tricky quickly
> > (https://issues.apache.org/jira/browse/TIKA-3416).
> >
> >   There was a request to do something like this here:
> > https://issues.apache.org/jira/browse/TIKA-3348, and I feel like we've
> > been getting more requests to do this.  We might want to improve our
> > /unpack endpoint or create a new one.  I don't think I'll be able to
> > work on this for a bit.  Let us know what you find and how you solve
> > this.
> >
> >
> >       Best,
> >
> >             Tim
> >
> > On Fri, Oct 22, 2021 at 11:15 AM nskarthik <[email protected]> wrote:
> > >
> > > Hi
> > >
> > > I plan to get Text/images out of pdf/docx/xlsx./html/csv/mht......so on
> > >
> > > Instead of using POI / PDFBox /... thought Tika would be single source
> > of Data extraction...
> > >
> > > Hence wanted to use the same.
> > >
> > >
> > > with regards
> > > Karthik
> > >
> > > On 2021/10/22 14:41:38, AJ Weber <[email protected]> wrote:
> > > >
> > > > >>> Question :  Need to extract Text / images at page level using java.
> > > > >>> Did not find any example on www or Tika website.
> > > >
> > > > Why not use a library specifically suited to the job like Apache
> > PDFBox (directly)?
> > > >
> > > >
> > > >
> >
> 

Reply via email to