Hi Ok so u say POI is currently only Text extractor for doc/docx...
I will do some HMWRK..and get back on the same. This thread can be closed. Thx for help appriciated On 2021/10/22 20:57:00, Tim Allison <[email protected]> wrote: > The other complication is how to handle embedded files. Perhaps punt on > them to start? > > On Fri, Oct 22, 2021 at 4:43 PM Tim Allison <[email protected]> wrote: > > > Hi Karthik, > > > > Tika hasn't been set up well to extract images and text per page. > > As Nick pointed out, we do mark page breaks in the xhtml, and we do > > put links for image locations within the text for file types that > > support that. > > > > Part of the challenge is that not all document types are paged > > (doc/docx), but also images get tricky quickly > > (https://issues.apache.org/jira/browse/TIKA-3416). > > > > There was a request to do something like this here: > > https://issues.apache.org/jira/browse/TIKA-3348, and I feel like we've > > been getting more requests to do this. We might want to improve our > > /unpack endpoint or create a new one. I don't think I'll be able to > > work on this for a bit. Let us know what you find and how you solve > > this. > > > > > > Best, > > > > Tim > > > > On Fri, Oct 22, 2021 at 11:15 AM nskarthik <[email protected]> wrote: > > > > > > Hi > > > > > > I plan to get Text/images out of pdf/docx/xlsx./html/csv/mht......so on > > > > > > Instead of using POI / PDFBox /... thought Tika would be single source > > of Data extraction... > > > > > > Hence wanted to use the same. > > > > > > > > > with regards > > > Karthik > > > > > > On 2021/10/22 14:41:38, AJ Weber <[email protected]> wrote: > > > > > > > > >>> Question : Need to extract Text / images at page level using java. > > > > >>> Did not find any example on www or Tika website. > > > > > > > > Why not use a library specifically suited to the job like Apache > > PDFBox (directly)? > > > > > > > > > > > > > > >
