The other complication is how to handle embedded files. Perhaps punt on them to start?
On Fri, Oct 22, 2021 at 4:43 PM Tim Allison <[email protected]> wrote: > Hi Karthik, > > Tika hasn't been set up well to extract images and text per page. > As Nick pointed out, we do mark page breaks in the xhtml, and we do > put links for image locations within the text for file types that > support that. > > Part of the challenge is that not all document types are paged > (doc/docx), but also images get tricky quickly > (https://issues.apache.org/jira/browse/TIKA-3416). > > There was a request to do something like this here: > https://issues.apache.org/jira/browse/TIKA-3348, and I feel like we've > been getting more requests to do this. We might want to improve our > /unpack endpoint or create a new one. I don't think I'll be able to > work on this for a bit. Let us know what you find and how you solve > this. > > > Best, > > Tim > > On Fri, Oct 22, 2021 at 11:15 AM nskarthik <[email protected]> wrote: > > > > Hi > > > > I plan to get Text/images out of pdf/docx/xlsx./html/csv/mht......so on > > > > Instead of using POI / PDFBox /... thought Tika would be single source > of Data extraction... > > > > Hence wanted to use the same. > > > > > > with regards > > Karthik > > > > On 2021/10/22 14:41:38, AJ Weber <[email protected]> wrote: > > > > > > >>> Question : Need to extract Text / images at page level using java. > > > >>> Did not find any example on www or Tika website. > > > > > > Why not use a library specifically suited to the job like Apache > PDFBox (directly)? > > > > > > > > > >
