Re: Tika 2.1.0 pdf parser

2021-10-22 Thread nskarthik
Hi Ok so u say POI is currently only Text extractor for doc/docx... I will do some HMWRK..and get back on the same. This thread can be closed. Thx for help appriciated On 2021/10/22 20:57:00, Tim Allison wrote: > The other complication is how to handle embedded files. Perhaps punt on > the

Re: Tika 2.1.0 pdf parser

2021-10-22 Thread Tim Allison
The other complication is how to handle embedded files. Perhaps punt on them to start? On Fri, Oct 22, 2021 at 4:43 PM Tim Allison wrote: > Hi Karthik, > > Tika hasn't been set up well to extract images and text per page. > As Nick pointed out, we do mark page breaks in the xhtml, and we do >

Re: Tika 2.1.0 pdf parser

2021-10-22 Thread Tim Allison
Hi Karthik, Tika hasn't been set up well to extract images and text per page. As Nick pointed out, we do mark page breaks in the xhtml, and we do put links for image locations within the text for file types that support that. Part of the challenge is that not all document types are paged (do

Re: Tika 2.1.0 pdf parser

2021-10-22 Thread nskarthik
Hi I plan to get Text/images out of pdf/docx/xlsx./html/csv/mht..so on Instead of using POI / PDFBox /... thought Tika would be single source of Data extraction... Hence wanted to use the same. with regards Karthik On 2021/10/22 14:41:38, AJ Weber wrote: > > >>> Question : Need to ex

Re: Tika 2.1.0 pdf parser

2021-10-22 Thread AJ Weber
Question : Need to extract Text / images at page level using java. Did not find any example on www or Tika website. Why not use a library specifically suited to the job like Apache PDFBox (directly)?

Re: Tika 2.1.0 pdf parser

2021-10-22 Thread nskarthik
Hi Thx for the Suggestion... Do we have a simple example for the same. please share with regards Karthik On 2021/10/21 18:26:58, Nick Burch wrote: > On Thu, 21 Oct 2021, nskarthik wrote: > > Question : Need to extract Text / images at page level using java. > > Did not find any example on

Re: Tika 2.1.0 pdf parser

2021-10-21 Thread Nick Burch
On Thu, 21 Oct 2021, nskarthik wrote: Question : Need to extract Text / images at page level using java. Did not find any example on www or Tika website. For PDF, you should fetch the contents as XHTML rather than plain text. You can then split on the page divs. This isn't available for forma

Tika 2.1.0 pdf parser

2021-10-21 Thread nskarthik
Hi Spec : JDK15.0, Tika-core2.1.0.jar ,win10 Process : Non authenticated or Simple non-password Pdf text extraction at page level using java Question : Need to extract Text / images at page level using java. Did not find any example on www or Tika website. Request : Please share Java sni