Re: Exclude headers & footers for PDF & PPT

Tim Allison Wed, 04 Sep 2019 08:18:24 -0700

Hi Kushal,

  Removing headers from PDFs is a notorious problem because text is
stored only by x/y coordinates, not by object type.  You might try the
grobid PDF parser and see if that works.  If your PDFs are homogeneous
enough, regexes should do the trick.  But, sorry, we don't have a
generalizable solution to that.


  As for PPT and PPTX, from the source code, it _looks_ like the most
recent version of Tika (at least) should respect
includeHeadersAndFooters.  There's a chance what you're seeing as
headers and footers aren't technically those objects, but rather show
up as copied boilerplate text boxes or something else.  If you can
share a file on our JIRA, I'll take a look and let you know if there's
anything we can do.

   Best,

            Tim

On Wed, Sep 4, 2019 at 3:18 AM Khare, Kushal (MIND)
<[email protected]> wrote:
>
> Hello community!
>
> I am new to Tika, &  I am using it to parse the documents for indexing them 
> to Solr.
> I have done the content extraction , but the problem I am facing is to 
> exclude the text from the Headers & Footers of the Documents for the PDF & 
> PPT format. I managed the Word & Excel Formats with the help of 
> OfficeParserConfig.
> I searched for it on the internet and tried several ways but could not 
> achieve it. Please help me how to get this done.
>
> I am using the following code :
>
>
>
> public void parseExample() {
>
>
>
>               ParseContext parseContext = new ParseContext();
>
>            AutoDetectParser parser = new AutoDetectParser();
>
>            BodyContentHandler handler = new BodyContentHandler();
>
>            Metadata metadata = new Metadata();
>
>
>
>            OfficeParserConfig officeParserConfig = new OfficeParserConfig();
>
>            officeParserConfig.setIncludeHeadersAndFooters(false);
>
>            boolean hf= officeParserConfig.getIncludeHeadersAndFooters();
>
>            parseContext.set(OfficeParserConfig.class, officeParserConfig);
>
>            System.out.println("headfoot"+hf);
>
>            try(FileInputStream fin=new 
> FileInputStream("D:\\docs\\Out22.docx")) {
>
>                parser.parse(fin, handler, metadata, parseContext);
>
>                String text = handler.toString();
>
>                System.out.println("output :"+text);
>
>            } catch (IOException | SAXException | TikaException ex) {
>
>               ex.printStackTrace();
>
>            }
>
>        }
>
>
>
> What could be the way to deal with PDF & PPT? I also read about 
> configuring/customizing Tika, but have no idea how to proceed with it. Please 
> help!
>
>
> Cheers!
>
> Kushal
>
>
> ________________________________
>
> The information contained in this electronic message and any attachments to 
> this message are intended for the exclusive use of the addressee(s) and may 
> contain proprietary, confidential or privileged information. If you are not 
> the intended recipient, you should not disseminate, distribute or copy this 
> e-mail. Please notify the sender immediately and destroy all copies of this 
> message and any attachments. WARNING: Computer viruses can be transmitted via 
> email. The recipient should check this email and any attachments for the 
> presence of viruses. The company accepts no liability for any damage caused 
> by any virus/trojan/worms/malicious code transmitted by this email. 
> www.motherson.com

Re: Exclude headers & footers for PDF & PPT

Reply via email to