Hello community!
I am new to Tika, &  I am using it to parse the documents for indexing them to 
Solr.
I have done the content extraction , but the problem I am facing is to exclude 
the text from the Headers & Footers of the Documents for the PDF & PPT format. 
I managed the Word & Excel Formats with the help of OfficeParserConfig.
I searched for it on the internet and tried several ways but could not achieve 
it. Please help me how to get this done.

I am using the following code :

public void parseExample() {

              ParseContext parseContext = new ParseContext();
           AutoDetectParser parser = new AutoDetectParser();
           BodyContentHandler handler = new BodyContentHandler();
           Metadata metadata = new Metadata();

           OfficeParserConfig officeParserConfig = new OfficeParserConfig();
           officeParserConfig.setIncludeHeadersAndFooters(false);
           boolean hf= officeParserConfig.getIncludeHeadersAndFooters();
           parseContext.set(OfficeParserConfig.class, officeParserConfig);
           System.out.println("headfoot"+hf);
           try(FileInputStream fin=new FileInputStream("D:\\docs\\Out22.docx")) 
{
               parser.parse(fin, handler, metadata, parseContext);
               String text = handler.toString();
               System.out.println("output :"+text);
           } catch (IOException | SAXException | TikaException ex) {
              ex.printStackTrace();
           }
       }

What could be the way to deal with PDF & PPT? I also read about 
configuring/customizing Tika, but have no idea how to proceed with it. Please 
help!

Cheers!
Kushal

________________________________

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. WARNING: Computer viruses can be transmitted via email. The 
recipient should check this email and any attachments for the presence of 
viruses. The company accepts no liability for any damage caused by any 
virus/trojan/worms/malicious code transmitted by this email. www.motherson.com

Reply via email to