I am parsing a PDF document for the purpose of indexing it in Solr, specially 
Solr Cell. My problem is to exclude certain areas of document from being 
indexed; for example Copyright section. The same argument applies to HTML pages 
where I don't want to index footer or header or other irrelevant sections. I 
looked at the MatchingContentHandler API which uses XPATH. However, the 
XPathParser supports a very limited set of XPATH features. Ideally, I want to 
use an XPATH such as:

//xhtml:body/xhtml:div\[not(contains(p,'EXCLUDE TEXT'))\]

I am thinking maybe I should customize Tika to handle these cases, any 
suggestion where I should start? What are my options here?

Any help is greatly appreciated.

Reply via email to