I am parsing a PDF document for the purpose of indexing it in Solr, specially Solr Cell. My problem is to exclude certain areas of document from being indexed; for example Copyright section. The same argument applies to HTML pages where I don't want to index footer or header or other irrelevant sections. I looked at the MatchingContentHandler API which uses XPATH. However, the XPathParser supports a very limited set of XPATH features. Ideally, I want to use an XPATH such as:
//xhtml:body/xhtml:div\[not(contains(p,'EXCLUDE TEXT'))\] I am thinking maybe I should customize Tika to handle these cases, any suggestion where I should start? What are my options here? Any help is greatly appreciated.
