Excluding part of document when parsing.

Koorosh Vakhshoori Mon, 20 Jun 2011 14:29:50 -0700

I am parsing a PDF document for the purpose of indexing it in Solr, specially 
Solr Cell. My problem is to exclude certain areas of document from being 
indexed; for example Copyright section. The same argument applies to HTML pages 
where I don't want to index footer or header or other irrelevant sections. I 
looked at the MatchingContentHandler API which uses XPATH. However, the 
XPathParser supports a very limited set of XPATH features. Ideally, I want to 
use an XPATH such as:


//xhtml:body/xhtml:div\[not(contains(p,'EXCLUDE TEXT'))\]

I am thinking maybe I should customize Tika to handle these cases, any 
suggestion where I should start? What are my options here?

Any help is greatly appreciated.

Excluding part of document when parsing.

Reply via email to