I'm trying to index a set of PDF documents with Solr 4.5.0. So far I can get Solr to ingest the entire document as one long string, stored in the index as "content". However, I want to index structure within the documents.

I know that the ExtractingRequestHandler uses Apache Tika to convert the documents to XHTML. I've used the Tika GUI to look at the XHTML representation, and I can see that each page is represented as a <div> element, and that structure within pages is represented by <p> elements. How do I configure Solr to index documents at this level of granularity?

Dr Peter J Bleackley
Computational Linguistics Contractor
Playful Technology Ltd

Reply via email to