I'm trying to index a set of PDF documents with Solr 4.5.0. So far I can
get Solr to ingest the entire document as one long string, stored in the
index as "content". However, I want to index structure within the documents.
I know that the ExtractingRequestHandler uses Apache Tika to convert the
documents to XHTML. I've used the Tika GUI to look at the XHTML
representation, and I can see that each page is represented as a <div>
element, and that structure within pages is represented by <p> elements.
How do I configure Solr to index documents at this level of granularity?
Dr Peter J Bleackley
Computational Linguistics Contractor
Playful Technology Ltd