Using Solr Cell to index the internal structure of a PDF

Peter Bleackley Thu, 10 Oct 2013 03:48:45 -0700

I'm trying to index a set of PDF documents with Solr 4.5.0. So far I canget Solr to ingest the entire document as one long string, stored in theindex as "content". However, I want to index structure within the documents.

I know that the ExtractingRequestHandler uses Apache Tika to convert thedocuments to XHTML. I've used the Tika GUI to look at the XHTMLrepresentation, and I can see that each page is represented as a <div>element, and that structure within pages is represented by <p> elements.How do I configure Solr to index documents at this level of granularity?


Dr Peter J Bleackley
Computational Linguistics Contractor
Playful Technology Ltd

Using Solr Cell to index the internal structure of a PDF

Reply via email to