On Thu, 30 Aug 2012, Alex Cougarman wrote:
Hi. Is it possible to specifically extract footer/header and body text
out of a Word document using Solr? In other words, we'd like to
index/store those items in different Solr fields.
As long as the have a suitable style applied, yes Tika will be able to
tell you
If you run Tika against this sample document from POI:
https://svn.apache.org/repos/asf/poi/trunk/test-data/document/HeaderFooterUnicode.doc
You can see the headers and footers in the xhtml:
<body><div class="header"><p>This is a simple header, with a € euro symbol
in it.</p>
</div>
<p>This is a fairly simple word document, over two pages, with headers and
footers.</p>
(snip)
<p>This is page two. <i>Les Précieuses ridicules. </i>The end.
</p>
<div class="footer"><p>The footer, with Molière, has Unicode in it.
</p>
</div>
</body></html>
Just filter on the footer and header classes on the surrounding DIV's
Nick