On Thu, 30 Aug 2012, Alex Cougarman wrote:
Hi. Is it possible to specifically extract footer/header and body text out of a Word document using Solr? In other words, we'd like to index/store those items in different Solr fields.

As long as the have a suitable style applied, yes Tika will be able to tell you

If you run Tika against this sample document from POI:
https://svn.apache.org/repos/asf/poi/trunk/test-data/document/HeaderFooterUnicode.doc

You can see the headers and footers in the xhtml:


<body><div class="header"><p>This is a simple header, with a € euro symbol in it.</p>
</div>
<p>This is a fairly simple word document, over two pages, with headers and footers.</p>

(snip)

<p>This is page two. <i>Les Précieuses ridicules. </i>The end.
</p>
<div class="footer"><p>The footer, with Molière, has Unicode in it.
</p>
</div>
</body></html>


Just filter on the footer and header classes on the surrounding DIV's

Nick

Reply via email to