Re: FW: Extract footer/header text out of Word docs

Nick Burch Thu, 30 Aug 2012 11:16:47 -0700

On Thu, 30 Aug 2012, Alex Cougarman wrote:

Hi. Is it possible to specifically extract footer/header and body textout of a Word document using Solr? In other words, we'd like toindex/store those items in different Solr fields.

As long as the have a suitable style applied, yes Tika will be able totell you


If you run Tika against this sample document from POI:
https://svn.apache.org/repos/asf/poi/trunk/test-data/document/HeaderFooterUnicode.doc

You can see the headers and footers in the xhtml:

<body><div class="header"><p>This is a simple header, with a € euro symbolin it.</p>

</div>

<p>This is a fairly simple word document, over two pages, with headers andfooters.</p>


(snip)

<p>This is page two. <i>Les Précieuses ridicules. </i>The end.
</p>
<div class="footer"><p>The footer, with Molière, has Unicode in it.
</p>
</div>
</body></html>


Just filter on the footer and header classes on the surrounding DIV's

Nick

Re: FW: Extract footer/header text out of Word docs

Reply via email to