You could try using the Tika Extractor in ManifoldCF.  There's support for
boilerplate removal, but I'm not sure how well it works.

Karl


On Tue, Jan 13, 2015 at 9:06 AM, Salih Sen <[email protected]> wrote:

> Hi Everyone,
>
> I'm trying to index Sharepoint 2013 web site with ManifoldCF 1.7.2
> using Solr as ouput connection.
>
> How can I remove header and footer of aspx files so they are not
> indexed with the rest of the document?
>
> I tried using custom updateRequestProcessorChain but since aspx pages
> indexed through ExtractingRequestHandler html is stripped before it
> reaches there.
>
> --
> Salih Şen
>
> Dilişim Bilgi Bilgisayar ve İletişim Teknolojileri Sanayi ve Ticaret Ltd.
> Sti.
>
> email: [email protected]
>
> Tel: 0 222 330 20 21
>
> GSM: 0 507 296 15 51
>

Reply via email to