You could try using the Tika Extractor in ManifoldCF. There's support for boilerplate removal, but I'm not sure how well it works.
Karl On Tue, Jan 13, 2015 at 9:06 AM, Salih Sen <[email protected]> wrote: > Hi Everyone, > > I'm trying to index Sharepoint 2013 web site with ManifoldCF 1.7.2 > using Solr as ouput connection. > > How can I remove header and footer of aspx files so they are not > indexed with the rest of the document? > > I tried using custom updateRequestProcessorChain but since aspx pages > indexed through ExtractingRequestHandler html is stripped before it > reaches there. > > -- > Salih Şen > > Dilişim Bilgi Bilgisayar ve İletişim Teknolojileri Sanayi ve Ticaret Ltd. > Sti. > > email: [email protected] > > Tel: 0 222 330 20 21 > > GSM: 0 507 296 15 51 >
