Hi Rushikesh, I'm very new to Nutch. I'll let Sebastian and the other experts guide you. I suspect that success in removing the header and footer will be very dependent on the HTML files you're processing.
A quick Google search finds these pages: http://grokbase.com/t/nutch/user/155ensey7k/parsing-pages-but-removing-headers-and-footers http://grokbase.com/t/nutch/user/1563bdhv85/crawling-pages-but-ignoring-header-and-footer http://lucene.472066.n3.nabble.com/Removing-Common-Web-Page-Header-and-Footer-from-content-td4168764.html I suggest you start a new thread since I don't believe your question has anything to do with this regex-urlfilter.txt discussion. I also suggest that you try to implement what is suggested in those pages and then write back (in a new discussion thread) what you did and what isn't working. Sol On Thu, Nov 9, 2017 at 11:02 AM, Rushikesh K <[email protected]> wrote: > Hi Sol, > i have a question we are trying to use Nutch 1.3 for our website crawling > ,we have a requirement of skipping the header and footer .I was searching > online but there isnt an exact solution i found.Can you please guide us > through that. > > I really appreciate you in advance! > > On Thu, Nov 9, 2017 at 11:23 AM, Sol Lederman <[email protected]> > wrote: > > > Awesome! Thank you. > > > > > > -- > Regards > Rushikesh M > .Net Developer >

