Re: different regex-urlfilter.txt files for different sets of URLs?

Sol Lederman Thu, 09 Nov 2017 10:16:53 -0800

Hi Rushikesh,

I'm very new to Nutch. I'll let Sebastian and the other experts guide you.
I suspect that success in removing the header and footer will be very
dependent on the HTML files you're processing.

A quick Google search finds these pages:

http://grokbase.com/t/nutch/user/155ensey7k/parsing-pages-but-removing-headers-and-footers
http://grokbase.com/t/nutch/user/1563bdhv85/crawling-pages-but-ignoring-header-and-footer
http://lucene.472066.n3.nabble.com/Removing-Common-Web-Page-Header-and-Footer-from-content-td4168764.html

I suggest you start a new thread since I don't believe your question has
anything to do with this regex-urlfilter.txt discussion.

I also suggest that you try to implement what is suggested in those pages
and then write back (in a new discussion thread) what you did and what
isn't working.

Sol

On Thu, Nov 9, 2017 at 11:02 AM, Rushikesh K <[email protected]>
wrote:

> Hi Sol,
> i have a question we are trying to use Nutch 1.3 for our website crawling
> ,we have a requirement of skipping the header and footer .I was searching
> online but there isnt an exact solution i found.Can you please guide us
> through that.
>
> I really appreciate you in advance!
>
> On Thu, Nov 9, 2017 at 11:23 AM, Sol Lederman <[email protected]>
> wrote:
>
> > Awesome! Thank you.
> >
>
>
>
> --
> Regards
> Rushikesh M
> .Net Developer
>

Re: different regex-urlfilter.txt files for different sets of URLs?

Reply via email to