hello markus I have one confusion should i implement changes in crawl-url
filter or regex filter


On Wed, Jul 10, 2013 at 3:12 PM, Markus Jelsma
<[email protected]>wrote:

> Hi,
>
> Use a regex url filter to filter those URL's and prevent them from being
> crawled again.
>
> Cheers
>
> -----Original message-----
> > From:devang pandey <[email protected]>
> > Sent: Wednesday 10th July 2013 10:29
> > To: [email protected]
> > Subject: nutch crawling issues
> >
> > I have a website eg . www.example.com. Now when I am crawling this using
> > nutch 1.4 problem is that of duplicated crawling . There are a number of
> > pages like www.example.com/s38r84rejkfndn/xyz.aspx . Now this number
> > s38r84rejkfndn keeps on changing every time you visit this page and hence
> > crawler is crawling this again and again as for nutch I this this must
> be a
> > new url everytime . Please suggest me how to overcome this issue
> >
>

Reply via email to