So I see just one solution for crawling limited count of sites with
behaviour like on mobile365. Its limit scope of sites using
regex-urlfilter.txt with list like this

+^www.mobile365.ru
+^mobile365.ru

Thanks.

On Wed, Aug 8, 2012 at 5:56 PM, Markus Jelsma <markus.jel...@openindex.io>wrote:

>
> If it starts to redirect and you are on the wrong side of the redirect,
> you're in trouble. But with the HostNormalizer you can then renormalize all
> URL's to the host that is being redirected to.
>
>
> -----Original message-----
> > From:Alexei Korolev <alexei.koro...@gmail.com>
> > Sent: Wed 08-Aug-2012 15:55
> > To: user@nutch.apache.org
> > Subject: Re: crawling site without www
> >
> > > You can use the HostURLNormalizer for this task or just crawl the www
> OR
> > > the non-www, not both.
> > >
> >
> > I'm trying to crawl only version without www. As I see, I can remove www.
> > using proper configured regex-normalize.xml.
> > But will it work if mobile365.ru redirect on www.mobile365.ru (it's very
> > common situation in web)
> >
> > Thanks.
> >
> > Alexei
> >
>



-- 
Alexei A. Korolev

Reply via email to