So I see just one solution for crawling limited count of sites with behaviour like on mobile365. Its limit scope of sites using regex-urlfilter.txt with list like this
+^www.mobile365.ru +^mobile365.ru Thanks. On Wed, Aug 8, 2012 at 5:56 PM, Markus Jelsma <markus.jel...@openindex.io>wrote: > > If it starts to redirect and you are on the wrong side of the redirect, > you're in trouble. But with the HostNormalizer you can then renormalize all > URL's to the host that is being redirected to. > > > -----Original message----- > > From:Alexei Korolev <alexei.koro...@gmail.com> > > Sent: Wed 08-Aug-2012 15:55 > > To: user@nutch.apache.org > > Subject: Re: crawling site without www > > > > > You can use the HostURLNormalizer for this task or just crawl the www > OR > > > the non-www, not both. > > > > > > > I'm trying to crawl only version without www. As I see, I can remove www. > > using proper configured regex-normalize.xml. > > But will it work if mobile365.ru redirect on www.mobile365.ru (it's very > > common situation in web) > > > > Thanks. > > > > Alexei > > > -- Alexei A. Korolev