Ok. Thank you a lot. I'll try later :) On Wed, Aug 8, 2012 at 9:18 PM, Sebastian Nagel <wastl.na...@googlemail.com>wrote:
> Hi Alexei, > > > So I see just one solution for crawling limited count of sites with > > behaviour like on mobile365. Its limit scope of sites using > > regex-urlfilter.txt with list like this > > > > +^www.mobile365.ru > > +^mobile365.ru > > Better: > +^https?://(?:www\.)?mobile365\.ru/ > or to catch all of mobile365.ru > +^https?://(?:[a-z0-9-]+\.)*mobile365\.ru/ > > and don't forget to remove the final rule > > # accept anything else > +. > > and replace it by > > # skip everything else > -. > > If you have more than a few hosts / domains you want to allow > the urlfilter-domain would be a more comfortable choice. > Here a simple line has the desired effect: > mobile365.ru > > > Sebastian > > > > > Thanks. > > > > On Wed, Aug 8, 2012 at 5:56 PM, Markus Jelsma < > markus.jel...@openindex.io>wrote: > > > >> > >> If it starts to redirect and you are on the wrong side of the redirect, > >> you're in trouble. But with the HostNormalizer you can then renormalize > all > >> URL's to the host that is being redirected to. > >> > >> > >> -----Original message----- > >>> From:Alexei Korolev <alexei.koro...@gmail.com> > >>> Sent: Wed 08-Aug-2012 15:55 > >>> To: user@nutch.apache.org > >>> Subject: Re: crawling site without www > >>> > >>>> You can use the HostURLNormalizer for this task or just crawl the www > >> OR > >>>> the non-www, not both. > >>>> > >>> > >>> I'm trying to crawl only version without www. As I see, I can remove > www. > >>> using proper configured regex-normalize.xml. > >>> But will it work if mobile365.ru redirect on www.mobile365.ru (it's > very > >>> common situation in web) > >>> > >>> Thanks. > >>> > >>> Alexei > >>> > >> > > > > > > > > -- Alexei A. Korolev