Hi Alexei, > So I see just one solution for crawling limited count of sites with > behaviour like on mobile365. Its limit scope of sites using > regex-urlfilter.txt with list like this > > +^www.mobile365.ru > +^mobile365.ru
Better: +^https?://(?:www\.)?mobile365\.ru/ or to catch all of mobile365.ru +^https?://(?:[a-z0-9-]+\.)*mobile365\.ru/ and don't forget to remove the final rule # accept anything else +. and replace it by # skip everything else -. If you have more than a few hosts / domains you want to allow the urlfilter-domain would be a more comfortable choice. Here a simple line has the desired effect: mobile365.ru Sebastian > > Thanks. > > On Wed, Aug 8, 2012 at 5:56 PM, Markus Jelsma > <[email protected]>wrote: > >> >> If it starts to redirect and you are on the wrong side of the redirect, >> you're in trouble. But with the HostNormalizer you can then renormalize all >> URL's to the host that is being redirected to. >> >> >> -----Original message----- >>> From:Alexei Korolev <[email protected]> >>> Sent: Wed 08-Aug-2012 15:55 >>> To: [email protected] >>> Subject: Re: crawling site without www >>> >>>> You can use the HostURLNormalizer for this task or just crawl the www >> OR >>>> the non-www, not both. >>>> >>> >>> I'm trying to crawl only version without www. As I see, I can remove www. >>> using proper configured regex-normalize.xml. >>> But will it work if mobile365.ru redirect on www.mobile365.ru (it's very >>> common situation in web) >>> >>> Thanks. >>> >>> Alexei >>> >> > > >

