Re: crawling site without www

Sebastian Nagel Wed, 08 Aug 2012 10:18:34 -0700

Hi Alexei,

> So I see just one solution for crawling limited count of sites with
> behaviour like on mobile365. Its limit scope of sites using
> regex-urlfilter.txt with list like this
> 
> +^www.mobile365.ru
> +^mobile365.ru


Better:
+^https?://(?:www\.)?mobile365\.ru/
or to catch all of mobile365.ru
+^https?://(?:[a-z0-9-]+\.)*mobile365\.ru/

and don't forget to remove the final rule

# accept anything else
+.

and replace it by

# skip everything else
-.

If you have more than a few hosts / domains you want to allow
the urlfilter-domain would be a more comfortable choice.
Here a simple line has the desired effect:
mobile365.ru


Sebastian

> 
> Thanks.
> 
> On Wed, Aug 8, 2012 at 5:56 PM, Markus Jelsma 
> <[email protected]>wrote:
> 
>>
>> If it starts to redirect and you are on the wrong side of the redirect,
>> you're in trouble. But with the HostNormalizer you can then renormalize all
>> URL's to the host that is being redirected to.
>>
>>
>> -----Original message-----
>>> From:Alexei Korolev <[email protected]>
>>> Sent: Wed 08-Aug-2012 15:55
>>> To: [email protected]
>>> Subject: Re: crawling site without www
>>>
>>>> You can use the HostURLNormalizer for this task or just crawl the www
>> OR
>>>> the non-www, not both.
>>>>
>>>
>>> I'm trying to crawl only version without www. As I see, I can remove www.
>>> using proper configured regex-normalize.xml.
>>> But will it work if mobile365.ru redirect on www.mobile365.ru (it's very
>>> common situation in web)
>>>
>>> Thanks.
>>>
>>> Alexei
>>>
>>
> 
> 
>

Re: crawling site without www

Reply via email to