Re: crawling site without www

Alexei Korolev Wed, 08 Aug 2012 10:20:07 -0700

Ok. Thank you a lot. I'll try later :)

On Wed, Aug 8, 2012 at 9:18 PM, Sebastian Nagel
<wastl.na...@googlemail.com>wrote:


> Hi Alexei,
>
> > So I see just one solution for crawling limited count of sites with
> > behaviour like on mobile365. Its limit scope of sites using
> > regex-urlfilter.txt with list like this
> >
> > +^www.mobile365.ru
> > +^mobile365.ru
>
> Better:
> +^https?://(?:www\.)?mobile365\.ru/
> or to catch all of mobile365.ru
> +^https?://(?:[a-z0-9-]+\.)*mobile365\.ru/
>
> and don't forget to remove the final rule
>
> # accept anything else
> +.
>
> and replace it by
>
> # skip everything else
> -.
>
> If you have more than a few hosts / domains you want to allow
> the urlfilter-domain would be a more comfortable choice.
> Here a simple line has the desired effect:
> mobile365.ru
>
>
> Sebastian
>
> >
> > Thanks.
> >
> > On Wed, Aug 8, 2012 at 5:56 PM, Markus Jelsma <
> markus.jel...@openindex.io>wrote:
> >
> >>
> >> If it starts to redirect and you are on the wrong side of the redirect,
> >> you're in trouble. But with the HostNormalizer you can then renormalize
> all
> >> URL's to the host that is being redirected to.
> >>
> >>
> >> -----Original message-----
> >>> From:Alexei Korolev <alexei.koro...@gmail.com>
> >>> Sent: Wed 08-Aug-2012 15:55
> >>> To: user@nutch.apache.org
> >>> Subject: Re: crawling site without www
> >>>
> >>>> You can use the HostURLNormalizer for this task or just crawl the www
> >> OR
> >>>> the non-www, not both.
> >>>>
> >>>
> >>> I'm trying to crawl only version without www. As I see, I can remove
> www.
> >>> using proper configured regex-normalize.xml.
> >>> But will it work if mobile365.ru redirect on www.mobile365.ru (it's
> very
> >>> common situation in web)
> >>>
> >>> Thanks.
> >>>
> >>> Alexei
> >>>
> >>
> >
> >
> >
>
>


-- 
Alexei A. Korolev

Re: crawling site without www

Reply via email to