Re: Relative urls are not crawled ?

Bahadir Cambel Wed, 22 Sep 2010 06:46:56 -0700

Hi ,

Thanks for the responses. Here is the configuration combination that I did
which seems to be working now by adding the following items ;


urlfilter :
+^/
regex-urlfilter:
+^/

This solved the relative URLs problems.

Kind regards ,
Bahadir Cambel

On Tue, Sep 21, 2010 at 6:48 PM, Thumuluri, Sai <
[email protected]> wrote:

> I would recommend making the filters accept everything in your dev
> environment and running the crawler again and see if that works. If it
> works, then you can focus on the suggestions from Raj - if not - issue
> is elsewhere
>
> -----Original Message-----
> From: Nemani, Raj [mailto:[email protected]]
> Sent: Tuesday, September 21, 2010 12:28 PM
> To: [email protected]
> Subject: RE: Relative urls are not crawled ?
>
> I think you are on the right track.  Try to modify the regEx filters
> appropriately.  You may have to think about the regEx a bit depending on
> your website folder structure.  Seeing that you have relative URLs the
> regEx you have will not work.  You have to start looking for patterns in
> your relative URLs (if any).
>
> -----Original Message-----
> From: Bahadir Cambel [mailto:[email protected]]
> Sent: Tuesday, September 21, 2010 10:53 AM
> To: [email protected]
> Subject: Re: Relative urls are not crawled ?
>
> We have the settings like ;
>
> *in Crawl-urlfilter ; *
> +^http://test.mydomain.com
>
> *in regex-urlfilter : *
> +^http://test.mydomain.com
>
> *in seed.txt : *
> http://test.mydomian.com/enGB/ProductLanding/Products.html
>
> What you say made me realize that since I have the full urls in these
> configuration , the crawler drop the relative urls as well ? Is this
> what
> you mean ? I thought there would be a setting.. Can you give an example
> of
> the combination in which case they should be working ?
>
> Thanks
>
> On Tue, Sep 21, 2010 at 4:42 PM, Thumuluri, Sai <
> [email protected]> wrote:
>
> > Did you check regex-url and crawl filters in nutch conf to make sure
> you
> > are not excluding the relative URLs?
> >
> > -----Original Message-----
> > From: Bahadir Cambel [mailto:[email protected]]
> > Sent: Tuesday, September 21, 2010 10:35 AM
> > To: [email protected]
> > Subject: Relative urls are not crawled ?
> >
> > Hey Guys ,
> >
> > Our website constructed using the relative URLs like the menu links
> are
> > "/Products/default.html" , "/Brands/default.html"
> >
> > Once Nutch crawl the website , I cannot see that these anchors are
> > fetched
> > although I set the depth to 2. The end result index only contain 1
> > document.
> >
> > If I run it against e.g http://androidyou.blogspot.com , I can see the
> > other
> > URLs are fetched as well, and you can see that the links are full urls
> > in
> > the web site.
> >
> > Is there any configuration exists for this ?
> >
> > Hope I had able to tell the issue clearly..
> >
> > Kind regards ,
> > Bahadir Cambel
> >
>

Re: Relative urls are not crawled ?

Reply via email to