I think you are on the right track.  Try to modify the regEx filters
appropriately.  You may have to think about the regEx a bit depending on
your website folder structure.  Seeing that you have relative URLs the
regEx you have will not work.  You have to start looking for patterns in
your relative URLs (if any).  

-----Original Message-----
From: Bahadir Cambel [mailto:[email protected]] 
Sent: Tuesday, September 21, 2010 10:53 AM
To: [email protected]
Subject: Re: Relative urls are not crawled ?

We have the settings like ;

*in Crawl-urlfilter ; *
+^http://test.mydomain.com

*in regex-urlfilter : *
+^http://test.mydomain.com

*in seed.txt : *
http://test.mydomian.com/enGB/ProductLanding/Products.html

What you say made me realize that since I have the full urls in these
configuration , the crawler drop the relative urls as well ? Is this
what
you mean ? I thought there would be a setting.. Can you give an example
of
the combination in which case they should be working ?

Thanks

On Tue, Sep 21, 2010 at 4:42 PM, Thumuluri, Sai <
[email protected]> wrote:

> Did you check regex-url and crawl filters in nutch conf to make sure
you
> are not excluding the relative URLs?
>
> -----Original Message-----
> From: Bahadir Cambel [mailto:[email protected]]
> Sent: Tuesday, September 21, 2010 10:35 AM
> To: [email protected]
> Subject: Relative urls are not crawled ?
>
> Hey Guys ,
>
> Our website constructed using the relative URLs like the menu links
are
> "/Products/default.html" , "/Brands/default.html"
>
> Once Nutch crawl the website , I cannot see that these anchors are
> fetched
> although I set the depth to 2. The end result index only contain 1
> document.
>
> If I run it against e.g http://androidyou.blogspot.com , I can see the
> other
> URLs are fetched as well, and you can see that the links are full urls
> in
> the web site.
>
> Is there any configuration exists for this ?
>
> Hope I had able to tell the issue clearly..
>
> Kind regards ,
> Bahadir Cambel
>

Reply via email to