Re: Relative urls are not crawled ?

Bahadir Cambel Tue, 21 Sep 2010 07:53:21 -0700

We have the settings like ;

*in Crawl-urlfilter ; *
+^http://test.mydomain.com


*in regex-urlfilter : *
+^http://test.mydomain.com

*in seed.txt : *
http://test.mydomian.com/enGB/ProductLanding/Products.html

What you say made me realize that since I have the full urls in these
configuration , the crawler drop the relative urls as well ? Is this what
you mean ? I thought there would be a setting.. Can you give an example of
the combination in which case they should be working ?

Thanks

On Tue, Sep 21, 2010 at 4:42 PM, Thumuluri, Sai <
[email protected]> wrote:

> Did you check regex-url and crawl filters in nutch conf to make sure you
> are not excluding the relative URLs?
>
> -----Original Message-----
> From: Bahadir Cambel [mailto:[email protected]]
> Sent: Tuesday, September 21, 2010 10:35 AM
> To: [email protected]
> Subject: Relative urls are not crawled ?
>
> Hey Guys ,
>
> Our website constructed using the relative URLs like the menu links are
> "/Products/default.html" , "/Brands/default.html"
>
> Once Nutch crawl the website , I cannot see that these anchors are
> fetched
> although I set the depth to 2. The end result index only contain 1
> document.
>
> If I run it against e.g http://androidyou.blogspot.com , I can see the
> other
> URLs are fetched as well, and you can see that the links are full urls
> in
> the web site.
>
> Is there any configuration exists for this ?
>
> Hope I had able to tell the issue clearly..
>
> Kind regards ,
> Bahadir Cambel
>

Re: Relative urls are not crawled ?

Reply via email to