Hi , Thanks for the responses. Here is the configuration combination that I did which seems to be working now by adding the following items ;
urlfilter : +^/ regex-urlfilter: +^/ This solved the relative URLs problems. Kind regards , Bahadir Cambel On Tue, Sep 21, 2010 at 6:48 PM, Thumuluri, Sai < [email protected]> wrote: > I would recommend making the filters accept everything in your dev > environment and running the crawler again and see if that works. If it > works, then you can focus on the suggestions from Raj - if not - issue > is elsewhere > > -----Original Message----- > From: Nemani, Raj [mailto:[email protected]] > Sent: Tuesday, September 21, 2010 12:28 PM > To: [email protected] > Subject: RE: Relative urls are not crawled ? > > I think you are on the right track. Try to modify the regEx filters > appropriately. You may have to think about the regEx a bit depending on > your website folder structure. Seeing that you have relative URLs the > regEx you have will not work. You have to start looking for patterns in > your relative URLs (if any). > > -----Original Message----- > From: Bahadir Cambel [mailto:[email protected]] > Sent: Tuesday, September 21, 2010 10:53 AM > To: [email protected] > Subject: Re: Relative urls are not crawled ? > > We have the settings like ; > > *in Crawl-urlfilter ; * > +^http://test.mydomain.com > > *in regex-urlfilter : * > +^http://test.mydomain.com > > *in seed.txt : * > http://test.mydomian.com/enGB/ProductLanding/Products.html > > What you say made me realize that since I have the full urls in these > configuration , the crawler drop the relative urls as well ? Is this > what > you mean ? I thought there would be a setting.. Can you give an example > of > the combination in which case they should be working ? > > Thanks > > On Tue, Sep 21, 2010 at 4:42 PM, Thumuluri, Sai < > [email protected]> wrote: > > > Did you check regex-url and crawl filters in nutch conf to make sure > you > > are not excluding the relative URLs? > > > > -----Original Message----- > > From: Bahadir Cambel [mailto:[email protected]] > > Sent: Tuesday, September 21, 2010 10:35 AM > > To: [email protected] > > Subject: Relative urls are not crawled ? > > > > Hey Guys , > > > > Our website constructed using the relative URLs like the menu links > are > > "/Products/default.html" , "/Brands/default.html" > > > > Once Nutch crawl the website , I cannot see that these anchors are > > fetched > > although I set the depth to 2. The end result index only contain 1 > > document. > > > > If I run it against e.g http://androidyou.blogspot.com , I can see the > > other > > URLs are fetched as well, and you can see that the links are full urls > > in > > the web site. > > > > Is there any configuration exists for this ? > > > > Hope I had able to tell the issue clearly.. > > > > Kind regards , > > Bahadir Cambel > > >

