Hi,

It appears that both of the urls you posted return 404 not found then 
autoredirect to a domain seller!

Further to this, did you remove the comment on this

#-[?*!@=]... from the info provided below it appears you have not.

hth

Lewis

________________________________________
From: [email protected] [[email protected]]
Sent: 21 April 2011 16:15
To: [email protected]
Subject: Fetching urls with query string

Hello,

I have problems fetching some urls having GET parameters with nutch. For
example, nutch is fetching :

http://www.mywebsite.com/studies/formation-offer/Sciences-Technologies-Sante?domaine=1&diplome=TI-DUT&composante=

but will not fetch :
http://www.mywebsite.com/studies/formation-offer/Sciences-Technologies-Sante?domaine=1&diplome=TI-DUT&composante=&mention=FR_RNE_0593559Y_PR_ST-dut-000001&specialite=FR_RNE_0593559Y_PR_formation-DUT-INFO

I updated the crawl-urlfilter :
#-[?*!@=]

+^http://www.mywebsite.com/studies/formation-offer/

and nutch-default.xml :

<property>
  <name>db.max.anchor.length</name>
  <value>300</value>
  <description>The maximum number of characters permitted in an anchor.
  </description>
</property>

but i have the same result, i didn't find anything in the configuration files to
make it work. Have somebody an idea ?

Best regards,
David

Email has been scanned for viruses by Altman Technologies' email management 
service - www.altman.co.uk/emailsystems

Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education’s Widening Participation Initiative of the Year 
2009 and Herald Society’s Education Initiative of the Year 2009.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

Winner: Times Higher Education’s Outstanding Support for Early Career 
Researchers of the Year 2010, GCU as a lead with Universities Scotland partners.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html

Reply via email to