Hi, Has this moved on any?
Did you manage to successfully fetch your urls, I have been away and didn't get time to complete. ________________________________________ From: [email protected] [[email protected]] Sent: 21 April 2011 21:11 To: [email protected] Subject: RE: Fetching urls with query string Hi, Sorry i didn't provide the real urls, here it is : nutch fetch this : http://www.univ-lille1.fr/etudes/offre-de-formation/Sciences-Technologies-Sante?domaine=1&diplome=TI-DUT&composante= nutch does not fetch this : http://www.univ-lille1.fr/etudes/offre-de-formation/Sciences-Technologies-Sante?domaine=1&diplome=TI-DUT&composante=&mention=FR_RNE_0593559Y_PR_ST-dut-000001&specialite=FR_RNE_0593559Y_PR_formation-DUT-INFO My crawl-urlfilter : # skip file:, ftp:, & mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js|crt|cert)$ # skip URLs containing certain characters as probable queries, etc. #-[?*!@=] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/.+?)/.*?\1/.*?\1/ # crawler seulement sur front-ig1 +^http://www.univ-lille1.fr/etudes/offre-de-formation # skip everything else -. By removing the comment on -[?*!@=], nutch doesn't fetch query strings at all. For information, i use nutch 0.9 (but i tried with a fresh install of 1.2 and i'm having the same problem). Thanks for your answer John Best regards David Selon "McGibbney, Lewis John" <[email protected]>: > Hi, > > It appears that both of the urls you posted return 404 not found then > autoredirect to a domain seller! > > Further to this, did you remove the comment on this > > #-[?*!@=]... from the info provided below it appears you have not. > > hth > > Lewis > > ________________________________________ > From: [email protected] [[email protected]] > Sent: 21 April 2011 16:15 > To: [email protected] > Subject: Fetching urls with query string > > Hello, > > I have problems fetching some urls having GET parameters with nutch. For > example, nutch is fetching : > > http://www.mywebsite.com/studies/formation-offer/Sciences-Technologies-Sante?domaine=1&diplome=TI-DUT&composante= > > but will not fetch : > http://www.mywebsite.com/studies/formation-offer/Sciences-Technologies-Sante?domaine=1&diplome=TI-DUT&composante=&mention=FR_RNE_0593559Y_PR_ST-dut-000001&specialite=FR_RNE_0593559Y_PR_formation-DUT-INFO > > I updated the crawl-urlfilter : > #-[?*!@=] > > +^http://www.mywebsite.com/studies/formation-offer/ > > and nutch-default.xml : > > <property> > <name>db.max.anchor.length</name> > <value>300</value> > <description>The maximum number of characters permitted in an anchor. > </description> > </property> > > but i have the same result, i didn't find anything in the configuration files > to > make it work. Have somebody an idea ? > > Best regards, > David > > Email has been scanned for viruses by Altman Technologies' email management > service - www.altman.co.uk/emailsystems > > Glasgow Caledonian University is a registered Scottish charity, number > SC021474 > > Winner: Times Higher Education’s Widening Participation Initiative of the > Year 2009 and Herald Society’s Education Initiative of the Year 2009. > http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html > > Winner: Times Higher Education’s Outstanding Support for Early Career > Researchers of the Year 2010, GCU as a lead with Universities Scotland > partners. > http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html > Email has been scanned for viruses by Altman Technologies' email management service - www.altman.co.uk/emailsystems Glasgow Caledonian University is a registered Scottish charity, number SC021474 Winner: Times Higher Education’s Widening Participation Initiative of the Year 2009 and Herald Society’s Education Initiative of the Year 2009. http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html Winner: Times Higher Education’s Outstanding Support for Early Career Researchers of the Year 2010, GCU as a lead with Universities Scotland partners. http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html

