Hi,

Has this moved on any?

Did you manage to successfully fetch your urls, I have been away and didn't get 
time to complete.

________________________________________
From: [email protected] [[email protected]]
Sent: 21 April 2011 21:11
To: [email protected]
Subject: RE: Fetching urls with query string

Hi,

Sorry i didn't provide the real urls, here it is :

nutch fetch this :
http://www.univ-lille1.fr/etudes/offre-de-formation/Sciences-Technologies-Sante?domaine=1&diplome=TI-DUT&composante=

nutch does not fetch this :
http://www.univ-lille1.fr/etudes/offre-de-formation/Sciences-Technologies-Sante?domaine=1&diplome=TI-DUT&composante=&mention=FR_RNE_0593559Y_PR_ST-dut-000001&specialite=FR_RNE_0593559Y_PR_formation-DUT-INFO

My crawl-urlfilter :

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js|crt|cert)$

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/.+?)/.*?\1/.*?\1/

# crawler seulement sur front-ig1
+^http://www.univ-lille1.fr/etudes/offre-de-formation

# skip everything else
-.


By removing  the comment on -[?*!@=], nutch doesn't fetch query strings at all.
For information, i use nutch 0.9 (but i tried with a fresh install of 1.2 and
i'm having the same problem).

Thanks for your answer John
Best regards
David

Selon "McGibbney, Lewis John" <[email protected]>:

> Hi,
>
> It appears that both of the urls you posted return 404 not found then
> autoredirect to a domain seller!
>
> Further to this, did you remove the comment on this
>
> #-[?*!@=]... from the info provided below it appears you have not.
>
> hth
>
> Lewis
>
> ________________________________________
> From: [email protected] [[email protected]]
> Sent: 21 April 2011 16:15
> To: [email protected]
> Subject: Fetching urls with query string
>
> Hello,
>
> I have problems fetching some urls having GET parameters with nutch. For
> example, nutch is fetching :
>
>
http://www.mywebsite.com/studies/formation-offer/Sciences-Technologies-Sante?domaine=1&diplome=TI-DUT&composante=
>
> but will not fetch :
>
http://www.mywebsite.com/studies/formation-offer/Sciences-Technologies-Sante?domaine=1&diplome=TI-DUT&composante=&mention=FR_RNE_0593559Y_PR_ST-dut-000001&specialite=FR_RNE_0593559Y_PR_formation-DUT-INFO
>
> I updated the crawl-urlfilter :
> #-[?*!@=]
>
> +^http://www.mywebsite.com/studies/formation-offer/
>
> and nutch-default.xml :
>
> <property>
>   <name>db.max.anchor.length</name>
>   <value>300</value>
>   <description>The maximum number of characters permitted in an anchor.
>   </description>
> </property>
>
> but i have the same result, i didn't find anything in the configuration files
> to
> make it work. Have somebody an idea ?
>
> Best regards,
> David
>
> Email has been scanned for viruses by Altman Technologies' email management
> service - www.altman.co.uk/emailsystems
>
> Glasgow Caledonian University is a registered Scottish charity, number
> SC021474
>
> Winner: Times Higher Education’s Widening Participation Initiative of the
> Year 2009 and Herald Society’s Education Initiative of the Year 2009.
>
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html
>
> Winner: Times Higher Education’s Outstanding Support for Early Career
> Researchers of the Year 2010, GCU as a lead with Universities Scotland
> partners.
>
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html
>



Email has been scanned for viruses by Altman Technologies' email management 
service - www.altman.co.uk/emailsystems

Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education’s Widening Participation Initiative of the Year 
2009 and Herald Society’s Education Initiative of the Year 2009.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

Winner: Times Higher Education’s Outstanding Support for Early Career 
Researchers of the Year 2010, GCU as a lead with Universities Scotland partners.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html

Reply via email to