Hi - you probably have URL filtering enabled, the regex specifically. By 
default 
it filters out query strings. Check your URL filters.

Markus




 
 
-----Original message-----
> From:Krishnanand, Kartik <[email protected] 
> <mailto:[email protected]> >
> Sent: Friday 12th September 2014 13:04
> To: [email protected] <mailto:[email protected]> 
> Subject: Crawl URL with varying query parameters values
> 
> Hi, Nutch Gurus,
> 
> I need to crawl two dynamically pages
> 
> 
> 1.       http://example.com <http://example.com>  and
> 
> 2.       http://example.com <http://example.com> ?request_locale=es_US
> 
> The difference is that when the query parameter "request_locale" equals 
> "es_US", Spanish content is loaded. We would like to be able to crawl both 
> the URLs if possible. I have passed these urls in my seed.txt but have the 
> logs show that only the first URL is being crawled, but not the second.
> 
> I modified the regex-normalize.xml to not strip out query parameters and is 
> given below. How do I configure Nutch to crawl both URLs?
> 
> Kartik
> 
> <regex-normalize>
> 
> <!-- removes session ids from urls (such as jsessionid and PHPSESSID) -->
> <regex>
>   
> <pattern>(?i)(;?\b_?(l|j|bv_)?(sid|phpsessid|sessionid)=.*?)(\?|&amp;|#|$)</pattern>
>   <substitution>$4</substitution>
> </regex>
> 
> <!-- changes default pages into standard for /index.html, etc. into /
> <regex>
>   
> <pattern>/((?i)index|default)\.((?i)js[pf]{1}?[afx]?|cgi|cfm|asp[x]?|[psx]?htm[l]?|php[3456]?)(\?|&amp;|#|$)</pattern>
>   <substitution>/$3</substitution>
> </regex> -->
> 
> <!-- removes interpage href anchors such as site.com#location -->
> <regex>
>   <pattern>#.*?(\?|&amp;|$)</pattern>
>   <substitution>$1</substitution>
> </regex>
> 
> <!-- cleans ?&amp;var=value into ?var=value -->
> <regex>
>   <pattern>\?&amp;</pattern>
>   <substitution>\?</substitution>
> </regex>
> 
> <!-- cleans multiple sequential ampersands into a single ampersand -->
> <regex>
>   <pattern>&amp;{2,}</pattern>
>   <substitution>&amp;</substitution>
> </regex>
> 
> <!-- removes trailing ? -->
> <regex>
>   <pattern>[\?&amp;\.]$</pattern>
>   <substitution></substitution>
> </regex>
> 
> <!-- removes duplicate slashes -->
> <regex>
>   <pattern>(?&lt;!:)/{2,}</pattern>
>   <substitution>/</substitution>
> </regex>
> 
> </regex-normalize>
> 
> ----------------------------------------------------------------------
> This message, and any attachments, is for the intended recipient(s) only, may 
> contain information that is privileged, confidential and/or proprietary and 
> subject to important terms and conditions available at 
> http://www.bankofamerica.com/emaildisclaimer 
> <http://www.bankofamerica.com/emaildisclaimer> .   If you are not the 
> intended recipient, please delete this message.
> 

Reply via email to