Hi, Nutch Gurus,

I need to crawl two dynamically pages


1.       http://example.com and

2.       http://example.com?request_locale=es_US

The difference is that when the query parameter "request_locale" equals 
"es_US", Spanish content is loaded. We would like to be able to crawl both the 
URLs if possible. I have passed these urls in my seed.txt but have the logs 
show that only the first URL is being crawled, but not the second.

I modified the regex-normalize.xml to not strip out query parameters and is 
given below. How do I configure Nutch to crawl both URLs?

Kartik

<regex-normalize>

<!-- removes session ids from urls (such as jsessionid and PHPSESSID) -->
<regex>
  
<pattern>(?i)(;?\b_?(l|j|bv_)?(sid|phpsessid|sessionid)=.*?)(\?|&amp;|#|$)</pattern>
  <substitution>$4</substitution>
</regex>

<!-- changes default pages into standard for /index.html, etc. into /
<regex>
  
<pattern>/((?i)index|default)\.((?i)js[pf]{1}?[afx]?|cgi|cfm|asp[x]?|[psx]?htm[l]?|php[3456]?)(\?|&amp;|#|$)</pattern>
  <substitution>/$3</substitution>
</regex> -->

<!-- removes interpage href anchors such as site.com#location -->
<regex>
  <pattern>#.*?(\?|&amp;|$)</pattern>
  <substitution>$1</substitution>
</regex>

<!-- cleans ?&amp;var=value into ?var=value -->
<regex>
  <pattern>\?&amp;</pattern>
  <substitution>\?</substitution>
</regex>

<!-- cleans multiple sequential ampersands into a single ampersand -->
<regex>
  <pattern>&amp;{2,}</pattern>
  <substitution>&amp;</substitution>
</regex>

<!-- removes trailing ? -->
<regex>
  <pattern>[\?&amp;\.]$</pattern>
  <substitution></substitution>
</regex>

<!-- removes duplicate slashes -->
<regex>
  <pattern>(?&lt;!:)/{2,}</pattern>
  <substitution>/</substitution>
</regex>

</regex-normalize>

----------------------------------------------------------------------
This message, and any attachments, is for the intended recipient(s) only, may 
contain information that is privileged, confidential and/or proprietary and 
subject to important terms and conditions available at 
http://www.bankofamerica.com/emaildisclaimer.   If you are not the intended 
recipient, please delete this message.

Reply via email to