======================================================================
exorbyte GmbH
Sebastian Nagel
Softwareentwickler

Line-Eid-Str. 1 | D-78467 Konstanz
Telefon: 0049 7531 363 39 15 | Telefax: 0049 7531 363 39 01
Email: [email protected] | Internet: www.exorbyte.de

______________________________________________________________________

Exorbyte ausgezeichnet: Landespreis 2010 und Rising Star 2010 –
Die neuen Awards...
http://www.exorbyte.de/auszeichnungen

Exorbyte Commerce Search überzeugt: Die fehlertolerante Produktsuche
für Online-Shops mit neuen Features – Als Webservice schnell in Ihren
Shop einbauen und sofort nutzen...
http://www.exorbyte.de/commerce-search/neue-commerce-search-features

Praktischer Ratgeber: Wie Sie mehr Umsatz aus Ihrer Shop-Suche holen –
Kostenlos downloaden...
http://www.exorbyte.de/ratgeber

______________________________________________________________________

Registergericht: AG Freiburg, HRB 381802
Umsatzsteuer-ID: DE213331910
Geschäftsführer: Gero Lüben, Benno Nieswand
Just add a rule to your regex-normalize.xml:

<!-- lowercase URLs -->
<regex>
  <pattern>([A-Z]+)</pattern>
  <substitution>\L$1</substitution>
</regex>

\L transforms the matched sequence $1 to lowercases,
  see 
http://jakarta.apache.org/oro/api/org/apache/oro/text/regex/Perl5Substitution.html
which is smarter (and faster) than

<regex>
  <pattern>A</pattern>
  <substitution>a</substitution>
</regex>
<regex>
  <pattern>B</pattern>
  <substitution>b</substitution>
</regex>
...

Of course you could write also a URL normalizer plug-in.
This could be aware of the fact that some servers are case-sensitive,
i.e., return a 404 for the lowercased URL.

On 06/04/2011 04:12 PM, Marseld Dedgjonaj wrote:
Hello Everyone,

I am using nutch-1.2 + Solr-1.3 to index a site.

I see in my results that nutch-1.2 considers "www.mysite.com/default.aspx"
and "www.mysite.com/DEFAULT.ASPX" as 2 different sites.

While the site is a aspx site, the url should be not case sensitive.

Any help ore suggestion how to have case insensitive during crawl?



Thanks in advance.

Best Regards,

Marseldi

Reply via email to