Hi,

it's a problem of deduplication caused by different rules regarding case in 
URLs (cf. [1]).
As you mentioned it's hard to handle by URL normalization:
- only the path element of a URL (protocol://host/path?query=value) has to be 
normalized,
  not necessarily parameters which are handled by the application (ASP, cf. [2])
- and only for servers running on Windows resp. Windows IIS
  (eservice2.gkd-re.de appears to run on Linux)

A custom URL normalizer would be possible:
- check whether the host belongs to the list of Windows servers
- convert the path element to lowercase

The regex normalizer does not support \L after it was moved from ORO to Java 
regexes
(NUTCH-1013). However, it would be difficult (even impossible) to formulate a 
proper
regular expression which catches only path elements for certain hosts.

Maybe it's best to find a pragmatic solution which could be one of:

- (if you stay in contact with the web admins)
  * find the links causing the duplicates and fix them
    (duplicates are also an issue for SEO, it may be worth to do the work)
  * ev. it's possible to configure Windows IIS to send redirects if case does 
not match

- (if there are few of these duplicates)
  maintain a list of duplicates and send deletions to Solr just after each run 
of Nutch

- (if there are many duplicates)
  use "nutch dedup" to remove duplicates by content, but make sure that a 
signature
  is chosen (see property db.signature.class) that does recognize the duplicates
  * org.apache.nutch.crawl.MD5Signature  may not work because the paths 
different in case
    can appear in the HTML as hrefs
  * org.apache.nutch.crawl.TextMD5Signature  should work but is only available 
since
    Nutch 1.10   (it's easily ported, see NUTCH-1693)


Best,
Sebastian

[1] 
https://webmasters.stackexchange.com/questions/90339/why-are-urls-case-sensitive
[2] https://forums.iis.net/t/1165661.aspx


On 09/07/2017 05:37 PM, Schwank, Désirée wrote:
> Hello community,
> 
> we use nutch in combination with solr for crawling internet- and 
> intranet-sites for our clients. Unfortunately I did not find a suitable 
> solution for the following problem but I am convinced there has to be one.
> 
> The versions installed on a Linux Debian system are Solr 4.10.2 and nutch 
> 1.9. However, sites are scanned on a Windows Web server (IIS). Nutch on Linux 
> behaves case-sensitive but the Windows results are case-insensitive.
> 
> I have tried the substitution in the regex-normalize.xml
> <regex>
>    <pattern>([A-Z]+)</pattern>
>    <substitution>\L$1</substitution>
> </regex>.
> 
> In first case it is useless cause some URLs should not be changed to 
> lowercase, here cause of parameters or names of servlets.
> https://eservice2.gkd-re.de/bsointer320/DokumentServlet?dokumentenname=320l1946.pdf
> https://www.gladbeck.de/Leben_Wohnen/autostart.asp?db=404&form=report&searchfieldBeginndatum.max=06.09.2017&searchfieldAblaufdatum.min=06.09.2017&top=5
> 
> In second case it doesn't work, supposedly cause of the installed nutch 
> version 1.9. I have read somewhere that it is not supposed to work since 
> nutch 1.5, for what reason whatever. It was  suggested to use a custom 
> URL-Normalizer. Otherwise it could be possible to prepare some regular 
> expressions. Could that be what mentioned deduplication is about (see message 
> https://www.mail-archive.com/user@nutch.apache.org/msg03904.html)?
> 
> Thanks for help or any useful hints in advance.
> 
> Mit freundlichem Gruß
> Désirée Schwank
> Team Verfahrensintegration/E-Government
> Gemeinsame Kommunale Datenzentrale Recklinghausen
> Zweckverband
> Castroper Straße 30, 45665 Recklinghausen
> Tel.: +49(0)2361-3033-247
> Fax: +49(0)2361-3033-333
> E-Mail desiree.schw...@gkd-re.de<mailto:desiree.schw...@gkd-re.de>
> Internet: www.gkd-re.de<http://www.gkd-re.de/>
> Bitte nutzen Sie unser OTRS-Ticket-System. Senden Sie Ihre Anfragen bitte an 
> e-governm...@support.gkd-re.de<mailto:e-governm...@support.gkd-re.de>. So 
> können Sie sicher sein, dass sich immer jemand um Ihr Problem kümmert.
> 

Reply via email to