Hi, it's a problem of deduplication caused by different rules regarding case in URLs (cf. [1]). As you mentioned it's hard to handle by URL normalization: - only the path element of a URL (protocol://host/path?query=value) has to be normalized, not necessarily parameters which are handled by the application (ASP, cf. [2]) - and only for servers running on Windows resp. Windows IIS (eservice2.gkd-re.de appears to run on Linux)
A custom URL normalizer would be possible: - check whether the host belongs to the list of Windows servers - convert the path element to lowercase The regex normalizer does not support \L after it was moved from ORO to Java regexes (NUTCH-1013). However, it would be difficult (even impossible) to formulate a proper regular expression which catches only path elements for certain hosts. Maybe it's best to find a pragmatic solution which could be one of: - (if you stay in contact with the web admins) * find the links causing the duplicates and fix them (duplicates are also an issue for SEO, it may be worth to do the work) * ev. it's possible to configure Windows IIS to send redirects if case does not match - (if there are few of these duplicates) maintain a list of duplicates and send deletions to Solr just after each run of Nutch - (if there are many duplicates) use "nutch dedup" to remove duplicates by content, but make sure that a signature is chosen (see property db.signature.class) that does recognize the duplicates * org.apache.nutch.crawl.MD5Signature may not work because the paths different in case can appear in the HTML as hrefs * org.apache.nutch.crawl.TextMD5Signature should work but is only available since Nutch 1.10 (it's easily ported, see NUTCH-1693) Best, Sebastian [1] https://webmasters.stackexchange.com/questions/90339/why-are-urls-case-sensitive [2] https://forums.iis.net/t/1165661.aspx On 09/07/2017 05:37 PM, Schwank, Désirée wrote: > Hello community, > > we use nutch in combination with solr for crawling internet- and > intranet-sites for our clients. Unfortunately I did not find a suitable > solution for the following problem but I am convinced there has to be one. > > The versions installed on a Linux Debian system are Solr 4.10.2 and nutch > 1.9. However, sites are scanned on a Windows Web server (IIS). Nutch on Linux > behaves case-sensitive but the Windows results are case-insensitive. > > I have tried the substitution in the regex-normalize.xml > <regex> > <pattern>([A-Z]+)</pattern> > <substitution>\L$1</substitution> > </regex>. > > In first case it is useless cause some URLs should not be changed to > lowercase, here cause of parameters or names of servlets. > https://eservice2.gkd-re.de/bsointer320/DokumentServlet?dokumentenname=320l1946.pdf > https://www.gladbeck.de/Leben_Wohnen/autostart.asp?db=404&form=report&searchfieldBeginndatum.max=06.09.2017&searchfieldAblaufdatum.min=06.09.2017&top=5 > > In second case it doesn't work, supposedly cause of the installed nutch > version 1.9. I have read somewhere that it is not supposed to work since > nutch 1.5, for what reason whatever. It was suggested to use a custom > URL-Normalizer. Otherwise it could be possible to prepare some regular > expressions. Could that be what mentioned deduplication is about (see message > https://www.mail-archive.com/user@nutch.apache.org/msg03904.html)? > > Thanks for help or any useful hints in advance. > > Mit freundlichem Gruß > Désirée Schwank > Team Verfahrensintegration/E-Government > Gemeinsame Kommunale Datenzentrale Recklinghausen > Zweckverband > Castroper Straße 30, 45665 Recklinghausen > Tel.: +49(0)2361-3033-247 > Fax: +49(0)2361-3033-333 > E-Mail desiree.schw...@gkd-re.de<mailto:desiree.schw...@gkd-re.de> > Internet: www.gkd-re.de<http://www.gkd-re.de/> > Bitte nutzen Sie unser OTRS-Ticket-System. Senden Sie Ihre Anfragen bitte an > e-governm...@support.gkd-re.de<mailto:e-governm...@support.gkd-re.de>. So > können Sie sicher sein, dass sich immer jemand um Ihr Problem kümmert. >