Hello community, we use nutch in combination with solr for crawling internet- and intranet-sites for our clients. Unfortunately I did not find a suitable solution for the following problem but I am convinced there has to be one.
The versions installed on a Linux Debian system are Solr 4.10.2 and nutch 1.9. However, sites are scanned on a Windows Web server (IIS). Nutch on Linux behaves case-sensitive but the Windows results are case-insensitive. I have tried the substitution in the regex-normalize.xml <regex> <pattern>([A-Z]+)</pattern> <substitution>\L$1</substitution> </regex>. In first case it is useless cause some URLs should not be changed to lowercase, here cause of parameters or names of servlets. https://eservice2.gkd-re.de/bsointer320/DokumentServlet?dokumentenname=320l1946.pdf https://www.gladbeck.de/Leben_Wohnen/autostart.asp?db=404&form=report&searchfieldBeginndatum.max=06.09.2017&searchfieldAblaufdatum.min=06.09.2017&top=5 In second case it doesn't work, supposedly cause of the installed nutch version 1.9. I have read somewhere that it is not supposed to work since nutch 1.5, for what reason whatever. It was suggested to use a custom URL-Normalizer. Otherwise it could be possible to prepare some regular expressions. Could that be what mentioned deduplication is about (see message https://www.mail-archive.com/[email protected]/msg03904.html)? Thanks for help or any useful hints in advance. Mit freundlichem Gruß Désirée Schwank Team Verfahrensintegration/E-Government Gemeinsame Kommunale Datenzentrale Recklinghausen Zweckverband Castroper Straße 30, 45665 Recklinghausen Tel.: +49(0)2361-3033-247 Fax: +49(0)2361-3033-333 E-Mail [email protected]<mailto:[email protected]> Internet: www.gkd-re.de<http://www.gkd-re.de/> Bitte nutzen Sie unser OTRS-Ticket-System. Senden Sie Ihre Anfragen bitte an [email protected]<mailto:[email protected]>. So können Sie sicher sein, dass sich immer jemand um Ihr Problem kümmert.

