Hello community,

we use nutch in combination with solr for crawling internet- and intranet-sites 
for our clients. Unfortunately I did not find a suitable solution for the 
following problem but I am convinced there has to be one.

The versions installed on a Linux Debian system are Solr 4.10.2 and nutch 1.9. 
However, sites are scanned on a Windows Web server (IIS). Nutch on Linux 
behaves case-sensitive but the Windows results are case-insensitive.

I have tried the substitution in the regex-normalize.xml
<regex>
   <pattern>([A-Z]+)</pattern>
   <substitution>\L$1</substitution>
</regex>.

In first case it is useless cause some URLs should not be changed to lowercase, 
here cause of parameters or names of servlets.
https://eservice2.gkd-re.de/bsointer320/DokumentServlet?dokumentenname=320l1946.pdf
https://www.gladbeck.de/Leben_Wohnen/autostart.asp?db=404&form=report&searchfieldBeginndatum.max=06.09.2017&searchfieldAblaufdatum.min=06.09.2017&top=5

In second case it doesn't work, supposedly cause of the installed nutch version 
1.9. I have read somewhere that it is not supposed to work since nutch 1.5, for 
what reason whatever. It was  suggested to use a custom URL-Normalizer. 
Otherwise it could be possible to prepare some regular expressions. Could that 
be what mentioned deduplication is about (see message 
https://www.mail-archive.com/[email protected]/msg03904.html)?

Thanks for help or any useful hints in advance.

Mit freundlichem Gruß
Désirée Schwank
Team Verfahrensintegration/E-Government
Gemeinsame Kommunale Datenzentrale Recklinghausen
Zweckverband
Castroper Straße 30, 45665 Recklinghausen
Tel.: +49(0)2361-3033-247
Fax: +49(0)2361-3033-333
E-Mail [email protected]<mailto:[email protected]>
Internet: www.gkd-re.de<http://www.gkd-re.de/>
Bitte nutzen Sie unser OTRS-Ticket-System. Senden Sie Ihre Anfragen bitte an 
[email protected]<mailto:[email protected]>. So 
können Sie sicher sein, dass sich immer jemand um Ihr Problem kümmert.

Reply via email to