Hi,
it's a problem of deduplication caused by different rules regarding case in
URLs (cf. [1]).
As you mentioned it's hard to handle by URL normalization:
- only the path element of a URL (protocol://host/path?query=value) has to be
normalized,
not necessarily parameters which are handled by the application (ASP, cf. [2])
- and only for servers running on Windows resp. Windows IIS
(eservice2.gkd-re.de appears to run on Linux)
A custom URL normalizer would be possible:
- check whether the host belongs to the list of Windows servers
- convert the path element to lowercase
The regex normalizer does not support \L after it was moved from ORO to Java
regexes
(NUTCH-1013). However, it would be difficult (even impossible) to formulate a
proper
regular expression which catches only path elements for certain hosts.
Maybe it's best to find a pragmatic solution which could be one of:
- (if you stay in contact with the web admins)
* find the links causing the duplicates and fix them
(duplicates are also an issue for SEO, it may be worth to do the work)
* ev. it's possible to configure Windows IIS to send redirects if case does
not match
- (if there are few of these duplicates)
maintain a list of duplicates and send deletions to Solr just after each run
of Nutch
- (if there are many duplicates)
use "nutch dedup" to remove duplicates by content, but make sure that a
signature
is chosen (see property db.signature.class) that does recognize the duplicates
* org.apache.nutch.crawl.MD5Signature may not work because the paths
different in case
can appear in the HTML as hrefs
* org.apache.nutch.crawl.TextMD5Signature should work but is only available
since
Nutch 1.10 (it's easily ported, see NUTCH-1693)
Best,
Sebastian
[1]
https://webmasters.stackexchange.com/questions/90339/why-are-urls-case-sensitive
[2] https://forums.iis.net/t/1165661.aspx
On 09/07/2017 05:37 PM, Schwank, Désirée wrote:
> Hello community,
>
> we use nutch in combination with solr for crawling internet- and
> intranet-sites for our clients. Unfortunately I did not find a suitable
> solution for the following problem but I am convinced there has to be one.
>
> The versions installed on a Linux Debian system are Solr 4.10.2 and nutch
> 1.9. However, sites are scanned on a Windows Web server (IIS). Nutch on Linux
> behaves case-sensitive but the Windows results are case-insensitive.
>
> I have tried the substitution in the regex-normalize.xml
> <regex>
> <pattern>([A-Z]+)</pattern>
> <substitution>\L$1</substitution>
> </regex>.
>
> In first case it is useless cause some URLs should not be changed to
> lowercase, here cause of parameters or names of servlets.
> https://eservice2.gkd-re.de/bsointer320/DokumentServlet?dokumentenname=320l1946.pdf
> https://www.gladbeck.de/Leben_Wohnen/autostart.asp?db=404&form=report&searchfieldBeginndatum.max=06.09.2017&searchfieldAblaufdatum.min=06.09.2017&top=5
>
> In second case it doesn't work, supposedly cause of the installed nutch
> version 1.9. I have read somewhere that it is not supposed to work since
> nutch 1.5, for what reason whatever. It was suggested to use a custom
> URL-Normalizer. Otherwise it could be possible to prepare some regular
> expressions. Could that be what mentioned deduplication is about (see message
> https://www.mail-archive.com/[email protected]/msg03904.html)?
>
> Thanks for help or any useful hints in advance.
>
> Mit freundlichem Gruß
> Désirée Schwank
> Team Verfahrensintegration/E-Government
> Gemeinsame Kommunale Datenzentrale Recklinghausen
> Zweckverband
> Castroper Straße 30, 45665 Recklinghausen
> Tel.: +49(0)2361-3033-247
> Fax: +49(0)2361-3033-333
> E-Mail [email protected]<mailto:[email protected]>
> Internet: www.gkd-re.de<http://www.gkd-re.de/>
> Bitte nutzen Sie unser OTRS-Ticket-System. Senden Sie Ihre Anfragen bitte an
> [email protected]<mailto:[email protected]>. So
> können Sie sicher sein, dass sich immer jemand um Ihr Problem kümmert.
>