> 1) Do we have a config setting that we can use already? Not out-of-the-box. But there is already an extension point for your use case [1]: the filter method takes to arguments (fromURL and toURL). Have a look at it, maybe you can fix it by implementing/contributing a plugin.
> 2) ... It looks more like same Host problem rather ... To determine the host of a URL Nutch uses everywhere java.net.URL.getHost() which implements RFC 1738 [2]. We cannot change Java but it would be possible to modify URLUtil.getDomainName(...), at least, as a work-around. > 3) Where this problem should be solved? Only in ParseOutputFormat.java or > somewhere else as well? You may also want to fix it in FetcherThread.handleRedirect(...) which affects also your use case of following only internal links (if db.ignore.also.redirects == true). Best, Sebastian [1] https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/net/URLExemptionFilter.html https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.html [2] https://tools.ietf.org/html/rfc1738#section-3.1 On 02/21/2018 01:52 PM, Semyon Semyonov wrote: > Hi Sabastian, > > If I > - modify the method URLUtil.getDomainName(URL url) > > doesn't it mean that I don't need > - set db.ignore.external.links.mode=byDomain > > anymore? http://www.somewebsite.com becomes the same host as somewhebsite.com. > > > To make it as generic as possible I can create an issue/pull request for > this, but I would like to hear your suggestion about the best way to do so. > 1) Do we have a config setting that we can use already? > 2) The domain discussion[1] is quite wide though. In my case I cover only one > issue with the mapping www -> _ . It looks more like same Host problem rather > than the same Domain problem. What to you think about such host resolution? > 3) Where this problem should be solved? Only in ParseOutputFormat.java or > somewhere else as well? > > Semyon. > > > > > Sent: Wednesday, February 21, 2018 at 11:51 AM > From: "Sebastian Nagel" <[email protected]> > To: [email protected] > Subject: Re: Internal links appear to be external in Parse. Improvement of > the crawling quality > Hi Semyon, > >> interpret www.somewebsite.com[http://www.somewebsite.com] and >> somewhebsite.com as one host? > > Yes, that's a common problem. More because of external links which must > include the host name - well-designed sites would use relative links > for internal same-host links. > > For a quick work-around: > - set db.ignore.external.links.mode=byDomain > - modify the method URLUtil.getDomainName(URL url) > so that it returns the hostname with www. stripped > > For a final solution we could make it configurable > which method or class is called. Since the definition of "domain" > is somewhat debatable [1], we could even provide alternative > implementations. > >> PS. For me it is not really clear how ProtocolResolver works. > > It's only a heuristics to avoid duplicates by protocol (http and https). > If you care about duplicates and cannot get rid of them afterwards by a > deduplication job, > you may have a look at urlnormalizer-protocol and NUTCH-2447. > > Best, > Sebastian > > > [1] > https://github.com/google/guava/wiki/InternetDomainNameExplained[https://github.com/google/guava/wiki/InternetDomainNameExplained] > > On 02/21/2018 10:44 AM, Semyon Semyonov wrote: >> Thanks Yossi, Markus, >> >> I have an issue with the db.ignore.external.links.mode=byDomain solution. >> >> I crawl specific hosts only therefore I have a finite number of hosts to >> crawl. >> Lets say, www.somewebsite.com[http://www.somewebsite.com] >> >> I want to stay limited with this host. In other words, neither >> www.art.somewebsite.com[http://www.art.somewebsite.com] nor >> www.sport.somewebsite.com[http://www.sport.somewebsite.com]. >> That's why db.ignore.external.links.mode=byHost and db.ignore.external = >> true(no external websites). >> >> Although, I want to get the links that seem to belong to the same >> host(www.somewebsite.com[http://www.somewebsite.com] -> >> somewebsite.com/games, without www). >> The question is shouldn't we include it as a default behavior(or configured >> behavior) in Nutch and interpret >> www.somewebsite.com[http://www.somewebsite.com] and somewhebsite.com as one >> host? >> >> >> >> PS. For me it is not really clear how ProtocolResolver works. >> >> Semyon >> >> >> >> >> Sent: Tuesday, February 20, 2018 at 9:40 PM >> From: "Markus Jelsma" <[email protected]> >> To: "[email protected]" <[email protected]> >> Subject: RE: Internal links appear to be external in Parse. Improvement of >> the crawling quality >> Hello Semyon, >> >> Yossi is right, you can use the db.ignore.* set of directives to resolve the >> problem. >> >> Regarding protocol, you can use urlnormalizer-protocol to set up per host >> rules. This is, of course, a tedious job if you operate a crawl on an >> indefinite amount of hosts, so use the uncommitted ProtocolResolver for that >> to do it for you. >> >> See: >> https://issues.apache.org/jira/browse/NUTCH-2247[https://issues.apache.org/jira/browse/NUTCH-2247] >> >> If i remember it tomorrow afternoon, i can probably schedule some time to >> work on it the coming seven days or so, and commit. >> >> Regards, >> Markus >> >> -----Original message----- >>> From:Yossi Tamari <[email protected]> >>> Sent: Tuesday 20th February 2018 21:06 >>> To: [email protected] >>> Subject: RE: Internal links appear to be external in Parse. Improvement of >>> the crawling quality >>> >>> Hi Semyon, >>> >>> Wouldn't setting db.ignore.external.links.mode=byDomain solve your wincs.be >>> issue? >>> As far as I can see the protocol (HTTP/HTTPS) does not play any part in the >>> decision if this is the same domain. >>> >>> Yossi. >>> >>>> -----Original Message----- >>>> From: Semyon Semyonov [mailto:[email protected]] >>>> Sent: 20 February 2018 20:43 >>>> To: usernutch.apache.org <[email protected]> >>>> Subject: Internal links appear to be external in Parse. Improvement of the >>>> crawling quality >>>> >>>> Dear All, >>>> >>>> I'm trying to increase quality of the crawling. A part of my database has >>>> DB_FETCHED = 1. >>>> >>>> Example, >>>> http://www.wincs.be/[http://www.wincs.be/][http://www.wincs.be/[http://www.wincs.be/]] >>>> in seed list. >>>> >>>> The root of the problem is in Parse/ParseOutputFormat.java line 364 - 374 >>>> >>>> Nutch considers one of the >>>> link(http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html][http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html]]) >>>> as external >>>> and therefore reject it. >>>> >>>> >>>> If I insert >>>> http://wincs.be[http://wincs.be][http://wincs.be[http://wincs.be]] in seed >>>> file, everything works fine. >>>> >>>> Do you think it is a good behavior? I mean, formally it is indeed two >>>> different >>>> domains, but from user perspective it is exactly the same. >>>> >>>> And if it is a default behavior, how can I fix it for my case? The same >>>> question for >>>> similar switch http -> https etc. >>>> >>>> Thanks. >>>> >>>> Semyon. >>> >>> > >

