Hi Semyon, > interpret www.somewebsite.com and somewhebsite.com as one host?
Yes, that's a common problem. More because of external links which must include the host name - well-designed sites would use relative links for internal same-host links. For a quick work-around: - set db.ignore.external.links.mode=byDomain - modify the method URLUtil.getDomainName(URL url) so that it returns the hostname with www. stripped For a final solution we could make it configurable which method or class is called. Since the definition of "domain" is somewhat debatable [1], we could even provide alternative implementations. > PS. For me it is not really clear how ProtocolResolver works. It's only a heuristics to avoid duplicates by protocol (http and https). If you care about duplicates and cannot get rid of them afterwards by a deduplication job, you may have a look at urlnormalizer-protocol and NUTCH-2447. Best, Sebastian [1] https://github.com/google/guava/wiki/InternetDomainNameExplained On 02/21/2018 10:44 AM, Semyon Semyonov wrote: > Thanks Yossi, Markus, > > I have an issue with the db.ignore.external.links.mode=byDomain solution. > > I crawl specific hosts only therefore I have a finite number of hosts to > crawl. > Lets say, www.somewebsite.com > > I want to stay limited with this host. In other words, neither > www.art.somewebsite.com nor www.sport.somewebsite.com. > That's why db.ignore.external.links.mode=byHost and db.ignore.external = > true(no external websites). > > Although, I want to get the links that seem to belong to the same > host(www.somewebsite.com -> somewebsite.com/games, without www). > The question is shouldn't we include it as a default behavior(or configured > behavior) in Nutch and interpret www.somewebsite.com and somewhebsite.com as > one host? > > > > PS. For me it is not really clear how ProtocolResolver works. > > Semyon > > > > > Sent: Tuesday, February 20, 2018 at 9:40 PM > From: "Markus Jelsma" <[email protected]> > To: "[email protected]" <[email protected]> > Subject: RE: Internal links appear to be external in Parse. Improvement of > the crawling quality > Hello Semyon, > > Yossi is right, you can use the db.ignore.* set of directives to resolve the > problem. > > Regarding protocol, you can use urlnormalizer-protocol to set up per host > rules. This is, of course, a tedious job if you operate a crawl on an > indefinite amount of hosts, so use the uncommitted ProtocolResolver for that > to do it for you. > > See: https://issues.apache.org/jira/browse/NUTCH-2247 > > If i remember it tomorrow afternoon, i can probably schedule some time to > work on it the coming seven days or so, and commit. > > Regards, > Markus > > -----Original message----- >> From:Yossi Tamari <[email protected]> >> Sent: Tuesday 20th February 2018 21:06 >> To: [email protected] >> Subject: RE: Internal links appear to be external in Parse. Improvement of >> the crawling quality >> >> Hi Semyon, >> >> Wouldn't setting db.ignore.external.links.mode=byDomain solve your wincs.be >> issue? >> As far as I can see the protocol (HTTP/HTTPS) does not play any part in the >> decision if this is the same domain. >> >> Yossi. >> >>> -----Original Message----- >>> From: Semyon Semyonov [mailto:[email protected]] >>> Sent: 20 February 2018 20:43 >>> To: usernutch.apache.org <[email protected]> >>> Subject: Internal links appear to be external in Parse. Improvement of the >>> crawling quality >>> >>> Dear All, >>> >>> I'm trying to increase quality of the crawling. A part of my database has >>> DB_FETCHED = 1. >>> >>> Example, http://www.wincs.be/[http://www.wincs.be/] in seed list. >>> >>> The root of the problem is in Parse/ParseOutputFormat.java line 364 - 374 >>> >>> Nutch considers one of the >>> link(http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html]) >>> as external >>> and therefore reject it. >>> >>> >>> If I insert http://wincs.be[http://wincs.be] in seed file, everything works >>> fine. >>> >>> Do you think it is a good behavior? I mean, formally it is indeed two >>> different >>> domains, but from user perspective it is exactly the same. >>> >>> And if it is a default behavior, how can I fix it for my case? The same >>> question for >>> similar switch http -> https etc. >>> >>> Thanks. >>> >>> Semyon. >> >>

