Hello Semyon, Yossi is right, you can use the db.ignore.* set of directives to resolve the problem.
Regarding protocol, you can use urlnormalizer-protocol to set up per host rules. This is, of course, a tedious job if you operate a crawl on an indefinite amount of hosts, so use the uncommitted ProtocolResolver for that to do it for you. See: https://issues.apache.org/jira/browse/NUTCH-2247 If i remember it tomorrow afternoon, i can probably schedule some time to work on it the coming seven days or so, and commit. Regards, Markus -----Original message----- > From:Yossi Tamari <[email protected]> > Sent: Tuesday 20th February 2018 21:06 > To: [email protected] > Subject: RE: Internal links appear to be external in Parse. Improvement of > the crawling quality > > Hi Semyon, > > Wouldn't setting db.ignore.external.links.mode=byDomain solve your wincs.be > issue? > As far as I can see the protocol (HTTP/HTTPS) does not play any part in the > decision if this is the same domain. > > Yossi. > > > -----Original Message----- > > From: Semyon Semyonov [mailto:[email protected]] > > Sent: 20 February 2018 20:43 > > To: usernutch.apache.org <[email protected]> > > Subject: Internal links appear to be external in Parse. Improvement of the > > crawling quality > > > > Dear All, > > > > I'm trying to increase quality of the crawling. A part of my database has > > DB_FETCHED = 1. > > > > Example, http://www.wincs.be/ in seed list. > > > > The root of the problem is in Parse/ParseOutputFormat.java line 364 - 374 > > > > Nutch considers one of the link(http://wincs.be/lakindustrie.html) as > > external > > and therefore reject it. > > > > > > If I insert http://wincs.be in seed file, everything works fine. > > > > Do you think it is a good behavior? I mean, formally it is indeed two > > different > > domains, but from user perspective it is exactly the same. > > > > And if it is a default behavior, how can I fix it for my case? The same > > question for > > similar switch http -> https etc. > > > > Thanks. > > > > Semyon. > >

