Dear All, I'm trying to increase quality of the crawling. A part of my database has DB_FETCHED = 1.
Example, http://www.wincs.be/ in seed list. The root of the problem is in Parse/ParseOutputFormat.java line 364 - 374 Nutch considers one of the link(http://wincs.be/lakindustrie.html) as external and therefore reject it. If I insert http://wincs.be in seed file, everything works fine. Do you think it is a good behavior? I mean, formally it is indeed two different domains, but from user perspective it is exactly the same. And if it is a default behavior, how can I fix it for my case? The same question for similar switch http -> https etc. Thanks. Semyon.

