Dear All,

I'm trying to increase quality of the crawling. A part of my database has 
DB_FETCHED = 1.

Example, http://www.wincs.be/ in seed list.

The root of the problem is in Parse/ParseOutputFormat.java line 364 - 374

Nutch considers one of the link(http://wincs.be/lakindustrie.html) as external 
and therefore reject it. 


If I insert http://wincs.be in seed file, everything works fine.

Do you think it is a good behavior? I mean, formally it is indeed two different 
domains, but from user perspective it is exactly the same.

And if it is a default behavior, how can I fix it for my case? The same 
question for similar switch http -> https  etc. 

Thanks.

Semyon.

Reply via email to