Re: Internal links appear to be external in Parse. Improvement of the crawling quality

Sebastian Nagel Wed, 21 Feb 2018 05:58:57 -0800

> 1) Do we have a config setting that we can use already?

Not out-of-the-box. But there is already an extension point for your use case 
[1]:
the filter method takes to arguments (fromURL and toURL).
Have a look at it, maybe you can fix it by implementing/contributing a plugin.


> 2) ... It looks more like same Host problem rather ...

To determine the host of a URL Nutch uses everywhere java.net.URL.getHost()
which implements RFC 1738 [2].  We cannot change Java but it would be possible
to modify URLUtil.getDomainName(...), at least, as a work-around.

> 3) Where this problem should be solved? Only in ParseOutputFormat.java or 
> somewhere else as well?

You may also want to fix it in FetcherThread.handleRedirect(...) which affects 
also your use case
of following only internal links (if db.ignore.also.redirects == true).

Best,
Sebastian


[1] 
https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/net/URLExemptionFilter.html

https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.html
[2] https://tools.ietf.org/html/rfc1738#section-3.1


On 02/21/2018 01:52 PM, Semyon Semyonov wrote:
> Hi Sabastian,
> 
> If I
> - modify the method URLUtil.getDomainName(URL url)
> 
> doesn't it mean that I don't need 
>  - set db.ignore.external.links.mode=byDomain
> 
> anymore? http://www.somewebsite.com becomes the same host as somewhebsite.com.
> 
> 
> To make it as generic as possible I can create an issue/pull request for 
> this, but I would like to hear your suggestion about the best way to do so.
> 1) Do we have a config setting that we can use already?
> 2) The domain discussion[1] is quite wide though. In my case I cover only one 
> issue with the mapping www -> _ . It looks more like same Host problem rather 
> than the same Domain problem. What to you think about such host resolution?
> 3) Where this problem should be solved? Only in ParseOutputFormat.java or 
> somewhere else as well?
> 
> Semyon.
> 
> 
>  
> 
> Sent: Wednesday, February 21, 2018 at 11:51 AM
> From: "Sebastian Nagel" <[email protected]>
> To: [email protected]
> Subject: Re: Internal links appear to be external in Parse. Improvement of 
> the crawling quality
> Hi Semyon,
> 
>> interpret www.somewebsite.com[http://www.somewebsite.com] and 
>> somewhebsite.com as one host?
> 
> Yes, that's a common problem. More because of external links which must
> include the host name - well-designed sites would use relative links
> for internal same-host links.
> 
> For a quick work-around:
> - set db.ignore.external.links.mode=byDomain
> - modify the method URLUtil.getDomainName(URL url)
> so that it returns the hostname with www. stripped
> 
> For a final solution we could make it configurable
> which method or class is called. Since the definition of "domain"
> is somewhat debatable [1], we could even provide alternative
> implementations.
> 
>> PS. For me it is not really clear how ProtocolResolver works.
> 
> It's only a heuristics to avoid duplicates by protocol (http and https).
> If you care about duplicates and cannot get rid of them afterwards by a 
> deduplication job,
> you may have a look at urlnormalizer-protocol and NUTCH-2447.
> 
> Best,
> Sebastian
> 
> 
> [1] 
> https://github.com/google/guava/wiki/InternetDomainNameExplained[https://github.com/google/guava/wiki/InternetDomainNameExplained]
> 
> On 02/21/2018 10:44 AM, Semyon Semyonov wrote:
>> Thanks Yossi, Markus,
>>
>> I have an issue with the db.ignore.external.links.mode=byDomain solution.
>>
>> I crawl specific hosts only therefore I have a finite number of hosts to 
>> crawl.
>> Lets say, www.somewebsite.com[http://www.somewebsite.com]
>>
>> I want to stay limited with this host. In other words, neither 
>> www.art.somewebsite.com[http://www.art.somewebsite.com] nor 
>> www.sport.somewebsite.com[http://www.sport.somewebsite.com].
>> That's why  db.ignore.external.links.mode=byHost and db.ignore.external = 
>> true(no external websites).
>>
>> Although, I want to get the links that seem to belong to the same 
>> host(www.somewebsite.com[http://www.somewebsite.com] -> 
>> somewebsite.com/games, without www).
>> The question is shouldn't we include it as a default behavior(or configured 
>> behavior) in Nutch and interpret 
>> www.somewebsite.com[http://www.somewebsite.com] and somewhebsite.com as one 
>> host?
>>
>>
>>
>> PS. For me it is not really clear how ProtocolResolver works.
>>
>> Semyon
>>
>>
>>  
>>
>> Sent: Tuesday, February 20, 2018 at 9:40 PM
>> From: "Markus Jelsma" <[email protected]>
>> To: "[email protected]" <[email protected]>
>> Subject: RE: Internal links appear to be external in Parse. Improvement of 
>> the crawling quality
>> Hello Semyon,
>>
>> Yossi is right, you can use the db.ignore.* set of directives to resolve the 
>> problem.
>>
>> Regarding protocol, you can use urlnormalizer-protocol to set up per host 
>> rules. This is, of course, a tedious job if you operate a crawl on an 
>> indefinite amount of hosts, so use the uncommitted ProtocolResolver for that 
>> to do it for you.
>>
>> See: 
>> https://issues.apache.org/jira/browse/NUTCH-2247[https://issues.apache.org/jira/browse/NUTCH-2247]
>>
>> If i remember it tomorrow afternoon, i can probably schedule some time to 
>> work on it the coming seven days or so, and commit.
>>
>> Regards,
>> Markus
>>
>> -----Original message-----
>>> From:Yossi Tamari <[email protected]>
>>> Sent: Tuesday 20th February 2018 21:06
>>> To: [email protected]
>>> Subject: RE: Internal links appear to be external in Parse. Improvement of 
>>> the crawling quality
>>>
>>> Hi Semyon,
>>>
>>> Wouldn't setting db.ignore.external.links.mode=byDomain solve your wincs.be 
>>> issue?
>>> As far as I can see the protocol (HTTP/HTTPS) does not play any part in the 
>>> decision if this is the same domain.
>>>
>>> Yossi.
>>>
>>>> -----Original Message-----
>>>> From: Semyon Semyonov [mailto:[email protected]]
>>>> Sent: 20 February 2018 20:43
>>>> To: usernutch.apache.org <[email protected]>
>>>> Subject: Internal links appear to be external in Parse. Improvement of the
>>>> crawling quality
>>>>
>>>> Dear All,
>>>>
>>>> I'm trying to increase quality of the crawling. A part of my database has
>>>> DB_FETCHED = 1.
>>>>
>>>> Example, 
>>>> http://www.wincs.be/[http://www.wincs.be/][http://www.wincs.be/[http://www.wincs.be/]]
>>>>  in seed list.
>>>>
>>>> The root of the problem is in Parse/ParseOutputFormat.java line 364 - 374
>>>>
>>>> Nutch considers one of the 
>>>> link(http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html][http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html]])
>>>>  as external
>>>> and therefore reject it.
>>>>
>>>>
>>>> If I insert 
>>>> http://wincs.be[http://wincs.be][http://wincs.be[http://wincs.be]] in seed 
>>>> file, everything works fine.
>>>>
>>>> Do you think it is a good behavior? I mean, formally it is indeed two 
>>>> different
>>>> domains, but from user perspective it is exactly the same.
>>>>
>>>> And if it is a default behavior, how can I fix it for my case? The same 
>>>> question for
>>>> similar switch http -> https etc.
>>>>
>>>> Thanks.
>>>>
>>>> Semyon.
>>>
>>>
>  
>

Re: Internal links appear to be external in Parse. Improvement of the crawling quality

Reply via email to