Re: Internal links appear to be external in Parse. Improvement of the crawling quality

Semyon Semyonov Tue, 06 Mar 2018 01:29:43 -0800

I have proposed a solution for this problem 
https://issues.apache.org/jira/browse/NUTCH-2522.


The other question is how voting mechanism of UrlExemptionFilters should work.

UrlExemptionFilters.java : lines 60-65
    //An URL is exempted when all the filters accept it to pass through
    for (int i = 0; i < this.filters.length && exempted; i++) {
      exempted = this.filters[i].filter(fromUrl, toUrl);
    }

We apply logical AND here, which is not really reasonable here.

I think if one of the filters votes for exempt then we should exempt it, 
therefore logical OR instead.
For example, with the new filter links such as http://www.website.com -> 
http://website.com/about can be exempted, but standart filter will not exempt 
it because they are from different hosts. With current logic, the url will not 
be exempted, because of logical AND


Any ideas?

 
 

Sent: Wednesday, February 21, 2018 at 2:58 PM
From: "Sebastian Nagel" <[email protected]>
To: [email protected]
Subject: Re: Internal links appear to be external in Parse. Improvement of the 
crawling quality
> 1) Do we have a config setting that we can use already?

Not out-of-the-box. But there is already an extension point for your use case 
[1]:
the filter method takes to arguments (fromURL and toURL).
Have a look at it, maybe you can fix it by implementing/contributing a plugin.

> 2) ... It looks more like same Host problem rather ...

To determine the host of a URL Nutch uses everywhere java.net.URL.getHost()
which implements RFC 1738 [2]. We cannot change Java but it would be possible
to modify URLUtil.getDomainName(...), at least, as a work-around.

> 3) Where this problem should be solved? Only in ParseOutputFormat.java or 
> somewhere else as well?

You may also want to fix it in FetcherThread.handleRedirect(...) which affects 
also your use case
of following only internal links (if db.ignore.also.redirects == true).

Best,
Sebastian


[1] 
https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/net/URLExemptionFilter.html

https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.html
[2] 
https://tools.ietf.org/html/rfc1738#section-3.1[https://tools.ietf.org/html/rfc1738#section-3.1]


On 02/21/2018 01:52 PM, Semyon Semyonov wrote:
> Hi Sabastian,
>
> If I
> - modify the method URLUtil.getDomainName(URL url)
>
> doesn't it mean that I don't need
>  - set db.ignore.external.links.mode=byDomain
>
> anymore? http://www.somewebsite.com[http://www.somewebsite.com] becomes the 
> same host as somewhebsite.com.
>
>
> To make it as generic as possible I can create an issue/pull request for 
> this, but I would like to hear your suggestion about the best way to do so.
> 1) Do we have a config setting that we can use already?
> 2) The domain discussion[1] is quite wide though. In my case I cover only one 
> issue with the mapping www -> _ . It looks more like same Host problem rather 
> than the same Domain problem. What to you think about such host resolution?
> 3) Where this problem should be solved? Only in ParseOutputFormat.java or 
> somewhere else as well?
>
> Semyon.
>
>
>  
>
> Sent: Wednesday, February 21, 2018 at 11:51 AM
> From: "Sebastian Nagel" <[email protected]>
> To: [email protected]
> Subject: Re: Internal links appear to be external in Parse. Improvement of 
> the crawling quality
> Hi Semyon,
>
>> interpret 
>> www.somewebsite.com[http://www.somewebsite.com][http://www.somewebsite.com[http://www.somewebsite.com]]
>>  and somewhebsite.com as one host?
>
> Yes, that's a common problem. More because of external links which must
> include the host name - well-designed sites would use relative links
> for internal same-host links.
>
> For a quick work-around:
> - set db.ignore.external.links.mode=byDomain
> - modify the method URLUtil.getDomainName(URL url)
> so that it returns the hostname with www. stripped
>
> For a final solution we could make it configurable
> which method or class is called. Since the definition of "domain"
> is somewhat debatable [1], we could even provide alternative
> implementations.
>
>> PS. For me it is not really clear how ProtocolResolver works.
>
> It's only a heuristics to avoid duplicates by protocol (http and https).
> If you care about duplicates and cannot get rid of them afterwards by a 
> deduplication job,
> you may have a look at urlnormalizer-protocol and NUTCH-2447.
>
> Best,
> Sebastian
>
>
> [1] 
> https://github.com/google/guava/wiki/InternetDomainNameExplained[https://github.com/google/guava/wiki/InternetDomainNameExplained][https://github.com/google/guava/wiki/InternetDomainNameExplained[https://github.com/google/guava/wiki/InternetDomainNameExplained]]
>
> On 02/21/2018 10:44 AM, Semyon Semyonov wrote:
>> Thanks Yossi, Markus,
>>
>> I have an issue with the db.ignore.external.links.mode=byDomain solution.
>>
>> I crawl specific hosts only therefore I have a finite number of hosts to 
>> crawl.
>> Lets say, 
>> www.somewebsite.com[http://www.somewebsite.com][http://www.somewebsite.com[http://www.somewebsite.com]]
>>
>> I want to stay limited with this host. In other words, neither 
>> www.art.somewebsite.com[http://www.art.somewebsite.com][http://www.art.somewebsite.com[http://www.art.somewebsite.com]]
>>  nor 
>> www.sport.somewebsite.com[http://www.sport.somewebsite.com][http://www.sport.somewebsite.com[http://www.sport.somewebsite.com]].
>> That's why  db.ignore.external.links.mode=byHost and db.ignore.external = 
>> true(no external websites).
>>
>> Although, I want to get the links that seem to belong to the same 
>> host(www.somewebsite.com[http://www.somewebsite.com][http://www.somewebsite.com[http://www.somewebsite.com]]
>>  -> somewebsite.com/games, without www).
>> The question is shouldn't we include it as a default behavior(or configured 
>> behavior) in Nutch and interpret 
>> www.somewebsite.com[http://www.somewebsite.com][http://www.somewebsite.com[http://www.somewebsite.com]]
>>  and somewhebsite.com as one host?
>>
>>
>>
>> PS. For me it is not really clear how ProtocolResolver works.
>>
>> Semyon
>>
>>
>>  
>>
>> Sent: Tuesday, February 20, 2018 at 9:40 PM
>> From: "Markus Jelsma" <[email protected]>
>> To: "[email protected]" <[email protected]>
>> Subject: RE: Internal links appear to be external in Parse. Improvement of 
>> the crawling quality
>> Hello Semyon,
>>
>> Yossi is right, you can use the db.ignore.* set of directives to resolve the 
>> problem.
>>
>> Regarding protocol, you can use urlnormalizer-protocol to set up per host 
>> rules. This is, of course, a tedious job if you operate a crawl on an 
>> indefinite amount of hosts, so use the uncommitted ProtocolResolver for that 
>> to do it for you.
>>
>> See: 
>> https://issues.apache.org/jira/browse/NUTCH-2247[https://issues.apache.org/jira/browse/NUTCH-2247][https://issues.apache.org/jira/browse/NUTCH-2247[https://issues.apache.org/jira/browse/NUTCH-2247]]
>>
>> If i remember it tomorrow afternoon, i can probably schedule some time to 
>> work on it the coming seven days or so, and commit.
>>
>> Regards,
>> Markus
>>
>> -----Original message-----
>>> From:Yossi Tamari <[email protected]>
>>> Sent: Tuesday 20th February 2018 21:06
>>> To: [email protected]
>>> Subject: RE: Internal links appear to be external in Parse. Improvement of 
>>> the crawling quality
>>>
>>> Hi Semyon,
>>>
>>> Wouldn't setting db.ignore.external.links.mode=byDomain solve your wincs.be 
>>> issue?
>>> As far as I can see the protocol (HTTP/HTTPS) does not play any part in the 
>>> decision if this is the same domain.
>>>
>>> Yossi.
>>>
>>>> -----Original Message-----
>>>> From: Semyon Semyonov [mailto:[email protected]]
>>>> Sent: 20 February 2018 20:43
>>>> To: usernutch.apache.org <[email protected]>
>>>> Subject: Internal links appear to be external in Parse. Improvement of the
>>>> crawling quality
>>>>
>>>> Dear All,
>>>>
>>>> I'm trying to increase quality of the crawling. A part of my database has
>>>> DB_FETCHED = 1.
>>>>
>>>> Example, 
>>>> http://www.wincs.be/[http://www.wincs.be/][http://www.wincs.be/[http://www.wincs.be/]][http://www.wincs.be/[http://www.wincs.be/][http://www.wincs.be/[http://www.wincs.be/]]]
>>>>  in seed list.
>>>>
>>>> The root of the problem is in Parse/ParseOutputFormat.java line 364 - 374
>>>>
>>>> Nutch considers one of the 
>>>> link(http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html][http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html]][http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html][http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html]]])
>>>>  as external
>>>> and therefore reject it.
>>>>
>>>>
>>>> If I insert 
>>>> http://wincs.be[http://wincs.be][http://wincs.be[http://wincs.be]][http://wincs.be[http://wincs.be][http://wincs.be[http://wincs.be]]]
>>>>  in seed file, everything works fine.
>>>>
>>>> Do you think it is a good behavior? I mean, formally it is indeed two 
>>>> different
>>>> domains, but from user perspective it is exactly the same.
>>>>
>>>> And if it is a default behavior, how can I fix it for my case? The same 
>>>> question for
>>>> similar switch http -> https etc.
>>>>
>>>> Thanks.
>>>>
>>>> Semyon.
>>>
>>>
>  
>

Re: Internal links appear to be external in Parse. Improvement of the crawling quality

Reply via email to