Re: Internal links appear to be external in Parse. Improvement of the crawling quality

Semyon Semyonov Fri, 16 Mar 2018 11:20:51 -0700

Hi again,

Another issue has appeared with introduction of bidirectional url exemption 
filter.


Having 
http://www.website.com/page1
and
http://website.com/page2

Before as an indexer output(lets say a text file) I had one 
parent/host(www.website.com) with children/pages(http://www.website.com/page1, 
http://www.website.com/...).
Now, I have two different hosts and therefore two different parents for my 
output. I prefer to have the same hostname/alias for both hosts.

I checked url exemption filters and they don't allow to add metadata to the 
parsed data.

Therefore, two questions:
1) What is the best way to do it?
2) Should I include it into Nutch code or we don't need it and I should make a 
quick fix for myself?

Semyon.
 

Sent: Tuesday, March 06, 2018 at 11:08 AM
From: "Sebastian Nagel" <[email protected]>
To: [email protected]
Subject: Re: Internal links appear to be external in Parse. Improvement of the 
crawling quality
Hi Semyon,

> We apply logical AND here, which is not really reasonable here.

By now, there was only a single exemption filter, it made no difference.
But yes, sounds plausible to change this to an OR resp. return true
as soon one of the filters accepts/exempts the URL. Please open a issue
to change it.

Thanks,
Sebastian

On 03/06/2018 10:28 AM, Semyon Semyonov wrote:
> I have proposed a solution for this problem 
> https://issues.apache.org/jira/browse/NUTCH-2522.
>
> The other question is how voting mechanism of UrlExemptionFilters should work.
>
> UrlExemptionFilters.java : lines 60-65
> //An URL is exempted when all the filters accept it to pass through
> for (int i = 0; i < this.filters.length && exempted; i++) {
> exempted = this.filters[i].filter(fromUrl, toUrl);
> }
> URLExemptionFilter
> We apply logical AND here, which is not really reasonable here.
>
> I think if one of the filters votes for exempt then we should exempt it, 
> therefore logical OR instead.
> For example, with the new filter links such as 
> http://www.website.com[http://www.website.com] -> 
> http://website.com/about[http://website.com/about] can be exempted, but 
> standart filter will not exempt it because they are from different hosts. 
> With current logic, the url will not be exempted, because of logical AND
>
>
> Any ideas?
>
>  
>  
>
> Sent: Wednesday, February 21, 2018 at 2:58 PM
> From: "Sebastian Nagel" <[email protected]>
> To: [email protected]
> Subject: Re: Internal links appear to be external in Parse. Improvement of 
> the crawling quality
>> 1) Do we have a config setting that we can use already?
>
> Not out-of-the-box. But there is already an extension point for your use case 
> [1]:
> the filter method takes to arguments (fromURL and toURL).
> Have a look at it, maybe you can fix it by implementing/contributing a plugin.
>
>> 2) ... It looks more like same Host problem rather ...
>
> To determine the host of a URL Nutch uses everywhere java.net.URL.getHost()
> which implements RFC 1738 [2]. We cannot change Java but it would be possible
> to modify URLUtil.getDomainName(...), at least, as a work-around.
>
>> 3) Where this problem should be solved? Only in ParseOutputFormat.java or 
>> somewhere else as well?
>
> You may also want to fix it in FetcherThread.handleRedirect(...) which 
> affects also your use case
> of following only internal links (if db.ignore.also.redirects == true).
>
> Best,
> Sebastian
>
>
> [1] 
> https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/net/URLExemptionFilter.html[https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/net/URLExemptionFilter.html]
>
> https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.html[https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.html]
> [2] 
> https://tools.ietf.org/html/rfc1738#section-3.1[https://tools.ietf.org/html/rfc1738#section-3.1][https://tools.ietf.org/html/rfc1738#section-3.1[https://tools.ietf.org/html/rfc1738#section-3.1]]
>
>
> On 02/21/2018 01:52 PM, Semyon Semyonov wrote:
>> Hi Sabastian,
>>
>> If I
>> - modify the method URLUtil.getDomainName(URL url)
>>
>> doesn't it mean that I don't need
>>  - set db.ignore.external.links.mode=byDomain
>>
>> anymore? 
>> http://www.somewebsite.com[http://www.somewebsite.com][http://www.somewebsite.com[http://www.somewebsite.com]]
>>  becomes the same host as somewhebsite.com.
>>
>>
>> To make it as generic as possible I can create an issue/pull request for 
>> this, but I would like to hear your suggestion about the best way to do so.
>> 1) Do we have a config setting that we can use already?
>> 2) The domain discussion[1] is quite wide though. In my case I cover only 
>> one issue with the mapping www -> _ . It looks more like same Host problem 
>> rather than the same Domain problem. What to you think about such host 
>> resolution?
>> 3) Where this problem should be solved? Only in ParseOutputFormat.java or 
>> somewhere else as well?
>>
>> Semyon.
>>
>>
>>  
>>
>> Sent: Wednesday, February 21, 2018 at 11:51 AM
>> From: "Sebastian Nagel" <[email protected]>
>> To: [email protected]
>> Subject: Re: Internal links appear to be external in Parse. Improvement of 
>> the crawling quality
>> Hi Semyon,
>>
>>> interpret 
>>> www.somewebsite.com[http://www.somewebsite.com][http://www.somewebsite.com[http://www.somewebsite.com]][http://www.somewebsite.com[http://www.somewebsite.com][http://www.somewebsite.com[http://www.somewebsite.com]]]
>>>  and somewhebsite.com as one host?
>>
>> Yes, that's a common problem. More because of external links which must
>> include the host name - well-designed sites would use relative links
>> for internal same-host links.
>>
>> For a quick work-around:
>> - set db.ignore.external.links.mode=byDomain
>> - modify the method URLUtil.getDomainName(URL url)
>> so that it returns the hostname with www. stripped
>>
>> For a final solution we could make it configurable
>> which method or class is called. Since the definition of "domain"
>> is somewhat debatable [1], we could even provide alternative
>> implementations.
>>
>>> PS. For me it is not really clear how ProtocolResolver works.
>>
>> It's only a heuristics to avoid duplicates by protocol (http and https).
>> If you care about duplicates and cannot get rid of them afterwards by a 
>> deduplication job,
>> you may have a look at urlnormalizer-protocol and NUTCH-2447.
>>
>> Best,
>> Sebastian
>>
>>
>> [1] 
>> https://github.com/google/guava/wiki/InternetDomainNameExplained[https://github.com/google/guava/wiki/InternetDomainNameExplained][https://github.com/google/guava/wiki/InternetDomainNameExplained[https://github.com/google/guava/wiki/InternetDomainNameExplained]][https://github.com/google/guava/wiki/InternetDomainNameExplained[https://github.com/google/guava/wiki/InternetDomainNameExplained][https://github.com/google/guava/wiki/InternetDomainNameExplained[https://github.com/google/guava/wiki/InternetDomainNameExplained]]]
>>
>> On 02/21/2018 10:44 AM, Semyon Semyonov wrote:
>>> Thanks Yossi, Markus,
>>>
>>> I have an issue with the db.ignore.external.links.mode=byDomain solution.
>>>
>>> I crawl specific hosts only therefore I have a finite number of hosts to 
>>> crawl.
>>> Lets say, 
>>> www.somewebsite.com[http://www.somewebsite.com][http://www.somewebsite.com[http://www.somewebsite.com]][http://www.somewebsite.com[http://www.somewebsite.com][http://www.somewebsite.com[http://www.somewebsite.com]]]
>>>
>>> I want to stay limited with this host. In other words, neither 
>>> www.art.somewebsite.com[http://www.art.somewebsite.com][http://www.art.somewebsite.com[http://www.art.somewebsite.com]][http://www.art.somewebsite.com[http://www.art.somewebsite.com][http://www.art.somewebsite.com[http://www.art.somewebsite.com]]]
>>>  nor 
>>> www.sport.somewebsite.com[http://www.sport.somewebsite.com][http://www.sport.somewebsite.com[http://www.sport.somewebsite.com]][http://www.sport.somewebsite.com[http://www.sport.somewebsite.com][http://www.sport.somewebsite.com[http://www.sport.somewebsite.com]]].
>>> That's why  db.ignore.external.links.mode=byHost and db.ignore.external = 
>>> true(no external websites).
>>>
>>> Although, I want to get the links that seem to belong to the same 
>>> host(www.somewebsite.com[http://www.somewebsite.com][http://www.somewebsite.com[http://www.somewebsite.com]][http://www.somewebsite.com[http://www.somewebsite.com][http://www.somewebsite.com[http://www.somewebsite.com]]]
>>>  -> somewebsite.com/games, without www).
>>> The question is shouldn't we include it as a default behavior(or configured 
>>> behavior) in Nutch and interpret 
>>> www.somewebsite.com[http://www.somewebsite.com][http://www.somewebsite.com[http://www.somewebsite.com]][http://www.somewebsite.com[http://www.somewebsite.com][http://www.somewebsite.com[http://www.somewebsite.com]]]
>>>  and somewhebsite.com as one host?
>>>
>>>
>>>
>>> PS. For me it is not really clear how ProtocolResolver works.
>>>
>>> Semyon
>>>
>>>
>>>  
>>>
>>> Sent: Tuesday, February 20, 2018 at 9:40 PM
>>> From: "Markus Jelsma" <[email protected]>
>>> To: "[email protected]" <[email protected]>
>>> Subject: RE: Internal links appear to be external in Parse. Improvement of 
>>> the crawling quality
>>> Hello Semyon,
>>>
>>> Yossi is right, you can use the db.ignore.* set of directives to resolve 
>>> the problem.
>>>
>>> Regarding protocol, you can use urlnormalizer-protocol to set up per host 
>>> rules. This is, of course, a tedious job if you operate a crawl on an 
>>> indefinite amount of hosts, so use the uncommitted ProtocolResolver for 
>>> that to do it for you.
>>>
>>> See: 
>>> https://issues.apache.org/jira/browse/NUTCH-2247[https://issues.apache.org/jira/browse/NUTCH-2247][https://issues.apache.org/jira/browse/NUTCH-2247[https://issues.apache.org/jira/browse/NUTCH-2247]][https://issues.apache.org/jira/browse/NUTCH-2247[https://issues.apache.org/jira/browse/NUTCH-2247][https://issues.apache.org/jira/browse/NUTCH-2247[https://issues.apache.org/jira/browse/NUTCH-2247]]]
>>>
>>> If i remember it tomorrow afternoon, i can probably schedule some time to 
>>> work on it the coming seven days or so, and commit.
>>>
>>> Regards,
>>> Markus
>>>
>>> -----Original message-----
>>>> From:Yossi Tamari <[email protected]>
>>>> Sent: Tuesday 20th February 2018 21:06
>>>> To: [email protected]
>>>> Subject: RE: Internal links appear to be external in Parse. Improvement of 
>>>> the crawling quality
>>>>
>>>> Hi Semyon,
>>>>
>>>> Wouldn't setting db.ignore.external.links.mode=byDomain solve your 
>>>> wincs.be issue?
>>>> As far as I can see the protocol (HTTP/HTTPS) does not play any part in 
>>>> the decision if this is the same domain.
>>>>
>>>> Yossi.
>>>>
>>>>> -----Original Message-----
>>>>> From: Semyon Semyonov [mailto:[email protected]]
>>>>> Sent: 20 February 2018 20:43
>>>>> To: usernutch.apache.org <[email protected]>
>>>>> Subject: Internal linksURLExemptionFilter appear to be external in Parse. 
>>>>> Improvement of the
>>>>> crawling quality
>>>>>
>>>>> Dear All,
>>>>>
>>>>> I'm trying to increase quality of the crawling. A part of my database has
>>>>> DB_FETCHED = 1.
>>>>>
>>>>> Example, 
>>>>> http://www.wincs.be/[http://www.wincs.be/][http://www.wincs.be/[http://www.wincs.be/]][http://www.wincs.be/[http://www.wincs.be/][http://www.wincs.be/[http://www.wincs.be/]]][http://www.wincs.be/[http://www.wincs.be/][http://www.wincs.be/[http://www.wincs.be/]][http://www.wincs.be/[http://www.wincs.be/][http://www.wincs.be/[http://www.wincs.be/]]]]
>>>>>  in seed list.
>>>>>
>>>>> The root of the problem is in Parse/ParseOutputFormat.java line 364 - 374
>>>>>
>>>>> Nutch considers one of the 
>>>>> link(http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html][http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html]][http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html][http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html]]][http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html][http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html]][http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html][http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html]]]])
>>>>>  as external
>>>>> and therefore reject it.
>>>>>
>>>>>
>>>>> If I insert 
>>>>> http://wincs.be[http://wincs.be][http://wincs.be[http://wincs.be]][http://wincs.be[http://wincs.be][http://wincs.be[http://wincs.be]]][http://wincs.be[http://wincs.be][http://wincs.be[http://wincs.be]][http://wincs.be[http://wincs.be][http://wincs.be[http://wincs.be]]]]
>>>>>  in seed file, everything works fine.
>>>>>
>>>>> Do you think it is a good behavior? I mean, formally it is indeed two 
>>>>> different
>>>>> domains, but from user perspective it is exactly the same.
>>>>>
>>>>> And if it is a default behavior, how can I fix it for my case? The same 
>>>>> question for
>>>>> similar switch http -> https etc.
>>>>>
>>>>> Thanks.
>>>>>
>>>>> Semyon.
>>>>
>>>>
>>  
>>
>  
>

Re: Internal links appear to be external in Parse. Improvement of the crawling quality

Reply via email to