Hi again,
Another issue has appeared with introduction of bidirectional url exemption
filter.
Having
http://www.website.com/page1
and
http://website.com/page2
Before as an indexer output(lets say a text file) I had one
parent/host(www.website.com) with children/pages(http://www.website.com/page1,
http://www.website.com/...).
Now, I have two different hosts and therefore two different parents for my
output. I prefer to have the same hostname/alias for both hosts.
I checked url exemption filters and they don't allow to add metadata to the
parsed data.
Therefore, two questions:
1) What is the best way to do it?
2) Should I include it into Nutch code or we don't need it and I should make a
quick fix for myself?
Semyon.
Sent: Tuesday, March 06, 2018 at 11:08 AM
From: "Sebastian Nagel"
To: user@nutch.apache.org
Subject: Re: Internal links appear to be external in Parse. Improvement of the
crawling quality
Hi Semyon,
> We apply logical AND here, which is not really reasonable here.
By now, there was only a single exemption filter, it made no difference.
But yes, sounds plausible to change this to an OR resp. return true
as soon one of the filters accepts/exempts the URL. Please open a issue
to change it.
Thanks,
Sebastian
On 03/06/2018 10:28 AM, Semyon Semyonov wrote:
> I have proposed a solution for this problem
> https://issues.apache.org/jira/browse/NUTCH-2522.
>
> The other question is how voting mechanism of UrlExemptionFilters should work.
>
> UrlExemptionFilters.java : lines 60-65
> //An URL is exempted when all the filters accept it to pass through
> for (int i = 0; i < this.filters.length && exempted; i++) {
> exempted = this.filters[i].filter(fromUrl, toUrl);
> }
> URLExemptionFilter
> We apply logical AND here, which is not really reasonable here.
>
> I think if one of the filters votes for exempt then we should exempt it,
> therefore logical OR instead.
> For example, with the new filter links such as
> http://www.website.com[http://www.website.com] ->
> http://website.com/about[http://website.com/about] can be exempted, but
> standart filter will not exempt it because they are from different hosts.
> With current logic, the url will not be exempted, because of logical AND
>
>
> Any ideas?
>
>
>
>
> Sent: Wednesday, February 21, 2018 at 2:58 PM
> From: "Sebastian Nagel"
> To: user@nutch.apache.org
> Subject: Re: Internal links appear to be external in Parse. Improvement of
> the crawling quality
>> 1) Do we have a config setting that we can use already?
>
> Not out-of-the-box. But there is already an extension point for your use case
> [1]:
> the filter method takes to arguments (fromURL and toURL).
> Have a look at it, maybe you can fix it by implementing/contributing a plugin.
>
>> 2) ... It looks more like same Host problem rather ...
>
> To determine the host of a URL Nutch uses everywhere java.net.URL.getHost()
> which implements RFC 1738 [2]. We cannot change Java but it would be possible
> to modify URLUtil.getDomainName(...), at least, as a work-around.
>
>> 3) Where this problem should be solved? Only in ParseOutputFormat.java or
>> somewhere else as well?
>
> You may also want to fix it in FetcherThread.handleRedirect(...) which
> affects also your use case
> of following only internal links (if db.ignore.also.redirects == true).
>
> Best,
> Sebastian
>
>
> [1]
> https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/net/URLExemptionFilter.html[https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/net/URLExemptionFilter.html]
>
> https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.html[https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.html]
> [2]
> https://tools.ietf.org/html/rfc1738#section-3.1[https://tools.ietf.org/html/rfc1738#section-3.1][https://tools.ietf.org/html/rfc1738#section-3.1[https://tools.ietf.org/html/rfc1738#section-3.1]]
>
>
> On 02/21/2018 01:52 PM, Semyon Semyonov wrote:
>> Hi Sabastian,
>>
>> If I
>> - modify the method URLUtil.getDomainName(URL url)
>>
>> doesn't it mean that I don't need
>> - set db.ignore.external.links.mode=byDomain
>>
>> anymore?
>> http://www.somewebsite.com[http://www.somewebsite.com][http://www.somewebsite.com[http://www.somewebsite.com]]
>> becomes the same host as somewhebsite.com.
>>
>>
>> To make it as generic as possible I can create an issue/pull request for
>> this, but I would like to hear your suggestion about the best way to do so.
>> 1) Do we have a config setting that we can use already?
>> 2) The domain discussion[1] is quite wide though. In my case I cover only
>> one issue with the mapping www -> _ . It looks more like same Host problem
>> rather than the same Domain problem. What to you think about such host
>> resolution?
>> 3) Where this problem should be solved? Only in ParseOutputFormat.java or
>>