Re: Internal links appear to be external in Parse. Improvement of the crawling quality

2018-03-20 Thread Semyon Semyonov
I found out that there is no direct way to do it, the problem was solved 
through calling of the regex transformation one more time in IndexerMapReduce, 
before the Indexer gets the Doc for writting.

Something like(IndexerMapReduce.java:line 369),
 doc.add("modifiedId", 
URLUtil.getHost(BidirectionalUrlExemptionFilter.tranform(key.toString()));
 

Sent: Friday, March 16, 2018 at 7:20 PM
From: "Semyon Semyonov" 
To: user@nutch.apache.org
Subject: Re: Internal links appear to be external in Parse. Improvement of the 
crawling quality
Hi again,

Another issue has appeared with introduction of bidirectional url exemption 
filter.

Having
http://www.website.com/page1
and
http://website.com/page2[http://website.com/page2]

Before as an indexer output(lets say a text file) I had one 
parent/host(www.website.com[http://www.website.com]) with 
children/pages(http://www.website.com/page1[http://www.website.com/page1], 
http://www.website.com/[http://www.website.com/]...).
Now, I have two different hosts and therefore two different parents for my 
output. I prefer to have the same hostname/alias for both hosts.

I checked url exemption filters and they don't allow to add metadata to the 
parsed data.

Therefore, two questions:
1) What is the best way to do it?
2) Should I include it into Nutch code or we don't need it and I should make a 
quick fix for myself?

Semyon.
 

Sent: Tuesday, March 06, 2018 at 11:08 AM
From: "Sebastian Nagel" 
To: user@nutch.apache.org
Subject: Re: Internal links appear to be external in Parse. Improvement of the 
crawling quality
Hi Semyon,

> We apply logical AND here, which is not really reasonable here.

By now, there was only a single exemption filter, it made no difference.
But yes, sounds plausible to change this to an OR resp. return true
as soon one of the filters accepts/exempts the URL. Please open a issue
to change it.

Thanks,
Sebastian

On 03/06/2018 10:28 AM, Semyon Semyonov wrote:
> I have proposed a solution for this problem 
> https://issues.apache.org/jira/browse/NUTCH-2522[https://issues.apache.org/jira/browse/NUTCH-2522].
>
> The other question is how voting mechanism of UrlExemptionFilters should work.
>
> UrlExemptionFilters.java : lines 60-65
> //An URL is exempted when all the filters accept it to pass through
> for (int i = 0; i < this.filters.length && exempted; i++) {
> exempted = this.filters[i].filter(fromUrl, toUrl);
> }
> URLExemptionFilter
> We apply logical AND here, which is not really reasonable here.
>
> I think if one of the filters votes for exempt then we should exempt it, 
> therefore logical OR instead.
> For example, with the new filter links such as 
> http://www.website.com[http://www.website.com][http://www.website.com[http://www.website.com]]
>  -> 
> http://website.com/about[http://website.com/about][http://website.com/about[http://website.com/about]]
>  can be exempted, but standart filter will not exempt it because they are 
> from different hosts. With current logic, the url will not be exempted, 
> because of logical AND
>
>
> Any ideas?
>
>  
>  
>
> Sent: Wednesday, February 21, 2018 at 2:58 PM
> From: "Sebastian Nagel" 
> To: user@nutch.apache.org
> Subject: Re: Internal links appear to be external in Parse. Improvement of 
> the crawling quality
>> 1) Do we have a config setting that we can use already?
>
> Not out-of-the-box. But there is already an extension point for your use case 
> [1]:
> the filter method takes to arguments (fromURL and toURL).
> Have a look at it, maybe you can fix it by implementing/contributing a plugin.
>
>> 2) ... It looks more like same Host problem rather ...
>
> To determine the host of a URL Nutch uses everywhere java.net.URL.getHost()
> which implements RFC 1738 [2]. We cannot change Java but it would be possible
> to modify URLUtil.getDomainName(...), at least, as a work-around.
>
>> 3) Where this problem should be solved? Only in ParseOutputFormat.java or 
>> somewhere else as well?
>
> You may also want to fix it in FetcherThread.handleRedirect(...) which 
> affects also your use case
> of following only internal links (if db.ignore.also.redirects == true).
>
> Best,
> Sebastian
>
>
> [1] 
> https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/net/URLExemptionFilter.html[https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/net/URLExemptionFilter.html][https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/net/URLExemptionFilter.html[https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/net/URLExemptionFilter.html]]
>
> 

Re: Internal links appear to be external in Parse. Improvement of the crawling quality

2018-03-16 Thread Semyon Semyonov
Hi again,

Another issue has appeared with introduction of bidirectional url exemption 
filter.

Having 
http://www.website.com/page1
and
http://website.com/page2

Before as an indexer output(lets say a text file) I had one 
parent/host(www.website.com) with children/pages(http://www.website.com/page1, 
http://www.website.com/...).
Now, I have two different hosts and therefore two different parents for my 
output. I prefer to have the same hostname/alias for both hosts.

I checked url exemption filters and they don't allow to add metadata to the 
parsed data.

Therefore, two questions:
1) What is the best way to do it?
2) Should I include it into Nutch code or we don't need it and I should make a 
quick fix for myself?

Semyon.
 

Sent: Tuesday, March 06, 2018 at 11:08 AM
From: "Sebastian Nagel" 
To: user@nutch.apache.org
Subject: Re: Internal links appear to be external in Parse. Improvement of the 
crawling quality
Hi Semyon,

> We apply logical AND here, which is not really reasonable here.

By now, there was only a single exemption filter, it made no difference.
But yes, sounds plausible to change this to an OR resp. return true
as soon one of the filters accepts/exempts the URL. Please open a issue
to change it.

Thanks,
Sebastian

On 03/06/2018 10:28 AM, Semyon Semyonov wrote:
> I have proposed a solution for this problem 
> https://issues.apache.org/jira/browse/NUTCH-2522.
>
> The other question is how voting mechanism of UrlExemptionFilters should work.
>
> UrlExemptionFilters.java : lines 60-65
> //An URL is exempted when all the filters accept it to pass through
> for (int i = 0; i < this.filters.length && exempted; i++) {
> exempted = this.filters[i].filter(fromUrl, toUrl);
> }
> URLExemptionFilter
> We apply logical AND here, which is not really reasonable here.
>
> I think if one of the filters votes for exempt then we should exempt it, 
> therefore logical OR instead.
> For example, with the new filter links such as 
> http://www.website.com[http://www.website.com] -> 
> http://website.com/about[http://website.com/about] can be exempted, but 
> standart filter will not exempt it because they are from different hosts. 
> With current logic, the url will not be exempted, because of logical AND
>
>
> Any ideas?
>
>  
>  
>
> Sent: Wednesday, February 21, 2018 at 2:58 PM
> From: "Sebastian Nagel" 
> To: user@nutch.apache.org
> Subject: Re: Internal links appear to be external in Parse. Improvement of 
> the crawling quality
>> 1) Do we have a config setting that we can use already?
>
> Not out-of-the-box. But there is already an extension point for your use case 
> [1]:
> the filter method takes to arguments (fromURL and toURL).
> Have a look at it, maybe you can fix it by implementing/contributing a plugin.
>
>> 2) ... It looks more like same Host problem rather ...
>
> To determine the host of a URL Nutch uses everywhere java.net.URL.getHost()
> which implements RFC 1738 [2]. We cannot change Java but it would be possible
> to modify URLUtil.getDomainName(...), at least, as a work-around.
>
>> 3) Where this problem should be solved? Only in ParseOutputFormat.java or 
>> somewhere else as well?
>
> You may also want to fix it in FetcherThread.handleRedirect(...) which 
> affects also your use case
> of following only internal links (if db.ignore.also.redirects == true).
>
> Best,
> Sebastian
>
>
> [1] 
> https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/net/URLExemptionFilter.html[https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/net/URLExemptionFilter.html]
>
> https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.html[https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.html]
> [2] 
> https://tools.ietf.org/html/rfc1738#section-3.1[https://tools.ietf.org/html/rfc1738#section-3.1][https://tools.ietf.org/html/rfc1738#section-3.1[https://tools.ietf.org/html/rfc1738#section-3.1]]
>
>
> On 02/21/2018 01:52 PM, Semyon Semyonov wrote:
>> Hi Sabastian,
>>
>> If I
>> - modify the method URLUtil.getDomainName(URL url)
>>
>> doesn't it mean that I don't need
>>  - set db.ignore.external.links.mode=byDomain
>>
>> anymore? 
>> http://www.somewebsite.com[http://www.somewebsite.com][http://www.somewebsite.com[http://www.somewebsite.com]]
>>  becomes the same host as somewhebsite.com.
>>
>>
>> To make it as generic as possible I can create an issue/pull request for 
>> this, but I would like to hear your suggestion about the best way to do so.
>> 1) Do we have a config setting that we can use already?
>> 2) The domain discussion[1] is quite wide though. In my case I cover only 
>> one issue with the mapping www -> _ . It looks more like same Host problem 
>> rather than the same Domain problem. What to you think about such host 
>> resolution?
>> 3) Where this problem should be solved? Only in ParseOutputFormat.java or 
>> 

Re: Internal links appear to be external in Parse. Improvement of the crawling quality

2018-03-06 Thread Sebastian Nagel
Hi Semyon,

> We apply logical AND here, which is not really reasonable here.

By now, there was only a single exemption filter, it made no difference.
But yes, sounds plausible to change this to an OR resp. return true
as soon one of the filters accepts/exempts the URL. Please open a issue
to change it.

Thanks,
Sebastian

On 03/06/2018 10:28 AM, Semyon Semyonov wrote:
> I have proposed a solution for this problem 
> https://issues.apache.org/jira/browse/NUTCH-2522.
> 
> The other question is how voting mechanism of UrlExemptionFilters should work.
> 
> UrlExemptionFilters.java : lines 60-65
> //An URL is exempted when all the filters accept it to pass through
> for (int i = 0; i < this.filters.length && exempted; i++) {
>   exempted = this.filters[i].filter(fromUrl, toUrl);
> }
> URLExemptionFilter
> We apply logical AND here, which is not really reasonable here.
> 
> I think if one of the filters votes for exempt then we should exempt it, 
> therefore logical OR instead.
> For example, with the new filter links such as http://www.website.com -> 
> http://website.com/about can be exempted, but standart filter will not exempt 
> it because they are from different hosts. With current logic, the url will 
> not be exempted, because of logical AND
> 
> 
> Any ideas?
> 
>  
>  
> 
> Sent: Wednesday, February 21, 2018 at 2:58 PM
> From: "Sebastian Nagel" 
> To: user@nutch.apache.org
> Subject: Re: Internal links appear to be external in Parse. Improvement of 
> the crawling quality
>> 1) Do we have a config setting that we can use already?
> 
> Not out-of-the-box. But there is already an extension point for your use case 
> [1]:
> the filter method takes to arguments (fromURL and toURL).
> Have a look at it, maybe you can fix it by implementing/contributing a plugin.
> 
>> 2) ... It looks more like same Host problem rather ...
> 
> To determine the host of a URL Nutch uses everywhere java.net.URL.getHost()
> which implements RFC 1738 [2]. We cannot change Java but it would be possible
> to modify URLUtil.getDomainName(...), at least, as a work-around.
> 
>> 3) Where this problem should be solved? Only in ParseOutputFormat.java or 
>> somewhere else as well?
> 
> You may also want to fix it in FetcherThread.handleRedirect(...) which 
> affects also your use case
> of following only internal links (if db.ignore.also.redirects == true).
> 
> Best,
> Sebastian
> 
> 
> [1] 
> https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/net/URLExemptionFilter.html
> 
> https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.html
> [2] 
> https://tools.ietf.org/html/rfc1738#section-3.1[https://tools.ietf.org/html/rfc1738#section-3.1]
> 
> 
> On 02/21/2018 01:52 PM, Semyon Semyonov wrote:
>> Hi Sabastian,
>>
>> If I
>> - modify the method URLUtil.getDomainName(URL url)
>>
>> doesn't it mean that I don't need
>>  - set db.ignore.external.links.mode=byDomain
>>
>> anymore? http://www.somewebsite.com[http://www.somewebsite.com] becomes the 
>> same host as somewhebsite.com.
>>
>>
>> To make it as generic as possible I can create an issue/pull request for 
>> this, but I would like to hear your suggestion about the best way to do so.
>> 1) Do we have a config setting that we can use already?
>> 2) The domain discussion[1] is quite wide though. In my case I cover only 
>> one issue with the mapping www -> _ . It looks more like same Host problem 
>> rather than the same Domain problem. What to you think about such host 
>> resolution?
>> 3) Where this problem should be solved? Only in ParseOutputFormat.java or 
>> somewhere else as well?
>>
>> Semyon.
>>
>>
>>  
>>
>> Sent: Wednesday, February 21, 2018 at 11:51 AM
>> From: "Sebastian Nagel" 
>> To: user@nutch.apache.org
>> Subject: Re: Internal links appear to be external in Parse. Improvement of 
>> the crawling quality
>> Hi Semyon,
>>
>>> interpret 
>>> www.somewebsite.com[http://www.somewebsite.com][http://www.somewebsite.com[http://www.somewebsite.com]]
>>>  and somewhebsite.com as one host?
>>
>> Yes, that's a common problem. More because of external links which must
>> include the host name - well-designed sites would use relative links
>> for internal same-host links.
>>
>> For a quick work-around:
>> - set db.ignore.external.links.mode=byDomain
>> - modify the method URLUtil.getDomainName(URL url)
>> so that it returns the hostname with www. stripped
>>
>> For a final solution we could make it configurable
>> which method or class is called. Since the definition of "domain"
>> is somewhat debatable [1], we could even provide alternative
>> implementations.
>>
>>> PS. For me it is not really clear how ProtocolResolver works.
>>
>> It's only a heuristics to avoid duplicates by protocol (http and https).
>> If you care about duplicates and cannot get rid of them afterwards by a 
>> deduplication job,
>> you may have a look at 

Re: Internal links appear to be external in Parse. Improvement of the crawling quality

2018-03-06 Thread Semyon Semyonov
I have proposed a solution for this problem 
https://issues.apache.org/jira/browse/NUTCH-2522.

The other question is how voting mechanism of UrlExemptionFilters should work.

UrlExemptionFilters.java : lines 60-65
//An URL is exempted when all the filters accept it to pass through
for (int i = 0; i < this.filters.length && exempted; i++) {
  exempted = this.filters[i].filter(fromUrl, toUrl);
}

We apply logical AND here, which is not really reasonable here.

I think if one of the filters votes for exempt then we should exempt it, 
therefore logical OR instead.
For example, with the new filter links such as http://www.website.com -> 
http://website.com/about can be exempted, but standart filter will not exempt 
it because they are from different hosts. With current logic, the url will not 
be exempted, because of logical AND


Any ideas?

 
 

Sent: Wednesday, February 21, 2018 at 2:58 PM
From: "Sebastian Nagel" 
To: user@nutch.apache.org
Subject: Re: Internal links appear to be external in Parse. Improvement of the 
crawling quality
> 1) Do we have a config setting that we can use already?

Not out-of-the-box. But there is already an extension point for your use case 
[1]:
the filter method takes to arguments (fromURL and toURL).
Have a look at it, maybe you can fix it by implementing/contributing a plugin.

> 2) ... It looks more like same Host problem rather ...

To determine the host of a URL Nutch uses everywhere java.net.URL.getHost()
which implements RFC 1738 [2]. We cannot change Java but it would be possible
to modify URLUtil.getDomainName(...), at least, as a work-around.

> 3) Where this problem should be solved? Only in ParseOutputFormat.java or 
> somewhere else as well?

You may also want to fix it in FetcherThread.handleRedirect(...) which affects 
also your use case
of following only internal links (if db.ignore.also.redirects == true).

Best,
Sebastian


[1] 
https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/net/URLExemptionFilter.html

https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.html
[2] 
https://tools.ietf.org/html/rfc1738#section-3.1[https://tools.ietf.org/html/rfc1738#section-3.1]


On 02/21/2018 01:52 PM, Semyon Semyonov wrote:
> Hi Sabastian,
>
> If I
> - modify the method URLUtil.getDomainName(URL url)
>
> doesn't it mean that I don't need
>  - set db.ignore.external.links.mode=byDomain
>
> anymore? http://www.somewebsite.com[http://www.somewebsite.com] becomes the 
> same host as somewhebsite.com.
>
>
> To make it as generic as possible I can create an issue/pull request for 
> this, but I would like to hear your suggestion about the best way to do so.
> 1) Do we have a config setting that we can use already?
> 2) The domain discussion[1] is quite wide though. In my case I cover only one 
> issue with the mapping www -> _ . It looks more like same Host problem rather 
> than the same Domain problem. What to you think about such host resolution?
> 3) Where this problem should be solved? Only in ParseOutputFormat.java or 
> somewhere else as well?
>
> Semyon.
>
>
>  
>
> Sent: Wednesday, February 21, 2018 at 11:51 AM
> From: "Sebastian Nagel" 
> To: user@nutch.apache.org
> Subject: Re: Internal links appear to be external in Parse. Improvement of 
> the crawling quality
> Hi Semyon,
>
>> interpret 
>> www.somewebsite.com[http://www.somewebsite.com][http://www.somewebsite.com[http://www.somewebsite.com]]
>>  and somewhebsite.com as one host?
>
> Yes, that's a common problem. More because of external links which must
> include the host name - well-designed sites would use relative links
> for internal same-host links.
>
> For a quick work-around:
> - set db.ignore.external.links.mode=byDomain
> - modify the method URLUtil.getDomainName(URL url)
> so that it returns the hostname with www. stripped
>
> For a final solution we could make it configurable
> which method or class is called. Since the definition of "domain"
> is somewhat debatable [1], we could even provide alternative
> implementations.
>
>> PS. For me it is not really clear how ProtocolResolver works.
>
> It's only a heuristics to avoid duplicates by protocol (http and https).
> If you care about duplicates and cannot get rid of them afterwards by a 
> deduplication job,
> you may have a look at urlnormalizer-protocol and NUTCH-2447.
>
> Best,
> Sebastian
>
>
> [1] 
> https://github.com/google/guava/wiki/InternetDomainNameExplained[https://github.com/google/guava/wiki/InternetDomainNameExplained][https://github.com/google/guava/wiki/InternetDomainNameExplained[https://github.com/google/guava/wiki/InternetDomainNameExplained]]
>
> On 02/21/2018 10:44 AM, Semyon Semyonov wrote:
>> Thanks Yossi, Markus,
>>
>> I have an issue with the db.ignore.external.links.mode=byDomain solution.
>>
>> I crawl specific hosts only therefore I have a finite number of hosts to 
>> crawl.
>> Lets say, 

Re: Internal links appear to be external in Parse. Improvement of the crawling quality

2018-02-21 Thread Sebastian Nagel
> 1) Do we have a config setting that we can use already?

Not out-of-the-box. But there is already an extension point for your use case 
[1]:
the filter method takes to arguments (fromURL and toURL).
Have a look at it, maybe you can fix it by implementing/contributing a plugin.

> 2) ... It looks more like same Host problem rather ...

To determine the host of a URL Nutch uses everywhere java.net.URL.getHost()
which implements RFC 1738 [2].  We cannot change Java but it would be possible
to modify URLUtil.getDomainName(...), at least, as a work-around.

> 3) Where this problem should be solved? Only in ParseOutputFormat.java or 
> somewhere else as well?

You may also want to fix it in FetcherThread.handleRedirect(...) which affects 
also your use case
of following only internal links (if db.ignore.also.redirects == true).

Best,
Sebastian


[1] 
https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/net/URLExemptionFilter.html

https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.html
[2] https://tools.ietf.org/html/rfc1738#section-3.1


On 02/21/2018 01:52 PM, Semyon Semyonov wrote:
> Hi Sabastian,
> 
> If I
> - modify the method URLUtil.getDomainName(URL url)
> 
> doesn't it mean that I don't need 
>  - set db.ignore.external.links.mode=byDomain
> 
> anymore? http://www.somewebsite.com becomes the same host as somewhebsite.com.
> 
> 
> To make it as generic as possible I can create an issue/pull request for 
> this, but I would like to hear your suggestion about the best way to do so.
> 1) Do we have a config setting that we can use already?
> 2) The domain discussion[1] is quite wide though. In my case I cover only one 
> issue with the mapping www -> _ . It looks more like same Host problem rather 
> than the same Domain problem. What to you think about such host resolution?
> 3) Where this problem should be solved? Only in ParseOutputFormat.java or 
> somewhere else as well?
> 
> Semyon.
> 
> 
>  
> 
> Sent: Wednesday, February 21, 2018 at 11:51 AM
> From: "Sebastian Nagel" 
> To: user@nutch.apache.org
> Subject: Re: Internal links appear to be external in Parse. Improvement of 
> the crawling quality
> Hi Semyon,
> 
>> interpret www.somewebsite.com[http://www.somewebsite.com] and 
>> somewhebsite.com as one host?
> 
> Yes, that's a common problem. More because of external links which must
> include the host name - well-designed sites would use relative links
> for internal same-host links.
> 
> For a quick work-around:
> - set db.ignore.external.links.mode=byDomain
> - modify the method URLUtil.getDomainName(URL url)
> so that it returns the hostname with www. stripped
> 
> For a final solution we could make it configurable
> which method or class is called. Since the definition of "domain"
> is somewhat debatable [1], we could even provide alternative
> implementations.
> 
>> PS. For me it is not really clear how ProtocolResolver works.
> 
> It's only a heuristics to avoid duplicates by protocol (http and https).
> If you care about duplicates and cannot get rid of them afterwards by a 
> deduplication job,
> you may have a look at urlnormalizer-protocol and NUTCH-2447.
> 
> Best,
> Sebastian
> 
> 
> [1] 
> https://github.com/google/guava/wiki/InternetDomainNameExplained[https://github.com/google/guava/wiki/InternetDomainNameExplained]
> 
> On 02/21/2018 10:44 AM, Semyon Semyonov wrote:
>> Thanks Yossi, Markus,
>>
>> I have an issue with the db.ignore.external.links.mode=byDomain solution.
>>
>> I crawl specific hosts only therefore I have a finite number of hosts to 
>> crawl.
>> Lets say, www.somewebsite.com[http://www.somewebsite.com]
>>
>> I want to stay limited with this host. In other words, neither 
>> www.art.somewebsite.com[http://www.art.somewebsite.com] nor 
>> www.sport.somewebsite.com[http://www.sport.somewebsite.com].
>> That's why  db.ignore.external.links.mode=byHost and db.ignore.external = 
>> true(no external websites).
>>
>> Although, I want to get the links that seem to belong to the same 
>> host(www.somewebsite.com[http://www.somewebsite.com] -> 
>> somewebsite.com/games, without www).
>> The question is shouldn't we include it as a default behavior(or configured 
>> behavior) in Nutch and interpret 
>> www.somewebsite.com[http://www.somewebsite.com] and somewhebsite.com as one 
>> host?
>>
>>
>>
>> PS. For me it is not really clear how ProtocolResolver works.
>>
>> Semyon
>>
>>
>>  
>>
>> Sent: Tuesday, February 20, 2018 at 9:40 PM
>> From: "Markus Jelsma" 
>> To: "user@nutch.apache.org" 
>> Subject: RE: Internal links appear to be external in Parse. Improvement of 
>> the crawling quality
>> Hello Semyon,
>>
>> Yossi is right, you can use the db.ignore.* set of directives to resolve the 
>> problem.
>>
>> Regarding protocol, you can use urlnormalizer-protocol to set up per host 
>> rules. This is, of course, a tedious job if you 

Re: Internal links appear to be external in Parse. Improvement of the crawling quality

2018-02-21 Thread Semyon Semyonov
Hi Sabastian,

If I
- modify the method URLUtil.getDomainName(URL url)

doesn't it mean that I don't need 
 - set db.ignore.external.links.mode=byDomain

anymore? http://www.somewebsite.com becomes the same host as somewhebsite.com.


To make it as generic as possible I can create an issue/pull request for this, 
but I would like to hear your suggestion about the best way to do so.
1) Do we have a config setting that we can use already?
2) The domain discussion[1] is quite wide though. In my case I cover only one 
issue with the mapping www -> _ . It looks more like same Host problem rather 
than the same Domain problem. What to you think about such host resolution?
3) Where this problem should be solved? Only in ParseOutputFormat.java or 
somewhere else as well?

Semyon.


 

Sent: Wednesday, February 21, 2018 at 11:51 AM
From: "Sebastian Nagel" 
To: user@nutch.apache.org
Subject: Re: Internal links appear to be external in Parse. Improvement of the 
crawling quality
Hi Semyon,

> interpret www.somewebsite.com[http://www.somewebsite.com] and 
> somewhebsite.com as one host?

Yes, that's a common problem. More because of external links which must
include the host name - well-designed sites would use relative links
for internal same-host links.

For a quick work-around:
- set db.ignore.external.links.mode=byDomain
- modify the method URLUtil.getDomainName(URL url)
so that it returns the hostname with www. stripped

For a final solution we could make it configurable
which method or class is called. Since the definition of "domain"
is somewhat debatable [1], we could even provide alternative
implementations.

> PS. For me it is not really clear how ProtocolResolver works.

It's only a heuristics to avoid duplicates by protocol (http and https).
If you care about duplicates and cannot get rid of them afterwards by a 
deduplication job,
you may have a look at urlnormalizer-protocol and NUTCH-2447.

Best,
Sebastian


[1] 
https://github.com/google/guava/wiki/InternetDomainNameExplained[https://github.com/google/guava/wiki/InternetDomainNameExplained]

On 02/21/2018 10:44 AM, Semyon Semyonov wrote:
> Thanks Yossi, Markus,
>
> I have an issue with the db.ignore.external.links.mode=byDomain solution.
>
> I crawl specific hosts only therefore I have a finite number of hosts to 
> crawl.
> Lets say, www.somewebsite.com[http://www.somewebsite.com]
>
> I want to stay limited with this host. In other words, neither 
> www.art.somewebsite.com[http://www.art.somewebsite.com] nor 
> www.sport.somewebsite.com[http://www.sport.somewebsite.com].
> That's why  db.ignore.external.links.mode=byHost and db.ignore.external = 
> true(no external websites).
>
> Although, I want to get the links that seem to belong to the same 
> host(www.somewebsite.com[http://www.somewebsite.com] -> 
> somewebsite.com/games, without www).
> The question is shouldn't we include it as a default behavior(or configured 
> behavior) in Nutch and interpret 
> www.somewebsite.com[http://www.somewebsite.com] and somewhebsite.com as one 
> host?
>
>
>
> PS. For me it is not really clear how ProtocolResolver works.
>
> Semyon
>
>
>  
>
> Sent: Tuesday, February 20, 2018 at 9:40 PM
> From: "Markus Jelsma" 
> To: "user@nutch.apache.org" 
> Subject: RE: Internal links appear to be external in Parse. Improvement of 
> the crawling quality
> Hello Semyon,
>
> Yossi is right, you can use the db.ignore.* set of directives to resolve the 
> problem.
>
> Regarding protocol, you can use urlnormalizer-protocol to set up per host 
> rules. This is, of course, a tedious job if you operate a crawl on an 
> indefinite amount of hosts, so use the uncommitted ProtocolResolver for that 
> to do it for you.
>
> See: 
> https://issues.apache.org/jira/browse/NUTCH-2247[https://issues.apache.org/jira/browse/NUTCH-2247]
>
> If i remember it tomorrow afternoon, i can probably schedule some time to 
> work on it the coming seven days or so, and commit.
>
> Regards,
> Markus
>
> -Original message-
>> From:Yossi Tamari 
>> Sent: Tuesday 20th February 2018 21:06
>> To: user@nutch.apache.org
>> Subject: RE: Internal links appear to be external in Parse. Improvement of 
>> the crawling quality
>>
>> Hi Semyon,
>>
>> Wouldn't setting db.ignore.external.links.mode=byDomain solve your wincs.be 
>> issue?
>> As far as I can see the protocol (HTTP/HTTPS) does not play any part in the 
>> decision if this is the same domain.
>>
>> Yossi.
>>
>>> -Original Message-
>>> From: Semyon Semyonov [mailto:semyon.semyo...@mail.com]
>>> Sent: 20 February 2018 20:43
>>> To: usernutch.apache.org 
>>> Subject: Internal links appear to be external in Parse. Improvement of the
>>> crawling quality
>>>
>>> Dear All,
>>>
>>> I'm trying to increase quality of the crawling. A part of my database has
>>> DB_FETCHED = 1.
>>>
>>> Example, 
>>> 

Re: Internal links appear to be external in Parse. Improvement of the crawling quality

2018-02-21 Thread Sebastian Nagel
Hi Semyon,

> interpret www.somewebsite.com and somewhebsite.com as one host?

Yes, that's a common problem. More because of external links which must
include the host name - well-designed sites would use relative links
for internal same-host links.

For a quick work-around:
- set db.ignore.external.links.mode=byDomain
- modify the method URLUtil.getDomainName(URL url)
  so that it returns the hostname with www. stripped

For a final solution we could make it configurable
which method or class is called. Since the definition of "domain"
is somewhat debatable [1], we could even provide alternative
implementations.

> PS. For me it is not really clear how ProtocolResolver works.

It's only a heuristics to avoid duplicates by protocol (http and https).
If you care about duplicates and cannot get rid of them afterwards by a 
deduplication job,
you may have a look at urlnormalizer-protocol and NUTCH-2447.

Best,
Sebastian


[1] https://github.com/google/guava/wiki/InternetDomainNameExplained

On 02/21/2018 10:44 AM, Semyon Semyonov wrote:
> Thanks Yossi, Markus,
> 
> I have an issue with the db.ignore.external.links.mode=byDomain solution.
> 
> I crawl specific hosts only therefore I have a finite number of hosts to 
> crawl.
> Lets say, www.somewebsite.com
> 
> I want to stay limited with this host. In other words, neither 
> www.art.somewebsite.com nor www.sport.somewebsite.com.
> That's why  db.ignore.external.links.mode=byHost and db.ignore.external = 
> true(no external websites).
> 
> Although, I want to get the links that seem to belong to the same 
> host(www.somewebsite.com -> somewebsite.com/games, without www).
> The question is shouldn't we include it as a default behavior(or configured 
> behavior) in Nutch and interpret www.somewebsite.com and somewhebsite.com as 
> one host?
> 
> 
> 
> PS. For me it is not really clear how ProtocolResolver works.
> 
> Semyon
> 
> 
>  
> 
> Sent: Tuesday, February 20, 2018 at 9:40 PM
> From: "Markus Jelsma" 
> To: "user@nutch.apache.org" 
> Subject: RE: Internal links appear to be external in Parse. Improvement of 
> the crawling quality
> Hello Semyon,
> 
> Yossi is right, you can use the db.ignore.* set of directives to resolve the 
> problem.
> 
> Regarding protocol, you can use urlnormalizer-protocol to set up per host 
> rules. This is, of course, a tedious job if you operate a crawl on an 
> indefinite amount of hosts, so use the uncommitted ProtocolResolver for that 
> to do it for you.
> 
> See: https://issues.apache.org/jira/browse/NUTCH-2247
> 
> If i remember it tomorrow afternoon, i can probably schedule some time to 
> work on it the coming seven days or so, and commit.
> 
> Regards,
> Markus
> 
> -Original message-
>> From:Yossi Tamari 
>> Sent: Tuesday 20th February 2018 21:06
>> To: user@nutch.apache.org
>> Subject: RE: Internal links appear to be external in Parse. Improvement of 
>> the crawling quality
>>
>> Hi Semyon,
>>
>> Wouldn't setting db.ignore.external.links.mode=byDomain solve your wincs.be 
>> issue?
>> As far as I can see the protocol (HTTP/HTTPS) does not play any part in the 
>> decision if this is the same domain.
>>
>> Yossi.
>>
>>> -Original Message-
>>> From: Semyon Semyonov [mailto:semyon.semyo...@mail.com]
>>> Sent: 20 February 2018 20:43
>>> To: usernutch.apache.org 
>>> Subject: Internal links appear to be external in Parse. Improvement of the
>>> crawling quality
>>>
>>> Dear All,
>>>
>>> I'm trying to increase quality of the crawling. A part of my database has
>>> DB_FETCHED = 1.
>>>
>>> Example, http://www.wincs.be/[http://www.wincs.be/] in seed list.
>>>
>>> The root of the problem is in Parse/ParseOutputFormat.java line 364 - 374
>>>
>>> Nutch considers one of the 
>>> link(http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html]) 
>>> as external
>>> and therefore reject it.
>>>
>>>
>>> If I insert http://wincs.be[http://wincs.be] in seed file, everything works 
>>> fine.
>>>
>>> Do you think it is a good behavior? I mean, formally it is indeed two 
>>> different
>>> domains, but from user perspective it is exactly the same.
>>>
>>> And if it is a default behavior, how can I fix it for my case? The same 
>>> question for
>>> similar switch http -> https etc.
>>>
>>> Thanks.
>>>
>>> Semyon.
>>
>>



Re: RE: Internal links appear to be external in Parse. Improvement of the crawling quality

2018-02-21 Thread Semyon Semyonov
Thanks Yossi, Markus,

I have an issue with the db.ignore.external.links.mode=byDomain solution.

I crawl specific hosts only therefore I have a finite number of hosts to crawl.
Lets say, www.somewebsite.com

I want to stay limited with this host. In other words, neither 
www.art.somewebsite.com nor www.sport.somewebsite.com.
That's why  db.ignore.external.links.mode=byHost and db.ignore.external = 
true(no external websites).

Although, I want to get the links that seem to belong to the same 
host(www.somewebsite.com -> somewebsite.com/games, without www).
The question is shouldn't we include it as a default behavior(or configured 
behavior) in Nutch and interpret www.somewebsite.com and somewhebsite.com as 
one host?



PS. For me it is not really clear how ProtocolResolver works.

Semyon


 

Sent: Tuesday, February 20, 2018 at 9:40 PM
From: "Markus Jelsma" 
To: "user@nutch.apache.org" 
Subject: RE: Internal links appear to be external in Parse. Improvement of the 
crawling quality
Hello Semyon,

Yossi is right, you can use the db.ignore.* set of directives to resolve the 
problem.

Regarding protocol, you can use urlnormalizer-protocol to set up per host 
rules. This is, of course, a tedious job if you operate a crawl on an 
indefinite amount of hosts, so use the uncommitted ProtocolResolver for that to 
do it for you.

See: https://issues.apache.org/jira/browse/NUTCH-2247

If i remember it tomorrow afternoon, i can probably schedule some time to work 
on it the coming seven days or so, and commit.

Regards,
Markus

-Original message-
> From:Yossi Tamari 
> Sent: Tuesday 20th February 2018 21:06
> To: user@nutch.apache.org
> Subject: RE: Internal links appear to be external in Parse. Improvement of 
> the crawling quality
>
> Hi Semyon,
>
> Wouldn't setting db.ignore.external.links.mode=byDomain solve your wincs.be 
> issue?
> As far as I can see the protocol (HTTP/HTTPS) does not play any part in the 
> decision if this is the same domain.
>
> Yossi.
>
> > -Original Message-
> > From: Semyon Semyonov [mailto:semyon.semyo...@mail.com]
> > Sent: 20 February 2018 20:43
> > To: usernutch.apache.org 
> > Subject: Internal links appear to be external in Parse. Improvement of the
> > crawling quality
> >
> > Dear All,
> >
> > I'm trying to increase quality of the crawling. A part of my database has
> > DB_FETCHED = 1.
> >
> > Example, http://www.wincs.be/[http://www.wincs.be/] in seed list.
> >
> > The root of the problem is in Parse/ParseOutputFormat.java line 364 - 374
> >
> > Nutch considers one of the 
> > link(http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html]) 
> > as external
> > and therefore reject it.
> >
> >
> > If I insert http://wincs.be[http://wincs.be] in seed file, everything works 
> > fine.
> >
> > Do you think it is a good behavior? I mean, formally it is indeed two 
> > different
> > domains, but from user perspective it is exactly the same.
> >
> > And if it is a default behavior, how can I fix it for my case? The same 
> > question for
> > similar switch http -> https etc.
> >
> > Thanks.
> >
> > Semyon.
>
>


RE: Internal links appear to be external in Parse. Improvement of the crawling quality

2018-02-20 Thread Yossi Tamari
Hi Semyon,

Wouldn't setting db.ignore.external.links.mode=byDomain solve your wincs.be 
issue?
As far as I can see the protocol (HTTP/HTTPS) does not play any part in the 
decision if this is the same domain.

Yossi.

> -Original Message-
> From: Semyon Semyonov [mailto:semyon.semyo...@mail.com]
> Sent: 20 February 2018 20:43
> To: usernutch.apache.org 
> Subject: Internal links appear to be external in Parse. Improvement of the
> crawling quality
> 
> Dear All,
> 
> I'm trying to increase quality of the crawling. A part of my database has
> DB_FETCHED = 1.
> 
> Example, http://www.wincs.be/ in seed list.
> 
> The root of the problem is in Parse/ParseOutputFormat.java line 364 - 374
> 
> Nutch considers one of the link(http://wincs.be/lakindustrie.html) as external
> and therefore reject it.
> 
> 
> If I insert http://wincs.be in seed file, everything works fine.
> 
> Do you think it is a good behavior? I mean, formally it is indeed two 
> different
> domains, but from user perspective it is exactly the same.
> 
> And if it is a default behavior, how can I fix it for my case? The same 
> question for
> similar switch http -> https  etc.
> 
> Thanks.
> 
> Semyon.