Re: Internal links appear to be external in Parse. Improvement of the crawling quality

2018-03-16 Thread Semyon Semyonov
Hi again,

Another issue has appeared with introduction of bidirectional url exemption 
filter.

Having 
http://www.website.com/page1
and
http://website.com/page2

Before as an indexer output(lets say a text file) I had one 
parent/host(www.website.com) with children/pages(http://www.website.com/page1, 
http://www.website.com/...).
Now, I have two different hosts and therefore two different parents for my 
output. I prefer to have the same hostname/alias for both hosts.

I checked url exemption filters and they don't allow to add metadata to the 
parsed data.

Therefore, two questions:
1) What is the best way to do it?
2) Should I include it into Nutch code or we don't need it and I should make a 
quick fix for myself?

Semyon.
 

Sent: Tuesday, March 06, 2018 at 11:08 AM
From: "Sebastian Nagel" 
To: user@nutch.apache.org
Subject: Re: Internal links appear to be external in Parse. Improvement of the 
crawling quality
Hi Semyon,

> We apply logical AND here, which is not really reasonable here.

By now, there was only a single exemption filter, it made no difference.
But yes, sounds plausible to change this to an OR resp. return true
as soon one of the filters accepts/exempts the URL. Please open a issue
to change it.

Thanks,
Sebastian

On 03/06/2018 10:28 AM, Semyon Semyonov wrote:
> I have proposed a solution for this problem 
> https://issues.apache.org/jira/browse/NUTCH-2522.
>
> The other question is how voting mechanism of UrlExemptionFilters should work.
>
> UrlExemptionFilters.java : lines 60-65
> //An URL is exempted when all the filters accept it to pass through
> for (int i = 0; i < this.filters.length && exempted; i++) {
> exempted = this.filters[i].filter(fromUrl, toUrl);
> }
> URLExemptionFilter
> We apply logical AND here, which is not really reasonable here.
>
> I think if one of the filters votes for exempt then we should exempt it, 
> therefore logical OR instead.
> For example, with the new filter links such as 
> http://www.website.com[http://www.website.com] -> 
> http://website.com/about[http://website.com/about] can be exempted, but 
> standart filter will not exempt it because they are from different hosts. 
> With current logic, the url will not be exempted, because of logical AND
>
>
> Any ideas?
>
>  
>  
>
> Sent: Wednesday, February 21, 2018 at 2:58 PM
> From: "Sebastian Nagel" 
> To: user@nutch.apache.org
> Subject: Re: Internal links appear to be external in Parse. Improvement of 
> the crawling quality
>> 1) Do we have a config setting that we can use already?
>
> Not out-of-the-box. But there is already an extension point for your use case 
> [1]:
> the filter method takes to arguments (fromURL and toURL).
> Have a look at it, maybe you can fix it by implementing/contributing a plugin.
>
>> 2) ... It looks more like same Host problem rather ...
>
> To determine the host of a URL Nutch uses everywhere java.net.URL.getHost()
> which implements RFC 1738 [2]. We cannot change Java but it would be possible
> to modify URLUtil.getDomainName(...), at least, as a work-around.
>
>> 3) Where this problem should be solved? Only in ParseOutputFormat.java or 
>> somewhere else as well?
>
> You may also want to fix it in FetcherThread.handleRedirect(...) which 
> affects also your use case
> of following only internal links (if db.ignore.also.redirects == true).
>
> Best,
> Sebastian
>
>
> [1] 
> https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/net/URLExemptionFilter.html[https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/net/URLExemptionFilter.html]
>
> https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.html[https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.html]
> [2] 
> https://tools.ietf.org/html/rfc1738#section-3.1[https://tools.ietf.org/html/rfc1738#section-3.1][https://tools.ietf.org/html/rfc1738#section-3.1[https://tools.ietf.org/html/rfc1738#section-3.1]]
>
>
> On 02/21/2018 01:52 PM, Semyon Semyonov wrote:
>> Hi Sabastian,
>>
>> If I
>> - modify the method URLUtil.getDomainName(URL url)
>>
>> doesn't it mean that I don't need
>>  - set db.ignore.external.links.mode=byDomain
>>
>> anymore? 
>> http://www.somewebsite.com[http://www.somewebsite.com][http://www.somewebsite.com[http://www.somewebsite.com]]
>>  becomes the same host as somewhebsite.com.
>>
>>
>> To make it as generic as possible I can create an issue/pull request for 
>> this, but I would like to hear your suggestion about the best way to do so.
>> 1) Do we have a config setting that we can use already?
>> 2) The domain discussion[1] is quite wide though. In my case I cover only 
>> one issue with the mapping www -> _ . It looks more like same Host problem 
>> rather than the same Domain problem. What to you think about such host 
>> resolution?
>> 3) Where this problem should be solved? Only in ParseOutputFormat.java or 
>> 

Re: Fetcher error when running on Amazon EMR with S3

2018-03-16 Thread Sebastian Nagel
Hi John,

the recent master has seen an upgrade to the new MapReduce API (NUTCH-2375),
it was a huge change which is already known to have introduced some issues.
For production it's recommended to use 1.14 and if necessary patch it.

Could you open a new issue on
  https://issues.apache.org/jira/projects/NUTCH
and provide the detailed stack there.

Thanks,
Sebastian

On 03/16/2018 01:45 PM, John Thornton wrote:
> Hello,
> 
> I'm currently running Nutch under Amazon EMR 5.12.0 with Hadoop 2.83 using
> S3 (EMRFS) as the filesystem.  If I build the latest version from the
> master branch and run a crawl in distributed mode I get a fetcher error
> like fetcher.Fetcher: Fetcher: java.lang.IllegalArgumentException: Wrong
> FS: s3:..., expected: hdfs://...
> 
> This problem was reported in NUTCH-2494 and fixed in PR-274 and indeed when
> I run the same crawl using a build of commit 87c7a2e it works with no
> error.  So my question is has a regression been introduced, or am I missing
> something?
> 
> Regards,
> 
> John
> 



Fetcher error when running on Amazon EMR with S3

2018-03-16 Thread John Thornton
Hello,

I'm currently running Nutch under Amazon EMR 5.12.0 with Hadoop 2.83 using
S3 (EMRFS) as the filesystem.  If I build the latest version from the
master branch and run a crawl in distributed mode I get a fetcher error
like fetcher.Fetcher: Fetcher: java.lang.IllegalArgumentException: Wrong
FS: s3:..., expected: hdfs://...

This problem was reported in NUTCH-2494 and fixed in PR-274 and indeed when
I run the same crawl using a build of commit 87c7a2e it works with no
error.  So my question is has a regression been introduced, or am I missing
something?

Regards,

John