RE: Need Tutorial on Nutch

2018-03-06 Thread Markus Jelsma
Hi,

Yes you are going to need code, and a lot more than just that, probably 
including dropping the 'every two hour' requirement.

For your case you need either site-specific price extraction, which is easy but 
a lot of work for 500+ sites. Or you need a more complicated generic algorithm, 
which is a lot of work too. Both can be implemented as Nutch ParseFilter 
plugins and need Java code to run.

Your next problem is daily volume, every product 12x per day for 500+ shops 
times many products. You can ignore bandwidth and processing, that is easy. But 
you are going to be blocked within a few days by at least a good amount of 
sites.

We once built a price checker crawler too, but the client's requirement for 
very high interval checks could not be met easily without the use of costly 
proxies to avoid being blocked, hardware and network costs. They dropped the 
requirement.

Good luck
Markus
 
-Original message-
> From:Eric Valencia 
> Sent: Tuesday 6th March 2018 21:17
> To: user@nutch.apache.org
> Subject: Re: Need Tutorial on Nutch
> 
> Yash, well, I want to monitor the price for every item in the top 500
> retail websites every two hours, 24/7/365.  Java is needed?
> 
> On Tue, Mar 6, 2018 at 12:15 PM, Yash Thenuan Thenuan <
> rit2014...@iiita.ac.in> wrote:
> 
> > If you want simple crawlung then Not at all.
> > But having experience with java will help you to fulfil your personal
> > requirements.
> >
> > On 7 Mar 2018 01:42, "Eric Valencia"  wrote:
> >
> > > Does this require knowing Java proficiently?
> > >
> > > On Tue, Mar 6, 2018 at 10:51 AM Semyon Semyonov <
> > semyon.semyo...@mail.com>
> > > wrote:
> > >
> > > > Here is an unpleasant truth - there is no up to date tutorial for
> > Nutch.
> > > > To make it even more interesting, sometimes the tutorial can contradict
> > > > real behavior of Nutch, because of lately introduced features/bugs. If
> > > you
> > > > find such cases, please try to fix and contribute to the project.
> > > >
> > > > Welcome to the open source world.
> > > >
> > > > Though, my recommendations as a person who started with Nutch less
> > then a
> > > > year ago :
> > > > 1) If you just need a simple crawl, you are in luck. Simply run crawl
> > > > script or several steps according to the Nutch crawl tutorial.
> > > > 2) If it is bit more comlex you start to face problems either with
> > > > configuration or with bugs. Therefore, first have a look at Nutch List
> > > > Archive http://nutch.apache.org/mailing_lists.html , if it doesnt work
> > > > try to figure out yourself, if that doesnt work ask here or at
> > developer
> > > > list.
> > > > 3) In most cases, you HAVE to open the code and fix/discover something.
> > > > Nutch is really complicated system and to understand it properly you
> > can
> > > > easily spend 2-3 months trying to get the full basic understanding of
> > the
> > > > system. It gets even worse if you don't know Hadoop. If you dont I do
> > > > recomend to read "Hadoop. The definitive guide", because, well, Nutch
> > is
> > > > Hadoop.
> > > >
> > > > Here we are, no pain, no gain.
> > > >
> > > >
> > > >
> > > > Sent: Tuesday, March 06, 2018 at 7:42 PM
> > > > From: "Eric Valencia" 
> > > > To: user@nutch.apache.org
> > > > Subject: Re: Need Tutorial on Nutch
> > > > Thank you kindly Yash. Yes, I did try some of the tutorials actually
> > but
> > > > they seem to be missing the complete amount of steps required to
> > > > successfully scrape in nutch.
> > > >
> > > > On Tue, Mar 6, 2018 at 10:37 AM Yash Thenuan Thenuan <
> > > > rit2014...@iiita.ac.in>
> > > > wrote:
> > > >
> > > > > I would suggest to start with the documentation on nutch's website.
> > > > > You can get a Idea about how to start crawling and all.
> > > > > Apart from that there are no proper tutorials as such.
> > > > > Just start crawling if you got stuck somewhere try to find something
> > > > > related to that on Google and nutch mailing list archives.
> > > > > Ask questions if nothing helps.
> > > > >
> > > > > On 7 Mar 2018 00:01, "Eric Valencia" 
> > wrote:
> > > > >
> > > > > I'm a beginner in Nutch and need the best tutorials to get started.
> > Can
> > > > > you guys let me know how you would advise yourselves if starting
> > today
> > > > > (like me)?
> > > > >
> > > > > Eric
> > > > >
> > > >
> > >
> >
> 


Re: Need Tutorial on Nutch

2018-03-06 Thread Eric Valencia
Yash, well, I want to monitor the price for every item in the top 500
retail websites every two hours, 24/7/365.  Java is needed?

On Tue, Mar 6, 2018 at 12:15 PM, Yash Thenuan Thenuan <
rit2014...@iiita.ac.in> wrote:

> If you want simple crawlung then Not at all.
> But having experience with java will help you to fulfil your personal
> requirements.
>
> On 7 Mar 2018 01:42, "Eric Valencia"  wrote:
>
> > Does this require knowing Java proficiently?
> >
> > On Tue, Mar 6, 2018 at 10:51 AM Semyon Semyonov <
> semyon.semyo...@mail.com>
> > wrote:
> >
> > > Here is an unpleasant truth - there is no up to date tutorial for
> Nutch.
> > > To make it even more interesting, sometimes the tutorial can contradict
> > > real behavior of Nutch, because of lately introduced features/bugs. If
> > you
> > > find such cases, please try to fix and contribute to the project.
> > >
> > > Welcome to the open source world.
> > >
> > > Though, my recommendations as a person who started with Nutch less
> then a
> > > year ago :
> > > 1) If you just need a simple crawl, you are in luck. Simply run crawl
> > > script or several steps according to the Nutch crawl tutorial.
> > > 2) If it is bit more comlex you start to face problems either with
> > > configuration or with bugs. Therefore, first have a look at Nutch List
> > > Archive http://nutch.apache.org/mailing_lists.html , if it doesnt work
> > > try to figure out yourself, if that doesnt work ask here or at
> developer
> > > list.
> > > 3) In most cases, you HAVE to open the code and fix/discover something.
> > > Nutch is really complicated system and to understand it properly you
> can
> > > easily spend 2-3 months trying to get the full basic understanding of
> the
> > > system. It gets even worse if you don't know Hadoop. If you dont I do
> > > recomend to read "Hadoop. The definitive guide", because, well, Nutch
> is
> > > Hadoop.
> > >
> > > Here we are, no pain, no gain.
> > >
> > >
> > >
> > > Sent: Tuesday, March 06, 2018 at 7:42 PM
> > > From: "Eric Valencia" 
> > > To: user@nutch.apache.org
> > > Subject: Re: Need Tutorial on Nutch
> > > Thank you kindly Yash. Yes, I did try some of the tutorials actually
> but
> > > they seem to be missing the complete amount of steps required to
> > > successfully scrape in nutch.
> > >
> > > On Tue, Mar 6, 2018 at 10:37 AM Yash Thenuan Thenuan <
> > > rit2014...@iiita.ac.in>
> > > wrote:
> > >
> > > > I would suggest to start with the documentation on nutch's website.
> > > > You can get a Idea about how to start crawling and all.
> > > > Apart from that there are no proper tutorials as such.
> > > > Just start crawling if you got stuck somewhere try to find something
> > > > related to that on Google and nutch mailing list archives.
> > > > Ask questions if nothing helps.
> > > >
> > > > On 7 Mar 2018 00:01, "Eric Valencia" 
> wrote:
> > > >
> > > > I'm a beginner in Nutch and need the best tutorials to get started.
> Can
> > > > you guys let me know how you would advise yourselves if starting
> today
> > > > (like me)?
> > > >
> > > > Eric
> > > >
> > >
> >
>


Re: Need Tutorial on Nutch

2018-03-06 Thread Yash Thenuan Thenuan
If you want simple crawlung then Not at all.
But having experience with java will help you to fulfil your personal
requirements.

On 7 Mar 2018 01:42, "Eric Valencia"  wrote:

> Does this require knowing Java proficiently?
>
> On Tue, Mar 6, 2018 at 10:51 AM Semyon Semyonov 
> wrote:
>
> > Here is an unpleasant truth - there is no up to date tutorial for Nutch.
> > To make it even more interesting, sometimes the tutorial can contradict
> > real behavior of Nutch, because of lately introduced features/bugs. If
> you
> > find such cases, please try to fix and contribute to the project.
> >
> > Welcome to the open source world.
> >
> > Though, my recommendations as a person who started with Nutch less then a
> > year ago :
> > 1) If you just need a simple crawl, you are in luck. Simply run crawl
> > script or several steps according to the Nutch crawl tutorial.
> > 2) If it is bit more comlex you start to face problems either with
> > configuration or with bugs. Therefore, first have a look at Nutch List
> > Archive http://nutch.apache.org/mailing_lists.html , if it doesnt work
> > try to figure out yourself, if that doesnt work ask here or at developer
> > list.
> > 3) In most cases, you HAVE to open the code and fix/discover something.
> > Nutch is really complicated system and to understand it properly you can
> > easily spend 2-3 months trying to get the full basic understanding of the
> > system. It gets even worse if you don't know Hadoop. If you dont I do
> > recomend to read "Hadoop. The definitive guide", because, well, Nutch is
> > Hadoop.
> >
> > Here we are, no pain, no gain.
> >
> >
> >
> > Sent: Tuesday, March 06, 2018 at 7:42 PM
> > From: "Eric Valencia" 
> > To: user@nutch.apache.org
> > Subject: Re: Need Tutorial on Nutch
> > Thank you kindly Yash. Yes, I did try some of the tutorials actually but
> > they seem to be missing the complete amount of steps required to
> > successfully scrape in nutch.
> >
> > On Tue, Mar 6, 2018 at 10:37 AM Yash Thenuan Thenuan <
> > rit2014...@iiita.ac.in>
> > wrote:
> >
> > > I would suggest to start with the documentation on nutch's website.
> > > You can get a Idea about how to start crawling and all.
> > > Apart from that there are no proper tutorials as such.
> > > Just start crawling if you got stuck somewhere try to find something
> > > related to that on Google and nutch mailing list archives.
> > > Ask questions if nothing helps.
> > >
> > > On 7 Mar 2018 00:01, "Eric Valencia"  wrote:
> > >
> > > I'm a beginner in Nutch and need the best tutorials to get started. Can
> > > you guys let me know how you would advise yourselves if starting today
> > > (like me)?
> > >
> > > Eric
> > >
> >
>


Re: Need Tutorial on Nutch

2018-03-06 Thread Eric Valencia
Does this require knowing Java proficiently?

On Tue, Mar 6, 2018 at 10:51 AM Semyon Semyonov 
wrote:

> Here is an unpleasant truth - there is no up to date tutorial for Nutch.
> To make it even more interesting, sometimes the tutorial can contradict
> real behavior of Nutch, because of lately introduced features/bugs. If you
> find such cases, please try to fix and contribute to the project.
>
> Welcome to the open source world.
>
> Though, my recommendations as a person who started with Nutch less then a
> year ago :
> 1) If you just need a simple crawl, you are in luck. Simply run crawl
> script or several steps according to the Nutch crawl tutorial.
> 2) If it is bit more comlex you start to face problems either with
> configuration or with bugs. Therefore, first have a look at Nutch List
> Archive http://nutch.apache.org/mailing_lists.html , if it doesnt work
> try to figure out yourself, if that doesnt work ask here or at developer
> list.
> 3) In most cases, you HAVE to open the code and fix/discover something.
> Nutch is really complicated system and to understand it properly you can
> easily spend 2-3 months trying to get the full basic understanding of the
> system. It gets even worse if you don't know Hadoop. If you dont I do
> recomend to read "Hadoop. The definitive guide", because, well, Nutch is
> Hadoop.
>
> Here we are, no pain, no gain.
>
>
>
> Sent: Tuesday, March 06, 2018 at 7:42 PM
> From: "Eric Valencia" 
> To: user@nutch.apache.org
> Subject: Re: Need Tutorial on Nutch
> Thank you kindly Yash. Yes, I did try some of the tutorials actually but
> they seem to be missing the complete amount of steps required to
> successfully scrape in nutch.
>
> On Tue, Mar 6, 2018 at 10:37 AM Yash Thenuan Thenuan <
> rit2014...@iiita.ac.in>
> wrote:
>
> > I would suggest to start with the documentation on nutch's website.
> > You can get a Idea about how to start crawling and all.
> > Apart from that there are no proper tutorials as such.
> > Just start crawling if you got stuck somewhere try to find something
> > related to that on Google and nutch mailing list archives.
> > Ask questions if nothing helps.
> >
> > On 7 Mar 2018 00:01, "Eric Valencia"  wrote:
> >
> > I'm a beginner in Nutch and need the best tutorials to get started. Can
> > you guys let me know how you would advise yourselves if starting today
> > (like me)?
> >
> > Eric
> >
>


RE: Regarding Internal Links

2018-03-06 Thread Yossi Tamari
Regarding the configuration parameter, your Parse Filter should expose a 
setConf method that receives a conf parameter. Keep that as a member variable 
and pass it where necessary.
Regarding parsestatus, contentmeta and parsemeta, you're going to have to look 
at them yourself (probably in a debugger), but as a baseline, you can probably 
just use the values in the inbound ParseResult (of the whole document).
More specifically, parsestatus is an indication of whether parsing was 
successful. Unless your parsing may fail even when the whole document parsing 
was successful, you don't need to change it. contentmeta is all the information 
that was gathered about this page before parsing, so again, you probably just 
want to keep it, and finally parsemeta is the metadata that was gathered during 
parsing and may be useful for indexing, so passing the metadata from the 
original ParseResult makes sense, or just using the constructor that does not 
require it if you don't care about the metadata.
This should all be easier to understand if you look at what the HTML Parser 
does with each of these fields.

> -Original Message-
> From: Yash Thenuan Thenuan 
> Sent: 06 March 2018 20:17
> To: user@nutch.apache.org
> Subject: RE: Regarding Internal Links
> 
> I am able to get parsetext data structure.
> But having trouble with parseData as it's constructor is asking for 
> parsestatus,
> outlinks, contentmeta and parsemeta.
> Outlinks I can get from outlinkExtractor but what about other parameters?
> And again getoutlinks is asking for configuration and i don't know, from 
> where I
> can get it?
> 
> On 6 Mar 2018 18:32, "Yossi Tamari"  wrote:
> 
> > You should go over each segment, and for each one produce a ParseText
> > and a ParseData. This is basically what the HTML Parser does for the
> > whole document, which is why I suggested you should dive into its code.
> > A ParseText is basically just a String containing the actual content
> > of the segment (after stripping the HTML tags). This is usually the
> > document you want to index.
> > The ParseData structure is a little more complex, but the main things
> > it contains are the title of this segment, and the outlinks from the
> > segment (for further crawling). Take a look at the code of both
> > classes and it should be relatively clear.
> > Finally, you need to build one ParseResult object, with the original
> > URL, and for each of the ParseText/ParseData pairs, call the put
> > method, with the internal URL of the segment as the key.
> >
> > > -Original Message-
> > > From: Yash Thenuan Thenuan 
> > > Sent: 06 March 2018 14:45
> > > To: user@nutch.apache.org
> > > Subject: RE: Regarding Internal Links
> > >
> > > > I am able to get the content corresponding to each Internal link
> > > > by writing a parse filter plugin. Now  I am  not getting how to
> > > > proceed further. How can I parse them as separate document and
> > > > what should my ParseResult filter return??
> >
> >



Re: Need Tutorial on Nutch

2018-03-06 Thread Semyon Semyonov
Here is an unpleasant truth - there is no up to date tutorial for Nutch. To 
make it even more interesting, sometimes the tutorial can contradict real 
behavior of Nutch, because of lately introduced features/bugs. If you find such 
cases, please try to fix and contribute to the project.

Welcome to the open source world.

Though, my recommendations as a person who started with Nutch less then a year 
ago :
1) If you just need a simple crawl, you are in luck. Simply run crawl script or 
several steps according to the Nutch crawl tutorial.
2) If it is bit more comlex you start to face problems either with 
configuration or with bugs. Therefore, first have a look at Nutch List Archive 
http://nutch.apache.org/mailing_lists.html , if it doesnt work try to figure 
out yourself, if that doesnt work ask here or at developer list.
3) In most cases, you HAVE to open the code and fix/discover something. Nutch 
is really complicated system and to understand it properly you can easily spend 
2-3 months trying to get the full basic understanding of the system. It gets 
even worse if you don't know Hadoop. If you dont I do recomend to read "Hadoop. 
The definitive guide", because, well, Nutch is Hadoop.

Here we are, no pain, no gain.
 
 

Sent: Tuesday, March 06, 2018 at 7:42 PM
From: "Eric Valencia" 
To: user@nutch.apache.org
Subject: Re: Need Tutorial on Nutch
Thank you kindly Yash. Yes, I did try some of the tutorials actually but
they seem to be missing the complete amount of steps required to
successfully scrape in nutch.

On Tue, Mar 6, 2018 at 10:37 AM Yash Thenuan Thenuan 
wrote:

> I would suggest to start with the documentation on nutch's website.
> You can get a Idea about how to start crawling and all.
> Apart from that there are no proper tutorials as such.
> Just start crawling if you got stuck somewhere try to find something
> related to that on Google and nutch mailing list archives.
> Ask questions if nothing helps.
>
> On 7 Mar 2018 00:01, "Eric Valencia"  wrote:
>
> I'm a beginner in Nutch and need the best tutorials to get started. Can
> you guys let me know how you would advise yourselves if starting today
> (like me)?
>
> Eric
>


Re: Need Tutorial on Nutch

2018-03-06 Thread Yash Thenuan Thenuan
Start with nutch 1.x if you are getting some trouble. Its easier to
configure and by following nutch 1.x tutorial you will be able to crawl
your first website easily.

On 7 Mar 2018 00:13, "Eric Valencia"  wrote:

> Thank you kindly Yash.  Yes, I did try some of the tutorials actually but
> they seem to be missing the complete amount of steps required to
> successfully scrape in nutch.
>
> On Tue, Mar 6, 2018 at 10:37 AM Yash Thenuan Thenuan <
> rit2014...@iiita.ac.in>
> wrote:
>
> > I would suggest to start with the documentation on nutch's website.
> > You can get a Idea about how to start crawling and all.
> > Apart from that there are no proper tutorials as such.
> > Just start crawling if you got stuck somewhere try to find something
> > related to that on Google and nutch mailing list archives.
> > Ask questions if nothing helps.
> >
> > On 7 Mar 2018 00:01, "Eric Valencia"  wrote:
> >
> > I'm a beginner in Nutch and need the best tutorials to get started.  Can
> > you guys let me know how you would advise yourselves if starting today
> > (like me)?
> >
> > Eric
> >
>


Re: Need Tutorial on Nutch

2018-03-06 Thread Eric Valencia
Thank you kindly Yash.  Yes, I did try some of the tutorials actually but
they seem to be missing the complete amount of steps required to
successfully scrape in nutch.

On Tue, Mar 6, 2018 at 10:37 AM Yash Thenuan Thenuan 
wrote:

> I would suggest to start with the documentation on nutch's website.
> You can get a Idea about how to start crawling and all.
> Apart from that there are no proper tutorials as such.
> Just start crawling if you got stuck somewhere try to find something
> related to that on Google and nutch mailing list archives.
> Ask questions if nothing helps.
>
> On 7 Mar 2018 00:01, "Eric Valencia"  wrote:
>
> I'm a beginner in Nutch and need the best tutorials to get started.  Can
> you guys let me know how you would advise yourselves if starting today
> (like me)?
>
> Eric
>


Re: Need Tutorial on Nutch

2018-03-06 Thread Yash Thenuan Thenuan
I would suggest to start with the documentation on nutch's website.
You can get a Idea about how to start crawling and all.
Apart from that there are no proper tutorials as such.
Just start crawling if you got stuck somewhere try to find something
related to that on Google and nutch mailing list archives.
Ask questions if nothing helps.

On 7 Mar 2018 00:01, "Eric Valencia"  wrote:

I'm a beginner in Nutch and need the best tutorials to get started.  Can
you guys let me know how you would advise yourselves if starting today
(like me)?

Eric


Need Tutorial on Nutch

2018-03-06 Thread Eric Valencia
I'm a beginner in Nutch and need the best tutorials to get started.  Can
you guys let me know how you would advise yourselves if starting today
(like me)?

Eric


RE: Regarding Internal Links

2018-03-06 Thread Yossi Tamari
You should go over each segment, and for each one produce a ParseText and a 
ParseData. This is basically what the HTML Parser does for the whole document, 
which is why I suggested you should dive into its code.
A ParseText is basically just a String containing the actual content of the 
segment (after stripping the HTML tags). This is usually the document you want 
to index.
The ParseData structure is a little more complex, but the main things it 
contains are the title of this segment, and the outlinks from the segment (for 
further crawling). Take a look at the code of both classes and it should be 
relatively clear.
Finally, you need to build one ParseResult object, with the original URL, and 
for each of the ParseText/ParseData pairs, call the put method, with the 
internal URL of the segment as the key.  

> -Original Message-
> From: Yash Thenuan Thenuan 
> Sent: 06 March 2018 14:45
> To: user@nutch.apache.org
> Subject: RE: Regarding Internal Links
> 
> > I am able to get the content corresponding to each Internal link by
> > writing a parse filter plugin. Now  I am  not getting how to proceed
> > further. How can I parse them as separate document and what should
> > my ParseResult filter return??



RE: Regarding Internal Links

2018-03-06 Thread Yash Thenuan Thenuan
> I am able to get the content corresponding to each Internal link by
> writing a parse filter plugin. Now  I am  not getting how to proceed
> further. How can I parse them as separate document and what should
> my ParseResult filter return??


Re: Internal links appear to be external in Parse. Improvement of the crawling quality

2018-03-06 Thread Sebastian Nagel
Hi Semyon,

> We apply logical AND here, which is not really reasonable here.

By now, there was only a single exemption filter, it made no difference.
But yes, sounds plausible to change this to an OR resp. return true
as soon one of the filters accepts/exempts the URL. Please open a issue
to change it.

Thanks,
Sebastian

On 03/06/2018 10:28 AM, Semyon Semyonov wrote:
> I have proposed a solution for this problem 
> https://issues.apache.org/jira/browse/NUTCH-2522.
> 
> The other question is how voting mechanism of UrlExemptionFilters should work.
> 
> UrlExemptionFilters.java : lines 60-65
> //An URL is exempted when all the filters accept it to pass through
> for (int i = 0; i < this.filters.length && exempted; i++) {
>   exempted = this.filters[i].filter(fromUrl, toUrl);
> }
> URLExemptionFilter
> We apply logical AND here, which is not really reasonable here.
> 
> I think if one of the filters votes for exempt then we should exempt it, 
> therefore logical OR instead.
> For example, with the new filter links such as http://www.website.com -> 
> http://website.com/about can be exempted, but standart filter will not exempt 
> it because they are from different hosts. With current logic, the url will 
> not be exempted, because of logical AND
> 
> 
> Any ideas?
> 
>  
>  
> 
> Sent: Wednesday, February 21, 2018 at 2:58 PM
> From: "Sebastian Nagel" 
> To: user@nutch.apache.org
> Subject: Re: Internal links appear to be external in Parse. Improvement of 
> the crawling quality
>> 1) Do we have a config setting that we can use already?
> 
> Not out-of-the-box. But there is already an extension point for your use case 
> [1]:
> the filter method takes to arguments (fromURL and toURL).
> Have a look at it, maybe you can fix it by implementing/contributing a plugin.
> 
>> 2) ... It looks more like same Host problem rather ...
> 
> To determine the host of a URL Nutch uses everywhere java.net.URL.getHost()
> which implements RFC 1738 [2]. We cannot change Java but it would be possible
> to modify URLUtil.getDomainName(...), at least, as a work-around.
> 
>> 3) Where this problem should be solved? Only in ParseOutputFormat.java or 
>> somewhere else as well?
> 
> You may also want to fix it in FetcherThread.handleRedirect(...) which 
> affects also your use case
> of following only internal links (if db.ignore.also.redirects == true).
> 
> Best,
> Sebastian
> 
> 
> [1] 
> https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/net/URLExemptionFilter.html
> 
> https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.html
> [2] 
> https://tools.ietf.org/html/rfc1738#section-3.1[https://tools.ietf.org/html/rfc1738#section-3.1]
> 
> 
> On 02/21/2018 01:52 PM, Semyon Semyonov wrote:
>> Hi Sabastian,
>>
>> If I
>> - modify the method URLUtil.getDomainName(URL url)
>>
>> doesn't it mean that I don't need
>>  - set db.ignore.external.links.mode=byDomain
>>
>> anymore? http://www.somewebsite.com[http://www.somewebsite.com] becomes the 
>> same host as somewhebsite.com.
>>
>>
>> To make it as generic as possible I can create an issue/pull request for 
>> this, but I would like to hear your suggestion about the best way to do so.
>> 1) Do we have a config setting that we can use already?
>> 2) The domain discussion[1] is quite wide though. In my case I cover only 
>> one issue with the mapping www -> _ . It looks more like same Host problem 
>> rather than the same Domain problem. What to you think about such host 
>> resolution?
>> 3) Where this problem should be solved? Only in ParseOutputFormat.java or 
>> somewhere else as well?
>>
>> Semyon.
>>
>>
>>  
>>
>> Sent: Wednesday, February 21, 2018 at 11:51 AM
>> From: "Sebastian Nagel" 
>> To: user@nutch.apache.org
>> Subject: Re: Internal links appear to be external in Parse. Improvement of 
>> the crawling quality
>> Hi Semyon,
>>
>>> interpret 
>>> www.somewebsite.com[http://www.somewebsite.com][http://www.somewebsite.com[http://www.somewebsite.com]]
>>>  and somewhebsite.com as one host?
>>
>> Yes, that's a common problem. More because of external links which must
>> include the host name - well-designed sites would use relative links
>> for internal same-host links.
>>
>> For a quick work-around:
>> - set db.ignore.external.links.mode=byDomain
>> - modify the method URLUtil.getDomainName(URL url)
>> so that it returns the hostname with www. stripped
>>
>> For a final solution we could make it configurable
>> which method or class is called. Since the definition of "domain"
>> is somewhat debatable [1], we could even provide alternative
>> implementations.
>>
>>> PS. For me it is not really clear how ProtocolResolver works.
>>
>> It's only a heuristics to avoid duplicates by protocol (http and https).
>> If you care about duplicates and cannot get rid of them afterwards by a 
>> deduplication job,
>> you may have a look at 

Re: Internal links appear to be external in Parse. Improvement of the crawling quality

2018-03-06 Thread Semyon Semyonov
I have proposed a solution for this problem 
https://issues.apache.org/jira/browse/NUTCH-2522.

The other question is how voting mechanism of UrlExemptionFilters should work.

UrlExemptionFilters.java : lines 60-65
//An URL is exempted when all the filters accept it to pass through
for (int i = 0; i < this.filters.length && exempted; i++) {
  exempted = this.filters[i].filter(fromUrl, toUrl);
}

We apply logical AND here, which is not really reasonable here.

I think if one of the filters votes for exempt then we should exempt it, 
therefore logical OR instead.
For example, with the new filter links such as http://www.website.com -> 
http://website.com/about can be exempted, but standart filter will not exempt 
it because they are from different hosts. With current logic, the url will not 
be exempted, because of logical AND


Any ideas?

 
 

Sent: Wednesday, February 21, 2018 at 2:58 PM
From: "Sebastian Nagel" 
To: user@nutch.apache.org
Subject: Re: Internal links appear to be external in Parse. Improvement of the 
crawling quality
> 1) Do we have a config setting that we can use already?

Not out-of-the-box. But there is already an extension point for your use case 
[1]:
the filter method takes to arguments (fromURL and toURL).
Have a look at it, maybe you can fix it by implementing/contributing a plugin.

> 2) ... It looks more like same Host problem rather ...

To determine the host of a URL Nutch uses everywhere java.net.URL.getHost()
which implements RFC 1738 [2]. We cannot change Java but it would be possible
to modify URLUtil.getDomainName(...), at least, as a work-around.

> 3) Where this problem should be solved? Only in ParseOutputFormat.java or 
> somewhere else as well?

You may also want to fix it in FetcherThread.handleRedirect(...) which affects 
also your use case
of following only internal links (if db.ignore.also.redirects == true).

Best,
Sebastian


[1] 
https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/net/URLExemptionFilter.html

https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.html
[2] 
https://tools.ietf.org/html/rfc1738#section-3.1[https://tools.ietf.org/html/rfc1738#section-3.1]


On 02/21/2018 01:52 PM, Semyon Semyonov wrote:
> Hi Sabastian,
>
> If I
> - modify the method URLUtil.getDomainName(URL url)
>
> doesn't it mean that I don't need
>  - set db.ignore.external.links.mode=byDomain
>
> anymore? http://www.somewebsite.com[http://www.somewebsite.com] becomes the 
> same host as somewhebsite.com.
>
>
> To make it as generic as possible I can create an issue/pull request for 
> this, but I would like to hear your suggestion about the best way to do so.
> 1) Do we have a config setting that we can use already?
> 2) The domain discussion[1] is quite wide though. In my case I cover only one 
> issue with the mapping www -> _ . It looks more like same Host problem rather 
> than the same Domain problem. What to you think about such host resolution?
> 3) Where this problem should be solved? Only in ParseOutputFormat.java or 
> somewhere else as well?
>
> Semyon.
>
>
>  
>
> Sent: Wednesday, February 21, 2018 at 11:51 AM
> From: "Sebastian Nagel" 
> To: user@nutch.apache.org
> Subject: Re: Internal links appear to be external in Parse. Improvement of 
> the crawling quality
> Hi Semyon,
>
>> interpret 
>> www.somewebsite.com[http://www.somewebsite.com][http://www.somewebsite.com[http://www.somewebsite.com]]
>>  and somewhebsite.com as one host?
>
> Yes, that's a common problem. More because of external links which must
> include the host name - well-designed sites would use relative links
> for internal same-host links.
>
> For a quick work-around:
> - set db.ignore.external.links.mode=byDomain
> - modify the method URLUtil.getDomainName(URL url)
> so that it returns the hostname with www. stripped
>
> For a final solution we could make it configurable
> which method or class is called. Since the definition of "domain"
> is somewhat debatable [1], we could even provide alternative
> implementations.
>
>> PS. For me it is not really clear how ProtocolResolver works.
>
> It's only a heuristics to avoid duplicates by protocol (http and https).
> If you care about duplicates and cannot get rid of them afterwards by a 
> deduplication job,
> you may have a look at urlnormalizer-protocol and NUTCH-2447.
>
> Best,
> Sebastian
>
>
> [1] 
> https://github.com/google/guava/wiki/InternetDomainNameExplained[https://github.com/google/guava/wiki/InternetDomainNameExplained][https://github.com/google/guava/wiki/InternetDomainNameExplained[https://github.com/google/guava/wiki/InternetDomainNameExplained]]
>
> On 02/21/2018 10:44 AM, Semyon Semyonov wrote:
>> Thanks Yossi, Markus,
>>
>> I have an issue with the db.ignore.external.links.mode=byDomain solution.
>>
>> I crawl specific hosts only therefore I have a finite number of hosts to 
>> crawl.
>> Lets say,