Re: Internal links appear to be external in Parse. Improvement of the crawling quality
I found out that there is no direct way to do it, the problem was solved through calling of the regex transformation one more time in IndexerMapReduce, before the Indexer gets the Doc for writting. Something like(IndexerMapReduce.java:line 369), doc.add("modifiedId", URLUtil.getHost(BidirectionalUrlExemptionFilter.tranform(key.toString())); Sent: Friday, March 16, 2018 at 7:20 PM From: "Semyon Semyonov"To: user@nutch.apache.org Subject: Re: Internal links appear to be external in Parse. Improvement of the crawling quality Hi again, Another issue has appeared with introduction of bidirectional url exemption filter. Having http://www.website.com/page1 and http://website.com/page2[http://website.com/page2] Before as an indexer output(lets say a text file) I had one parent/host(www.website.com[http://www.website.com]) with children/pages(http://www.website.com/page1[http://www.website.com/page1], http://www.website.com/[http://www.website.com/]...). Now, I have two different hosts and therefore two different parents for my output. I prefer to have the same hostname/alias for both hosts. I checked url exemption filters and they don't allow to add metadata to the parsed data. Therefore, two questions: 1) What is the best way to do it? 2) Should I include it into Nutch code or we don't need it and I should make a quick fix for myself? Semyon. Sent: Tuesday, March 06, 2018 at 11:08 AM From: "Sebastian Nagel" To: user@nutch.apache.org Subject: Re: Internal links appear to be external in Parse. Improvement of the crawling quality Hi Semyon, > We apply logical AND here, which is not really reasonable here. By now, there was only a single exemption filter, it made no difference. But yes, sounds plausible to change this to an OR resp. return true as soon one of the filters accepts/exempts the URL. Please open a issue to change it. Thanks, Sebastian On 03/06/2018 10:28 AM, Semyon Semyonov wrote: > I have proposed a solution for this problem > https://issues.apache.org/jira/browse/NUTCH-2522[https://issues.apache.org/jira/browse/NUTCH-2522]. > > The other question is how voting mechanism of UrlExemptionFilters should work. > > UrlExemptionFilters.java : lines 60-65 > //An URL is exempted when all the filters accept it to pass through > for (int i = 0; i < this.filters.length && exempted; i++) { > exempted = this.filters[i].filter(fromUrl, toUrl); > } > URLExemptionFilter > We apply logical AND here, which is not really reasonable here. > > I think if one of the filters votes for exempt then we should exempt it, > therefore logical OR instead. > For example, with the new filter links such as > http://www.website.com[http://www.website.com][http://www.website.com[http://www.website.com]] > -> > http://website.com/about[http://website.com/about][http://website.com/about[http://website.com/about]] > can be exempted, but standart filter will not exempt it because they are > from different hosts. With current logic, the url will not be exempted, > because of logical AND > > > Any ideas? > > > > > Sent: Wednesday, February 21, 2018 at 2:58 PM > From: "Sebastian Nagel" > To: user@nutch.apache.org > Subject: Re: Internal links appear to be external in Parse. Improvement of > the crawling quality >> 1) Do we have a config setting that we can use already? > > Not out-of-the-box. But there is already an extension point for your use case > [1]: > the filter method takes to arguments (fromURL and toURL). > Have a look at it, maybe you can fix it by implementing/contributing a plugin. > >> 2) ... It looks more like same Host problem rather ... > > To determine the host of a URL Nutch uses everywhere java.net.URL.getHost() > which implements RFC 1738 [2]. We cannot change Java but it would be possible > to modify URLUtil.getDomainName(...), at least, as a work-around. > >> 3) Where this problem should be solved? Only in ParseOutputFormat.java or >> somewhere else as well? > > You may also want to fix it in FetcherThread.handleRedirect(...) which > affects also your use case > of following only internal links (if db.ignore.also.redirects == true). > > Best, > Sebastian > > > [1] > https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/net/URLExemptionFilter.html[https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/net/URLExemptionFilter.html][https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/net/URLExemptionFilter.html[https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/net/URLExemptionFilter.html]] > >
Re: Internal links appear to be external in Parse. Improvement of the crawling quality
Hi again, Another issue has appeared with introduction of bidirectional url exemption filter. Having http://www.website.com/page1 and http://website.com/page2 Before as an indexer output(lets say a text file) I had one parent/host(www.website.com) with children/pages(http://www.website.com/page1, http://www.website.com/...). Now, I have two different hosts and therefore two different parents for my output. I prefer to have the same hostname/alias for both hosts. I checked url exemption filters and they don't allow to add metadata to the parsed data. Therefore, two questions: 1) What is the best way to do it? 2) Should I include it into Nutch code or we don't need it and I should make a quick fix for myself? Semyon. Sent: Tuesday, March 06, 2018 at 11:08 AM From: "Sebastian Nagel"To: user@nutch.apache.org Subject: Re: Internal links appear to be external in Parse. Improvement of the crawling quality Hi Semyon, > We apply logical AND here, which is not really reasonable here. By now, there was only a single exemption filter, it made no difference. But yes, sounds plausible to change this to an OR resp. return true as soon one of the filters accepts/exempts the URL. Please open a issue to change it. Thanks, Sebastian On 03/06/2018 10:28 AM, Semyon Semyonov wrote: > I have proposed a solution for this problem > https://issues.apache.org/jira/browse/NUTCH-2522. > > The other question is how voting mechanism of UrlExemptionFilters should work. > > UrlExemptionFilters.java : lines 60-65 > //An URL is exempted when all the filters accept it to pass through > for (int i = 0; i < this.filters.length && exempted; i++) { > exempted = this.filters[i].filter(fromUrl, toUrl); > } > URLExemptionFilter > We apply logical AND here, which is not really reasonable here. > > I think if one of the filters votes for exempt then we should exempt it, > therefore logical OR instead. > For example, with the new filter links such as > http://www.website.com[http://www.website.com] -> > http://website.com/about[http://website.com/about] can be exempted, but > standart filter will not exempt it because they are from different hosts. > With current logic, the url will not be exempted, because of logical AND > > > Any ideas? > > > > > Sent: Wednesday, February 21, 2018 at 2:58 PM > From: "Sebastian Nagel" > To: user@nutch.apache.org > Subject: Re: Internal links appear to be external in Parse. Improvement of > the crawling quality >> 1) Do we have a config setting that we can use already? > > Not out-of-the-box. But there is already an extension point for your use case > [1]: > the filter method takes to arguments (fromURL and toURL). > Have a look at it, maybe you can fix it by implementing/contributing a plugin. > >> 2) ... It looks more like same Host problem rather ... > > To determine the host of a URL Nutch uses everywhere java.net.URL.getHost() > which implements RFC 1738 [2]. We cannot change Java but it would be possible > to modify URLUtil.getDomainName(...), at least, as a work-around. > >> 3) Where this problem should be solved? Only in ParseOutputFormat.java or >> somewhere else as well? > > You may also want to fix it in FetcherThread.handleRedirect(...) which > affects also your use case > of following only internal links (if db.ignore.also.redirects == true). > > Best, > Sebastian > > > [1] > https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/net/URLExemptionFilter.html[https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/net/URLExemptionFilter.html] > > https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.html[https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.html] > [2] > https://tools.ietf.org/html/rfc1738#section-3.1[https://tools.ietf.org/html/rfc1738#section-3.1][https://tools.ietf.org/html/rfc1738#section-3.1[https://tools.ietf.org/html/rfc1738#section-3.1]] > > > On 02/21/2018 01:52 PM, Semyon Semyonov wrote: >> Hi Sabastian, >> >> If I >> - modify the method URLUtil.getDomainName(URL url) >> >> doesn't it mean that I don't need >> - set db.ignore.external.links.mode=byDomain >> >> anymore? >> http://www.somewebsite.com[http://www.somewebsite.com][http://www.somewebsite.com[http://www.somewebsite.com]] >> becomes the same host as somewhebsite.com. >> >> >> To make it as generic as possible I can create an issue/pull request for >> this, but I would like to hear your suggestion about the best way to do so. >> 1) Do we have a config setting that we can use already? >> 2) The domain discussion[1] is quite wide though. In my case I cover only >> one issue with the mapping www -> _ . It looks more like same Host problem >> rather than the same Domain problem. What to you think about such host >> resolution? >> 3) Where this problem should be solved? Only in ParseOutputFormat.java or >>
Re: Internal links appear to be external in Parse. Improvement of the crawling quality
Hi Semyon, > We apply logical AND here, which is not really reasonable here. By now, there was only a single exemption filter, it made no difference. But yes, sounds plausible to change this to an OR resp. return true as soon one of the filters accepts/exempts the URL. Please open a issue to change it. Thanks, Sebastian On 03/06/2018 10:28 AM, Semyon Semyonov wrote: > I have proposed a solution for this problem > https://issues.apache.org/jira/browse/NUTCH-2522. > > The other question is how voting mechanism of UrlExemptionFilters should work. > > UrlExemptionFilters.java : lines 60-65 > //An URL is exempted when all the filters accept it to pass through > for (int i = 0; i < this.filters.length && exempted; i++) { > exempted = this.filters[i].filter(fromUrl, toUrl); > } > URLExemptionFilter > We apply logical AND here, which is not really reasonable here. > > I think if one of the filters votes for exempt then we should exempt it, > therefore logical OR instead. > For example, with the new filter links such as http://www.website.com -> > http://website.com/about can be exempted, but standart filter will not exempt > it because they are from different hosts. With current logic, the url will > not be exempted, because of logical AND > > > Any ideas? > > > > > Sent: Wednesday, February 21, 2018 at 2:58 PM > From: "Sebastian Nagel"> To: user@nutch.apache.org > Subject: Re: Internal links appear to be external in Parse. Improvement of > the crawling quality >> 1) Do we have a config setting that we can use already? > > Not out-of-the-box. But there is already an extension point for your use case > [1]: > the filter method takes to arguments (fromURL and toURL). > Have a look at it, maybe you can fix it by implementing/contributing a plugin. > >> 2) ... It looks more like same Host problem rather ... > > To determine the host of a URL Nutch uses everywhere java.net.URL.getHost() > which implements RFC 1738 [2]. We cannot change Java but it would be possible > to modify URLUtil.getDomainName(...), at least, as a work-around. > >> 3) Where this problem should be solved? Only in ParseOutputFormat.java or >> somewhere else as well? > > You may also want to fix it in FetcherThread.handleRedirect(...) which > affects also your use case > of following only internal links (if db.ignore.also.redirects == true). > > Best, > Sebastian > > > [1] > https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/net/URLExemptionFilter.html > > https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.html > [2] > https://tools.ietf.org/html/rfc1738#section-3.1[https://tools.ietf.org/html/rfc1738#section-3.1] > > > On 02/21/2018 01:52 PM, Semyon Semyonov wrote: >> Hi Sabastian, >> >> If I >> - modify the method URLUtil.getDomainName(URL url) >> >> doesn't it mean that I don't need >> - set db.ignore.external.links.mode=byDomain >> >> anymore? http://www.somewebsite.com[http://www.somewebsite.com] becomes the >> same host as somewhebsite.com. >> >> >> To make it as generic as possible I can create an issue/pull request for >> this, but I would like to hear your suggestion about the best way to do so. >> 1) Do we have a config setting that we can use already? >> 2) The domain discussion[1] is quite wide though. In my case I cover only >> one issue with the mapping www -> _ . It looks more like same Host problem >> rather than the same Domain problem. What to you think about such host >> resolution? >> 3) Where this problem should be solved? Only in ParseOutputFormat.java or >> somewhere else as well? >> >> Semyon. >> >> >> >> >> Sent: Wednesday, February 21, 2018 at 11:51 AM >> From: "Sebastian Nagel" >> To: user@nutch.apache.org >> Subject: Re: Internal links appear to be external in Parse. Improvement of >> the crawling quality >> Hi Semyon, >> >>> interpret >>> www.somewebsite.com[http://www.somewebsite.com][http://www.somewebsite.com[http://www.somewebsite.com]] >>> and somewhebsite.com as one host? >> >> Yes, that's a common problem. More because of external links which must >> include the host name - well-designed sites would use relative links >> for internal same-host links. >> >> For a quick work-around: >> - set db.ignore.external.links.mode=byDomain >> - modify the method URLUtil.getDomainName(URL url) >> so that it returns the hostname with www. stripped >> >> For a final solution we could make it configurable >> which method or class is called. Since the definition of "domain" >> is somewhat debatable [1], we could even provide alternative >> implementations. >> >>> PS. For me it is not really clear how ProtocolResolver works. >> >> It's only a heuristics to avoid duplicates by protocol (http and https). >> If you care about duplicates and cannot get rid of them afterwards by a >> deduplication job, >> you may have a look at
Re: Internal links appear to be external in Parse. Improvement of the crawling quality
I have proposed a solution for this problem https://issues.apache.org/jira/browse/NUTCH-2522. The other question is how voting mechanism of UrlExemptionFilters should work. UrlExemptionFilters.java : lines 60-65 //An URL is exempted when all the filters accept it to pass through for (int i = 0; i < this.filters.length && exempted; i++) { exempted = this.filters[i].filter(fromUrl, toUrl); } We apply logical AND here, which is not really reasonable here. I think if one of the filters votes for exempt then we should exempt it, therefore logical OR instead. For example, with the new filter links such as http://www.website.com -> http://website.com/about can be exempted, but standart filter will not exempt it because they are from different hosts. With current logic, the url will not be exempted, because of logical AND Any ideas? Sent: Wednesday, February 21, 2018 at 2:58 PM From: "Sebastian Nagel"To: user@nutch.apache.org Subject: Re: Internal links appear to be external in Parse. Improvement of the crawling quality > 1) Do we have a config setting that we can use already? Not out-of-the-box. But there is already an extension point for your use case [1]: the filter method takes to arguments (fromURL and toURL). Have a look at it, maybe you can fix it by implementing/contributing a plugin. > 2) ... It looks more like same Host problem rather ... To determine the host of a URL Nutch uses everywhere java.net.URL.getHost() which implements RFC 1738 [2]. We cannot change Java but it would be possible to modify URLUtil.getDomainName(...), at least, as a work-around. > 3) Where this problem should be solved? Only in ParseOutputFormat.java or > somewhere else as well? You may also want to fix it in FetcherThread.handleRedirect(...) which affects also your use case of following only internal links (if db.ignore.also.redirects == true). Best, Sebastian [1] https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/net/URLExemptionFilter.html https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.html [2] https://tools.ietf.org/html/rfc1738#section-3.1[https://tools.ietf.org/html/rfc1738#section-3.1] On 02/21/2018 01:52 PM, Semyon Semyonov wrote: > Hi Sabastian, > > If I > - modify the method URLUtil.getDomainName(URL url) > > doesn't it mean that I don't need > - set db.ignore.external.links.mode=byDomain > > anymore? http://www.somewebsite.com[http://www.somewebsite.com] becomes the > same host as somewhebsite.com. > > > To make it as generic as possible I can create an issue/pull request for > this, but I would like to hear your suggestion about the best way to do so. > 1) Do we have a config setting that we can use already? > 2) The domain discussion[1] is quite wide though. In my case I cover only one > issue with the mapping www -> _ . It looks more like same Host problem rather > than the same Domain problem. What to you think about such host resolution? > 3) Where this problem should be solved? Only in ParseOutputFormat.java or > somewhere else as well? > > Semyon. > > > > > Sent: Wednesday, February 21, 2018 at 11:51 AM > From: "Sebastian Nagel" > To: user@nutch.apache.org > Subject: Re: Internal links appear to be external in Parse. Improvement of > the crawling quality > Hi Semyon, > >> interpret >> www.somewebsite.com[http://www.somewebsite.com][http://www.somewebsite.com[http://www.somewebsite.com]] >> and somewhebsite.com as one host? > > Yes, that's a common problem. More because of external links which must > include the host name - well-designed sites would use relative links > for internal same-host links. > > For a quick work-around: > - set db.ignore.external.links.mode=byDomain > - modify the method URLUtil.getDomainName(URL url) > so that it returns the hostname with www. stripped > > For a final solution we could make it configurable > which method or class is called. Since the definition of "domain" > is somewhat debatable [1], we could even provide alternative > implementations. > >> PS. For me it is not really clear how ProtocolResolver works. > > It's only a heuristics to avoid duplicates by protocol (http and https). > If you care about duplicates and cannot get rid of them afterwards by a > deduplication job, > you may have a look at urlnormalizer-protocol and NUTCH-2447. > > Best, > Sebastian > > > [1] > https://github.com/google/guava/wiki/InternetDomainNameExplained[https://github.com/google/guava/wiki/InternetDomainNameExplained][https://github.com/google/guava/wiki/InternetDomainNameExplained[https://github.com/google/guava/wiki/InternetDomainNameExplained]] > > On 02/21/2018 10:44 AM, Semyon Semyonov wrote: >> Thanks Yossi, Markus, >> >> I have an issue with the db.ignore.external.links.mode=byDomain solution. >> >> I crawl specific hosts only therefore I have a finite number of hosts to >> crawl. >> Lets say,
Re: Internal links appear to be external in Parse. Improvement of the crawling quality
> 1) Do we have a config setting that we can use already? Not out-of-the-box. But there is already an extension point for your use case [1]: the filter method takes to arguments (fromURL and toURL). Have a look at it, maybe you can fix it by implementing/contributing a plugin. > 2) ... It looks more like same Host problem rather ... To determine the host of a URL Nutch uses everywhere java.net.URL.getHost() which implements RFC 1738 [2]. We cannot change Java but it would be possible to modify URLUtil.getDomainName(...), at least, as a work-around. > 3) Where this problem should be solved? Only in ParseOutputFormat.java or > somewhere else as well? You may also want to fix it in FetcherThread.handleRedirect(...) which affects also your use case of following only internal links (if db.ignore.also.redirects == true). Best, Sebastian [1] https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/net/URLExemptionFilter.html https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.html [2] https://tools.ietf.org/html/rfc1738#section-3.1 On 02/21/2018 01:52 PM, Semyon Semyonov wrote: > Hi Sabastian, > > If I > - modify the method URLUtil.getDomainName(URL url) > > doesn't it mean that I don't need > - set db.ignore.external.links.mode=byDomain > > anymore? http://www.somewebsite.com becomes the same host as somewhebsite.com. > > > To make it as generic as possible I can create an issue/pull request for > this, but I would like to hear your suggestion about the best way to do so. > 1) Do we have a config setting that we can use already? > 2) The domain discussion[1] is quite wide though. In my case I cover only one > issue with the mapping www -> _ . It looks more like same Host problem rather > than the same Domain problem. What to you think about such host resolution? > 3) Where this problem should be solved? Only in ParseOutputFormat.java or > somewhere else as well? > > Semyon. > > > > > Sent: Wednesday, February 21, 2018 at 11:51 AM > From: "Sebastian Nagel"> To: user@nutch.apache.org > Subject: Re: Internal links appear to be external in Parse. Improvement of > the crawling quality > Hi Semyon, > >> interpret www.somewebsite.com[http://www.somewebsite.com] and >> somewhebsite.com as one host? > > Yes, that's a common problem. More because of external links which must > include the host name - well-designed sites would use relative links > for internal same-host links. > > For a quick work-around: > - set db.ignore.external.links.mode=byDomain > - modify the method URLUtil.getDomainName(URL url) > so that it returns the hostname with www. stripped > > For a final solution we could make it configurable > which method or class is called. Since the definition of "domain" > is somewhat debatable [1], we could even provide alternative > implementations. > >> PS. For me it is not really clear how ProtocolResolver works. > > It's only a heuristics to avoid duplicates by protocol (http and https). > If you care about duplicates and cannot get rid of them afterwards by a > deduplication job, > you may have a look at urlnormalizer-protocol and NUTCH-2447. > > Best, > Sebastian > > > [1] > https://github.com/google/guava/wiki/InternetDomainNameExplained[https://github.com/google/guava/wiki/InternetDomainNameExplained] > > On 02/21/2018 10:44 AM, Semyon Semyonov wrote: >> Thanks Yossi, Markus, >> >> I have an issue with the db.ignore.external.links.mode=byDomain solution. >> >> I crawl specific hosts only therefore I have a finite number of hosts to >> crawl. >> Lets say, www.somewebsite.com[http://www.somewebsite.com] >> >> I want to stay limited with this host. In other words, neither >> www.art.somewebsite.com[http://www.art.somewebsite.com] nor >> www.sport.somewebsite.com[http://www.sport.somewebsite.com]. >> That's why db.ignore.external.links.mode=byHost and db.ignore.external = >> true(no external websites). >> >> Although, I want to get the links that seem to belong to the same >> host(www.somewebsite.com[http://www.somewebsite.com] -> >> somewebsite.com/games, without www). >> The question is shouldn't we include it as a default behavior(or configured >> behavior) in Nutch and interpret >> www.somewebsite.com[http://www.somewebsite.com] and somewhebsite.com as one >> host? >> >> >> >> PS. For me it is not really clear how ProtocolResolver works. >> >> Semyon >> >> >> >> >> Sent: Tuesday, February 20, 2018 at 9:40 PM >> From: "Markus Jelsma" >> To: "user@nutch.apache.org" >> Subject: RE: Internal links appear to be external in Parse. Improvement of >> the crawling quality >> Hello Semyon, >> >> Yossi is right, you can use the db.ignore.* set of directives to resolve the >> problem. >> >> Regarding protocol, you can use urlnormalizer-protocol to set up per host >> rules. This is, of course, a tedious job if you
Re: Internal links appear to be external in Parse. Improvement of the crawling quality
Hi Sabastian, If I - modify the method URLUtil.getDomainName(URL url) doesn't it mean that I don't need - set db.ignore.external.links.mode=byDomain anymore? http://www.somewebsite.com becomes the same host as somewhebsite.com. To make it as generic as possible I can create an issue/pull request for this, but I would like to hear your suggestion about the best way to do so. 1) Do we have a config setting that we can use already? 2) The domain discussion[1] is quite wide though. In my case I cover only one issue with the mapping www -> _ . It looks more like same Host problem rather than the same Domain problem. What to you think about such host resolution? 3) Where this problem should be solved? Only in ParseOutputFormat.java or somewhere else as well? Semyon. Sent: Wednesday, February 21, 2018 at 11:51 AM From: "Sebastian Nagel"To: user@nutch.apache.org Subject: Re: Internal links appear to be external in Parse. Improvement of the crawling quality Hi Semyon, > interpret www.somewebsite.com[http://www.somewebsite.com] and > somewhebsite.com as one host? Yes, that's a common problem. More because of external links which must include the host name - well-designed sites would use relative links for internal same-host links. For a quick work-around: - set db.ignore.external.links.mode=byDomain - modify the method URLUtil.getDomainName(URL url) so that it returns the hostname with www. stripped For a final solution we could make it configurable which method or class is called. Since the definition of "domain" is somewhat debatable [1], we could even provide alternative implementations. > PS. For me it is not really clear how ProtocolResolver works. It's only a heuristics to avoid duplicates by protocol (http and https). If you care about duplicates and cannot get rid of them afterwards by a deduplication job, you may have a look at urlnormalizer-protocol and NUTCH-2447. Best, Sebastian [1] https://github.com/google/guava/wiki/InternetDomainNameExplained[https://github.com/google/guava/wiki/InternetDomainNameExplained] On 02/21/2018 10:44 AM, Semyon Semyonov wrote: > Thanks Yossi, Markus, > > I have an issue with the db.ignore.external.links.mode=byDomain solution. > > I crawl specific hosts only therefore I have a finite number of hosts to > crawl. > Lets say, www.somewebsite.com[http://www.somewebsite.com] > > I want to stay limited with this host. In other words, neither > www.art.somewebsite.com[http://www.art.somewebsite.com] nor > www.sport.somewebsite.com[http://www.sport.somewebsite.com]. > That's why db.ignore.external.links.mode=byHost and db.ignore.external = > true(no external websites). > > Although, I want to get the links that seem to belong to the same > host(www.somewebsite.com[http://www.somewebsite.com] -> > somewebsite.com/games, without www). > The question is shouldn't we include it as a default behavior(or configured > behavior) in Nutch and interpret > www.somewebsite.com[http://www.somewebsite.com] and somewhebsite.com as one > host? > > > > PS. For me it is not really clear how ProtocolResolver works. > > Semyon > > > > > Sent: Tuesday, February 20, 2018 at 9:40 PM > From: "Markus Jelsma" > To: "user@nutch.apache.org" > Subject: RE: Internal links appear to be external in Parse. Improvement of > the crawling quality > Hello Semyon, > > Yossi is right, you can use the db.ignore.* set of directives to resolve the > problem. > > Regarding protocol, you can use urlnormalizer-protocol to set up per host > rules. This is, of course, a tedious job if you operate a crawl on an > indefinite amount of hosts, so use the uncommitted ProtocolResolver for that > to do it for you. > > See: > https://issues.apache.org/jira/browse/NUTCH-2247[https://issues.apache.org/jira/browse/NUTCH-2247] > > If i remember it tomorrow afternoon, i can probably schedule some time to > work on it the coming seven days or so, and commit. > > Regards, > Markus > > -Original message- >> From:Yossi Tamari >> Sent: Tuesday 20th February 2018 21:06 >> To: user@nutch.apache.org >> Subject: RE: Internal links appear to be external in Parse. Improvement of >> the crawling quality >> >> Hi Semyon, >> >> Wouldn't setting db.ignore.external.links.mode=byDomain solve your wincs.be >> issue? >> As far as I can see the protocol (HTTP/HTTPS) does not play any part in the >> decision if this is the same domain. >> >> Yossi. >> >>> -Original Message- >>> From: Semyon Semyonov [mailto:semyon.semyo...@mail.com] >>> Sent: 20 February 2018 20:43 >>> To: usernutch.apache.org >>> Subject: Internal links appear to be external in Parse. Improvement of the >>> crawling quality >>> >>> Dear All, >>> >>> I'm trying to increase quality of the crawling. A part of my database has >>> DB_FETCHED = 1. >>> >>> Example, >>>
Re: Internal links appear to be external in Parse. Improvement of the crawling quality
Hi Semyon, > interpret www.somewebsite.com and somewhebsite.com as one host? Yes, that's a common problem. More because of external links which must include the host name - well-designed sites would use relative links for internal same-host links. For a quick work-around: - set db.ignore.external.links.mode=byDomain - modify the method URLUtil.getDomainName(URL url) so that it returns the hostname with www. stripped For a final solution we could make it configurable which method or class is called. Since the definition of "domain" is somewhat debatable [1], we could even provide alternative implementations. > PS. For me it is not really clear how ProtocolResolver works. It's only a heuristics to avoid duplicates by protocol (http and https). If you care about duplicates and cannot get rid of them afterwards by a deduplication job, you may have a look at urlnormalizer-protocol and NUTCH-2447. Best, Sebastian [1] https://github.com/google/guava/wiki/InternetDomainNameExplained On 02/21/2018 10:44 AM, Semyon Semyonov wrote: > Thanks Yossi, Markus, > > I have an issue with the db.ignore.external.links.mode=byDomain solution. > > I crawl specific hosts only therefore I have a finite number of hosts to > crawl. > Lets say, www.somewebsite.com > > I want to stay limited with this host. In other words, neither > www.art.somewebsite.com nor www.sport.somewebsite.com. > That's why db.ignore.external.links.mode=byHost and db.ignore.external = > true(no external websites). > > Although, I want to get the links that seem to belong to the same > host(www.somewebsite.com -> somewebsite.com/games, without www). > The question is shouldn't we include it as a default behavior(or configured > behavior) in Nutch and interpret www.somewebsite.com and somewhebsite.com as > one host? > > > > PS. For me it is not really clear how ProtocolResolver works. > > Semyon > > > > > Sent: Tuesday, February 20, 2018 at 9:40 PM > From: "Markus Jelsma"> To: "user@nutch.apache.org" > Subject: RE: Internal links appear to be external in Parse. Improvement of > the crawling quality > Hello Semyon, > > Yossi is right, you can use the db.ignore.* set of directives to resolve the > problem. > > Regarding protocol, you can use urlnormalizer-protocol to set up per host > rules. This is, of course, a tedious job if you operate a crawl on an > indefinite amount of hosts, so use the uncommitted ProtocolResolver for that > to do it for you. > > See: https://issues.apache.org/jira/browse/NUTCH-2247 > > If i remember it tomorrow afternoon, i can probably schedule some time to > work on it the coming seven days or so, and commit. > > Regards, > Markus > > -Original message- >> From:Yossi Tamari >> Sent: Tuesday 20th February 2018 21:06 >> To: user@nutch.apache.org >> Subject: RE: Internal links appear to be external in Parse. Improvement of >> the crawling quality >> >> Hi Semyon, >> >> Wouldn't setting db.ignore.external.links.mode=byDomain solve your wincs.be >> issue? >> As far as I can see the protocol (HTTP/HTTPS) does not play any part in the >> decision if this is the same domain. >> >> Yossi. >> >>> -Original Message- >>> From: Semyon Semyonov [mailto:semyon.semyo...@mail.com] >>> Sent: 20 February 2018 20:43 >>> To: usernutch.apache.org >>> Subject: Internal links appear to be external in Parse. Improvement of the >>> crawling quality >>> >>> Dear All, >>> >>> I'm trying to increase quality of the crawling. A part of my database has >>> DB_FETCHED = 1. >>> >>> Example, http://www.wincs.be/[http://www.wincs.be/] in seed list. >>> >>> The root of the problem is in Parse/ParseOutputFormat.java line 364 - 374 >>> >>> Nutch considers one of the >>> link(http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html]) >>> as external >>> and therefore reject it. >>> >>> >>> If I insert http://wincs.be[http://wincs.be] in seed file, everything works >>> fine. >>> >>> Do you think it is a good behavior? I mean, formally it is indeed two >>> different >>> domains, but from user perspective it is exactly the same. >>> >>> And if it is a default behavior, how can I fix it for my case? The same >>> question for >>> similar switch http -> https etc. >>> >>> Thanks. >>> >>> Semyon. >> >>
Re: RE: Internal links appear to be external in Parse. Improvement of the crawling quality
Thanks Yossi, Markus, I have an issue with the db.ignore.external.links.mode=byDomain solution. I crawl specific hosts only therefore I have a finite number of hosts to crawl. Lets say, www.somewebsite.com I want to stay limited with this host. In other words, neither www.art.somewebsite.com nor www.sport.somewebsite.com. That's why db.ignore.external.links.mode=byHost and db.ignore.external = true(no external websites). Although, I want to get the links that seem to belong to the same host(www.somewebsite.com -> somewebsite.com/games, without www). The question is shouldn't we include it as a default behavior(or configured behavior) in Nutch and interpret www.somewebsite.com and somewhebsite.com as one host? PS. For me it is not really clear how ProtocolResolver works. Semyon Sent: Tuesday, February 20, 2018 at 9:40 PM From: "Markus Jelsma"To: "user@nutch.apache.org" Subject: RE: Internal links appear to be external in Parse. Improvement of the crawling quality Hello Semyon, Yossi is right, you can use the db.ignore.* set of directives to resolve the problem. Regarding protocol, you can use urlnormalizer-protocol to set up per host rules. This is, of course, a tedious job if you operate a crawl on an indefinite amount of hosts, so use the uncommitted ProtocolResolver for that to do it for you. See: https://issues.apache.org/jira/browse/NUTCH-2247 If i remember it tomorrow afternoon, i can probably schedule some time to work on it the coming seven days or so, and commit. Regards, Markus -Original message- > From:Yossi Tamari > Sent: Tuesday 20th February 2018 21:06 > To: user@nutch.apache.org > Subject: RE: Internal links appear to be external in Parse. Improvement of > the crawling quality > > Hi Semyon, > > Wouldn't setting db.ignore.external.links.mode=byDomain solve your wincs.be > issue? > As far as I can see the protocol (HTTP/HTTPS) does not play any part in the > decision if this is the same domain. > > Yossi. > > > -Original Message- > > From: Semyon Semyonov [mailto:semyon.semyo...@mail.com] > > Sent: 20 February 2018 20:43 > > To: usernutch.apache.org > > Subject: Internal links appear to be external in Parse. Improvement of the > > crawling quality > > > > Dear All, > > > > I'm trying to increase quality of the crawling. A part of my database has > > DB_FETCHED = 1. > > > > Example, http://www.wincs.be/[http://www.wincs.be/] in seed list. > > > > The root of the problem is in Parse/ParseOutputFormat.java line 364 - 374 > > > > Nutch considers one of the > > link(http://wincs.be/lakindustrie.html[http://wincs.be/lakindustrie.html]) > > as external > > and therefore reject it. > > > > > > If I insert http://wincs.be[http://wincs.be] in seed file, everything works > > fine. > > > > Do you think it is a good behavior? I mean, formally it is indeed two > > different > > domains, but from user perspective it is exactly the same. > > > > And if it is a default behavior, how can I fix it for my case? The same > > question for > > similar switch http -> https etc. > > > > Thanks. > > > > Semyon. > >
RE: Internal links appear to be external in Parse. Improvement of the crawling quality
Hi Semyon, Wouldn't setting db.ignore.external.links.mode=byDomain solve your wincs.be issue? As far as I can see the protocol (HTTP/HTTPS) does not play any part in the decision if this is the same domain. Yossi. > -Original Message- > From: Semyon Semyonov [mailto:semyon.semyo...@mail.com] > Sent: 20 February 2018 20:43 > To: usernutch.apache.org> Subject: Internal links appear to be external in Parse. Improvement of the > crawling quality > > Dear All, > > I'm trying to increase quality of the crawling. A part of my database has > DB_FETCHED = 1. > > Example, http://www.wincs.be/ in seed list. > > The root of the problem is in Parse/ParseOutputFormat.java line 364 - 374 > > Nutch considers one of the link(http://wincs.be/lakindustrie.html) as external > and therefore reject it. > > > If I insert http://wincs.be in seed file, everything works fine. > > Do you think it is a good behavior? I mean, formally it is indeed two > different > domains, but from user perspective it is exactly the same. > > And if it is a default behavior, how can I fix it for my case? The same > question for > similar switch http -> https etc. > > Thanks. > > Semyon.
Internal links appear to be external in Parse. Improvement of the crawling quality
Dear All, I'm trying to increase quality of the crawling. A part of my database has DB_FETCHED = 1. Example, http://www.wincs.be/ in seed list. The root of the problem is in Parse/ParseOutputFormat.java line 364 - 374 Nutch considers one of the link(http://wincs.be/lakindustrie.html) as external and therefore reject it. If I insert http://wincs.be in seed file, everything works fine. Do you think it is a good behavior? I mean, formally it is indeed two different domains, but from user perspective it is exactly the same. And if it is a default behavior, how can I fix it for my case? The same question for similar switch http -> https etc. Thanks. Semyon.