[jira] [Commented] (NUTCH-2676) Update to the latest selenium and add code to use chrome and firefox headless mode with the remote web driver
[ https://issues.apache.org/jira/browse/NUTCH-2676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735780#comment-16735780 ] Sebastian Nagel commented on NUTCH-2676: Great! Thanks! > Update to the latest selenium and add code to use chrome and firefox headless > mode with the remote web driver > - > > Key: NUTCH-2676 > URL: https://issues.apache.org/jira/browse/NUTCH-2676 > Project: Nutch > Issue Type: New Feature > Components: protocol >Affects Versions: 1.15 >Reporter: Stas Batururimi >Priority: Major > Fix For: 1.16 > > Attachments: Screenshot 2018-11-16 at 18.15.52.png > > > * Selenium needs to be updated > * missing remote web driver for chrome > * necessity to add headless mode for both remote WebDriverBase Firefox & > Chrome > * use case with Selenium grid using docker (1 hub docker container, several > nodes in different docker containers, Nutch in another docker container, > streaming to Apache Solr in docker container, that is at least 4 different > docker containers) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2666) Increase default value for http.content.limit / ftp.content.limit / file.content.limit
[ https://issues.apache.org/jira/browse/NUTCH-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2666: --- Fix Version/s: 1.16 > Increase default value for http.content.limit / ftp.content.limit / > file.content.limit > -- > > Key: NUTCH-2666 > URL: https://issues.apache.org/jira/browse/NUTCH-2666 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Affects Versions: 1.15 >Reporter: Marco Ebbinghaus >Priority: Minor > Fix For: 1.16 > > > The default value for http.content.limit in nutch-default.xml (The length > limit for downloaded content using the http:// > protocol, in bytes. If this value is nonnegative (>=0), content longer > than it will be truncated; otherwise, no truncation at all. Do not > confuse this setting with the file.content.limit setting.) is set to 64kb. > Maybe this default value should be increased as many pages today are greater > than 64kb. > This fact hit me when trying to crawl a single website whose pages are much > greater than 64kb and because of that with every crawl cycle the count of > db_unfetched urls decreased until it hit zero and the crawler became inactive > (because the first 64 kB contained always the same set of navigation links) > The description might also be updated as this is not only the case for the > http protocol, but also for https. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2666) Increase default value for http.content.limit / ftp.content.limit / file.content.limit
[ https://issues.apache.org/jira/browse/NUTCH-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735709#comment-16735709 ] ASF GitHub Bot commented on NUTCH-2666: --- sebastian-nagel commented on pull request #427: NUTCH-2666 Increase default value for http.content.limit / ftp.content.limit / file.content.limit URL: https://github.com/apache/nutch/pull/427 Increase the default content limit from 64 kB to 1024 kB for http, ftp and file protocol plugins. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Increase default value for http.content.limit / ftp.content.limit / > file.content.limit > -- > > Key: NUTCH-2666 > URL: https://issues.apache.org/jira/browse/NUTCH-2666 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Affects Versions: 1.15 >Reporter: Marco Ebbinghaus >Priority: Minor > > The default value for http.content.limit in nutch-default.xml (The length > limit for downloaded content using the http:// > protocol, in bytes. If this value is nonnegative (>=0), content longer > than it will be truncated; otherwise, no truncation at all. Do not > confuse this setting with the file.content.limit setting.) is set to 64kb. > Maybe this default value should be increased as many pages today are greater > than 64kb. > This fact hit me when trying to crawl a single website whose pages are much > greater than 64kb and because of that with every crawl cycle the count of > db_unfetched urls decreased until it hit zero and the crawler became inactive > (because the first 64 kB contained always the same set of navigation links) > The description might also be updated as this is not only the case for the > http protocol, but also for https. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2666) Increase default value for http.content.limit / ftp.content.limit / file.content.limit
[ https://issues.apache.org/jira/browse/NUTCH-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2666: --- Summary: Increase default value for http.content.limit / ftp.content.limit / file.content.limit (was: increase default value for http.content.limit) > Increase default value for http.content.limit / ftp.content.limit / > file.content.limit > -- > > Key: NUTCH-2666 > URL: https://issues.apache.org/jira/browse/NUTCH-2666 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Affects Versions: 1.15 >Reporter: Marco Ebbinghaus >Priority: Minor > > The default value for http.content.limit in nutch-default.xml (The length > limit for downloaded content using the http:// > protocol, in bytes. If this value is nonnegative (>=0), content longer > than it will be truncated; otherwise, no truncation at all. Do not > confuse this setting with the file.content.limit setting.) is set to 64kb. > Maybe this default value should be increased as many pages today are greater > than 64kb. > This fact hit me when trying to crawl a single website whose pages are much > greater than 64kb and because of that with every crawl cycle the count of > db_unfetched urls decreased until it hit zero and the crawler became inactive > (because the first 64 kB contained always the same set of navigation links) > The description might also be updated as this is not only the case for the > http protocol, but also for https. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2673) EOFException protocol-http
[ https://issues.apache.org/jira/browse/NUTCH-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735692#comment-16735692 ] Markus Jelsma commented on NUTCH-2673: -- Yes, thanks Sebastian! > EOFException protocol-http > -- > > Key: NUTCH-2673 > URL: https://issues.apache.org/jira/browse/NUTCH-2673 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.14 >Reporter: Markus Jelsma >Priority: Major > Fix For: 1.16 > > > Got an EOFException for some URL: > {code} > 2018-11-07 12:23:18,463 INFO indexer.IndexingFiltersChecker - fetching: > https://www.misdaadjournalist.nl/2018/11/politie-kraakt-server-van-blackbox-265-000-criminele-berichten-onderschept/ > 2018-11-07 12:23:18,704 INFO protocol.RobotRulesParser - robots.txt > whitelist not configured. > 2018-11-07 12:23:18,704 INFO http.Http - http.proxy.host = null > 2018-11-07 12:23:18,704 INFO http.Http - http.proxy.port = 8080 > 2018-11-07 12:23:18,704 INFO http.Http - http.proxy.exception.list = false > 2018-11-07 12:23:18,704 INFO http.Http - http.timeout = 3 > 2018-11-07 12:23:18,704 INFO http.Http - http.content.limit = 32554432 > 2018-11-07 12:23:18,704 INFO http.Http - http.agent = Mozilla/5.0 > (compatible; OpenindexSpider; > +https://www.openindex.io/saas/about-our-spider/) > 2018-11-07 12:23:18,704 INFO http.Http - http.accept.language = > en-us,en-gb,en;q=0.7,*;q=0.3 > 2018-11-07 12:23:18,704 INFO http.Http - http.accept = > text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 > 2018-11-07 12:23:18,704 INFO http.Http - http.enable.cookie.header = false > 2018-11-07 12:23:18,911 ERROR http.Http - Failed to get protocol output > java.io.EOFException > at > org.apache.nutch.protocol.http.HttpResponse.readLine(HttpResponse.java:591) > at > org.apache.nutch.protocol.http.HttpResponse.parseStatusLine(HttpResponse.java:482) > at > org.apache.nutch.protocol.http.HttpResponse.(HttpResponse.java:249) > at org.apache.nutch.protocol.http.Http.getResponse(Http.java:72) > at > org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:276) > at > org.apache.nutch.indexer.IndexingFiltersChecker.getProtocolOutput(IndexingFiltersChecker.java:270) > at > org.apache.nutch.indexer.IndexingFiltersChecker.process(IndexingFiltersChecker.java:141) > at > org.apache.nutch.util.AbstractChecker.processSingle(AbstractChecker.java:86) > at > org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:111) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at > org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:275) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (NUTCH-2673) EOFException protocol-http
[ https://issues.apache.org/jira/browse/NUTCH-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-2673. Resolution: Not A Problem > EOFException protocol-http > -- > > Key: NUTCH-2673 > URL: https://issues.apache.org/jira/browse/NUTCH-2673 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.14 >Reporter: Markus Jelsma >Priority: Major > Fix For: 1.16 > > > Got an EOFException for some URL: > {code} > 2018-11-07 12:23:18,463 INFO indexer.IndexingFiltersChecker - fetching: > https://www.misdaadjournalist.nl/2018/11/politie-kraakt-server-van-blackbox-265-000-criminele-berichten-onderschept/ > 2018-11-07 12:23:18,704 INFO protocol.RobotRulesParser - robots.txt > whitelist not configured. > 2018-11-07 12:23:18,704 INFO http.Http - http.proxy.host = null > 2018-11-07 12:23:18,704 INFO http.Http - http.proxy.port = 8080 > 2018-11-07 12:23:18,704 INFO http.Http - http.proxy.exception.list = false > 2018-11-07 12:23:18,704 INFO http.Http - http.timeout = 3 > 2018-11-07 12:23:18,704 INFO http.Http - http.content.limit = 32554432 > 2018-11-07 12:23:18,704 INFO http.Http - http.agent = Mozilla/5.0 > (compatible; OpenindexSpider; > +https://www.openindex.io/saas/about-our-spider/) > 2018-11-07 12:23:18,704 INFO http.Http - http.accept.language = > en-us,en-gb,en;q=0.7,*;q=0.3 > 2018-11-07 12:23:18,704 INFO http.Http - http.accept = > text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 > 2018-11-07 12:23:18,704 INFO http.Http - http.enable.cookie.header = false > 2018-11-07 12:23:18,911 ERROR http.Http - Failed to get protocol output > java.io.EOFException > at > org.apache.nutch.protocol.http.HttpResponse.readLine(HttpResponse.java:591) > at > org.apache.nutch.protocol.http.HttpResponse.parseStatusLine(HttpResponse.java:482) > at > org.apache.nutch.protocol.http.HttpResponse.(HttpResponse.java:249) > at org.apache.nutch.protocol.http.Http.getResponse(Http.java:72) > at > org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:276) > at > org.apache.nutch.indexer.IndexingFiltersChecker.getProtocolOutput(IndexingFiltersChecker.java:270) > at > org.apache.nutch.indexer.IndexingFiltersChecker.process(IndexingFiltersChecker.java:141) > at > org.apache.nutch.util.AbstractChecker.processSingle(AbstractChecker.java:86) > at > org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:111) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at > org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:275) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2680) Documentation: https supported by multiple protocol plugins not only httpclient
[ https://issues.apache.org/jira/browse/NUTCH-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735686#comment-16735686 ] ASF GitHub Bot commented on NUTCH-2680: --- sebastian-nagel commented on pull request #426: NUTCH-2680 Documentation: https supported by multiple protocol plugins not only httpclient URL: https://github.com/apache/nutch/pull/426 Improve description of property plugin.includes: - https is supported by default - no need to enable the stub plugin nutch-extensionpoints This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Documentation: https supported by multiple protocol plugins not only > httpclient > --- > > Key: NUTCH-2680 > URL: https://issues.apache.org/jira/browse/NUTCH-2680 > Project: Nutch > Issue Type: Bug > Components: documentation, plugin >Affects Versions: 1.15 >Reporter: Sebastian Nagel >Priority: Trivial > Fix For: 1.16 > > > nutch-default.xml still states: > ??In order to use HTTPS please enable protocol-httpclient, but be aware of > possible intermittent problems with the underlying commons-httpclient > library.?? > Now https is supported by most protocol plugins and there is no need to > activate protocol-httpclient to fetch https:// pages. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2683) DeduplicationJob: add option to prefer https:// over http://
[ https://issues.apache.org/jira/browse/NUTCH-2683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735682#comment-16735682 ] ASF GitHub Bot commented on NUTCH-2683: --- sebastian-nagel commented on pull request #425: NUTCH-2683 DeduplicationJob: add option to prefer https:// over http:// URL: https://github.com/apache/nutch/pull/425 - add optional value "httpsOverHttp" to -compareOrder argument to prefer https:// over http:// if it comes before the "urlLength" and neither "score" nor "fetchTime" take precedence - code improvements: remove nested loop, sort imports, add `@Override` statements where applicable Testing with one pair of https/http duplicates: ``` % cat seeds.txt http://nutch.apache.org/ https://nutch.apache.org/ % nutch inject crawldb seeds.txt ... % nutch generate crawldb/ segments ... % nutch fetch segments/* ... % nutch parse segments/* ... % nutch updatedb crawldb/ segments/* ... % nutch dedup crawldb -compareOrder httpsOverHttp,score,urlLength,fetchTime ... Deduplication: 1 documents marked as duplicates ... % nutch readdb crawldb/ -url https://nutch.apache.org/ URL: https://nutch.apache.org/ Version: 7 Status: 2 (db_fetched) Fetch time: Wed Feb 06 11:55:33 CET 2019 Modified time: Mon Jan 07 11:55:33 CET 2019 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 1.181 Signature: da0ffbf19768ea2cab9ffa0fb4a778a7 Metadata: ... % nutch readdb crawldb/ -url http://nutch.apache.org/ URL: http://nutch.apache.org/ Version: 7 Status: 7 (db_duplicate) Fetch time: Wed Feb 06 11:55:39 CET 2019 Modified time: Mon Jan 07 11:55:39 CET 2019 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 1.181 Signature: da0ffbf19768ea2cab9ffa0fb4a778a7 Metadata: ... ``` The URL `https://nutch.apache.org/` is kept as expected if "httpsOverHttp" is configured. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > DeduplicationJob: add option to prefer https:// over http:// > > > Key: NUTCH-2683 > URL: https://issues.apache.org/jira/browse/NUTCH-2683 > Project: Nutch > Issue Type: Improvement > Components: crawldb >Affects Versions: 1.15 >Reporter: Sebastian Nagel >Priority: Major > Fix For: 1.16 > > > The deduplication job allows to keep the shortest URLs as the "best" URL of a > set of duplicates, marking all longer ones as duplicates. Recently search > engines started to penalize non-https pages by [giving https pages a higher > rank|https://webmasters.googleblog.com/2014/08/https-as-ranking-signal.html] > and [marking http as > insecure|https://www.blog.google/products/chrome/milestone-chrome-security-marking-http-not-secure/]. > If URLs are identical except for the protocol the deduplication job should be > able to prefer https:// over http:// URLs, although the latter ones are > shorter by one character. Of course, this should be configurable and in > addition to existing preferences (length, score and fetch time) to select the > "best" URL among duplicates. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (NUTCH-2676) Update to the latest selenium and add code to use chrome and firefox headless mode with the remote web driver
[ https://issues.apache.org/jira/browse/NUTCH-2676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735656#comment-16735656 ] Stas Batururimi edited comment on NUTCH-2676 at 1/7/19 10:48 AM: - [~wastl-nagel] Hi. Yes. I will provide it soon, somewhere between Jan 9 - Jan 13. was (Author: virt): [~wastl-nagel] Hi. Yes. I will provide it soo, somewhere between Jan 9 - Jan 13. > Update to the latest selenium and add code to use chrome and firefox headless > mode with the remote web driver > - > > Key: NUTCH-2676 > URL: https://issues.apache.org/jira/browse/NUTCH-2676 > Project: Nutch > Issue Type: New Feature > Components: protocol >Affects Versions: 1.15 >Reporter: Stas Batururimi >Priority: Major > Fix For: 1.16 > > Attachments: Screenshot 2018-11-16 at 18.15.52.png > > > * Selenium needs to be updated > * missing remote web driver for chrome > * necessity to add headless mode for both remote WebDriverBase Firefox & > Chrome > * use case with Selenium grid using docker (1 hub docker container, several > nodes in different docker containers, Nutch in another docker container, > streaming to Apache Solr in docker container, that is at least 4 different > docker containers) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2676) Update to the latest selenium and add code to use chrome and firefox headless mode with the remote web driver
[ https://issues.apache.org/jira/browse/NUTCH-2676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735656#comment-16735656 ] Stas Batururimi commented on NUTCH-2676: [~wastl-nagel] Hi. Yes. I will provide it soo, somewhere between Jan 9 - Jan 13. > Update to the latest selenium and add code to use chrome and firefox headless > mode with the remote web driver > - > > Key: NUTCH-2676 > URL: https://issues.apache.org/jira/browse/NUTCH-2676 > Project: Nutch > Issue Type: New Feature > Components: protocol >Affects Versions: 1.15 >Reporter: Stas Batururimi >Priority: Major > Fix For: 1.16 > > Attachments: Screenshot 2018-11-16 at 18.15.52.png > > > * Selenium needs to be updated > * missing remote web driver for chrome > * necessity to add headless mode for both remote WebDriverBase Firefox & > Chrome > * use case with Selenium grid using docker (1 hub docker container, several > nodes in different docker containers, Nutch in another docker container, > streaming to Apache Solr in docker container, that is at least 4 different > docker containers) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (NUTCH-2670) org.apache.nutch.indexer.IndexerMapReduce does not read the value of "indexer.delete" from nutch-site.xml
[ https://issues.apache.org/jira/browse/NUTCH-2670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2670. Resolution: Not A Problem Thanks for the feedback, [~aquaticwater]! > org.apache.nutch.indexer.IndexerMapReduce does not read the value of > "indexer.delete" from nutch-site.xml > - > > Key: NUTCH-2670 > URL: https://issues.apache.org/jira/browse/NUTCH-2670 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.14, 1.15 > Environment: macOS Mojave and High Sierra > MacBook Pro (Retina, 13-inch, Mid 2014) > Oracle Java 1.8.0_144-b01 and previous versions >Reporter: Junqiang Zhang >Priority: Minor > > Inside org.apache.nutch.indexer.IndexerMapReduce.IndexerReducer, the setup() > function should read the value of "indexer.delete" from nutch-site.xml, and > assign the value to the variable of "delete". See the following line of code. > (line 201) delete = conf.getBoolean(INDEXER_DELETE, false); > However, the value of "indexer.delete" set in nutch-site.xml and > nutch-default.xml is not assigned to the variable, "delete". I put the > following setting in one of nutch-site.xml and nutch-default.xml, or in both > of them. The variable of "delete" remains false. > > indexer.delete > true > Whether the indexer will delete documents GONE or REDIRECTS by > indexing filters > > > I also changed the line of code to > delete = conf.getBoolean(INDEXER_DELETE, true); > Whatever value of "indexer.delete" is set in nutch-site.xml or > nutch-default.xml, the value of "delete" remains false. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2676) Update to the latest selenium and add code to use chrome and firefox headless mode with the remote web driver
[ https://issues.apache.org/jira/browse/NUTCH-2676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735629#comment-16735629 ] Sebastian Nagel commented on NUTCH-2676: Hi [~virt], was the upgrade of Selenium successful? If yes, could you provide a patch? > Update to the latest selenium and add code to use chrome and firefox headless > mode with the remote web driver > - > > Key: NUTCH-2676 > URL: https://issues.apache.org/jira/browse/NUTCH-2676 > Project: Nutch > Issue Type: New Feature > Components: protocol >Affects Versions: 1.15 >Reporter: Stas Batururimi >Priority: Major > Fix For: 1.16 > > Attachments: Screenshot 2018-11-16 at 18.15.52.png > > > * Selenium needs to be updated > * missing remote web driver for chrome > * necessity to add headless mode for both remote WebDriverBase Firefox & > Chrome > * use case with Selenium grid using docker (1 hub docker container, several > nodes in different docker containers, Nutch in another docker container, > streaming to Apache Solr in docker container, that is at least 4 different > docker containers) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2673) EOFException protocol-http
[ https://issues.apache.org/jira/browse/NUTCH-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735626#comment-16735626 ] Sebastian Nagel commented on NUTCH-2673: [~markus17], can we close this? > EOFException protocol-http > -- > > Key: NUTCH-2673 > URL: https://issues.apache.org/jira/browse/NUTCH-2673 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.14 >Reporter: Markus Jelsma >Priority: Major > Fix For: 1.16 > > > Got an EOFException for some URL: > {code} > 2018-11-07 12:23:18,463 INFO indexer.IndexingFiltersChecker - fetching: > https://www.misdaadjournalist.nl/2018/11/politie-kraakt-server-van-blackbox-265-000-criminele-berichten-onderschept/ > 2018-11-07 12:23:18,704 INFO protocol.RobotRulesParser - robots.txt > whitelist not configured. > 2018-11-07 12:23:18,704 INFO http.Http - http.proxy.host = null > 2018-11-07 12:23:18,704 INFO http.Http - http.proxy.port = 8080 > 2018-11-07 12:23:18,704 INFO http.Http - http.proxy.exception.list = false > 2018-11-07 12:23:18,704 INFO http.Http - http.timeout = 3 > 2018-11-07 12:23:18,704 INFO http.Http - http.content.limit = 32554432 > 2018-11-07 12:23:18,704 INFO http.Http - http.agent = Mozilla/5.0 > (compatible; OpenindexSpider; > +https://www.openindex.io/saas/about-our-spider/) > 2018-11-07 12:23:18,704 INFO http.Http - http.accept.language = > en-us,en-gb,en;q=0.7,*;q=0.3 > 2018-11-07 12:23:18,704 INFO http.Http - http.accept = > text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 > 2018-11-07 12:23:18,704 INFO http.Http - http.enable.cookie.header = false > 2018-11-07 12:23:18,911 ERROR http.Http - Failed to get protocol output > java.io.EOFException > at > org.apache.nutch.protocol.http.HttpResponse.readLine(HttpResponse.java:591) > at > org.apache.nutch.protocol.http.HttpResponse.parseStatusLine(HttpResponse.java:482) > at > org.apache.nutch.protocol.http.HttpResponse.(HttpResponse.java:249) > at org.apache.nutch.protocol.http.Http.getResponse(Http.java:72) > at > org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:276) > at > org.apache.nutch.indexer.IndexingFiltersChecker.getProtocolOutput(IndexingFiltersChecker.java:270) > at > org.apache.nutch.indexer.IndexingFiltersChecker.process(IndexingFiltersChecker.java:141) > at > org.apache.nutch.util.AbstractChecker.processSingle(AbstractChecker.java:86) > at > org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:111) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at > org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:275) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2395) Cannot run job worker! - error while running multiple crawling jobs in parallel
[ https://issues.apache.org/jira/browse/NUTCH-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2395: --- Affects Version/s: (was: 1.14) > Cannot run job worker! - error while running multiple crawling jobs in > parallel > --- > > Key: NUTCH-2395 > URL: https://issues.apache.org/jira/browse/NUTCH-2395 > Project: Nutch > Issue Type: Bug > Components: generator, nutch server >Affects Versions: 2.3.1 > Environment: Ubuntu 16.04 64-bit > Oracle Java 8 64-bit > Nutch 2.3.1 (standalone deployment) > MongoDB 3.4 >Reporter: Vyacheslav Pascarel >Priority: Major > Fix For: 2.4, 1.16 > > > Cannot run job worker! - error while running multiple crawling jobs in > parallel > Ubuntu 16.04 64-bit > Oracle Java 8 64-bit > Nutch 2.3.1 (standalone deployment) > MongoDB 3.4 > My application is trying to execute multiple Nutch jobs in parallel using > Nutch REST services. The application injects a seed URL and then repeats > GENERATE/FETCH/PARSE/UPDATEDB sequence requested number of times to emulated > continuous crawling (each step in the sequence is executed upon successful > competition of the previous step then the whole sequence is repeated again). > Here is a brief description of the jobs: > * Number of parallel jobs: 7 > * Each job has unique crawl id and MongoDB collection > * Seed URL for all jobs: http://www.cnn.com > * Regex URL filters for all jobs: > ** *"-^.\{1000,\}$"* - exclude very long URLs > ** *"+."* - include the rest > The jobs are started as expected but at some point some of them fail with > "Cannot run job worker!" error. For more details see job status and > hadoop.log lines below. > In debugger during crash I noticed that a single instance of > SelectorEntryComparator (definition is nested in GeneratorJob) is shared > across multiple reducer tasks. The class is inherited from > org.apache.hadoop.io.WritableComparator which has a few members unprotected > for concurrent usage. At some point multiple threads may access those members > in WritableComparator.compare call. I modified SelectorEntryComparator and it > seems solved the problem but I am not sure if the change is appropriate > and/or sufficient (covers GENERATE only?) > Original code: > {code:java} > public static class SelectorEntryComparator extends WritableComparator { > public SelectorEntryComparator() { > super(SelectorEntry.class, true); > } > } > {code} > Modified code: > {code:java} > public static class SelectorEntryComparator extends WritableComparator { > public SelectorEntryComparator() { > super(SelectorEntry.class, true); > } > > @Override > synchronized public int compare(byte[] b1, int s1, int l1, byte[] b2, int > s2, int l2) { > return super.compare(b1, s1, l1, b2, s2, l2); > } > } > {code} > Example of failed job status: > {code} > { > "id" : "parallel_0-65ff2f1b-382e-4eb2-a813-a0370b84d5b6-GENERATE-1961495833", > "type" : "GENERATE", > "confId" : "65ff2f1b-382e-4eb2-a813-a0370b84d5b6", > "args" : { "topN" : "100" }, > "result" : null, > "state" : "FAILED", > "msg" : "ERROR: java.lang.RuntimeException: job failed: > name=[parallel_0]generate: 1498059912-1448058551, > jobid=job_local1142434549_0036", > "crawlId" : "parallel_0" > } > {code} > Lines from hadoop.log > {code} > 2017-06-21 11:45:13,021 WARN mapred.LocalJobRunner - job_local1142434549_0036 > java.lang.Exception: java.lang.RuntimeException: java.io.EOFException > at > org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529) > Caused by: java.lang.RuntimeException: java.io.EOFException > at > org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:164) > at > org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:158) > at > org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:121) > at > org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:302) > at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:170) > at > org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627) > at > org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389) > at > org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) >
[jira] [Updated] (NUTCH-2395) Cannot run job worker! - error while running multiple crawling jobs in parallel
[ https://issues.apache.org/jira/browse/NUTCH-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2395: --- Fix Version/s: (was: 1.16) > Cannot run job worker! - error while running multiple crawling jobs in > parallel > --- > > Key: NUTCH-2395 > URL: https://issues.apache.org/jira/browse/NUTCH-2395 > Project: Nutch > Issue Type: Bug > Components: generator, nutch server >Affects Versions: 2.3.1 > Environment: Ubuntu 16.04 64-bit > Oracle Java 8 64-bit > Nutch 2.3.1 (standalone deployment) > MongoDB 3.4 >Reporter: Vyacheslav Pascarel >Priority: Major > Fix For: 2.4 > > > Cannot run job worker! - error while running multiple crawling jobs in > parallel > Ubuntu 16.04 64-bit > Oracle Java 8 64-bit > Nutch 2.3.1 (standalone deployment) > MongoDB 3.4 > My application is trying to execute multiple Nutch jobs in parallel using > Nutch REST services. The application injects a seed URL and then repeats > GENERATE/FETCH/PARSE/UPDATEDB sequence requested number of times to emulated > continuous crawling (each step in the sequence is executed upon successful > competition of the previous step then the whole sequence is repeated again). > Here is a brief description of the jobs: > * Number of parallel jobs: 7 > * Each job has unique crawl id and MongoDB collection > * Seed URL for all jobs: http://www.cnn.com > * Regex URL filters for all jobs: > ** *"-^.\{1000,\}$"* - exclude very long URLs > ** *"+."* - include the rest > The jobs are started as expected but at some point some of them fail with > "Cannot run job worker!" error. For more details see job status and > hadoop.log lines below. > In debugger during crash I noticed that a single instance of > SelectorEntryComparator (definition is nested in GeneratorJob) is shared > across multiple reducer tasks. The class is inherited from > org.apache.hadoop.io.WritableComparator which has a few members unprotected > for concurrent usage. At some point multiple threads may access those members > in WritableComparator.compare call. I modified SelectorEntryComparator and it > seems solved the problem but I am not sure if the change is appropriate > and/or sufficient (covers GENERATE only?) > Original code: > {code:java} > public static class SelectorEntryComparator extends WritableComparator { > public SelectorEntryComparator() { > super(SelectorEntry.class, true); > } > } > {code} > Modified code: > {code:java} > public static class SelectorEntryComparator extends WritableComparator { > public SelectorEntryComparator() { > super(SelectorEntry.class, true); > } > > @Override > synchronized public int compare(byte[] b1, int s1, int l1, byte[] b2, int > s2, int l2) { > return super.compare(b1, s1, l1, b2, s2, l2); > } > } > {code} > Example of failed job status: > {code} > { > "id" : "parallel_0-65ff2f1b-382e-4eb2-a813-a0370b84d5b6-GENERATE-1961495833", > "type" : "GENERATE", > "confId" : "65ff2f1b-382e-4eb2-a813-a0370b84d5b6", > "args" : { "topN" : "100" }, > "result" : null, > "state" : "FAILED", > "msg" : "ERROR: java.lang.RuntimeException: job failed: > name=[parallel_0]generate: 1498059912-1448058551, > jobid=job_local1142434549_0036", > "crawlId" : "parallel_0" > } > {code} > Lines from hadoop.log > {code} > 2017-06-21 11:45:13,021 WARN mapred.LocalJobRunner - job_local1142434549_0036 > java.lang.Exception: java.lang.RuntimeException: java.io.EOFException > at > org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529) > Caused by: java.lang.RuntimeException: java.io.EOFException > at > org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:164) > at > org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:158) > at > org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:121) > at > org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:302) > at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:170) > at > org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627) > at > org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389) > at > org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at >
[jira] [Commented] (NUTCH-2395) Cannot run job worker! - error while running multiple crawling jobs in parallel
[ https://issues.apache.org/jira/browse/NUTCH-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735620#comment-16735620 ] Sebastian Nagel commented on NUTCH-2395: Sorry, 1.x is safe because it inherits from FloatWritable.Comparator which is thread-safe (does not use any fields when reading values as does WritableComparator). > Cannot run job worker! - error while running multiple crawling jobs in > parallel > --- > > Key: NUTCH-2395 > URL: https://issues.apache.org/jira/browse/NUTCH-2395 > Project: Nutch > Issue Type: Bug > Components: generator, nutch server >Affects Versions: 2.3.1, 1.14 > Environment: Ubuntu 16.04 64-bit > Oracle Java 8 64-bit > Nutch 2.3.1 (standalone deployment) > MongoDB 3.4 >Reporter: Vyacheslav Pascarel >Priority: Major > Fix For: 2.4, 1.16 > > > Cannot run job worker! - error while running multiple crawling jobs in > parallel > Ubuntu 16.04 64-bit > Oracle Java 8 64-bit > Nutch 2.3.1 (standalone deployment) > MongoDB 3.4 > My application is trying to execute multiple Nutch jobs in parallel using > Nutch REST services. The application injects a seed URL and then repeats > GENERATE/FETCH/PARSE/UPDATEDB sequence requested number of times to emulated > continuous crawling (each step in the sequence is executed upon successful > competition of the previous step then the whole sequence is repeated again). > Here is a brief description of the jobs: > * Number of parallel jobs: 7 > * Each job has unique crawl id and MongoDB collection > * Seed URL for all jobs: http://www.cnn.com > * Regex URL filters for all jobs: > ** *"-^.\{1000,\}$"* - exclude very long URLs > ** *"+."* - include the rest > The jobs are started as expected but at some point some of them fail with > "Cannot run job worker!" error. For more details see job status and > hadoop.log lines below. > In debugger during crash I noticed that a single instance of > SelectorEntryComparator (definition is nested in GeneratorJob) is shared > across multiple reducer tasks. The class is inherited from > org.apache.hadoop.io.WritableComparator which has a few members unprotected > for concurrent usage. At some point multiple threads may access those members > in WritableComparator.compare call. I modified SelectorEntryComparator and it > seems solved the problem but I am not sure if the change is appropriate > and/or sufficient (covers GENERATE only?) > Original code: > {code:java} > public static class SelectorEntryComparator extends WritableComparator { > public SelectorEntryComparator() { > super(SelectorEntry.class, true); > } > } > {code} > Modified code: > {code:java} > public static class SelectorEntryComparator extends WritableComparator { > public SelectorEntryComparator() { > super(SelectorEntry.class, true); > } > > @Override > synchronized public int compare(byte[] b1, int s1, int l1, byte[] b2, int > s2, int l2) { > return super.compare(b1, s1, l1, b2, s2, l2); > } > } > {code} > Example of failed job status: > {code} > { > "id" : "parallel_0-65ff2f1b-382e-4eb2-a813-a0370b84d5b6-GENERATE-1961495833", > "type" : "GENERATE", > "confId" : "65ff2f1b-382e-4eb2-a813-a0370b84d5b6", > "args" : { "topN" : "100" }, > "result" : null, > "state" : "FAILED", > "msg" : "ERROR: java.lang.RuntimeException: job failed: > name=[parallel_0]generate: 1498059912-1448058551, > jobid=job_local1142434549_0036", > "crawlId" : "parallel_0" > } > {code} > Lines from hadoop.log > {code} > 2017-06-21 11:45:13,021 WARN mapred.LocalJobRunner - job_local1142434549_0036 > java.lang.Exception: java.lang.RuntimeException: java.io.EOFException > at > org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529) > Caused by: java.lang.RuntimeException: java.io.EOFException > at > org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:164) > at > org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:158) > at > org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:121) > at > org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:302) > at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:170) > at > org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627) > at > org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389) > at > org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319) >
[jira] [Commented] (NUTCH-2395) Cannot run job worker! - error while running multiple crawling jobs in parallel
[ https://issues.apache.org/jira/browse/NUTCH-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735613#comment-16735613 ] Sebastian Nagel commented on NUTCH-2395: Also affects 1.x when Generator is used from Nutch server in parallel. > Cannot run job worker! - error while running multiple crawling jobs in > parallel > --- > > Key: NUTCH-2395 > URL: https://issues.apache.org/jira/browse/NUTCH-2395 > Project: Nutch > Issue Type: Bug > Components: nutch server >Affects Versions: 2.3.1, 1.14 > Environment: Ubuntu 16.04 64-bit > Oracle Java 8 64-bit > Nutch 2.3.1 (standalone deployment) > MongoDB 3.4 >Reporter: Vyacheslav Pascarel >Priority: Major > Fix For: 2.4, 1.16 > > > Cannot run job worker! - error while running multiple crawling jobs in > parallel > Ubuntu 16.04 64-bit > Oracle Java 8 64-bit > Nutch 2.3.1 (standalone deployment) > MongoDB 3.4 > My application is trying to execute multiple Nutch jobs in parallel using > Nutch REST services. The application injects a seed URL and then repeats > GENERATE/FETCH/PARSE/UPDATEDB sequence requested number of times to emulated > continuous crawling (each step in the sequence is executed upon successful > competition of the previous step then the whole sequence is repeated again). > Here is a brief description of the jobs: > * Number of parallel jobs: 7 > * Each job has unique crawl id and MongoDB collection > * Seed URL for all jobs: http://www.cnn.com > * Regex URL filters for all jobs: > ** *"-^.\{1000,\}$"* - exclude very long URLs > ** *"+."* - include the rest > The jobs are started as expected but at some point some of them fail with > "Cannot run job worker!" error. For more details see job status and > hadoop.log lines below. > In debugger during crash I noticed that a single instance of > SelectorEntryComparator (definition is nested in GeneratorJob) is shared > across multiple reducer tasks. The class is inherited from > org.apache.hadoop.io.WritableComparator which has a few members unprotected > for concurrent usage. At some point multiple threads may access those members > in WritableComparator.compare call. I modified SelectorEntryComparator and it > seems solved the problem but I am not sure if the change is appropriate > and/or sufficient (covers GENERATE only?) > Original code: > {code:java} > public static class SelectorEntryComparator extends WritableComparator { > public SelectorEntryComparator() { > super(SelectorEntry.class, true); > } > } > {code} > Modified code: > {code:java} > public static class SelectorEntryComparator extends WritableComparator { > public SelectorEntryComparator() { > super(SelectorEntry.class, true); > } > > @Override > synchronized public int compare(byte[] b1, int s1, int l1, byte[] b2, int > s2, int l2) { > return super.compare(b1, s1, l1, b2, s2, l2); > } > } > {code} > Example of failed job status: > {code} > { > "id" : "parallel_0-65ff2f1b-382e-4eb2-a813-a0370b84d5b6-GENERATE-1961495833", > "type" : "GENERATE", > "confId" : "65ff2f1b-382e-4eb2-a813-a0370b84d5b6", > "args" : { "topN" : "100" }, > "result" : null, > "state" : "FAILED", > "msg" : "ERROR: java.lang.RuntimeException: job failed: > name=[parallel_0]generate: 1498059912-1448058551, > jobid=job_local1142434549_0036", > "crawlId" : "parallel_0" > } > {code} > Lines from hadoop.log > {code} > 2017-06-21 11:45:13,021 WARN mapred.LocalJobRunner - job_local1142434549_0036 > java.lang.Exception: java.lang.RuntimeException: java.io.EOFException > at > org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529) > Caused by: java.lang.RuntimeException: java.io.EOFException > at > org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:164) > at > org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:158) > at > org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:121) > at > org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:302) > at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:170) > at > org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627) > at > org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389) > at > org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at
[jira] [Updated] (NUTCH-2395) Cannot run job worker! - error while running multiple crawling jobs in parallel
[ https://issues.apache.org/jira/browse/NUTCH-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2395: --- Component/s: generator > Cannot run job worker! - error while running multiple crawling jobs in > parallel > --- > > Key: NUTCH-2395 > URL: https://issues.apache.org/jira/browse/NUTCH-2395 > Project: Nutch > Issue Type: Bug > Components: generator, nutch server >Affects Versions: 2.3.1, 1.14 > Environment: Ubuntu 16.04 64-bit > Oracle Java 8 64-bit > Nutch 2.3.1 (standalone deployment) > MongoDB 3.4 >Reporter: Vyacheslav Pascarel >Priority: Major > Fix For: 2.4, 1.16 > > > Cannot run job worker! - error while running multiple crawling jobs in > parallel > Ubuntu 16.04 64-bit > Oracle Java 8 64-bit > Nutch 2.3.1 (standalone deployment) > MongoDB 3.4 > My application is trying to execute multiple Nutch jobs in parallel using > Nutch REST services. The application injects a seed URL and then repeats > GENERATE/FETCH/PARSE/UPDATEDB sequence requested number of times to emulated > continuous crawling (each step in the sequence is executed upon successful > competition of the previous step then the whole sequence is repeated again). > Here is a brief description of the jobs: > * Number of parallel jobs: 7 > * Each job has unique crawl id and MongoDB collection > * Seed URL for all jobs: http://www.cnn.com > * Regex URL filters for all jobs: > ** *"-^.\{1000,\}$"* - exclude very long URLs > ** *"+."* - include the rest > The jobs are started as expected but at some point some of them fail with > "Cannot run job worker!" error. For more details see job status and > hadoop.log lines below. > In debugger during crash I noticed that a single instance of > SelectorEntryComparator (definition is nested in GeneratorJob) is shared > across multiple reducer tasks. The class is inherited from > org.apache.hadoop.io.WritableComparator which has a few members unprotected > for concurrent usage. At some point multiple threads may access those members > in WritableComparator.compare call. I modified SelectorEntryComparator and it > seems solved the problem but I am not sure if the change is appropriate > and/or sufficient (covers GENERATE only?) > Original code: > {code:java} > public static class SelectorEntryComparator extends WritableComparator { > public SelectorEntryComparator() { > super(SelectorEntry.class, true); > } > } > {code} > Modified code: > {code:java} > public static class SelectorEntryComparator extends WritableComparator { > public SelectorEntryComparator() { > super(SelectorEntry.class, true); > } > > @Override > synchronized public int compare(byte[] b1, int s1, int l1, byte[] b2, int > s2, int l2) { > return super.compare(b1, s1, l1, b2, s2, l2); > } > } > {code} > Example of failed job status: > {code} > { > "id" : "parallel_0-65ff2f1b-382e-4eb2-a813-a0370b84d5b6-GENERATE-1961495833", > "type" : "GENERATE", > "confId" : "65ff2f1b-382e-4eb2-a813-a0370b84d5b6", > "args" : { "topN" : "100" }, > "result" : null, > "state" : "FAILED", > "msg" : "ERROR: java.lang.RuntimeException: job failed: > name=[parallel_0]generate: 1498059912-1448058551, > jobid=job_local1142434549_0036", > "crawlId" : "parallel_0" > } > {code} > Lines from hadoop.log > {code} > 2017-06-21 11:45:13,021 WARN mapred.LocalJobRunner - job_local1142434549_0036 > java.lang.Exception: java.lang.RuntimeException: java.io.EOFException > at > org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529) > Caused by: java.lang.RuntimeException: java.io.EOFException > at > org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:164) > at > org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:158) > at > org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:121) > at > org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:302) > at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:170) > at > org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627) > at > org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389) > at > org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) >
[jira] [Updated] (NUTCH-2395) Cannot run job worker! - error while running multiple crawling jobs in parallel
[ https://issues.apache.org/jira/browse/NUTCH-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2395: --- Fix Version/s: 1.16 > Cannot run job worker! - error while running multiple crawling jobs in > parallel > --- > > Key: NUTCH-2395 > URL: https://issues.apache.org/jira/browse/NUTCH-2395 > Project: Nutch > Issue Type: Bug > Components: nutch server >Affects Versions: 2.3.1, 1.14 > Environment: Ubuntu 16.04 64-bit > Oracle Java 8 64-bit > Nutch 2.3.1 (standalone deployment) > MongoDB 3.4 >Reporter: Vyacheslav Pascarel >Priority: Major > Fix For: 2.4, 1.16 > > > Cannot run job worker! - error while running multiple crawling jobs in > parallel > Ubuntu 16.04 64-bit > Oracle Java 8 64-bit > Nutch 2.3.1 (standalone deployment) > MongoDB 3.4 > My application is trying to execute multiple Nutch jobs in parallel using > Nutch REST services. The application injects a seed URL and then repeats > GENERATE/FETCH/PARSE/UPDATEDB sequence requested number of times to emulated > continuous crawling (each step in the sequence is executed upon successful > competition of the previous step then the whole sequence is repeated again). > Here is a brief description of the jobs: > * Number of parallel jobs: 7 > * Each job has unique crawl id and MongoDB collection > * Seed URL for all jobs: http://www.cnn.com > * Regex URL filters for all jobs: > ** *"-^.\{1000,\}$"* - exclude very long URLs > ** *"+."* - include the rest > The jobs are started as expected but at some point some of them fail with > "Cannot run job worker!" error. For more details see job status and > hadoop.log lines below. > In debugger during crash I noticed that a single instance of > SelectorEntryComparator (definition is nested in GeneratorJob) is shared > across multiple reducer tasks. The class is inherited from > org.apache.hadoop.io.WritableComparator which has a few members unprotected > for concurrent usage. At some point multiple threads may access those members > in WritableComparator.compare call. I modified SelectorEntryComparator and it > seems solved the problem but I am not sure if the change is appropriate > and/or sufficient (covers GENERATE only?) > Original code: > {code:java} > public static class SelectorEntryComparator extends WritableComparator { > public SelectorEntryComparator() { > super(SelectorEntry.class, true); > } > } > {code} > Modified code: > {code:java} > public static class SelectorEntryComparator extends WritableComparator { > public SelectorEntryComparator() { > super(SelectorEntry.class, true); > } > > @Override > synchronized public int compare(byte[] b1, int s1, int l1, byte[] b2, int > s2, int l2) { > return super.compare(b1, s1, l1, b2, s2, l2); > } > } > {code} > Example of failed job status: > {code} > { > "id" : "parallel_0-65ff2f1b-382e-4eb2-a813-a0370b84d5b6-GENERATE-1961495833", > "type" : "GENERATE", > "confId" : "65ff2f1b-382e-4eb2-a813-a0370b84d5b6", > "args" : { "topN" : "100" }, > "result" : null, > "state" : "FAILED", > "msg" : "ERROR: java.lang.RuntimeException: job failed: > name=[parallel_0]generate: 1498059912-1448058551, > jobid=job_local1142434549_0036", > "crawlId" : "parallel_0" > } > {code} > Lines from hadoop.log > {code} > 2017-06-21 11:45:13,021 WARN mapred.LocalJobRunner - job_local1142434549_0036 > java.lang.Exception: java.lang.RuntimeException: java.io.EOFException > at > org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529) > Caused by: java.lang.RuntimeException: java.io.EOFException > at > org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:164) > at > org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:158) > at > org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:121) > at > org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:302) > at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:170) > at > org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627) > at > org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389) > at > org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at >
[jira] [Updated] (NUTCH-2395) Cannot run job worker! - error while running multiple crawling jobs in parallel
[ https://issues.apache.org/jira/browse/NUTCH-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2395: --- Affects Version/s: 1.14 > Cannot run job worker! - error while running multiple crawling jobs in > parallel > --- > > Key: NUTCH-2395 > URL: https://issues.apache.org/jira/browse/NUTCH-2395 > Project: Nutch > Issue Type: Bug > Components: nutch server >Affects Versions: 2.3.1, 1.14 > Environment: Ubuntu 16.04 64-bit > Oracle Java 8 64-bit > Nutch 2.3.1 (standalone deployment) > MongoDB 3.4 >Reporter: Vyacheslav Pascarel >Priority: Major > Fix For: 2.4 > > > Cannot run job worker! - error while running multiple crawling jobs in > parallel > Ubuntu 16.04 64-bit > Oracle Java 8 64-bit > Nutch 2.3.1 (standalone deployment) > MongoDB 3.4 > My application is trying to execute multiple Nutch jobs in parallel using > Nutch REST services. The application injects a seed URL and then repeats > GENERATE/FETCH/PARSE/UPDATEDB sequence requested number of times to emulated > continuous crawling (each step in the sequence is executed upon successful > competition of the previous step then the whole sequence is repeated again). > Here is a brief description of the jobs: > * Number of parallel jobs: 7 > * Each job has unique crawl id and MongoDB collection > * Seed URL for all jobs: http://www.cnn.com > * Regex URL filters for all jobs: > ** *"-^.\{1000,\}$"* - exclude very long URLs > ** *"+."* - include the rest > The jobs are started as expected but at some point some of them fail with > "Cannot run job worker!" error. For more details see job status and > hadoop.log lines below. > In debugger during crash I noticed that a single instance of > SelectorEntryComparator (definition is nested in GeneratorJob) is shared > across multiple reducer tasks. The class is inherited from > org.apache.hadoop.io.WritableComparator which has a few members unprotected > for concurrent usage. At some point multiple threads may access those members > in WritableComparator.compare call. I modified SelectorEntryComparator and it > seems solved the problem but I am not sure if the change is appropriate > and/or sufficient (covers GENERATE only?) > Original code: > {code:java} > public static class SelectorEntryComparator extends WritableComparator { > public SelectorEntryComparator() { > super(SelectorEntry.class, true); > } > } > {code} > Modified code: > {code:java} > public static class SelectorEntryComparator extends WritableComparator { > public SelectorEntryComparator() { > super(SelectorEntry.class, true); > } > > @Override > synchronized public int compare(byte[] b1, int s1, int l1, byte[] b2, int > s2, int l2) { > return super.compare(b1, s1, l1, b2, s2, l2); > } > } > {code} > Example of failed job status: > {code} > { > "id" : "parallel_0-65ff2f1b-382e-4eb2-a813-a0370b84d5b6-GENERATE-1961495833", > "type" : "GENERATE", > "confId" : "65ff2f1b-382e-4eb2-a813-a0370b84d5b6", > "args" : { "topN" : "100" }, > "result" : null, > "state" : "FAILED", > "msg" : "ERROR: java.lang.RuntimeException: job failed: > name=[parallel_0]generate: 1498059912-1448058551, > jobid=job_local1142434549_0036", > "crawlId" : "parallel_0" > } > {code} > Lines from hadoop.log > {code} > 2017-06-21 11:45:13,021 WARN mapred.LocalJobRunner - job_local1142434549_0036 > java.lang.Exception: java.lang.RuntimeException: java.io.EOFException > at > org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529) > Caused by: java.lang.RuntimeException: java.io.EOFException > at > org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:164) > at > org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:158) > at > org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:121) > at > org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:302) > at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:170) > at > org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627) > at > org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389) > at > org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at >
[jira] [Updated] (NUTCH-1623) Implement file.content.ignored function
[ https://issues.apache.org/jira/browse/NUTCH-1623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1623: --- Fix Version/s: 2.5 > Implement file.content.ignored function > --- > > Key: NUTCH-1623 > URL: https://issues.apache.org/jira/browse/NUTCH-1623 > Project: Nutch > Issue Type: New Feature > Components: crawldb, fetcher >Affects Versions: 2.2, 2.2.1 >Reporter: Osy >Priority: Major > Fix For: 2.5 > > > For Nutch 2.2.1 in nutch-default.xml there is a description for this > functionality (!! NO IMPLEMENTED YET !!): > If true, no file content will be saved during fetch. > And it is probably what we want to set most of time, since [file://|file:///] > URLs > are meant to be local and we can always use them directly at parsing > and indexing stages. Otherwise file contents will be saved. > Exactly what I need. > Thanks -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (NUTCH-2683) DeduplicationJob: add option to prefer https:// over http://
Sebastian Nagel created NUTCH-2683: -- Summary: DeduplicationJob: add option to prefer https:// over http:// Key: NUTCH-2683 URL: https://issues.apache.org/jira/browse/NUTCH-2683 Project: Nutch Issue Type: Improvement Components: crawldb Affects Versions: 1.15 Reporter: Sebastian Nagel Fix For: 1.16 The deduplication job allows to keep the shortest URLs as the "best" URL of a set of duplicates, marking all longer ones as duplicates. Recently search engines started to penalize non-https pages by [giving https pages a higher rank|https://webmasters.googleblog.com/2014/08/https-as-ranking-signal.html] and [marking http as insecure|https://www.blog.google/products/chrome/milestone-chrome-security-marking-http-not-secure/]. If URLs are identical except for the protocol the deduplication job should be able to prefer https:// over http:// URLs, although the latter ones are shorter by one character. Of course, this should be configurable and in addition to existing preferences (length, score and fetch time) to select the "best" URL among duplicates. -- This message was sent by Atlassian JIRA (v7.6.3#76005)