[jira] [Commented] (NUTCH-2676) Update to the latest selenium and add code to use chrome and firefox headless mode with the remote web driver

2019-01-07 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735780#comment-16735780
 ] 

Sebastian Nagel commented on NUTCH-2676:


Great! Thanks!

> Update to the latest selenium and add code to use chrome and firefox headless 
> mode with the remote web driver
> -
>
> Key: NUTCH-2676
> URL: https://issues.apache.org/jira/browse/NUTCH-2676
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: 1.15
>Reporter: Stas Batururimi
>Priority: Major
> Fix For: 1.16
>
> Attachments: Screenshot 2018-11-16 at 18.15.52.png
>
>
> * Selenium needs to be updated
>  * missing remote web driver for chrome 
>  * necessity to add headless mode for both remote WebDriverBase Firefox & 
> Chrome
>  * use case with Selenium grid using docker (1 hub docker container, several 
> nodes in different docker containers, Nutch in another docker container, 
> streaming to Apache Solr in docker container, that is at least 4 different 
> docker containers)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2666) Increase default value for http.content.limit / ftp.content.limit / file.content.limit

2019-01-07 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2666:
---
Fix Version/s: 1.16

> Increase default value for http.content.limit / ftp.content.limit / 
> file.content.limit
> --
>
> Key: NUTCH-2666
> URL: https://issues.apache.org/jira/browse/NUTCH-2666
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.15
>Reporter: Marco Ebbinghaus
>Priority: Minor
> Fix For: 1.16
>
>
> The default value for http.content.limit in nutch-default.xml (The length 
> limit for downloaded content using the http://
>  protocol, in bytes. If this value is nonnegative (>=0), content longer
>  than it will be truncated; otherwise, no truncation at all. Do not
>  confuse this setting with the file.content.limit setting.) is set to 64kb. 
> Maybe this default value should be increased as many pages today are greater 
> than 64kb.
> This fact hit me when trying to crawl a single website whose pages are much 
> greater than 64kb and because of that with every crawl cycle the count of 
> db_unfetched urls decreased until it hit zero and the crawler became inactive 
> (because the first 64 kB contained always the same set of navigation links)
> The description might also be updated as this is not only the case for the 
> http protocol, but also for https.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2666) Increase default value for http.content.limit / ftp.content.limit / file.content.limit

2019-01-07 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735709#comment-16735709
 ] 

ASF GitHub Bot commented on NUTCH-2666:
---

sebastian-nagel commented on pull request #427: NUTCH-2666 Increase default 
value for http.content.limit / ftp.content.limit / file.content.limit
URL: https://github.com/apache/nutch/pull/427
 
 
   Increase the default content limit from 64 kB to 1024 kB for http, ftp and 
file protocol plugins.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Increase default value for http.content.limit / ftp.content.limit / 
> file.content.limit
> --
>
> Key: NUTCH-2666
> URL: https://issues.apache.org/jira/browse/NUTCH-2666
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.15
>Reporter: Marco Ebbinghaus
>Priority: Minor
>
> The default value for http.content.limit in nutch-default.xml (The length 
> limit for downloaded content using the http://
>  protocol, in bytes. If this value is nonnegative (>=0), content longer
>  than it will be truncated; otherwise, no truncation at all. Do not
>  confuse this setting with the file.content.limit setting.) is set to 64kb. 
> Maybe this default value should be increased as many pages today are greater 
> than 64kb.
> This fact hit me when trying to crawl a single website whose pages are much 
> greater than 64kb and because of that with every crawl cycle the count of 
> db_unfetched urls decreased until it hit zero and the crawler became inactive 
> (because the first 64 kB contained always the same set of navigation links)
> The description might also be updated as this is not only the case for the 
> http protocol, but also for https.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2666) Increase default value for http.content.limit / ftp.content.limit / file.content.limit

2019-01-07 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2666:
---
Summary: Increase default value for http.content.limit / ftp.content.limit 
/ file.content.limit  (was: increase default value for http.content.limit)

> Increase default value for http.content.limit / ftp.content.limit / 
> file.content.limit
> --
>
> Key: NUTCH-2666
> URL: https://issues.apache.org/jira/browse/NUTCH-2666
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.15
>Reporter: Marco Ebbinghaus
>Priority: Minor
>
> The default value for http.content.limit in nutch-default.xml (The length 
> limit for downloaded content using the http://
>  protocol, in bytes. If this value is nonnegative (>=0), content longer
>  than it will be truncated; otherwise, no truncation at all. Do not
>  confuse this setting with the file.content.limit setting.) is set to 64kb. 
> Maybe this default value should be increased as many pages today are greater 
> than 64kb.
> This fact hit me when trying to crawl a single website whose pages are much 
> greater than 64kb and because of that with every crawl cycle the count of 
> db_unfetched urls decreased until it hit zero and the crawler became inactive 
> (because the first 64 kB contained always the same set of navigation links)
> The description might also be updated as this is not only the case for the 
> http protocol, but also for https.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2673) EOFException protocol-http

2019-01-07 Thread Markus Jelsma (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735692#comment-16735692
 ] 

Markus Jelsma commented on NUTCH-2673:
--

Yes, thanks Sebastian!

> EOFException protocol-http
> --
>
> Key: NUTCH-2673
> URL: https://issues.apache.org/jira/browse/NUTCH-2673
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 1.16
>
>
> Got an EOFException for some URL:
> {code}
> 2018-11-07 12:23:18,463 INFO  indexer.IndexingFiltersChecker - fetching: 
> https://www.misdaadjournalist.nl/2018/11/politie-kraakt-server-van-blackbox-265-000-criminele-berichten-onderschept/
> 2018-11-07 12:23:18,704 INFO  protocol.RobotRulesParser - robots.txt 
> whitelist not configured.
> 2018-11-07 12:23:18,704 INFO  http.Http - http.proxy.host = null
> 2018-11-07 12:23:18,704 INFO  http.Http - http.proxy.port = 8080
> 2018-11-07 12:23:18,704 INFO  http.Http - http.proxy.exception.list = false
> 2018-11-07 12:23:18,704 INFO  http.Http - http.timeout = 3
> 2018-11-07 12:23:18,704 INFO  http.Http - http.content.limit = 32554432
> 2018-11-07 12:23:18,704 INFO  http.Http - http.agent = Mozilla/5.0 
> (compatible; OpenindexSpider; 
> +https://www.openindex.io/saas/about-our-spider/)
> 2018-11-07 12:23:18,704 INFO  http.Http - http.accept.language = 
> en-us,en-gb,en;q=0.7,*;q=0.3
> 2018-11-07 12:23:18,704 INFO  http.Http - http.accept = 
> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> 2018-11-07 12:23:18,704 INFO  http.Http - http.enable.cookie.header = false
> 2018-11-07 12:23:18,911 ERROR http.Http - Failed to get protocol output
> java.io.EOFException
> at 
> org.apache.nutch.protocol.http.HttpResponse.readLine(HttpResponse.java:591)
> at 
> org.apache.nutch.protocol.http.HttpResponse.parseStatusLine(HttpResponse.java:482)
> at 
> org.apache.nutch.protocol.http.HttpResponse.(HttpResponse.java:249)
> at org.apache.nutch.protocol.http.Http.getResponse(Http.java:72)
> at 
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:276)
> at 
> org.apache.nutch.indexer.IndexingFiltersChecker.getProtocolOutput(IndexingFiltersChecker.java:270)
> at 
> org.apache.nutch.indexer.IndexingFiltersChecker.process(IndexingFiltersChecker.java:141)
> at 
> org.apache.nutch.util.AbstractChecker.processSingle(AbstractChecker.java:86)
> at 
> org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:111)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at 
> org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:275)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (NUTCH-2673) EOFException protocol-http

2019-01-07 Thread Markus Jelsma (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-2673.

Resolution: Not A Problem

> EOFException protocol-http
> --
>
> Key: NUTCH-2673
> URL: https://issues.apache.org/jira/browse/NUTCH-2673
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 1.16
>
>
> Got an EOFException for some URL:
> {code}
> 2018-11-07 12:23:18,463 INFO  indexer.IndexingFiltersChecker - fetching: 
> https://www.misdaadjournalist.nl/2018/11/politie-kraakt-server-van-blackbox-265-000-criminele-berichten-onderschept/
> 2018-11-07 12:23:18,704 INFO  protocol.RobotRulesParser - robots.txt 
> whitelist not configured.
> 2018-11-07 12:23:18,704 INFO  http.Http - http.proxy.host = null
> 2018-11-07 12:23:18,704 INFO  http.Http - http.proxy.port = 8080
> 2018-11-07 12:23:18,704 INFO  http.Http - http.proxy.exception.list = false
> 2018-11-07 12:23:18,704 INFO  http.Http - http.timeout = 3
> 2018-11-07 12:23:18,704 INFO  http.Http - http.content.limit = 32554432
> 2018-11-07 12:23:18,704 INFO  http.Http - http.agent = Mozilla/5.0 
> (compatible; OpenindexSpider; 
> +https://www.openindex.io/saas/about-our-spider/)
> 2018-11-07 12:23:18,704 INFO  http.Http - http.accept.language = 
> en-us,en-gb,en;q=0.7,*;q=0.3
> 2018-11-07 12:23:18,704 INFO  http.Http - http.accept = 
> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> 2018-11-07 12:23:18,704 INFO  http.Http - http.enable.cookie.header = false
> 2018-11-07 12:23:18,911 ERROR http.Http - Failed to get protocol output
> java.io.EOFException
> at 
> org.apache.nutch.protocol.http.HttpResponse.readLine(HttpResponse.java:591)
> at 
> org.apache.nutch.protocol.http.HttpResponse.parseStatusLine(HttpResponse.java:482)
> at 
> org.apache.nutch.protocol.http.HttpResponse.(HttpResponse.java:249)
> at org.apache.nutch.protocol.http.Http.getResponse(Http.java:72)
> at 
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:276)
> at 
> org.apache.nutch.indexer.IndexingFiltersChecker.getProtocolOutput(IndexingFiltersChecker.java:270)
> at 
> org.apache.nutch.indexer.IndexingFiltersChecker.process(IndexingFiltersChecker.java:141)
> at 
> org.apache.nutch.util.AbstractChecker.processSingle(AbstractChecker.java:86)
> at 
> org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:111)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at 
> org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:275)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2680) Documentation: https supported by multiple protocol plugins not only httpclient

2019-01-07 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735686#comment-16735686
 ] 

ASF GitHub Bot commented on NUTCH-2680:
---

sebastian-nagel commented on pull request #426: NUTCH-2680 Documentation: https 
supported by multiple protocol plugins not only httpclient
URL: https://github.com/apache/nutch/pull/426
 
 
   Improve description of property plugin.includes:
   - https is supported by default
   - no need to enable the stub plugin nutch-extensionpoints
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Documentation: https supported by multiple protocol plugins not only 
> httpclient
> ---
>
> Key: NUTCH-2680
> URL: https://issues.apache.org/jira/browse/NUTCH-2680
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation, plugin
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Priority: Trivial
> Fix For: 1.16
>
>
> nutch-default.xml still states:
> ??In order to use HTTPS please enable protocol-httpclient, but be aware of 
> possible intermittent problems with the underlying commons-httpclient 
> library.??
> Now https is supported by most protocol plugins and there is no need to 
> activate protocol-httpclient to fetch https:// pages.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2683) DeduplicationJob: add option to prefer https:// over http://

2019-01-07 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735682#comment-16735682
 ] 

ASF GitHub Bot commented on NUTCH-2683:
---

sebastian-nagel commented on pull request #425: NUTCH-2683 DeduplicationJob: 
add option to prefer https:// over http://
URL: https://github.com/apache/nutch/pull/425
 
 
   - add optional value "httpsOverHttp" to -compareOrder argument to prefer 
https:// over http:// if it comes before the "urlLength" and neither "score" 
nor "fetchTime" take precedence
   - code improvements: remove nested loop, sort imports, add `@Override` 
statements where applicable
   
   Testing with one pair of https/http duplicates:
   ```
   % cat seeds.txt 
   http://nutch.apache.org/
   https://nutch.apache.org/
   
   % nutch inject crawldb seeds.txt
   ...
   
   % nutch generate crawldb/ segments
   ...
   
   % nutch fetch segments/*
   ...
   
   % nutch parse segments/*
   ...
   
   % nutch updatedb crawldb/ segments/*
   ...
   
   % nutch dedup crawldb -compareOrder httpsOverHttp,score,urlLength,fetchTime
   ...
   Deduplication: 1 documents marked as duplicates
   ...
   
   % nutch readdb crawldb/ -url https://nutch.apache.org/
   URL: https://nutch.apache.org/
   Version: 7
   Status: 2 (db_fetched)
   Fetch time: Wed Feb 06 11:55:33 CET 2019
   Modified time: Mon Jan 07 11:55:33 CET 2019
   Retries since fetch: 0
   Retry interval: 2592000 seconds (30 days)
   Score: 1.181
   Signature: da0ffbf19768ea2cab9ffa0fb4a778a7
   Metadata: 
   ...
   
   % nutch readdb crawldb/ -url http://nutch.apache.org/
   URL: http://nutch.apache.org/
   Version: 7
   Status: 7 (db_duplicate)
   Fetch time: Wed Feb 06 11:55:39 CET 2019
   Modified time: Mon Jan 07 11:55:39 CET 2019
   Retries since fetch: 0
   Retry interval: 2592000 seconds (30 days)
   Score: 1.181
   Signature: da0ffbf19768ea2cab9ffa0fb4a778a7
   Metadata: 
   ...
   ```
   The URL `https://nutch.apache.org/` is kept as expected if "httpsOverHttp" 
is configured.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> DeduplicationJob: add option to prefer https:// over http://
> 
>
> Key: NUTCH-2683
> URL: https://issues.apache.org/jira/browse/NUTCH-2683
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.16
>
>
> The deduplication job allows to keep the shortest URLs as the "best" URL of a 
> set of duplicates, marking all longer ones as duplicates. Recently search 
> engines started to penalize non-https pages by [giving https pages a higher 
> rank|https://webmasters.googleblog.com/2014/08/https-as-ranking-signal.html] 
> and [marking http as 
> insecure|https://www.blog.google/products/chrome/milestone-chrome-security-marking-http-not-secure/].
> If URLs are identical except for the protocol the deduplication job should be 
> able to prefer https:// over http:// URLs, although the latter ones are 
> shorter by one character. Of course, this should be configurable and in 
> addition to existing preferences (length, score and fetch time) to select the 
> "best" URL among duplicates.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (NUTCH-2676) Update to the latest selenium and add code to use chrome and firefox headless mode with the remote web driver

2019-01-07 Thread Stas Batururimi (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735656#comment-16735656
 ] 

Stas Batururimi edited comment on NUTCH-2676 at 1/7/19 10:48 AM:
-

[~wastl-nagel] Hi. Yes. I will provide it soon, somewhere between Jan 9 - Jan 
13.


was (Author: virt):
[~wastl-nagel] Hi. Yes. I will provide it soo, somewhere between Jan 9 - Jan 13.

> Update to the latest selenium and add code to use chrome and firefox headless 
> mode with the remote web driver
> -
>
> Key: NUTCH-2676
> URL: https://issues.apache.org/jira/browse/NUTCH-2676
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: 1.15
>Reporter: Stas Batururimi
>Priority: Major
> Fix For: 1.16
>
> Attachments: Screenshot 2018-11-16 at 18.15.52.png
>
>
> * Selenium needs to be updated
>  * missing remote web driver for chrome 
>  * necessity to add headless mode for both remote WebDriverBase Firefox & 
> Chrome
>  * use case with Selenium grid using docker (1 hub docker container, several 
> nodes in different docker containers, Nutch in another docker container, 
> streaming to Apache Solr in docker container, that is at least 4 different 
> docker containers)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2676) Update to the latest selenium and add code to use chrome and firefox headless mode with the remote web driver

2019-01-07 Thread Stas Batururimi (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735656#comment-16735656
 ] 

Stas Batururimi commented on NUTCH-2676:


[~wastl-nagel] Hi. Yes. I will provide it soo, somewhere between Jan 9 - Jan 13.

> Update to the latest selenium and add code to use chrome and firefox headless 
> mode with the remote web driver
> -
>
> Key: NUTCH-2676
> URL: https://issues.apache.org/jira/browse/NUTCH-2676
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: 1.15
>Reporter: Stas Batururimi
>Priority: Major
> Fix For: 1.16
>
> Attachments: Screenshot 2018-11-16 at 18.15.52.png
>
>
> * Selenium needs to be updated
>  * missing remote web driver for chrome 
>  * necessity to add headless mode for both remote WebDriverBase Firefox & 
> Chrome
>  * use case with Selenium grid using docker (1 hub docker container, several 
> nodes in different docker containers, Nutch in another docker container, 
> streaming to Apache Solr in docker container, that is at least 4 different 
> docker containers)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (NUTCH-2670) org.apache.nutch.indexer.IndexerMapReduce does not read the value of "indexer.delete" from nutch-site.xml

2019-01-07 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2670.

Resolution: Not A Problem

Thanks for the feedback, [~aquaticwater]!

> org.apache.nutch.indexer.IndexerMapReduce does not read the value of 
> "indexer.delete" from nutch-site.xml
> -
>
> Key: NUTCH-2670
> URL: https://issues.apache.org/jira/browse/NUTCH-2670
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.14, 1.15
> Environment: macOS Mojave and High Sierra
> MacBook Pro (Retina, 13-inch, Mid 2014)
> Oracle Java 1.8.0_144-b01 and previous versions
>Reporter: Junqiang Zhang
>Priority: Minor
>
> Inside org.apache.nutch.indexer.IndexerMapReduce.IndexerReducer, the setup() 
> function should read the value of "indexer.delete" from nutch-site.xml, and 
> assign the value to the variable of "delete". See the following line of code.
> (line 201)  delete = conf.getBoolean(INDEXER_DELETE, false);
> However, the value of "indexer.delete" set in nutch-site.xml and 
> nutch-default.xml is not assigned to the variable, "delete". I put the 
> following setting in one of nutch-site.xml and nutch-default.xml, or in both 
> of them. The variable of "delete" remains false.
> 
>   indexer.delete
>   true
>   Whether the indexer will delete documents GONE or REDIRECTS by 
> indexing filters
>   
> 
> I also changed the line of code to
> delete = conf.getBoolean(INDEXER_DELETE, true);
> Whatever value of "indexer.delete" is set in nutch-site.xml or 
> nutch-default.xml, the value of "delete" remains false.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2676) Update to the latest selenium and add code to use chrome and firefox headless mode with the remote web driver

2019-01-07 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735629#comment-16735629
 ] 

Sebastian Nagel commented on NUTCH-2676:


Hi [~virt], was the upgrade of Selenium successful? If yes, could you provide a 
patch?

> Update to the latest selenium and add code to use chrome and firefox headless 
> mode with the remote web driver
> -
>
> Key: NUTCH-2676
> URL: https://issues.apache.org/jira/browse/NUTCH-2676
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: 1.15
>Reporter: Stas Batururimi
>Priority: Major
> Fix For: 1.16
>
> Attachments: Screenshot 2018-11-16 at 18.15.52.png
>
>
> * Selenium needs to be updated
>  * missing remote web driver for chrome 
>  * necessity to add headless mode for both remote WebDriverBase Firefox & 
> Chrome
>  * use case with Selenium grid using docker (1 hub docker container, several 
> nodes in different docker containers, Nutch in another docker container, 
> streaming to Apache Solr in docker container, that is at least 4 different 
> docker containers)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2673) EOFException protocol-http

2019-01-07 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735626#comment-16735626
 ] 

Sebastian Nagel commented on NUTCH-2673:


[~markus17], can we close this?

> EOFException protocol-http
> --
>
> Key: NUTCH-2673
> URL: https://issues.apache.org/jira/browse/NUTCH-2673
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 1.16
>
>
> Got an EOFException for some URL:
> {code}
> 2018-11-07 12:23:18,463 INFO  indexer.IndexingFiltersChecker - fetching: 
> https://www.misdaadjournalist.nl/2018/11/politie-kraakt-server-van-blackbox-265-000-criminele-berichten-onderschept/
> 2018-11-07 12:23:18,704 INFO  protocol.RobotRulesParser - robots.txt 
> whitelist not configured.
> 2018-11-07 12:23:18,704 INFO  http.Http - http.proxy.host = null
> 2018-11-07 12:23:18,704 INFO  http.Http - http.proxy.port = 8080
> 2018-11-07 12:23:18,704 INFO  http.Http - http.proxy.exception.list = false
> 2018-11-07 12:23:18,704 INFO  http.Http - http.timeout = 3
> 2018-11-07 12:23:18,704 INFO  http.Http - http.content.limit = 32554432
> 2018-11-07 12:23:18,704 INFO  http.Http - http.agent = Mozilla/5.0 
> (compatible; OpenindexSpider; 
> +https://www.openindex.io/saas/about-our-spider/)
> 2018-11-07 12:23:18,704 INFO  http.Http - http.accept.language = 
> en-us,en-gb,en;q=0.7,*;q=0.3
> 2018-11-07 12:23:18,704 INFO  http.Http - http.accept = 
> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> 2018-11-07 12:23:18,704 INFO  http.Http - http.enable.cookie.header = false
> 2018-11-07 12:23:18,911 ERROR http.Http - Failed to get protocol output
> java.io.EOFException
> at 
> org.apache.nutch.protocol.http.HttpResponse.readLine(HttpResponse.java:591)
> at 
> org.apache.nutch.protocol.http.HttpResponse.parseStatusLine(HttpResponse.java:482)
> at 
> org.apache.nutch.protocol.http.HttpResponse.(HttpResponse.java:249)
> at org.apache.nutch.protocol.http.Http.getResponse(Http.java:72)
> at 
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:276)
> at 
> org.apache.nutch.indexer.IndexingFiltersChecker.getProtocolOutput(IndexingFiltersChecker.java:270)
> at 
> org.apache.nutch.indexer.IndexingFiltersChecker.process(IndexingFiltersChecker.java:141)
> at 
> org.apache.nutch.util.AbstractChecker.processSingle(AbstractChecker.java:86)
> at 
> org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:111)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at 
> org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:275)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2395) Cannot run job worker! - error while running multiple crawling jobs in parallel

2019-01-07 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2395:
---
Affects Version/s: (was: 1.14)

> Cannot run job worker! - error while running multiple crawling jobs in 
> parallel
> ---
>
> Key: NUTCH-2395
> URL: https://issues.apache.org/jira/browse/NUTCH-2395
> Project: Nutch
>  Issue Type: Bug
>  Components: generator, nutch server
>Affects Versions: 2.3.1
> Environment: Ubuntu 16.04 64-bit
> Oracle Java 8 64-bit
> Nutch 2.3.1 (standalone deployment)
> MongoDB 3.4
>Reporter: Vyacheslav Pascarel
>Priority: Major
> Fix For: 2.4, 1.16
>
>
> Cannot run job worker! - error while running multiple crawling jobs in 
> parallel
> Ubuntu 16.04 64-bit
> Oracle Java 8 64-bit
> Nutch 2.3.1 (standalone deployment)
> MongoDB 3.4
> My application is trying to execute multiple Nutch jobs in parallel using 
> Nutch REST services. The application injects a seed URL and then repeats 
> GENERATE/FETCH/PARSE/UPDATEDB sequence requested number of times to emulated 
> continuous crawling (each step in the sequence is executed upon successful 
> competition of the previous step then the whole sequence is repeated again). 
> Here is a brief description of the jobs:
> * Number of parallel jobs: 7
> * Each job has unique crawl id and MongoDB collection
> * Seed URL for all jobs: http://www.cnn.com
> * Regex URL filters for all jobs: 
> ** *"-^.\{1000,\}$"* - exclude very long URLs
> ** *"+."* - include the rest
> The jobs are started as expected but at some point some of them fail with 
> "Cannot run job worker!" error. For more details see job status and 
> hadoop.log lines below.
> In debugger during crash I noticed that a single instance of 
> SelectorEntryComparator (definition is nested in GeneratorJob) is shared 
> across multiple reducer tasks. The class is inherited from 
> org.apache.hadoop.io.WritableComparator which has a few members unprotected 
> for concurrent usage. At some point multiple threads may access those members 
> in WritableComparator.compare call. I modified SelectorEntryComparator and it 
> seems solved the problem but I am not sure if the change is appropriate 
> and/or sufficient (covers GENERATE only?)
> Original code:
> {code:java}
> public static class SelectorEntryComparator extends WritableComparator {
> public SelectorEntryComparator() {
>   super(SelectorEntry.class, true);
> }
> }
> {code}
> Modified code:
> {code:java}
> public static class SelectorEntryComparator extends WritableComparator {
> public SelectorEntryComparator() {
>   super(SelectorEntry.class, true);
> }
> 
> @Override
> synchronized public int compare(byte[] b1, int s1, int l1, byte[] b2, int 
> s2, int l2) {
>   return super.compare(b1, s1, l1, b2, s2, l2);
> }
> }
> {code}
> Example of failed job status:
> {code}
> {
> "id" : "parallel_0-65ff2f1b-382e-4eb2-a813-a0370b84d5b6-GENERATE-1961495833",
> "type" : "GENERATE",
> "confId" : "65ff2f1b-382e-4eb2-a813-a0370b84d5b6",
> "args" : { "topN" : "100" },
> "result" : null,
> "state" : "FAILED",
> "msg" : "ERROR: java.lang.RuntimeException: job failed: 
> name=[parallel_0]generate: 1498059912-1448058551, 
> jobid=job_local1142434549_0036",
> "crawlId" : "parallel_0"
> }
> {code}
> Lines from hadoop.log
> {code}
> 2017-06-21 11:45:13,021 WARN  mapred.LocalJobRunner - job_local1142434549_0036
> java.lang.Exception: java.lang.RuntimeException: java.io.EOFException
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
> Caused by: java.lang.RuntimeException: java.io.EOFException
> at 
> org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:164)
> at 
> org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:158)
> at 
> org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:121)
> at 
> org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:302)
> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:170)
> at 
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
> at 
> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 

[jira] [Updated] (NUTCH-2395) Cannot run job worker! - error while running multiple crawling jobs in parallel

2019-01-07 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2395:
---
Fix Version/s: (was: 1.16)

> Cannot run job worker! - error while running multiple crawling jobs in 
> parallel
> ---
>
> Key: NUTCH-2395
> URL: https://issues.apache.org/jira/browse/NUTCH-2395
> Project: Nutch
>  Issue Type: Bug
>  Components: generator, nutch server
>Affects Versions: 2.3.1
> Environment: Ubuntu 16.04 64-bit
> Oracle Java 8 64-bit
> Nutch 2.3.1 (standalone deployment)
> MongoDB 3.4
>Reporter: Vyacheslav Pascarel
>Priority: Major
> Fix For: 2.4
>
>
> Cannot run job worker! - error while running multiple crawling jobs in 
> parallel
> Ubuntu 16.04 64-bit
> Oracle Java 8 64-bit
> Nutch 2.3.1 (standalone deployment)
> MongoDB 3.4
> My application is trying to execute multiple Nutch jobs in parallel using 
> Nutch REST services. The application injects a seed URL and then repeats 
> GENERATE/FETCH/PARSE/UPDATEDB sequence requested number of times to emulated 
> continuous crawling (each step in the sequence is executed upon successful 
> competition of the previous step then the whole sequence is repeated again). 
> Here is a brief description of the jobs:
> * Number of parallel jobs: 7
> * Each job has unique crawl id and MongoDB collection
> * Seed URL for all jobs: http://www.cnn.com
> * Regex URL filters for all jobs: 
> ** *"-^.\{1000,\}$"* - exclude very long URLs
> ** *"+."* - include the rest
> The jobs are started as expected but at some point some of them fail with 
> "Cannot run job worker!" error. For more details see job status and 
> hadoop.log lines below.
> In debugger during crash I noticed that a single instance of 
> SelectorEntryComparator (definition is nested in GeneratorJob) is shared 
> across multiple reducer tasks. The class is inherited from 
> org.apache.hadoop.io.WritableComparator which has a few members unprotected 
> for concurrent usage. At some point multiple threads may access those members 
> in WritableComparator.compare call. I modified SelectorEntryComparator and it 
> seems solved the problem but I am not sure if the change is appropriate 
> and/or sufficient (covers GENERATE only?)
> Original code:
> {code:java}
> public static class SelectorEntryComparator extends WritableComparator {
> public SelectorEntryComparator() {
>   super(SelectorEntry.class, true);
> }
> }
> {code}
> Modified code:
> {code:java}
> public static class SelectorEntryComparator extends WritableComparator {
> public SelectorEntryComparator() {
>   super(SelectorEntry.class, true);
> }
> 
> @Override
> synchronized public int compare(byte[] b1, int s1, int l1, byte[] b2, int 
> s2, int l2) {
>   return super.compare(b1, s1, l1, b2, s2, l2);
> }
> }
> {code}
> Example of failed job status:
> {code}
> {
> "id" : "parallel_0-65ff2f1b-382e-4eb2-a813-a0370b84d5b6-GENERATE-1961495833",
> "type" : "GENERATE",
> "confId" : "65ff2f1b-382e-4eb2-a813-a0370b84d5b6",
> "args" : { "topN" : "100" },
> "result" : null,
> "state" : "FAILED",
> "msg" : "ERROR: java.lang.RuntimeException: job failed: 
> name=[parallel_0]generate: 1498059912-1448058551, 
> jobid=job_local1142434549_0036",
> "crawlId" : "parallel_0"
> }
> {code}
> Lines from hadoop.log
> {code}
> 2017-06-21 11:45:13,021 WARN  mapred.LocalJobRunner - job_local1142434549_0036
> java.lang.Exception: java.lang.RuntimeException: java.io.EOFException
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
> Caused by: java.lang.RuntimeException: java.io.EOFException
> at 
> org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:164)
> at 
> org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:158)
> at 
> org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:121)
> at 
> org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:302)
> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:170)
> at 
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
> at 
> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> 

[jira] [Commented] (NUTCH-2395) Cannot run job worker! - error while running multiple crawling jobs in parallel

2019-01-07 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735620#comment-16735620
 ] 

Sebastian Nagel commented on NUTCH-2395:


Sorry, 1.x is safe because it inherits from FloatWritable.Comparator which is 
thread-safe (does not use any fields when reading values as does 
WritableComparator).

> Cannot run job worker! - error while running multiple crawling jobs in 
> parallel
> ---
>
> Key: NUTCH-2395
> URL: https://issues.apache.org/jira/browse/NUTCH-2395
> Project: Nutch
>  Issue Type: Bug
>  Components: generator, nutch server
>Affects Versions: 2.3.1, 1.14
> Environment: Ubuntu 16.04 64-bit
> Oracle Java 8 64-bit
> Nutch 2.3.1 (standalone deployment)
> MongoDB 3.4
>Reporter: Vyacheslav Pascarel
>Priority: Major
> Fix For: 2.4, 1.16
>
>
> Cannot run job worker! - error while running multiple crawling jobs in 
> parallel
> Ubuntu 16.04 64-bit
> Oracle Java 8 64-bit
> Nutch 2.3.1 (standalone deployment)
> MongoDB 3.4
> My application is trying to execute multiple Nutch jobs in parallel using 
> Nutch REST services. The application injects a seed URL and then repeats 
> GENERATE/FETCH/PARSE/UPDATEDB sequence requested number of times to emulated 
> continuous crawling (each step in the sequence is executed upon successful 
> competition of the previous step then the whole sequence is repeated again). 
> Here is a brief description of the jobs:
> * Number of parallel jobs: 7
> * Each job has unique crawl id and MongoDB collection
> * Seed URL for all jobs: http://www.cnn.com
> * Regex URL filters for all jobs: 
> ** *"-^.\{1000,\}$"* - exclude very long URLs
> ** *"+."* - include the rest
> The jobs are started as expected but at some point some of them fail with 
> "Cannot run job worker!" error. For more details see job status and 
> hadoop.log lines below.
> In debugger during crash I noticed that a single instance of 
> SelectorEntryComparator (definition is nested in GeneratorJob) is shared 
> across multiple reducer tasks. The class is inherited from 
> org.apache.hadoop.io.WritableComparator which has a few members unprotected 
> for concurrent usage. At some point multiple threads may access those members 
> in WritableComparator.compare call. I modified SelectorEntryComparator and it 
> seems solved the problem but I am not sure if the change is appropriate 
> and/or sufficient (covers GENERATE only?)
> Original code:
> {code:java}
> public static class SelectorEntryComparator extends WritableComparator {
> public SelectorEntryComparator() {
>   super(SelectorEntry.class, true);
> }
> }
> {code}
> Modified code:
> {code:java}
> public static class SelectorEntryComparator extends WritableComparator {
> public SelectorEntryComparator() {
>   super(SelectorEntry.class, true);
> }
> 
> @Override
> synchronized public int compare(byte[] b1, int s1, int l1, byte[] b2, int 
> s2, int l2) {
>   return super.compare(b1, s1, l1, b2, s2, l2);
> }
> }
> {code}
> Example of failed job status:
> {code}
> {
> "id" : "parallel_0-65ff2f1b-382e-4eb2-a813-a0370b84d5b6-GENERATE-1961495833",
> "type" : "GENERATE",
> "confId" : "65ff2f1b-382e-4eb2-a813-a0370b84d5b6",
> "args" : { "topN" : "100" },
> "result" : null,
> "state" : "FAILED",
> "msg" : "ERROR: java.lang.RuntimeException: job failed: 
> name=[parallel_0]generate: 1498059912-1448058551, 
> jobid=job_local1142434549_0036",
> "crawlId" : "parallel_0"
> }
> {code}
> Lines from hadoop.log
> {code}
> 2017-06-21 11:45:13,021 WARN  mapred.LocalJobRunner - job_local1142434549_0036
> java.lang.Exception: java.lang.RuntimeException: java.io.EOFException
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
> Caused by: java.lang.RuntimeException: java.io.EOFException
> at 
> org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:164)
> at 
> org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:158)
> at 
> org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:121)
> at 
> org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:302)
> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:170)
> at 
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
> at 
> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
> 

[jira] [Commented] (NUTCH-2395) Cannot run job worker! - error while running multiple crawling jobs in parallel

2019-01-07 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735613#comment-16735613
 ] 

Sebastian Nagel commented on NUTCH-2395:


Also affects 1.x when Generator is used from Nutch server in parallel.

> Cannot run job worker! - error while running multiple crawling jobs in 
> parallel
> ---
>
> Key: NUTCH-2395
> URL: https://issues.apache.org/jira/browse/NUTCH-2395
> Project: Nutch
>  Issue Type: Bug
>  Components: nutch server
>Affects Versions: 2.3.1, 1.14
> Environment: Ubuntu 16.04 64-bit
> Oracle Java 8 64-bit
> Nutch 2.3.1 (standalone deployment)
> MongoDB 3.4
>Reporter: Vyacheslav Pascarel
>Priority: Major
> Fix For: 2.4, 1.16
>
>
> Cannot run job worker! - error while running multiple crawling jobs in 
> parallel
> Ubuntu 16.04 64-bit
> Oracle Java 8 64-bit
> Nutch 2.3.1 (standalone deployment)
> MongoDB 3.4
> My application is trying to execute multiple Nutch jobs in parallel using 
> Nutch REST services. The application injects a seed URL and then repeats 
> GENERATE/FETCH/PARSE/UPDATEDB sequence requested number of times to emulated 
> continuous crawling (each step in the sequence is executed upon successful 
> competition of the previous step then the whole sequence is repeated again). 
> Here is a brief description of the jobs:
> * Number of parallel jobs: 7
> * Each job has unique crawl id and MongoDB collection
> * Seed URL for all jobs: http://www.cnn.com
> * Regex URL filters for all jobs: 
> ** *"-^.\{1000,\}$"* - exclude very long URLs
> ** *"+."* - include the rest
> The jobs are started as expected but at some point some of them fail with 
> "Cannot run job worker!" error. For more details see job status and 
> hadoop.log lines below.
> In debugger during crash I noticed that a single instance of 
> SelectorEntryComparator (definition is nested in GeneratorJob) is shared 
> across multiple reducer tasks. The class is inherited from 
> org.apache.hadoop.io.WritableComparator which has a few members unprotected 
> for concurrent usage. At some point multiple threads may access those members 
> in WritableComparator.compare call. I modified SelectorEntryComparator and it 
> seems solved the problem but I am not sure if the change is appropriate 
> and/or sufficient (covers GENERATE only?)
> Original code:
> {code:java}
> public static class SelectorEntryComparator extends WritableComparator {
> public SelectorEntryComparator() {
>   super(SelectorEntry.class, true);
> }
> }
> {code}
> Modified code:
> {code:java}
> public static class SelectorEntryComparator extends WritableComparator {
> public SelectorEntryComparator() {
>   super(SelectorEntry.class, true);
> }
> 
> @Override
> synchronized public int compare(byte[] b1, int s1, int l1, byte[] b2, int 
> s2, int l2) {
>   return super.compare(b1, s1, l1, b2, s2, l2);
> }
> }
> {code}
> Example of failed job status:
> {code}
> {
> "id" : "parallel_0-65ff2f1b-382e-4eb2-a813-a0370b84d5b6-GENERATE-1961495833",
> "type" : "GENERATE",
> "confId" : "65ff2f1b-382e-4eb2-a813-a0370b84d5b6",
> "args" : { "topN" : "100" },
> "result" : null,
> "state" : "FAILED",
> "msg" : "ERROR: java.lang.RuntimeException: job failed: 
> name=[parallel_0]generate: 1498059912-1448058551, 
> jobid=job_local1142434549_0036",
> "crawlId" : "parallel_0"
> }
> {code}
> Lines from hadoop.log
> {code}
> 2017-06-21 11:45:13,021 WARN  mapred.LocalJobRunner - job_local1142434549_0036
> java.lang.Exception: java.lang.RuntimeException: java.io.EOFException
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
> Caused by: java.lang.RuntimeException: java.io.EOFException
> at 
> org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:164)
> at 
> org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:158)
> at 
> org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:121)
> at 
> org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:302)
> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:170)
> at 
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
> at 
> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at 

[jira] [Updated] (NUTCH-2395) Cannot run job worker! - error while running multiple crawling jobs in parallel

2019-01-07 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2395:
---
Component/s: generator

> Cannot run job worker! - error while running multiple crawling jobs in 
> parallel
> ---
>
> Key: NUTCH-2395
> URL: https://issues.apache.org/jira/browse/NUTCH-2395
> Project: Nutch
>  Issue Type: Bug
>  Components: generator, nutch server
>Affects Versions: 2.3.1, 1.14
> Environment: Ubuntu 16.04 64-bit
> Oracle Java 8 64-bit
> Nutch 2.3.1 (standalone deployment)
> MongoDB 3.4
>Reporter: Vyacheslav Pascarel
>Priority: Major
> Fix For: 2.4, 1.16
>
>
> Cannot run job worker! - error while running multiple crawling jobs in 
> parallel
> Ubuntu 16.04 64-bit
> Oracle Java 8 64-bit
> Nutch 2.3.1 (standalone deployment)
> MongoDB 3.4
> My application is trying to execute multiple Nutch jobs in parallel using 
> Nutch REST services. The application injects a seed URL and then repeats 
> GENERATE/FETCH/PARSE/UPDATEDB sequence requested number of times to emulated 
> continuous crawling (each step in the sequence is executed upon successful 
> competition of the previous step then the whole sequence is repeated again). 
> Here is a brief description of the jobs:
> * Number of parallel jobs: 7
> * Each job has unique crawl id and MongoDB collection
> * Seed URL for all jobs: http://www.cnn.com
> * Regex URL filters for all jobs: 
> ** *"-^.\{1000,\}$"* - exclude very long URLs
> ** *"+."* - include the rest
> The jobs are started as expected but at some point some of them fail with 
> "Cannot run job worker!" error. For more details see job status and 
> hadoop.log lines below.
> In debugger during crash I noticed that a single instance of 
> SelectorEntryComparator (definition is nested in GeneratorJob) is shared 
> across multiple reducer tasks. The class is inherited from 
> org.apache.hadoop.io.WritableComparator which has a few members unprotected 
> for concurrent usage. At some point multiple threads may access those members 
> in WritableComparator.compare call. I modified SelectorEntryComparator and it 
> seems solved the problem but I am not sure if the change is appropriate 
> and/or sufficient (covers GENERATE only?)
> Original code:
> {code:java}
> public static class SelectorEntryComparator extends WritableComparator {
> public SelectorEntryComparator() {
>   super(SelectorEntry.class, true);
> }
> }
> {code}
> Modified code:
> {code:java}
> public static class SelectorEntryComparator extends WritableComparator {
> public SelectorEntryComparator() {
>   super(SelectorEntry.class, true);
> }
> 
> @Override
> synchronized public int compare(byte[] b1, int s1, int l1, byte[] b2, int 
> s2, int l2) {
>   return super.compare(b1, s1, l1, b2, s2, l2);
> }
> }
> {code}
> Example of failed job status:
> {code}
> {
> "id" : "parallel_0-65ff2f1b-382e-4eb2-a813-a0370b84d5b6-GENERATE-1961495833",
> "type" : "GENERATE",
> "confId" : "65ff2f1b-382e-4eb2-a813-a0370b84d5b6",
> "args" : { "topN" : "100" },
> "result" : null,
> "state" : "FAILED",
> "msg" : "ERROR: java.lang.RuntimeException: job failed: 
> name=[parallel_0]generate: 1498059912-1448058551, 
> jobid=job_local1142434549_0036",
> "crawlId" : "parallel_0"
> }
> {code}
> Lines from hadoop.log
> {code}
> 2017-06-21 11:45:13,021 WARN  mapred.LocalJobRunner - job_local1142434549_0036
> java.lang.Exception: java.lang.RuntimeException: java.io.EOFException
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
> Caused by: java.lang.RuntimeException: java.io.EOFException
> at 
> org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:164)
> at 
> org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:158)
> at 
> org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:121)
> at 
> org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:302)
> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:170)
> at 
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
> at 
> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 

[jira] [Updated] (NUTCH-2395) Cannot run job worker! - error while running multiple crawling jobs in parallel

2019-01-07 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2395:
---
Fix Version/s: 1.16

> Cannot run job worker! - error while running multiple crawling jobs in 
> parallel
> ---
>
> Key: NUTCH-2395
> URL: https://issues.apache.org/jira/browse/NUTCH-2395
> Project: Nutch
>  Issue Type: Bug
>  Components: nutch server
>Affects Versions: 2.3.1, 1.14
> Environment: Ubuntu 16.04 64-bit
> Oracle Java 8 64-bit
> Nutch 2.3.1 (standalone deployment)
> MongoDB 3.4
>Reporter: Vyacheslav Pascarel
>Priority: Major
> Fix For: 2.4, 1.16
>
>
> Cannot run job worker! - error while running multiple crawling jobs in 
> parallel
> Ubuntu 16.04 64-bit
> Oracle Java 8 64-bit
> Nutch 2.3.1 (standalone deployment)
> MongoDB 3.4
> My application is trying to execute multiple Nutch jobs in parallel using 
> Nutch REST services. The application injects a seed URL and then repeats 
> GENERATE/FETCH/PARSE/UPDATEDB sequence requested number of times to emulated 
> continuous crawling (each step in the sequence is executed upon successful 
> competition of the previous step then the whole sequence is repeated again). 
> Here is a brief description of the jobs:
> * Number of parallel jobs: 7
> * Each job has unique crawl id and MongoDB collection
> * Seed URL for all jobs: http://www.cnn.com
> * Regex URL filters for all jobs: 
> ** *"-^.\{1000,\}$"* - exclude very long URLs
> ** *"+."* - include the rest
> The jobs are started as expected but at some point some of them fail with 
> "Cannot run job worker!" error. For more details see job status and 
> hadoop.log lines below.
> In debugger during crash I noticed that a single instance of 
> SelectorEntryComparator (definition is nested in GeneratorJob) is shared 
> across multiple reducer tasks. The class is inherited from 
> org.apache.hadoop.io.WritableComparator which has a few members unprotected 
> for concurrent usage. At some point multiple threads may access those members 
> in WritableComparator.compare call. I modified SelectorEntryComparator and it 
> seems solved the problem but I am not sure if the change is appropriate 
> and/or sufficient (covers GENERATE only?)
> Original code:
> {code:java}
> public static class SelectorEntryComparator extends WritableComparator {
> public SelectorEntryComparator() {
>   super(SelectorEntry.class, true);
> }
> }
> {code}
> Modified code:
> {code:java}
> public static class SelectorEntryComparator extends WritableComparator {
> public SelectorEntryComparator() {
>   super(SelectorEntry.class, true);
> }
> 
> @Override
> synchronized public int compare(byte[] b1, int s1, int l1, byte[] b2, int 
> s2, int l2) {
>   return super.compare(b1, s1, l1, b2, s2, l2);
> }
> }
> {code}
> Example of failed job status:
> {code}
> {
> "id" : "parallel_0-65ff2f1b-382e-4eb2-a813-a0370b84d5b6-GENERATE-1961495833",
> "type" : "GENERATE",
> "confId" : "65ff2f1b-382e-4eb2-a813-a0370b84d5b6",
> "args" : { "topN" : "100" },
> "result" : null,
> "state" : "FAILED",
> "msg" : "ERROR: java.lang.RuntimeException: job failed: 
> name=[parallel_0]generate: 1498059912-1448058551, 
> jobid=job_local1142434549_0036",
> "crawlId" : "parallel_0"
> }
> {code}
> Lines from hadoop.log
> {code}
> 2017-06-21 11:45:13,021 WARN  mapred.LocalJobRunner - job_local1142434549_0036
> java.lang.Exception: java.lang.RuntimeException: java.io.EOFException
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
> Caused by: java.lang.RuntimeException: java.io.EOFException
> at 
> org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:164)
> at 
> org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:158)
> at 
> org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:121)
> at 
> org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:302)
> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:170)
> at 
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
> at 
> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> 

[jira] [Updated] (NUTCH-2395) Cannot run job worker! - error while running multiple crawling jobs in parallel

2019-01-07 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2395:
---
Affects Version/s: 1.14

> Cannot run job worker! - error while running multiple crawling jobs in 
> parallel
> ---
>
> Key: NUTCH-2395
> URL: https://issues.apache.org/jira/browse/NUTCH-2395
> Project: Nutch
>  Issue Type: Bug
>  Components: nutch server
>Affects Versions: 2.3.1, 1.14
> Environment: Ubuntu 16.04 64-bit
> Oracle Java 8 64-bit
> Nutch 2.3.1 (standalone deployment)
> MongoDB 3.4
>Reporter: Vyacheslav Pascarel
>Priority: Major
> Fix For: 2.4
>
>
> Cannot run job worker! - error while running multiple crawling jobs in 
> parallel
> Ubuntu 16.04 64-bit
> Oracle Java 8 64-bit
> Nutch 2.3.1 (standalone deployment)
> MongoDB 3.4
> My application is trying to execute multiple Nutch jobs in parallel using 
> Nutch REST services. The application injects a seed URL and then repeats 
> GENERATE/FETCH/PARSE/UPDATEDB sequence requested number of times to emulated 
> continuous crawling (each step in the sequence is executed upon successful 
> competition of the previous step then the whole sequence is repeated again). 
> Here is a brief description of the jobs:
> * Number of parallel jobs: 7
> * Each job has unique crawl id and MongoDB collection
> * Seed URL for all jobs: http://www.cnn.com
> * Regex URL filters for all jobs: 
> ** *"-^.\{1000,\}$"* - exclude very long URLs
> ** *"+."* - include the rest
> The jobs are started as expected but at some point some of them fail with 
> "Cannot run job worker!" error. For more details see job status and 
> hadoop.log lines below.
> In debugger during crash I noticed that a single instance of 
> SelectorEntryComparator (definition is nested in GeneratorJob) is shared 
> across multiple reducer tasks. The class is inherited from 
> org.apache.hadoop.io.WritableComparator which has a few members unprotected 
> for concurrent usage. At some point multiple threads may access those members 
> in WritableComparator.compare call. I modified SelectorEntryComparator and it 
> seems solved the problem but I am not sure if the change is appropriate 
> and/or sufficient (covers GENERATE only?)
> Original code:
> {code:java}
> public static class SelectorEntryComparator extends WritableComparator {
> public SelectorEntryComparator() {
>   super(SelectorEntry.class, true);
> }
> }
> {code}
> Modified code:
> {code:java}
> public static class SelectorEntryComparator extends WritableComparator {
> public SelectorEntryComparator() {
>   super(SelectorEntry.class, true);
> }
> 
> @Override
> synchronized public int compare(byte[] b1, int s1, int l1, byte[] b2, int 
> s2, int l2) {
>   return super.compare(b1, s1, l1, b2, s2, l2);
> }
> }
> {code}
> Example of failed job status:
> {code}
> {
> "id" : "parallel_0-65ff2f1b-382e-4eb2-a813-a0370b84d5b6-GENERATE-1961495833",
> "type" : "GENERATE",
> "confId" : "65ff2f1b-382e-4eb2-a813-a0370b84d5b6",
> "args" : { "topN" : "100" },
> "result" : null,
> "state" : "FAILED",
> "msg" : "ERROR: java.lang.RuntimeException: job failed: 
> name=[parallel_0]generate: 1498059912-1448058551, 
> jobid=job_local1142434549_0036",
> "crawlId" : "parallel_0"
> }
> {code}
> Lines from hadoop.log
> {code}
> 2017-06-21 11:45:13,021 WARN  mapred.LocalJobRunner - job_local1142434549_0036
> java.lang.Exception: java.lang.RuntimeException: java.io.EOFException
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
> Caused by: java.lang.RuntimeException: java.io.EOFException
> at 
> org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:164)
> at 
> org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:158)
> at 
> org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:121)
> at 
> org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:302)
> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:170)
> at 
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
> at 
> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> 

[jira] [Updated] (NUTCH-1623) Implement file.content.ignored function

2019-01-07 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1623:
---
Fix Version/s: 2.5

> Implement file.content.ignored function
> ---
>
> Key: NUTCH-1623
> URL: https://issues.apache.org/jira/browse/NUTCH-1623
> Project: Nutch
>  Issue Type: New Feature
>  Components: crawldb, fetcher
>Affects Versions: 2.2, 2.2.1
>Reporter: Osy
>Priority: Major
> Fix For: 2.5
>
>
> For Nutch 2.2.1 in nutch-default.xml there is a description for this 
> functionality (!! NO IMPLEMENTED YET !!):
> If true, no file content will be saved during fetch.
> And it is probably what we want to set most of time, since [file://|file:///] 
> URLs
> are meant to be local and we can always use them directly at parsing
> and indexing stages. Otherwise file contents will be saved.
> Exactly what I need.
> Thanks



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2683) DeduplicationJob: add option to prefer https:// over http://

2019-01-07 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2683:
--

 Summary: DeduplicationJob: add option to prefer https:// over 
http://
 Key: NUTCH-2683
 URL: https://issues.apache.org/jira/browse/NUTCH-2683
 Project: Nutch
  Issue Type: Improvement
  Components: crawldb
Affects Versions: 1.15
Reporter: Sebastian Nagel
 Fix For: 1.16


The deduplication job allows to keep the shortest URLs as the "best" URL of a 
set of duplicates, marking all longer ones as duplicates. Recently search 
engines started to penalize non-https pages by [giving https pages a higher 
rank|https://webmasters.googleblog.com/2014/08/https-as-ranking-signal.html] 
and [marking http as 
insecure|https://www.blog.google/products/chrome/milestone-chrome-security-marking-http-not-secure/].

If URLs are identical except for the protocol the deduplication job should be 
able to prefer https:// over http:// URLs, although the latter ones are shorter 
by one character. Of course, this should be configurable and in addition to 
existing preferences (length, score and fetch time) to select the "best" URL 
among duplicates.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)