[jira] [Commented] (NUTCH-2395) Cannot run job worker! - error while running multiple crawling jobs in parallel

2020-01-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018897#comment-17018897
 ] 

ASF GitHub Bot commented on NUTCH-2395:
---

sebastian-nagel commented on pull request #197: NUTCH-2395 Cannot run job 
worker! - error while running multiple crawling jobs in parallel
URL: https://github.com/apache/nutch/pull/197
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Cannot run job worker! - error while running multiple crawling jobs in 
> parallel
> ---
>
> Key: NUTCH-2395
> URL: https://issues.apache.org/jira/browse/NUTCH-2395
> Project: Nutch
>  Issue Type: Bug
>  Components: generator, nutch server
>Affects Versions: 2.3.1
> Environment: Ubuntu 16.04 64-bit
> Oracle Java 8 64-bit
> Nutch 2.3.1 (standalone deployment)
> MongoDB 3.4
>Reporter: Vyacheslav Pascarel
>Priority: Major
> Fix For: 2.5
>
>
> Cannot run job worker! - error while running multiple crawling jobs in 
> parallel
> Ubuntu 16.04 64-bit
> Oracle Java 8 64-bit
> Nutch 2.3.1 (standalone deployment)
> MongoDB 3.4
> My application is trying to execute multiple Nutch jobs in parallel using 
> Nutch REST services. The application injects a seed URL and then repeats 
> GENERATE/FETCH/PARSE/UPDATEDB sequence requested number of times to emulated 
> continuous crawling (each step in the sequence is executed upon successful 
> competition of the previous step then the whole sequence is repeated again). 
> Here is a brief description of the jobs:
> * Number of parallel jobs: 7
> * Each job has unique crawl id and MongoDB collection
> * Seed URL for all jobs: http://www.cnn.com
> * Regex URL filters for all jobs: 
> ** *"-^.\{1000,\}$"* - exclude very long URLs
> ** *"+."* - include the rest
> The jobs are started as expected but at some point some of them fail with 
> "Cannot run job worker!" error. For more details see job status and 
> hadoop.log lines below.
> In debugger during crash I noticed that a single instance of 
> SelectorEntryComparator (definition is nested in GeneratorJob) is shared 
> across multiple reducer tasks. The class is inherited from 
> org.apache.hadoop.io.WritableComparator which has a few members unprotected 
> for concurrent usage. At some point multiple threads may access those members 
> in WritableComparator.compare call. I modified SelectorEntryComparator and it 
> seems solved the problem but I am not sure if the change is appropriate 
> and/or sufficient (covers GENERATE only?)
> Original code:
> {code:java}
> public static class SelectorEntryComparator extends WritableComparator {
> public SelectorEntryComparator() {
>   super(SelectorEntry.class, true);
> }
> }
> {code}
> Modified code:
> {code:java}
> public static class SelectorEntryComparator extends WritableComparator {
> public SelectorEntryComparator() {
>   super(SelectorEntry.class, true);
> }
> 
> @Override
> synchronized public int compare(byte[] b1, int s1, int l1, byte[] b2, int 
> s2, int l2) {
>   return super.compare(b1, s1, l1, b2, s2, l2);
> }
> }
> {code}
> Example of failed job status:
> {code}
> {
> "id" : "parallel_0-65ff2f1b-382e-4eb2-a813-a0370b84d5b6-GENERATE-1961495833",
> "type" : "GENERATE",
> "confId" : "65ff2f1b-382e-4eb2-a813-a0370b84d5b6",
> "args" : { "topN" : "100" },
> "result" : null,
> "state" : "FAILED",
> "msg" : "ERROR: java.lang.RuntimeException: job failed: 
> name=[parallel_0]generate: 1498059912-1448058551, 
> jobid=job_local1142434549_0036",
> "crawlId" : "parallel_0"
> }
> {code}
> Lines from hadoop.log
> {code}
> 2017-06-21 11:45:13,021 WARN  mapred.LocalJobRunner - job_local1142434549_0036
> java.lang.Exception: java.lang.RuntimeException: java.io.EOFException
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
> Caused by: java.lang.RuntimeException: java.io.EOFException
> at 
> org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:164)
> at 
> org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:158)
> at 
> org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:121)
> at 
> org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:302)
> at 

[jira] [Commented] (NUTCH-2395) Cannot run job worker! - error while running multiple crawling jobs in parallel

2020-01-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018896#comment-17018896
 ] 

ASF GitHub Bot commented on NUTCH-2395:
---

sebastian-nagel commented on issue #197: NUTCH-2395 Cannot run job worker! - 
error while running multiple crawling jobs in parallel
URL: https://github.com/apache/nutch/pull/197#issuecomment-575999412
 
 
   Closing unmerged PR for 2.x (NUTCH-2395 also has been closed). Thanks, 
@lewismc!
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Cannot run job worker! - error while running multiple crawling jobs in 
> parallel
> ---
>
> Key: NUTCH-2395
> URL: https://issues.apache.org/jira/browse/NUTCH-2395
> Project: Nutch
>  Issue Type: Bug
>  Components: generator, nutch server
>Affects Versions: 2.3.1
> Environment: Ubuntu 16.04 64-bit
> Oracle Java 8 64-bit
> Nutch 2.3.1 (standalone deployment)
> MongoDB 3.4
>Reporter: Vyacheslav Pascarel
>Priority: Major
> Fix For: 2.5
>
>
> Cannot run job worker! - error while running multiple crawling jobs in 
> parallel
> Ubuntu 16.04 64-bit
> Oracle Java 8 64-bit
> Nutch 2.3.1 (standalone deployment)
> MongoDB 3.4
> My application is trying to execute multiple Nutch jobs in parallel using 
> Nutch REST services. The application injects a seed URL and then repeats 
> GENERATE/FETCH/PARSE/UPDATEDB sequence requested number of times to emulated 
> continuous crawling (each step in the sequence is executed upon successful 
> competition of the previous step then the whole sequence is repeated again). 
> Here is a brief description of the jobs:
> * Number of parallel jobs: 7
> * Each job has unique crawl id and MongoDB collection
> * Seed URL for all jobs: http://www.cnn.com
> * Regex URL filters for all jobs: 
> ** *"-^.\{1000,\}$"* - exclude very long URLs
> ** *"+."* - include the rest
> The jobs are started as expected but at some point some of them fail with 
> "Cannot run job worker!" error. For more details see job status and 
> hadoop.log lines below.
> In debugger during crash I noticed that a single instance of 
> SelectorEntryComparator (definition is nested in GeneratorJob) is shared 
> across multiple reducer tasks. The class is inherited from 
> org.apache.hadoop.io.WritableComparator which has a few members unprotected 
> for concurrent usage. At some point multiple threads may access those members 
> in WritableComparator.compare call. I modified SelectorEntryComparator and it 
> seems solved the problem but I am not sure if the change is appropriate 
> and/or sufficient (covers GENERATE only?)
> Original code:
> {code:java}
> public static class SelectorEntryComparator extends WritableComparator {
> public SelectorEntryComparator() {
>   super(SelectorEntry.class, true);
> }
> }
> {code}
> Modified code:
> {code:java}
> public static class SelectorEntryComparator extends WritableComparator {
> public SelectorEntryComparator() {
>   super(SelectorEntry.class, true);
> }
> 
> @Override
> synchronized public int compare(byte[] b1, int s1, int l1, byte[] b2, int 
> s2, int l2) {
>   return super.compare(b1, s1, l1, b2, s2, l2);
> }
> }
> {code}
> Example of failed job status:
> {code}
> {
> "id" : "parallel_0-65ff2f1b-382e-4eb2-a813-a0370b84d5b6-GENERATE-1961495833",
> "type" : "GENERATE",
> "confId" : "65ff2f1b-382e-4eb2-a813-a0370b84d5b6",
> "args" : { "topN" : "100" },
> "result" : null,
> "state" : "FAILED",
> "msg" : "ERROR: java.lang.RuntimeException: job failed: 
> name=[parallel_0]generate: 1498059912-1448058551, 
> jobid=job_local1142434549_0036",
> "crawlId" : "parallel_0"
> }
> {code}
> Lines from hadoop.log
> {code}
> 2017-06-21 11:45:13,021 WARN  mapred.LocalJobRunner - job_local1142434549_0036
> java.lang.Exception: java.lang.RuntimeException: java.io.EOFException
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
> Caused by: java.lang.RuntimeException: java.io.EOFException
> at 
> org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:164)
> at 
> org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:158)
> at 
> org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:121)
> at 
> 

[jira] [Commented] (NUTCH-2395) Cannot run job worker! - error while running multiple crawling jobs in parallel

2019-01-07 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735620#comment-16735620
 ] 

Sebastian Nagel commented on NUTCH-2395:


Sorry, 1.x is safe because it inherits from FloatWritable.Comparator which is 
thread-safe (does not use any fields when reading values as does 
WritableComparator).

> Cannot run job worker! - error while running multiple crawling jobs in 
> parallel
> ---
>
> Key: NUTCH-2395
> URL: https://issues.apache.org/jira/browse/NUTCH-2395
> Project: Nutch
>  Issue Type: Bug
>  Components: generator, nutch server
>Affects Versions: 2.3.1, 1.14
> Environment: Ubuntu 16.04 64-bit
> Oracle Java 8 64-bit
> Nutch 2.3.1 (standalone deployment)
> MongoDB 3.4
>Reporter: Vyacheslav Pascarel
>Priority: Major
> Fix For: 2.4, 1.16
>
>
> Cannot run job worker! - error while running multiple crawling jobs in 
> parallel
> Ubuntu 16.04 64-bit
> Oracle Java 8 64-bit
> Nutch 2.3.1 (standalone deployment)
> MongoDB 3.4
> My application is trying to execute multiple Nutch jobs in parallel using 
> Nutch REST services. The application injects a seed URL and then repeats 
> GENERATE/FETCH/PARSE/UPDATEDB sequence requested number of times to emulated 
> continuous crawling (each step in the sequence is executed upon successful 
> competition of the previous step then the whole sequence is repeated again). 
> Here is a brief description of the jobs:
> * Number of parallel jobs: 7
> * Each job has unique crawl id and MongoDB collection
> * Seed URL for all jobs: http://www.cnn.com
> * Regex URL filters for all jobs: 
> ** *"-^.\{1000,\}$"* - exclude very long URLs
> ** *"+."* - include the rest
> The jobs are started as expected but at some point some of them fail with 
> "Cannot run job worker!" error. For more details see job status and 
> hadoop.log lines below.
> In debugger during crash I noticed that a single instance of 
> SelectorEntryComparator (definition is nested in GeneratorJob) is shared 
> across multiple reducer tasks. The class is inherited from 
> org.apache.hadoop.io.WritableComparator which has a few members unprotected 
> for concurrent usage. At some point multiple threads may access those members 
> in WritableComparator.compare call. I modified SelectorEntryComparator and it 
> seems solved the problem but I am not sure if the change is appropriate 
> and/or sufficient (covers GENERATE only?)
> Original code:
> {code:java}
> public static class SelectorEntryComparator extends WritableComparator {
> public SelectorEntryComparator() {
>   super(SelectorEntry.class, true);
> }
> }
> {code}
> Modified code:
> {code:java}
> public static class SelectorEntryComparator extends WritableComparator {
> public SelectorEntryComparator() {
>   super(SelectorEntry.class, true);
> }
> 
> @Override
> synchronized public int compare(byte[] b1, int s1, int l1, byte[] b2, int 
> s2, int l2) {
>   return super.compare(b1, s1, l1, b2, s2, l2);
> }
> }
> {code}
> Example of failed job status:
> {code}
> {
> "id" : "parallel_0-65ff2f1b-382e-4eb2-a813-a0370b84d5b6-GENERATE-1961495833",
> "type" : "GENERATE",
> "confId" : "65ff2f1b-382e-4eb2-a813-a0370b84d5b6",
> "args" : { "topN" : "100" },
> "result" : null,
> "state" : "FAILED",
> "msg" : "ERROR: java.lang.RuntimeException: job failed: 
> name=[parallel_0]generate: 1498059912-1448058551, 
> jobid=job_local1142434549_0036",
> "crawlId" : "parallel_0"
> }
> {code}
> Lines from hadoop.log
> {code}
> 2017-06-21 11:45:13,021 WARN  mapred.LocalJobRunner - job_local1142434549_0036
> java.lang.Exception: java.lang.RuntimeException: java.io.EOFException
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
> Caused by: java.lang.RuntimeException: java.io.EOFException
> at 
> org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:164)
> at 
> org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:158)
> at 
> org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:121)
> at 
> org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:302)
> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:170)
> at 
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
> at 
> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
> 

[jira] [Commented] (NUTCH-2395) Cannot run job worker! - error while running multiple crawling jobs in parallel

2019-01-07 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735613#comment-16735613
 ] 

Sebastian Nagel commented on NUTCH-2395:


Also affects 1.x when Generator is used from Nutch server in parallel.

> Cannot run job worker! - error while running multiple crawling jobs in 
> parallel
> ---
>
> Key: NUTCH-2395
> URL: https://issues.apache.org/jira/browse/NUTCH-2395
> Project: Nutch
>  Issue Type: Bug
>  Components: nutch server
>Affects Versions: 2.3.1, 1.14
> Environment: Ubuntu 16.04 64-bit
> Oracle Java 8 64-bit
> Nutch 2.3.1 (standalone deployment)
> MongoDB 3.4
>Reporter: Vyacheslav Pascarel
>Priority: Major
> Fix For: 2.4, 1.16
>
>
> Cannot run job worker! - error while running multiple crawling jobs in 
> parallel
> Ubuntu 16.04 64-bit
> Oracle Java 8 64-bit
> Nutch 2.3.1 (standalone deployment)
> MongoDB 3.4
> My application is trying to execute multiple Nutch jobs in parallel using 
> Nutch REST services. The application injects a seed URL and then repeats 
> GENERATE/FETCH/PARSE/UPDATEDB sequence requested number of times to emulated 
> continuous crawling (each step in the sequence is executed upon successful 
> competition of the previous step then the whole sequence is repeated again). 
> Here is a brief description of the jobs:
> * Number of parallel jobs: 7
> * Each job has unique crawl id and MongoDB collection
> * Seed URL for all jobs: http://www.cnn.com
> * Regex URL filters for all jobs: 
> ** *"-^.\{1000,\}$"* - exclude very long URLs
> ** *"+."* - include the rest
> The jobs are started as expected but at some point some of them fail with 
> "Cannot run job worker!" error. For more details see job status and 
> hadoop.log lines below.
> In debugger during crash I noticed that a single instance of 
> SelectorEntryComparator (definition is nested in GeneratorJob) is shared 
> across multiple reducer tasks. The class is inherited from 
> org.apache.hadoop.io.WritableComparator which has a few members unprotected 
> for concurrent usage. At some point multiple threads may access those members 
> in WritableComparator.compare call. I modified SelectorEntryComparator and it 
> seems solved the problem but I am not sure if the change is appropriate 
> and/or sufficient (covers GENERATE only?)
> Original code:
> {code:java}
> public static class SelectorEntryComparator extends WritableComparator {
> public SelectorEntryComparator() {
>   super(SelectorEntry.class, true);
> }
> }
> {code}
> Modified code:
> {code:java}
> public static class SelectorEntryComparator extends WritableComparator {
> public SelectorEntryComparator() {
>   super(SelectorEntry.class, true);
> }
> 
> @Override
> synchronized public int compare(byte[] b1, int s1, int l1, byte[] b2, int 
> s2, int l2) {
>   return super.compare(b1, s1, l1, b2, s2, l2);
> }
> }
> {code}
> Example of failed job status:
> {code}
> {
> "id" : "parallel_0-65ff2f1b-382e-4eb2-a813-a0370b84d5b6-GENERATE-1961495833",
> "type" : "GENERATE",
> "confId" : "65ff2f1b-382e-4eb2-a813-a0370b84d5b6",
> "args" : { "topN" : "100" },
> "result" : null,
> "state" : "FAILED",
> "msg" : "ERROR: java.lang.RuntimeException: job failed: 
> name=[parallel_0]generate: 1498059912-1448058551, 
> jobid=job_local1142434549_0036",
> "crawlId" : "parallel_0"
> }
> {code}
> Lines from hadoop.log
> {code}
> 2017-06-21 11:45:13,021 WARN  mapred.LocalJobRunner - job_local1142434549_0036
> java.lang.Exception: java.lang.RuntimeException: java.io.EOFException
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
> Caused by: java.lang.RuntimeException: java.io.EOFException
> at 
> org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:164)
> at 
> org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:158)
> at 
> org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:121)
> at 
> org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:302)
> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:170)
> at 
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
> at 
> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at 

[jira] [Commented] (NUTCH-2395) Cannot run job worker! - error while running multiple crawling jobs in parallel

2017-11-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16273339#comment-16273339
 ] 

ASF GitHub Bot commented on NUTCH-2395:
---

sebastian-nagel commented on a change in pull request #197: NUTCH-2395 Cannot 
run job worker! - error while running multiple crawling jobs in parallel
URL: https://github.com/apache/nutch/pull/197#discussion_r154188741
 
 

 ##
 File path: src/java/org/apache/nutch/crawl/GeneratorJob.java
 ##
 @@ -153,6 +153,11 @@ public void set(String url, float score) {
 public SelectorEntryComparator() {
   super(SelectorEntry.class, true);
 }
+
+@Override
+synchronized public int compare(byte[] b1, int s1, int l1, byte[] b2, int 
s2, int l2) {
 
 Review comment:
   Making a method synchronized penalizes every call, even in the default case: 
one single reducer running per JVM. Shouldn't the right solution be to 
implement a thread-safe comparator? Implementing a thread-safe version of 
compare should be everything to do:
   ```
   @Override
   public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) 
{
 SelectorEntry key1 = new SelectorEntry();
 SelectorEntry key2 = new SelectorEntry();
 DataInputBuffer buffer = new DataInputBuffer();
   
 try {
   buffer.reset(b1, s1, l1);
   key1.readFields(buffer);
   
   buffer.reset(b2, s2, l2);
   key2.readFields(buffer);
   
 } catch (IOException e) {
   throw new RuntimeException(e);
 }
 
 return key1.compareTo(key2);
   }
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Cannot run job worker! - error while running multiple crawling jobs in 
> parallel
> ---
>
> Key: NUTCH-2395
> URL: https://issues.apache.org/jira/browse/NUTCH-2395
> Project: Nutch
>  Issue Type: Bug
>  Components: nutch server
>Affects Versions: 2.3.1
> Environment: Ubuntu 16.04 64-bit
> Oracle Java 8 64-bit
> Nutch 2.3.1 (standalone deployment)
> MongoDB 3.4
>Reporter: Vyacheslav Pascarel
>
> Cannot run job worker! - error while running multiple crawling jobs in 
> parallel
> Ubuntu 16.04 64-bit
> Oracle Java 8 64-bit
> Nutch 2.3.1 (standalone deployment)
> MongoDB 3.4
> My application is trying to execute multiple Nutch jobs in parallel using 
> Nutch REST services. The application injects a seed URL and then repeats 
> GENERATE/FETCH/PARSE/UPDATEDB sequence requested number of times to emulated 
> continuous crawling (each step in the sequence is executed upon successful 
> competition of the previous step then the whole sequence is repeated again). 
> Here is a brief description of the jobs:
> * Number of parallel jobs: 7
> * Each job has unique crawl id and MongoDB collection
> * Seed URL for all jobs: http://www.cnn.com
> * Regex URL filters for all jobs: 
> ** *"-^.\{1000,\}$"* - exclude very long URLs
> ** *"+."* - include the rest
> The jobs are started as expected but at some point some of them fail with 
> "Cannot run job worker!" error. For more details see job status and 
> hadoop.log lines below.
> In debugger during crash I noticed that a single instance of 
> SelectorEntryComparator (definition is nested in GeneratorJob) is shared 
> across multiple reducer tasks. The class is inherited from 
> org.apache.hadoop.io.WritableComparator which has a few members unprotected 
> for concurrent usage. At some point multiple threads may access those members 
> in WritableComparator.compare call. I modified SelectorEntryComparator and it 
> seems solved the problem but I am not sure if the change is appropriate 
> and/or sufficient (covers GENERATE only?)
> Original code:
> {code:java}
> public static class SelectorEntryComparator extends WritableComparator {
> public SelectorEntryComparator() {
>   super(SelectorEntry.class, true);
> }
> }
> {code}
> Modified code:
> {code:java}
> public static class SelectorEntryComparator extends WritableComparator {
> public SelectorEntryComparator() {
>   super(SelectorEntry.class, true);
> }
> 
> @Override
> synchronized public int compare(byte[] b1, int s1, int l1, byte[] b2, int 
> s2, int l2) {
>   return super.compare(b1, s1, l1, b2, s2, l2);
> }
> }
> {code}
> Example of failed job status:
> {code}
> {
> "id" : "parallel_0-65ff2f1b-382e-4eb2-a813-a0370b84d5b6-GENERATE-1961495833",
> "type" : "GENERATE",
> "confId" : "65ff2f1b-382e-4eb2-a813-a0370b84d5b6",
> "args" 

[jira] [Commented] (NUTCH-2395) Cannot run job worker! - error while running multiple crawling jobs in parallel

2017-11-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16273336#comment-16273336
 ] 

ASF GitHub Bot commented on NUTCH-2395:
---

sebastian-nagel commented on a change in pull request #197: NUTCH-2395 Cannot 
run job worker! - error while running multiple crawling jobs in parallel
URL: https://github.com/apache/nutch/pull/197#discussion_r154188741
 
 

 ##
 File path: src/java/org/apache/nutch/crawl/GeneratorJob.java
 ##
 @@ -153,6 +153,11 @@ public void set(String url, float score) {
 public SelectorEntryComparator() {
   super(SelectorEntry.class, true);
 }
+
+@Override
+synchronized public int compare(byte[] b1, int s1, int l1, byte[] b2, int 
s2, int l2) {
 
 Review comment:
   Making a method synchronized penalizes every call, even in the default case: 
one single reducer running per JVM. Shouldn't the right solution be to 
implement a thread-safe comparator? Implementing a thread-safe version of 
compare should be everything to do:
   ```
   @Override
   public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) 
{
 SelectorEntry key1 = new SelectorEntry();
 SelectorEntry key2 = new SelectorEntry();
 DataInputBuffer buffer = new DataInputBuffer();
   
 try {
   buffer.reset(b1, s1, l1);
   key1.readFields(buffer);
   
   buffer.reset(b2, s2, l2);
   key2.readFields(buffer);
   
 } catch (IOException e) {
   throw new RuntimeException(e);
 }
 
 return compare(key1, key2);
   }
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Cannot run job worker! - error while running multiple crawling jobs in 
> parallel
> ---
>
> Key: NUTCH-2395
> URL: https://issues.apache.org/jira/browse/NUTCH-2395
> Project: Nutch
>  Issue Type: Bug
>  Components: nutch server
>Affects Versions: 2.3.1
> Environment: Ubuntu 16.04 64-bit
> Oracle Java 8 64-bit
> Nutch 2.3.1 (standalone deployment)
> MongoDB 3.4
>Reporter: Vyacheslav Pascarel
>
> Cannot run job worker! - error while running multiple crawling jobs in 
> parallel
> Ubuntu 16.04 64-bit
> Oracle Java 8 64-bit
> Nutch 2.3.1 (standalone deployment)
> MongoDB 3.4
> My application is trying to execute multiple Nutch jobs in parallel using 
> Nutch REST services. The application injects a seed URL and then repeats 
> GENERATE/FETCH/PARSE/UPDATEDB sequence requested number of times to emulated 
> continuous crawling (each step in the sequence is executed upon successful 
> competition of the previous step then the whole sequence is repeated again). 
> Here is a brief description of the jobs:
> * Number of parallel jobs: 7
> * Each job has unique crawl id and MongoDB collection
> * Seed URL for all jobs: http://www.cnn.com
> * Regex URL filters for all jobs: 
> ** *"-^.\{1000,\}$"* - exclude very long URLs
> ** *"+."* - include the rest
> The jobs are started as expected but at some point some of them fail with 
> "Cannot run job worker!" error. For more details see job status and 
> hadoop.log lines below.
> In debugger during crash I noticed that a single instance of 
> SelectorEntryComparator (definition is nested in GeneratorJob) is shared 
> across multiple reducer tasks. The class is inherited from 
> org.apache.hadoop.io.WritableComparator which has a few members unprotected 
> for concurrent usage. At some point multiple threads may access those members 
> in WritableComparator.compare call. I modified SelectorEntryComparator and it 
> seems solved the problem but I am not sure if the change is appropriate 
> and/or sufficient (covers GENERATE only?)
> Original code:
> {code:java}
> public static class SelectorEntryComparator extends WritableComparator {
> public SelectorEntryComparator() {
>   super(SelectorEntry.class, true);
> }
> }
> {code}
> Modified code:
> {code:java}
> public static class SelectorEntryComparator extends WritableComparator {
> public SelectorEntryComparator() {
>   super(SelectorEntry.class, true);
> }
> 
> @Override
> synchronized public int compare(byte[] b1, int s1, int l1, byte[] b2, int 
> s2, int l2) {
>   return super.compare(b1, s1, l1, b2, s2, l2);
> }
> }
> {code}
> Example of failed job status:
> {code}
> {
> "id" : "parallel_0-65ff2f1b-382e-4eb2-a813-a0370b84d5b6-GENERATE-1961495833",
> "type" : "GENERATE",
> "confId" : "65ff2f1b-382e-4eb2-a813-a0370b84d5b6",
> "args" : 

[jira] [Commented] (NUTCH-2395) Cannot run job worker! - error while running multiple crawling jobs in parallel

2017-11-30 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16273319#comment-16273319
 ] 

Sebastian Nagel commented on NUTCH-2395:


Good catch, as {{WritableComparable.define(Class c, WritableComparator 
comparator)}} states:
{quote}Register an optimized comparator for a WritableComparable 
implementation. Comparators registered with this method must be thread-safe.
{quote}

> Cannot run job worker! - error while running multiple crawling jobs in 
> parallel
> ---
>
> Key: NUTCH-2395
> URL: https://issues.apache.org/jira/browse/NUTCH-2395
> Project: Nutch
>  Issue Type: Bug
>  Components: nutch server
>Affects Versions: 2.3.1
> Environment: Ubuntu 16.04 64-bit
> Oracle Java 8 64-bit
> Nutch 2.3.1 (standalone deployment)
> MongoDB 3.4
>Reporter: Vyacheslav Pascarel
>
> Cannot run job worker! - error while running multiple crawling jobs in 
> parallel
> Ubuntu 16.04 64-bit
> Oracle Java 8 64-bit
> Nutch 2.3.1 (standalone deployment)
> MongoDB 3.4
> My application is trying to execute multiple Nutch jobs in parallel using 
> Nutch REST services. The application injects a seed URL and then repeats 
> GENERATE/FETCH/PARSE/UPDATEDB sequence requested number of times to emulated 
> continuous crawling (each step in the sequence is executed upon successful 
> competition of the previous step then the whole sequence is repeated again). 
> Here is a brief description of the jobs:
> * Number of parallel jobs: 7
> * Each job has unique crawl id and MongoDB collection
> * Seed URL for all jobs: http://www.cnn.com
> * Regex URL filters for all jobs: 
> ** *"-^.\{1000,\}$"* - exclude very long URLs
> ** *"+."* - include the rest
> The jobs are started as expected but at some point some of them fail with 
> "Cannot run job worker!" error. For more details see job status and 
> hadoop.log lines below.
> In debugger during crash I noticed that a single instance of 
> SelectorEntryComparator (definition is nested in GeneratorJob) is shared 
> across multiple reducer tasks. The class is inherited from 
> org.apache.hadoop.io.WritableComparator which has a few members unprotected 
> for concurrent usage. At some point multiple threads may access those members 
> in WritableComparator.compare call. I modified SelectorEntryComparator and it 
> seems solved the problem but I am not sure if the change is appropriate 
> and/or sufficient (covers GENERATE only?)
> Original code:
> {code:java}
> public static class SelectorEntryComparator extends WritableComparator {
> public SelectorEntryComparator() {
>   super(SelectorEntry.class, true);
> }
> }
> {code}
> Modified code:
> {code:java}
> public static class SelectorEntryComparator extends WritableComparator {
> public SelectorEntryComparator() {
>   super(SelectorEntry.class, true);
> }
> 
> @Override
> synchronized public int compare(byte[] b1, int s1, int l1, byte[] b2, int 
> s2, int l2) {
>   return super.compare(b1, s1, l1, b2, s2, l2);
> }
> }
> {code}
> Example of failed job status:
> {code}
> {
> "id" : "parallel_0-65ff2f1b-382e-4eb2-a813-a0370b84d5b6-GENERATE-1961495833",
> "type" : "GENERATE",
> "confId" : "65ff2f1b-382e-4eb2-a813-a0370b84d5b6",
> "args" : { "topN" : "100" },
> "result" : null,
> "state" : "FAILED",
> "msg" : "ERROR: java.lang.RuntimeException: job failed: 
> name=[parallel_0]generate: 1498059912-1448058551, 
> jobid=job_local1142434549_0036",
> "crawlId" : "parallel_0"
> }
> {code}
> Lines from hadoop.log
> {code}
> 2017-06-21 11:45:13,021 WARN  mapred.LocalJobRunner - job_local1142434549_0036
> java.lang.Exception: java.lang.RuntimeException: java.io.EOFException
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
> Caused by: java.lang.RuntimeException: java.io.EOFException
> at 
> org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:164)
> at 
> org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:158)
> at 
> org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:121)
> at 
> org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:302)
> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:170)
> at 
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
> at 
> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
>

[jira] [Commented] (NUTCH-2395) Cannot run job worker! - error while running multiple crawling jobs in parallel

2017-07-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16074041#comment-16074041
 ] 

ASF GitHub Bot commented on NUTCH-2395:
---

lewismc opened a new pull request #197: NUTCH-2395 Cannot run job worker! - 
error while running multiple crawling jobs in parallel
URL: https://github.com/apache/nutch/pull/197
 
 
   This PR addresses https://issues.apache.org/jira/browse/NUTCH-2395
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Cannot run job worker! - error while running multiple crawling jobs in 
> parallel
> ---
>
> Key: NUTCH-2395
> URL: https://issues.apache.org/jira/browse/NUTCH-2395
> Project: Nutch
>  Issue Type: Bug
>  Components: nutch server
>Affects Versions: 2.3.1
> Environment: Ubuntu 16.04 64-bit
> Oracle Java 8 64-bit
> Nutch 2.3.1 (standalone deployment)
> MongoDB 3.4
>Reporter: Vyacheslav Pascarel
>
> Cannot run job worker! - error while running multiple crawling jobs in 
> parallel
> Ubuntu 16.04 64-bit
> Oracle Java 8 64-bit
> Nutch 2.3.1 (standalone deployment)
> MongoDB 3.4
> My application is trying to execute multiple Nutch jobs in parallel using 
> Nutch REST services. The application injects a seed URL and then repeats 
> GENERATE/FETCH/PARSE/UPDATEDB sequence requested number of times to emulated 
> continuous crawling (each step in the sequence is executed upon successful 
> competition of the previous step then the whole sequence is repeated again). 
> Here is a brief description of the jobs:
> * Number of parallel jobs: 7
> * Each job has unique crawl id and MongoDB collection
> * Seed URL for all jobs: http://www.cnn.com
> * Regex URL filters for all jobs: 
> ** *"-^.\{1000,\}$"* - exclude very long URLs
> ** *"+."* - include the rest
> The jobs are started as expected but at some point some of them fail with 
> "Cannot run job worker!" error. For more details see job status and 
> hadoop.log lines below.
> In debugger during crash I noticed that a single instance of 
> SelectorEntryComparator (definition is nested in GeneratorJob) is shared 
> across multiple reducer tasks. The class is inherited from 
> org.apache.hadoop.io.WritableComparator which has a few members unprotected 
> for concurrent usage. At some point multiple threads may access those members 
> in WritableComparator.compare call. I modified SelectorEntryComparator and it 
> seems solved the problem but I am not sure if the change is appropriate 
> and/or sufficient (covers GENERATE only?)
> Original code:
> {code:java}
> public static class SelectorEntryComparator extends WritableComparator {
> public SelectorEntryComparator() {
>   super(SelectorEntry.class, true);
> }
> }
> {code}
> Modified code:
> {code:java}
> public static class SelectorEntryComparator extends WritableComparator {
> public SelectorEntryComparator() {
>   super(SelectorEntry.class, true);
> }
> 
> @Override
> synchronized public int compare(byte[] b1, int s1, int l1, byte[] b2, int 
> s2, int l2) {
>   return super.compare(b1, s1, l1, b2, s2, l2);
> }
> }
> {code}
> Example of failed job status:
> {code}
> {
> "id" : "parallel_0-65ff2f1b-382e-4eb2-a813-a0370b84d5b6-GENERATE-1961495833",
> "type" : "GENERATE",
> "confId" : "65ff2f1b-382e-4eb2-a813-a0370b84d5b6",
> "args" : { "topN" : "100" },
> "result" : null,
> "state" : "FAILED",
> "msg" : "ERROR: java.lang.RuntimeException: job failed: 
> name=[parallel_0]generate: 1498059912-1448058551, 
> jobid=job_local1142434549_0036",
> "crawlId" : "parallel_0"
> }
> {code}
> Lines from hadoop.log
> {code}
> 2017-06-21 11:45:13,021 WARN  mapred.LocalJobRunner - job_local1142434549_0036
> java.lang.Exception: java.lang.RuntimeException: java.io.EOFException
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
> Caused by: java.lang.RuntimeException: java.io.EOFException
> at 
> org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:164)
> at 
> org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:158)
> at 
> org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:121)
> at 
> org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:302)
> at