Hello,

I am writing an application that performs web site crawling using Nutch REST 
services. The application:


1.       Injects seed URLs once

2.       Repeats GENERATE/FETCH/PARSE/UPDATEDB sequence requested number of 
times to emulated continuous crawling (each step in sequence is executed upon 
successful competition of the previous step then sequence repeated again)

The application is also capable of running multiple crawls with different crawl 
IDs at the same time. That seems to be putting stress on Nutch and it starts to 
fail with the following error: "Cannot run job worker!". The parallel crawling 
seems to be starting normally with corresponding Nutch jobs finishing as 
expected but eventually they start to break.

Here are some details on crawling (work fine for non-parallel crawls):


-          Seed URL: http://www.cnn.com

-          Regex URL filters: "-^.{1000,}$"  and "+." (1. Exclude very long 
URLs; 2. Include the rest)

-          fetcher.threads.fetch in nutch-site.xml:  2 (smaller number seems to 
reproduce the problem faster, 100 as a value takes longer)

-          Number of parallel crawls: 7

Here is an example of failed job status (in this case GENERATE step failed but 
saw PARSE failing with the same error in other test executions):

{
"id" : "parallel_0-65ff2f1b-382e-4eb2-a813-a0370b84d5b6-GENERATE-1961495833",
"type" : "GENERATE",
"confId" : "65ff2f1b-382e-4eb2-a813-a0370b84d5b6",
"args" : { "topN" : "100" },
"result" : null,
"state" : "FAILED",
"msg" : "ERROR: java.lang.RuntimeException: job failed: 
name=[parallel_0]generate: 1498059912-1448058551, 
jobid=job_local1142434549_0036",
"crawlId" : "parallel_0"
}

Lines from hadoop.log:

2017-06-21 11:45:13,021 WARN  mapred.LocalJobRunner - job_local1142434549_0036
java.lang.Exception: java.lang.RuntimeException: java.io.EOFException
                at 
org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
                at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: java.lang.RuntimeException: java.io.EOFException
                at 
org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:164)
                at 
org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:158)
                at 
org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:121)
                at 
org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:302)
                at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:170)
                at 
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
                at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
                at 
org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
                at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
                at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
                at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
                at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
                at java.io.DataInputStream.readFully(DataInputStream.java:197)
                at org.apache.hadoop.io.Text.readString(Text.java:466)
                at org.apache.hadoop.io.Text.readString(Text.java:457)
                at 
org.apache.nutch.crawl.GeneratorJob$SelectorEntry.readFields(GeneratorJob.java:92)
                at 
org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:158)
                ... 12 more
2017-06-21 11:45:13,058 WARN  mapred.LocalJobRunner - job_local1976432650_0038
java.lang.Exception: java.lang.RuntimeException: java.io.EOFException
                at 
org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
                at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: java.lang.RuntimeException: java.io.EOFException
                at 
org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:164)
                at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1245)
                at 
org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:99)
                at 
org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:126)
                at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:63)
                at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1575)
                at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1462)
                at 
org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:700)
                at 
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:770)
                at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
                at 
org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
                at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
                at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
                at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
                at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
                at java.io.DataInputStream.readByte(DataInputStream.java:267)
                at 
org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
                at 
org.apache.hadoop.io.WritableUtils.readVIntInRange(WritableUtils.java:348)
                at org.apache.hadoop.io.Text.readString(Text.java:464)
                at org.apache.hadoop.io.Text.readString(Text.java:457)
                at 
org.apache.nutch.crawl.GeneratorJob$SelectorEntry.readFields(GeneratorJob.java:92)
                at 
org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:158)
                ... 15 more

...

2017-06-21 11:45:13,372 ERROR impl.JobWorker - Cannot run job worker!
java.lang.RuntimeException: job failed: name=[parallel_0]generate: 
1498059912-1448058551, jobid=job_local1142434549_0036
                at 
org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120)
                at 
org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:227)
                at org.apache.nutch.api.impl.JobWorker.run(JobWorker.java:64)
                at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
                at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
                at java.lang.Thread.run(Thread.java:745)


Regards,

Vyacheslav Pascarel

Reply via email to