Re: Indexing Nutch 1.11 indexing Fails

Sebastian Nagel Sat, 23 Jan 2016 13:06:32 -0800

Hi Jason,

this looks like a library dependency version conflict, probably
between httpcore and httpclient. The class on top of the stack
belong to these libs:
 org.apache.http.impl.io.DefaultHttpRequestWriterFactory  -> httpcore
 org.apache.http.impl.conn.ManagedHttpClientConnectionFactory  -> httpclient


You mentioned that indexing to Solr works in local mode.
Is it possible that the mapreduce tasks get a wrong httpcore (or httpclient)
lib? They should use those from the apache-nutch-1.11.job,
from classes/plugins/indexer-solr/ strictly speaking.

We know that there are problems because the plugin class loader
asks first its parent, see [1] for the most recent discussion.

Can you try to add -verbose:class so that you can see in the logs from
which jar the classes are loaded? Sorry, I didn't try this in
(pseudo-)distributed mode yet. According to the documentation
it should be possible to set this option in "mapred.child.java.opts"
in your mapred-site.xml (check also other *.java.opts properties)?

Cheers,
Sebastian

[1] https://issues.apache.org/jira/browse/NUTCH-2191


On 01/23/2016 04:09 PM, Jason S wrote:
> I'm not sure if it is ok to attach files to a list email, if anyone wants
> to look at some log files, they're here:
> 
> https://s3.amazonaws.com/nutch-hadoop-error/hadoop-nutch-error.tgz
> 
> This crawl was done on Ubuntu 15.10 and Open Jdk 8, however, I have also
> had the error with Ubuntu 14, Open Jdk 7 and Oracle Jdk 7, Hadoop in single
> server mode and on a cluster with a master and 5 slaves.
> 
> This crawl had minimal changes made to the config files, only
> http.agent.name and sol.server.url were changed.  Nutch was built with ant,
> "ant clean runtime".
> 
> Entire log directory with an entire
> inject/generate/fetch/parse/updatedb/index cycle is in there.  As indicated
> in my previous messages, everything works fine until indexer, and same data
> indexes fine in local mode.
> 
> Thanks in advance,
> 
> Jason
> 
> 
> On Sat, Jan 23, 2016 at 11:43 AM, Jason S <[email protected]> wrote:
> 
>> Bump.
>>
>> Is there anyone who can help me with this?
>>
>> I'm not familiar enough with Nutch source code to label this as a bug but
>> it seems to be the case, unless I have made some mistake being new to
>> Hadoop 2.  I have been running Nutch on Hadoop 1.X for years and never had
>> any problems like this.  Have I overlooked something in my setup?
>>
>> I believe the error I posted is the one causing the indexing job to fail,
>> I can confirm quite a few things that are not causing the problem.
>>
>> -- I have used Nutch with minimal changes to default configs, and Solr
>> with exactly the unmodified Schema and solrindex-mapping files provided in
>> the config.
>>
>> -- Same error occurs on hadoop 2.4.0, 2.4.1, 2.7.1
>>
>> -- Solr 4.10.2, and solr 4.10.4 makes no difference
>>
>> -- Building Nutch and Solr with Open JDK or Oracle JDK makes no difference
>>
>> It seems like Nutch/Hadoop never connects to Solr before it fails, Solr
>> logging in verbose mode creates 0 lines of output when the indexer job runs
>> on Hadoop.
>>
>> All data/settings/everything the same works fine in local mode.
>>
>> Short of dumping segments to local mode and indexing them that way, or
>> trying another indexer, i'm baffled.
>>
>> Many thanks if someone could help me out.
>>
>> Jason
>>
>>
>> On Thu, Jan 21, 2016 at 10:29 PM, Jason S <[email protected]> wrote:
>>
>>> Hi Markus,
>>>
>>> I guess that is part of my question, from the data in the top-level logs,
>>> how can I tell where to look?  I have spent a couple days trying to
>>> understand hadoop 2 logging , i'm still not really very sure.
>>>
>>> For example, I found this error here:
>>>
>>>
>>> ~/hadoop-2.4.0/logs/userlogs/application_1453403905213_0001/container_1453403905213_0001_01_000041/syslog
>>>
>>> At first I thought the "no such field" error meant I was trying to put
>>> data in Solr where the field didn't exist in the schema, but the same data
>>> indexes fine in local mode.  Also, there are no errors in Solr logs.
>>>
>>> Thanks,
>>>
>>> Jason
>>>
>>> ### syslog error ###
>>>
>>> 2016-01-21 14:21:14,211 INFO [main]
>>> org.apache.nutch.plugin.PluginRepository: Nutch Content Parser
>>> (org.apache.nutch.parse.Parser)
>>>
>>> 2016-01-21 14:21:14,211 INFO [main]
>>> org.apache.nutch.plugin.PluginRepository: Nutch Scoring
>>> (org.apache.nutch.scoring.ScoringFilter)
>>>
>>> 2016-01-21 14:21:14,637 INFO [main]
>>> org.apache.nutch.indexer.anchor.AnchorIndexingFilter: Anchor deduplication
>>> is: on
>>>
>>> 2016-01-21 14:21:14,668 INFO [main]
>>> org.apache.nutch.indexer.IndexWriters: Adding
>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter
>>>
>>> 2016-01-21 14:21:14,916 FATAL [main] org.apache.hadoop.mapred.YarnChild:
>>> Error running child : java.lang.NoSuchFieldError: INSTANCE
>>>
>>> at
>>> org.apache.http.impl.io.DefaultHttpRequestWriterFactory.<init>(DefaultHttpRequestWriterFactory.java:52)
>>>
>>> at
>>> org.apache.http.impl.io.DefaultHttpRequestWriterFactory.<init>(DefaultHttpRequestWriterFactory.java:56)
>>>
>>> at
>>> org.apache.http.impl.io.DefaultHttpRequestWriterFactory.<clinit>(DefaultHttpRequestWriterFactory.java:46)
>>>
>>> at
>>> org.apache.http.impl.conn.ManagedHttpClientConnectionFactory.<init>(ManagedHttpClientConnectionFactory.java:72)
>>>
>>> at
>>> org.apache.http.impl.conn.ManagedHttpClientConnectionFactory.<init>(ManagedHttpClientConnectionFactory.java:84)
>>>
>>> at
>>> org.apache.http.impl.conn.ManagedHttpClientConnectionFactory.<clinit>(ManagedHttpClientConnectionFactory.java:59)
>>>
>>> at
>>> org.apache.http.impl.conn.PoolingHttpClientConnectionManager$InternalConnectionFactory.<init>(PoolingHttpClientConnectionManager.java:493)
>>>
>>> at
>>> org.apache.http.impl.conn.PoolingHttpClientConnectionManager.<init>(PoolingHttpClientConnectionManager.java:149)
>>>
>>> at
>>> org.apache.http.impl.conn.PoolingHttpClientConnectionManager.<init>(PoolingHttpClientConnectionManager.java:138)
>>>
>>> at
>>> org.apache.http.impl.conn.PoolingHttpClientConnectionManager.<init>(PoolingHttpClientConnectionManager.java:114)
>>>
>>> at
>>> org.apache.http.impl.client.HttpClientBuilder.build(HttpClientBuilder.java:726)
>>>
>>> at
>>> org.apache.nutch.indexwriter.solr.SolrUtils.getSolrServer(SolrUtils.java:57)
>>>
>>> at
>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter.open(SolrIndexWriter.java:58)
>>>
>>> at org.apache.nutch.indexer.IndexWriters.open(IndexWriters.java:75)
>>>
>>> at
>>> org.apache.nutch.indexer.IndexerOutputFormat.getRecordWriter(IndexerOutputFormat.java:39)
>>>
>>> at
>>> org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.<init>(ReduceTask.java:484)
>>>
>>> at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:414)
>>>
>>> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
>>>
>>> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
>>>
>>> at java.security.AccessController.doPrivileged(Native Method)
>>>
>>> at javax.security.auth.Subject.doAs(Subject.java:415)
>>>
>>> at
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>
>>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
>>>
>>>
>>> 2016-01-21 14:21:14,927 INFO [main]
>>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping ReduceTask
>>> metrics system...
>>>
>>> 2016-01-21 14:21:14,928 INFO [main]
>>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ReduceTask metrics
>>> system stopped.
>>>
>>> 2016-01-21 14:21:14,928 INFO [main]
>>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ReduceTask metrics
>>> system shutdown complete.
>>>
>>>
>>>
>>> On Thu, Jan 21, 2016 at 9:47 PM, Markus Jelsma <
>>> [email protected]> wrote:
>>>
>>>> Hi Jason - these are the top-level job logs but to really know what's
>>>> going on, we need the actual reducer task logs.
>>>> Markus
>>>>
>>>>
>>>>
>>>> -----Original message-----
>>>>> From:Jason S <[email protected]>
>>>>> Sent: Thursday 21st January 2016 20:35
>>>>> To: [email protected]
>>>>> Subject: Indexing Nutch 1.11 indexing Fails
>>>>>
>>>>> Hi,
>>>>>
>>>>> I am having a problem indexing segments in Nutch 1.11 on Hadoop.
>>>>>
>>>>> The cluster seems to be configured correctly and every part of the
>>>> crawl
>>>>> process is working flawlessly, however this is my first attempt at
>>>> hadoop
>>>>> 2, so perhaps my memory settings aren't perfect.  I'm also not sure
>>>> where
>>>>> to look in the log files for more information.
>>>>>
>>>>> The same data can be indexed with Nutch in local mode, so I don't
>>>> think it
>>>>> is a problem with the Solr configuration, and I have had Nutch 1.0.9
>>>> with
>>>>> Hadoop 1.2.1 on this same cluster and everything worked ok.
>>>>>
>>>>> Please let me know if I can send more information, I have spent several
>>>>> days working on this with no success or clue why it is happening.
>>>>>
>>>>> Thanks in advance,
>>>>>
>>>>> Jason
>>>>>
>>>>> ### Command ###
>>>>>
>>>>> /root/hadoop-2.4.0/bin/hadoop jar
>>>>> /root/src/apache-nutch-1.11/build/apache-nutch-1.11.job
>>>>> org.apache.nutch.indexer.IndexingJob crawl/crawldb -linkdb crawl/linkdb
>>>>> crawl/segments/20160121113335
>>>>>
>>>>> ### Error ###
>>>>>
>>>>> 16/01/21 14:20:47 INFO mapreduce.Job:  map 100% reduce 19%
>>>>> 16/01/21 14:20:48 INFO mapreduce.Job:  map 100% reduce 26%
>>>>> 16/01/21 14:20:48 INFO mapreduce.Job: Task Id :
>>>>> attempt_1453403905213_0001_r_000001_0, Status : FAILED
>>>>> Error: INSTANCE
>>>>> 16/01/21 14:20:48 INFO mapreduce.Job: Task Id :
>>>>> attempt_1453403905213_0001_r_000002_0, Status : FAILED
>>>>> Error: INSTANCE
>>>>> 16/01/21 14:20:48 INFO mapreduce.Job: Task Id :
>>>>> attempt_1453403905213_0001_r_000000_0, Status : FAILED
>>>>> Error: INSTANCE
>>>>> 16/01/21 14:20:49 INFO mapreduce.Job:  map 100% reduce 0%
>>>>> 16/01/21 14:20:54 INFO mapreduce.Job: Task Id :
>>>>> attempt_1453403905213_0001_r_000004_0, Status : FAILED
>>>>> Error: INSTANCE
>>>>> 16/01/21 14:20:55 INFO mapreduce.Job: Task Id :
>>>>> attempt_1453403905213_0001_r_000002_1, Status : FAILED
>>>>> Error: INSTANCE
>>>>> 16/01/21 14:20:56 INFO mapreduce.Job: Task Id :
>>>>> attempt_1453403905213_0001_r_000001_1, Status : FAILED
>>>>> Error: INSTANCE
>>>>> 16/01/21 14:21:00 INFO mapreduce.Job: Task Id :
>>>>> attempt_1453403905213_0001_r_000000_1, Status : FAILED
>>>>> Error: INSTANCE
>>>>> 16/01/21 14:21:01 INFO mapreduce.Job: Task Id :
>>>>> attempt_1453403905213_0001_r_000004_1, Status : FAILED
>>>>> Error: INSTANCE
>>>>> 16/01/21 14:21:02 INFO mapreduce.Job: Task Id :
>>>>> attempt_1453403905213_0001_r_000002_2, Status : FAILED
>>>>> Error: INSTANCE
>>>>> 16/01/21 14:21:07 INFO mapreduce.Job: Task Id :
>>>>> attempt_1453403905213_0001_r_000003_0, Status : FAILED
>>>>> Error: INSTANCE
>>>>> 16/01/21 14:21:08 INFO mapreduce.Job: Task Id :
>>>>> attempt_1453403905213_0001_r_000004_2, Status : FAILED
>>>>> Error: INSTANCE
>>>>> 16/01/21 14:21:08 INFO mapreduce.Job: Task Id :
>>>>> attempt_1453403905213_0001_r_000001_2, Status : FAILED
>>>>> Error: INSTANCE
>>>>> 16/01/21 14:21:11 INFO mapreduce.Job: Task Id :
>>>>> attempt_1453403905213_0001_r_000000_2, Status : FAILED
>>>>> Error: INSTANCE
>>>>> 16/01/21 14:21:15 INFO mapreduce.Job: Task Id :
>>>>> attempt_1453403905213_0001_r_000003_1, Status : FAILED
>>>>> Error: INSTANCE
>>>>> 16/01/21 14:21:16 INFO mapreduce.Job:  map 100% reduce 100%
>>>>> 16/01/21 14:21:16 INFO mapreduce.Job: Job job_1453403905213_0001 failed
>>>>> with state FAILED due to: Task failed task_1453403905213_0001_r_000004
>>>>> Job failed as tasks failed. failedMaps:0 failedReduces:1
>>>>>
>>>>> 16/01/21 14:21:16 INFO mapreduce.Job: Counters: 39
>>>>> File System Counters
>>>>> FILE: Number of bytes read=0
>>>>> FILE: Number of bytes written=5578886
>>>>> FILE: Number of read operations=0
>>>>> FILE: Number of large read operations=0
>>>>> FILE: Number of write operations=0
>>>>> HDFS: Number of bytes read=2277523
>>>>> HDFS: Number of bytes written=0
>>>>> HDFS: Number of read operations=80
>>>>> HDFS: Number of large read operations=0
>>>>> HDFS: Number of write operations=0
>>>>> Job Counters
>>>>> Failed reduce tasks=15
>>>>> Killed reduce tasks=2
>>>>> Launched map tasks=20
>>>>> Launched reduce tasks=17
>>>>> Data-local map tasks=19
>>>>> Rack-local map tasks=1
>>>>> Total time spent by all maps in occupied slots (ms)=334664
>>>>> Total time spent by all reduces in occupied slots (ms)=548199
>>>>> Total time spent by all map tasks (ms)=167332
>>>>> Total time spent by all reduce tasks (ms)=182733
>>>>> Total vcore-seconds taken by all map tasks=167332
>>>>> Total vcore-seconds taken by all reduce tasks=182733
>>>>> Total megabyte-seconds taken by all map tasks=257021952
>>>>> Total megabyte-seconds taken by all reduce tasks=561355776
>>>>> Map-Reduce Framework
>>>>> Map input records=18083
>>>>> Map output records=18083
>>>>> Map output bytes=3140643
>>>>> Map output materialized bytes=3178436
>>>>> Input split bytes=2812
>>>>> Combine input records=0
>>>>> Spilled Records=18083
>>>>> Failed Shuffles=0
>>>>> Merged Map outputs=0
>>>>> GC time elapsed (ms)=1182
>>>>> CPU time spent (ms)=56070
>>>>> Physical memory (bytes) snapshot=6087245824
>>>>> Virtual memory (bytes) snapshot=34655649792
>>>>> Total committed heap usage (bytes)=5412749312
>>>>> File Input Format Counters
>>>>> Bytes Read=2274711
>>>>> 16/01/21 14:21:16 ERROR indexer.IndexingJob: Indexer:
>>>> java.io.IOException:
>>>>> Job failed!
>>>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
>>>>> at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
>>>>> at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222)
>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>>>>> at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231)
>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>> at
>>>>>
>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>> at
>>>>>
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>> at java.lang.reflect.Method.invoke(Method.java:606)
>>>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
>>>>>
>>>>
>>>
>>>
>>
>

Re: Indexing Nutch 1.11 indexing Fails

Reply via email to