Hi Jason, this looks like a library dependency version conflict, probably between httpcore and httpclient. The class on top of the stack belong to these libs: org.apache.http.impl.io.DefaultHttpRequestWriterFactory -> httpcore org.apache.http.impl.conn.ManagedHttpClientConnectionFactory -> httpclient
You mentioned that indexing to Solr works in local mode. Is it possible that the mapreduce tasks get a wrong httpcore (or httpclient) lib? They should use those from the apache-nutch-1.11.job, from classes/plugins/indexer-solr/ strictly speaking. We know that there are problems because the plugin class loader asks first its parent, see [1] for the most recent discussion. Can you try to add -verbose:class so that you can see in the logs from which jar the classes are loaded? Sorry, I didn't try this in (pseudo-)distributed mode yet. According to the documentation it should be possible to set this option in "mapred.child.java.opts" in your mapred-site.xml (check also other *.java.opts properties)? Cheers, Sebastian [1] https://issues.apache.org/jira/browse/NUTCH-2191 On 01/23/2016 04:09 PM, Jason S wrote: > I'm not sure if it is ok to attach files to a list email, if anyone wants > to look at some log files, they're here: > > https://s3.amazonaws.com/nutch-hadoop-error/hadoop-nutch-error.tgz > > This crawl was done on Ubuntu 15.10 and Open Jdk 8, however, I have also > had the error with Ubuntu 14, Open Jdk 7 and Oracle Jdk 7, Hadoop in single > server mode and on a cluster with a master and 5 slaves. > > This crawl had minimal changes made to the config files, only > http.agent.name and sol.server.url were changed. Nutch was built with ant, > "ant clean runtime". > > Entire log directory with an entire > inject/generate/fetch/parse/updatedb/index cycle is in there. As indicated > in my previous messages, everything works fine until indexer, and same data > indexes fine in local mode. > > Thanks in advance, > > Jason > > > On Sat, Jan 23, 2016 at 11:43 AM, Jason S <[email protected]> wrote: > >> Bump. >> >> Is there anyone who can help me with this? >> >> I'm not familiar enough with Nutch source code to label this as a bug but >> it seems to be the case, unless I have made some mistake being new to >> Hadoop 2. I have been running Nutch on Hadoop 1.X for years and never had >> any problems like this. Have I overlooked something in my setup? >> >> I believe the error I posted is the one causing the indexing job to fail, >> I can confirm quite a few things that are not causing the problem. >> >> -- I have used Nutch with minimal changes to default configs, and Solr >> with exactly the unmodified Schema and solrindex-mapping files provided in >> the config. >> >> -- Same error occurs on hadoop 2.4.0, 2.4.1, 2.7.1 >> >> -- Solr 4.10.2, and solr 4.10.4 makes no difference >> >> -- Building Nutch and Solr with Open JDK or Oracle JDK makes no difference >> >> It seems like Nutch/Hadoop never connects to Solr before it fails, Solr >> logging in verbose mode creates 0 lines of output when the indexer job runs >> on Hadoop. >> >> All data/settings/everything the same works fine in local mode. >> >> Short of dumping segments to local mode and indexing them that way, or >> trying another indexer, i'm baffled. >> >> Many thanks if someone could help me out. >> >> Jason >> >> >> On Thu, Jan 21, 2016 at 10:29 PM, Jason S <[email protected]> wrote: >> >>> Hi Markus, >>> >>> I guess that is part of my question, from the data in the top-level logs, >>> how can I tell where to look? I have spent a couple days trying to >>> understand hadoop 2 logging , i'm still not really very sure. >>> >>> For example, I found this error here: >>> >>> >>> ~/hadoop-2.4.0/logs/userlogs/application_1453403905213_0001/container_1453403905213_0001_01_000041/syslog >>> >>> At first I thought the "no such field" error meant I was trying to put >>> data in Solr where the field didn't exist in the schema, but the same data >>> indexes fine in local mode. Also, there are no errors in Solr logs. >>> >>> Thanks, >>> >>> Jason >>> >>> ### syslog error ### >>> >>> 2016-01-21 14:21:14,211 INFO [main] >>> org.apache.nutch.plugin.PluginRepository: Nutch Content Parser >>> (org.apache.nutch.parse.Parser) >>> >>> 2016-01-21 14:21:14,211 INFO [main] >>> org.apache.nutch.plugin.PluginRepository: Nutch Scoring >>> (org.apache.nutch.scoring.ScoringFilter) >>> >>> 2016-01-21 14:21:14,637 INFO [main] >>> org.apache.nutch.indexer.anchor.AnchorIndexingFilter: Anchor deduplication >>> is: on >>> >>> 2016-01-21 14:21:14,668 INFO [main] >>> org.apache.nutch.indexer.IndexWriters: Adding >>> org.apache.nutch.indexwriter.solr.SolrIndexWriter >>> >>> 2016-01-21 14:21:14,916 FATAL [main] org.apache.hadoop.mapred.YarnChild: >>> Error running child : java.lang.NoSuchFieldError: INSTANCE >>> >>> at >>> org.apache.http.impl.io.DefaultHttpRequestWriterFactory.<init>(DefaultHttpRequestWriterFactory.java:52) >>> >>> at >>> org.apache.http.impl.io.DefaultHttpRequestWriterFactory.<init>(DefaultHttpRequestWriterFactory.java:56) >>> >>> at >>> org.apache.http.impl.io.DefaultHttpRequestWriterFactory.<clinit>(DefaultHttpRequestWriterFactory.java:46) >>> >>> at >>> org.apache.http.impl.conn.ManagedHttpClientConnectionFactory.<init>(ManagedHttpClientConnectionFactory.java:72) >>> >>> at >>> org.apache.http.impl.conn.ManagedHttpClientConnectionFactory.<init>(ManagedHttpClientConnectionFactory.java:84) >>> >>> at >>> org.apache.http.impl.conn.ManagedHttpClientConnectionFactory.<clinit>(ManagedHttpClientConnectionFactory.java:59) >>> >>> at >>> org.apache.http.impl.conn.PoolingHttpClientConnectionManager$InternalConnectionFactory.<init>(PoolingHttpClientConnectionManager.java:493) >>> >>> at >>> org.apache.http.impl.conn.PoolingHttpClientConnectionManager.<init>(PoolingHttpClientConnectionManager.java:149) >>> >>> at >>> org.apache.http.impl.conn.PoolingHttpClientConnectionManager.<init>(PoolingHttpClientConnectionManager.java:138) >>> >>> at >>> org.apache.http.impl.conn.PoolingHttpClientConnectionManager.<init>(PoolingHttpClientConnectionManager.java:114) >>> >>> at >>> org.apache.http.impl.client.HttpClientBuilder.build(HttpClientBuilder.java:726) >>> >>> at >>> org.apache.nutch.indexwriter.solr.SolrUtils.getSolrServer(SolrUtils.java:57) >>> >>> at >>> org.apache.nutch.indexwriter.solr.SolrIndexWriter.open(SolrIndexWriter.java:58) >>> >>> at org.apache.nutch.indexer.IndexWriters.open(IndexWriters.java:75) >>> >>> at >>> org.apache.nutch.indexer.IndexerOutputFormat.getRecordWriter(IndexerOutputFormat.java:39) >>> >>> at >>> org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.<init>(ReduceTask.java:484) >>> >>> at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:414) >>> >>> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) >>> >>> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) >>> >>> at java.security.AccessController.doPrivileged(Native Method) >>> >>> at javax.security.auth.Subject.doAs(Subject.java:415) >>> >>> at >>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) >>> >>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) >>> >>> >>> 2016-01-21 14:21:14,927 INFO [main] >>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping ReduceTask >>> metrics system... >>> >>> 2016-01-21 14:21:14,928 INFO [main] >>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ReduceTask metrics >>> system stopped. >>> >>> 2016-01-21 14:21:14,928 INFO [main] >>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ReduceTask metrics >>> system shutdown complete. >>> >>> >>> >>> On Thu, Jan 21, 2016 at 9:47 PM, Markus Jelsma < >>> [email protected]> wrote: >>> >>>> Hi Jason - these are the top-level job logs but to really know what's >>>> going on, we need the actual reducer task logs. >>>> Markus >>>> >>>> >>>> >>>> -----Original message----- >>>>> From:Jason S <[email protected]> >>>>> Sent: Thursday 21st January 2016 20:35 >>>>> To: [email protected] >>>>> Subject: Indexing Nutch 1.11 indexing Fails >>>>> >>>>> Hi, >>>>> >>>>> I am having a problem indexing segments in Nutch 1.11 on Hadoop. >>>>> >>>>> The cluster seems to be configured correctly and every part of the >>>> crawl >>>>> process is working flawlessly, however this is my first attempt at >>>> hadoop >>>>> 2, so perhaps my memory settings aren't perfect. I'm also not sure >>>> where >>>>> to look in the log files for more information. >>>>> >>>>> The same data can be indexed with Nutch in local mode, so I don't >>>> think it >>>>> is a problem with the Solr configuration, and I have had Nutch 1.0.9 >>>> with >>>>> Hadoop 1.2.1 on this same cluster and everything worked ok. >>>>> >>>>> Please let me know if I can send more information, I have spent several >>>>> days working on this with no success or clue why it is happening. >>>>> >>>>> Thanks in advance, >>>>> >>>>> Jason >>>>> >>>>> ### Command ### >>>>> >>>>> /root/hadoop-2.4.0/bin/hadoop jar >>>>> /root/src/apache-nutch-1.11/build/apache-nutch-1.11.job >>>>> org.apache.nutch.indexer.IndexingJob crawl/crawldb -linkdb crawl/linkdb >>>>> crawl/segments/20160121113335 >>>>> >>>>> ### Error ### >>>>> >>>>> 16/01/21 14:20:47 INFO mapreduce.Job: map 100% reduce 19% >>>>> 16/01/21 14:20:48 INFO mapreduce.Job: map 100% reduce 26% >>>>> 16/01/21 14:20:48 INFO mapreduce.Job: Task Id : >>>>> attempt_1453403905213_0001_r_000001_0, Status : FAILED >>>>> Error: INSTANCE >>>>> 16/01/21 14:20:48 INFO mapreduce.Job: Task Id : >>>>> attempt_1453403905213_0001_r_000002_0, Status : FAILED >>>>> Error: INSTANCE >>>>> 16/01/21 14:20:48 INFO mapreduce.Job: Task Id : >>>>> attempt_1453403905213_0001_r_000000_0, Status : FAILED >>>>> Error: INSTANCE >>>>> 16/01/21 14:20:49 INFO mapreduce.Job: map 100% reduce 0% >>>>> 16/01/21 14:20:54 INFO mapreduce.Job: Task Id : >>>>> attempt_1453403905213_0001_r_000004_0, Status : FAILED >>>>> Error: INSTANCE >>>>> 16/01/21 14:20:55 INFO mapreduce.Job: Task Id : >>>>> attempt_1453403905213_0001_r_000002_1, Status : FAILED >>>>> Error: INSTANCE >>>>> 16/01/21 14:20:56 INFO mapreduce.Job: Task Id : >>>>> attempt_1453403905213_0001_r_000001_1, Status : FAILED >>>>> Error: INSTANCE >>>>> 16/01/21 14:21:00 INFO mapreduce.Job: Task Id : >>>>> attempt_1453403905213_0001_r_000000_1, Status : FAILED >>>>> Error: INSTANCE >>>>> 16/01/21 14:21:01 INFO mapreduce.Job: Task Id : >>>>> attempt_1453403905213_0001_r_000004_1, Status : FAILED >>>>> Error: INSTANCE >>>>> 16/01/21 14:21:02 INFO mapreduce.Job: Task Id : >>>>> attempt_1453403905213_0001_r_000002_2, Status : FAILED >>>>> Error: INSTANCE >>>>> 16/01/21 14:21:07 INFO mapreduce.Job: Task Id : >>>>> attempt_1453403905213_0001_r_000003_0, Status : FAILED >>>>> Error: INSTANCE >>>>> 16/01/21 14:21:08 INFO mapreduce.Job: Task Id : >>>>> attempt_1453403905213_0001_r_000004_2, Status : FAILED >>>>> Error: INSTANCE >>>>> 16/01/21 14:21:08 INFO mapreduce.Job: Task Id : >>>>> attempt_1453403905213_0001_r_000001_2, Status : FAILED >>>>> Error: INSTANCE >>>>> 16/01/21 14:21:11 INFO mapreduce.Job: Task Id : >>>>> attempt_1453403905213_0001_r_000000_2, Status : FAILED >>>>> Error: INSTANCE >>>>> 16/01/21 14:21:15 INFO mapreduce.Job: Task Id : >>>>> attempt_1453403905213_0001_r_000003_1, Status : FAILED >>>>> Error: INSTANCE >>>>> 16/01/21 14:21:16 INFO mapreduce.Job: map 100% reduce 100% >>>>> 16/01/21 14:21:16 INFO mapreduce.Job: Job job_1453403905213_0001 failed >>>>> with state FAILED due to: Task failed task_1453403905213_0001_r_000004 >>>>> Job failed as tasks failed. failedMaps:0 failedReduces:1 >>>>> >>>>> 16/01/21 14:21:16 INFO mapreduce.Job: Counters: 39 >>>>> File System Counters >>>>> FILE: Number of bytes read=0 >>>>> FILE: Number of bytes written=5578886 >>>>> FILE: Number of read operations=0 >>>>> FILE: Number of large read operations=0 >>>>> FILE: Number of write operations=0 >>>>> HDFS: Number of bytes read=2277523 >>>>> HDFS: Number of bytes written=0 >>>>> HDFS: Number of read operations=80 >>>>> HDFS: Number of large read operations=0 >>>>> HDFS: Number of write operations=0 >>>>> Job Counters >>>>> Failed reduce tasks=15 >>>>> Killed reduce tasks=2 >>>>> Launched map tasks=20 >>>>> Launched reduce tasks=17 >>>>> Data-local map tasks=19 >>>>> Rack-local map tasks=1 >>>>> Total time spent by all maps in occupied slots (ms)=334664 >>>>> Total time spent by all reduces in occupied slots (ms)=548199 >>>>> Total time spent by all map tasks (ms)=167332 >>>>> Total time spent by all reduce tasks (ms)=182733 >>>>> Total vcore-seconds taken by all map tasks=167332 >>>>> Total vcore-seconds taken by all reduce tasks=182733 >>>>> Total megabyte-seconds taken by all map tasks=257021952 >>>>> Total megabyte-seconds taken by all reduce tasks=561355776 >>>>> Map-Reduce Framework >>>>> Map input records=18083 >>>>> Map output records=18083 >>>>> Map output bytes=3140643 >>>>> Map output materialized bytes=3178436 >>>>> Input split bytes=2812 >>>>> Combine input records=0 >>>>> Spilled Records=18083 >>>>> Failed Shuffles=0 >>>>> Merged Map outputs=0 >>>>> GC time elapsed (ms)=1182 >>>>> CPU time spent (ms)=56070 >>>>> Physical memory (bytes) snapshot=6087245824 >>>>> Virtual memory (bytes) snapshot=34655649792 >>>>> Total committed heap usage (bytes)=5412749312 >>>>> File Input Format Counters >>>>> Bytes Read=2274711 >>>>> 16/01/21 14:21:16 ERROR indexer.IndexingJob: Indexer: >>>> java.io.IOException: >>>>> Job failed! >>>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836) >>>>> at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145) >>>>> at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222) >>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) >>>>> at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231) >>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>>> at >>>>> >>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >>>>> at >>>>> >>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>>>> at java.lang.reflect.Method.invoke(Method.java:606) >>>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:212) >>>>> >>>> >>> >>> >> >

