I think I had the same problem, I just checked my schema.xml ... it looks like I just commented out the copyField source="url" dest="id"
<!-- copyField source="url" dest="id"/ --> |-----Original Message----- |From: Markus Jelsma [mailto:[email protected]] |Sent: Tuesday, May 25, 2010 12:04 PM |To: [email protected] |Subject: RE: Solr integration in nutch-1.1dev | |Hi Brian, | | | |Thanks for your reply. But as can be seen in the stacktrace, it's the ID |field of a document. It cannot be set to accommodate multiple values and it |wouldn't make sense either. The ID field should contain the URL of the |fetched and parsed content. Also, you can clearly see the mapping in the |included Nutch logs; it maps the URL field to Solr's ID field as well as |mapping the URL to the URL field which doesn't make sense but it'm still the |example schema and mapping configuration. Also, i couldn't image if multiple |values for a URL field in Nutch itself makes any sense at all, how would a |piece of content on a distinct URL have more than one URL? | | | |Do you or anybody else have an idea to solve this mystery? I'm also not |getting much from Nutch' logs, they don't mention anything else accept that |sending the data over to a Solr instance failed. | | | |Cheers, | |-----Original message----- |From: Brian Tingle <[email protected]> |Sent: Tue 25-05-2010 20:47 |To: [email protected]; Markus Jelsma <[email protected]>; |Subject: RE: Solr integration in nutch-1.1dev | |Update the solr schema.xml so that it allows multiple values for that field? | ||-----Original Message----- ||From: Markus Jelsma [mailto:[email protected]] ||Sent: Tuesday, May 25, 2010 4:49 AM ||To: [email protected] ||Subject: Re: Solr integration in nutch-1.1dev || ||Hello Julien, || || ||I picked today's build from your URL but the problem persists as reported ||earlier. Any more ideas on how to tackle this? || || ||Cheers, || ||On Monday 17 May 2010 15:50:55 Julien Nioche wrote: ||> Hi Markus, ||> ||> This has been solved last week and is in the trunk of the SVN repository. ||> The nightly build has just been fixed after the move to the TLP so the ||> version you are using does not have the fix yet. Check ||> http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ to get the latest ||> build or check it out from SVN ||> ||> J. ||> ||> > Hi, ||> > ||> > ||> > I've got a copy of the nutch-2010-05-11_04-34-41 nightly build because |i ||> > need ||> > Tika to parse JPEG images and that would be in 1.1 as i read somewhere ||> > [1]. ||> > ||> > First i fetch only a single HTML page and send it to Solr as i did with ||> > 1.0 but it fails now. Here's what Solr thinks of the request: ||> > ||> > ||> > --------------- ||> > May 17, 2010 2:25:32 PM org.apache.solr.common.SolrException log ||> > SEVERE: org.apache.solr.common.SolrException: ERROR: multiple values ||> > encountered for non multiValued copy field id: <URL HERE> ||> > at ||> > ||org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:26 ||> >0) at ||> > ||> > ||org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateP ||> >rocessorFactory.java:60) at ||> > ||> > ||org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateP ||> >rocessorFactory.java:94) at ||> > ||> > ||org.apache.solr.update.processor.SignatureUpdateProcessorFactory$Signatur ||> >eUpdateProcessor.processAdd(SignatureUpdateProcessorFactory.java:162) at ||> > org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:139) ||> > at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69) ||> > at ||> > ||> > ||org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Conten ||> >tStreamHandlerBase.java:54) at ||> > ||> > ||org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBa ||> >se.java:131) at |org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) ||> > at ||> > ||> > ||org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.jav ||> >a:338) at ||> > ||> > ||org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.ja ||> >va:241) at ||> > ||> > ||org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Applicat ||> >ionFilterChain.java:235) at ||> > ||> > ||org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilte ||> >rChain.java:206) at ||> > ||> > ||org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve ||> >.java:233) at ||> > ||> > ||org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve ||> >.java:191) at ||> > ||> > ||org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java: ||> >128) at ||> > ||> > ||org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java: ||> >102) at ||> > ||> > ||org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.j ||> >ava:109) at ||> > ||org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:28 ||> >6) at ||> > ||org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:845 ||> >) at ||> > ||> > ||org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(H ||> >ttp11Protocol.java:583) at ||> > org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447) ||> > at java.lang.Thread.run(Thread.java:619) ||> > --------------- ||> > ||> > ||> > Well, this obviously is wrong. Although i am still using the old 1.0 ||> > schema.xml, it still isn't multiValued in the nightly build's |schema.xml ||> > file. ||> > ||> > Below Nutch's relevant log lines: ||> > ||> > ||> > --------------- ||> > 2010-05-17 14:25:31,776 INFO solr.SolrMappingReader - source: content ||> > dest: ||> > content ||> > 2010-05-17 14:25:31,776 INFO solr.SolrMappingReader - source: site ||dest: ||> > site ||> > 2010-05-17 14:25:31,776 INFO solr.SolrMappingReader - source: title ||> > dest: title ||> > 2010-05-17 14:25:31,776 INFO solr.SolrMappingReader - source: host ||dest: ||> > host ||> > 2010-05-17 14:25:31,776 INFO solr.SolrMappingReader - source: segment ||> > dest: ||> > segment ||> > 2010-05-17 14:25:31,777 INFO solr.SolrMappingReader - source: boost ||> > dest: boost ||> > 2010-05-17 14:25:31,777 INFO solr.SolrMappingReader - source: digest ||> > dest: digest ||> > 2010-05-17 14:25:31,777 INFO solr.SolrMappingReader - source: tstamp ||> > dest: tstamp ||> > 2010-05-17 14:25:31,777 INFO solr.SolrMappingReader - source: url |dest: ||> > id 2010-05-17 14:25:31,777 INFO solr.SolrMappingReader - source: url ||> > dest: url ||> > 2010-05-17 14:25:31,821 INFO collection.CollectionManager - ||> > Instantiating CollectionManager ||> > 2010-05-17 14:25:31,822 INFO collection.CollectionManager - ||initializing ||> > CollectionManager ||> > 2010-05-17 14:25:31,849 INFO collection.CollectionManager - file has1 ||> > elements ||> > 2010-05-17 14:25:32,474 WARN mapred.LocalJobRunner - job_local_0001 ||> > org.apache.solr.common.SolrException: Bad Request ||> > ||> > Bad Request ||> > ||> > request: http://127.0.0.1:8080/solr/update?wt=javabin&version=1 ||> > at ||> > ||> > ||org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHt ||> >tpSolrServer.java:424) at ||> > ||> > ||org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHt ||> >tpSolrServer.java:243) at ||> > ||> > ||org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(Abstra ||> >ctUpdateRequest.java:105) at ||> > org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49) at ||> > org.apache.nutch.indexer.solr.SolrWriter.close(SolrWriter.java:74) ||> > at ||> > ||> > ||org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat. ||> >java:48) at ||> > org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:474) ||> > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) ||> > at ||> > |org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216) ||> > --------------- ||> > ||> > Because i still use my old 1.0 configuration files i get the following ||> > warning ||> > from Nutch but doesn't look like it's related to the Sorl integration: ||> > ||> > --------------- ||> > 2010-05-17 14:34:11,529 WARN conf.Configuration - DEPRECATED: ||> > hadoop-site.xml ||> > found in the classpath. Usage of hadoop-site.xml is deprecated. Instead ||> > use core-site.xml, mapred-site.xml and hdfs-site.xml to override ||> > properties of core-default.xml, mapred-default.xml and hdfs-default.xml ||> > respectively --------------- ||> > ||> > Did i just stumble upon a regression in 1.1dev and should i file a bug ||or ||> > could something else spoil the fun? ||> > ||> > ||> > ||> > [1]: http://lucene.472066.n3.nabble.com/Adding-jpeg-parser-to-nutch- ||> > td710135.html<http://lucene.472066.n3.nabble.com/Adding-jpeg-parser-to- ||nu ||> >tch-%0Atd710135.html> ||> > ||> > Cheers, ||> > ||> > Markus Jelsma - Technisch Architect - Buyways BV ||> > http://www.linkedin.com/in/markus17 ||> > 050-8536620 / 06-50258350 ||> || ||Markus Jelsma - Technisch Architect - Buyways BV ||http://www.linkedin.com/in/markus17 ||050-8536620 / 06-50258350

