Hi Brian,
Thanks for your reply. But as can be seen in the stacktrace, it's the ID field of a document. It cannot be set to accommodate multiple values and it wouldn't make sense either. The ID field should contain the URL of the fetched and parsed content. Also, you can clearly see the mapping in the included Nutch logs; it maps the URL field to Solr's ID field as well as mapping the URL to the URL field which doesn't make sense but it'm still the example schema and mapping configuration. Also, i couldn't image if multiple values for a URL field in Nutch itself makes any sense at all, how would a piece of content on a distinct URL have more than one URL? Do you or anybody else have an idea to solve this mystery? I'm also not getting much from Nutch' logs, they don't mention anything else accept that sending the data over to a Solr instance failed. Cheers, -----Original message----- From: Brian Tingle <[email protected]> Sent: Tue 25-05-2010 20:47 To: [email protected]; Markus Jelsma <[email protected]>; Subject: RE: Solr integration in nutch-1.1dev Update the solr schema.xml so that it allows multiple values for that field? |-----Original Message----- |From: Markus Jelsma [mailto:[email protected]] |Sent: Tuesday, May 25, 2010 4:49 AM |To: [email protected] |Subject: Re: Solr integration in nutch-1.1dev | |Hello Julien, | | |I picked today's build from your URL but the problem persists as reported |earlier. Any more ideas on how to tackle this? | | |Cheers, | |On Monday 17 May 2010 15:50:55 Julien Nioche wrote: |> Hi Markus, |> |> This has been solved last week and is in the trunk of the SVN repository. |> The nightly build has just been fixed after the move to the TLP so the |> version you are using does not have the fix yet. Check |> http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ to get the latest |> build or check it out from SVN |> |> J. |> |> > Hi, |> > |> > |> > I've got a copy of the nutch-2010-05-11_04-34-41 nightly build because i |> > need |> > Tika to parse JPEG images and that would be in 1.1 as i read somewhere |> > [1]. |> > |> > First i fetch only a single HTML page and send it to Solr as i did with |> > 1.0 but it fails now. Here's what Solr thinks of the request: |> > |> > |> > --------------- |> > May 17, 2010 2:25:32 PM org.apache.solr.common.SolrException log |> > SEVERE: org.apache.solr.common.SolrException: ERROR: multiple values |> > encountered for non multiValued copy field id: <URL HERE> |> > at |> > |org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:26 |> >0) at |> > |> > |org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateP |> >rocessorFactory.java:60) at |> > |> > |org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateP |> >rocessorFactory.java:94) at |> > |> > |org.apache.solr.update.processor.SignatureUpdateProcessorFactory$Signatur |> >eUpdateProcessor.processAdd(SignatureUpdateProcessorFactory.java:162) at |> > org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:139) |> > at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69) |> > at |> > |> > |org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Conten |> >tStreamHandlerBase.java:54) at |> > |> > |org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBa |> >se.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) |> > at |> > |> > |org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.jav |> >a:338) at |> > |> > |org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.ja |> >va:241) at |> > |> > |org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Applicat |> >ionFilterChain.java:235) at |> > |> > |org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilte |> >rChain.java:206) at |> > |> > |org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve |> >.java:233) at |> > |> > |org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve |> >.java:191) at |> > |> > |org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java: |> >128) at |> > |> > |org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java: |> >102) at |> > |> > |org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.j |> >ava:109) at |> > |org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:28 |> >6) at |> > |org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:845 |> >) at |> > |> > |org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(H |> >ttp11Protocol.java:583) at |> > org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447) |> > at java.lang.Thread.run(Thread.java:619) |> > --------------- |> > |> > |> > Well, this obviously is wrong. Although i am still using the old 1.0 |> > schema.xml, it still isn't multiValued in the nightly build's schema.xml |> > file. |> > |> > Below Nutch's relevant log lines: |> > |> > |> > --------------- |> > 2010-05-17 14:25:31,776 INFO solr.SolrMappingReader - source: content |> > dest: |> > content |> > 2010-05-17 14:25:31,776 INFO solr.SolrMappingReader - source: site |dest: |> > site |> > 2010-05-17 14:25:31,776 INFO solr.SolrMappingReader - source: title |> > dest: title |> > 2010-05-17 14:25:31,776 INFO solr.SolrMappingReader - source: host |dest: |> > host |> > 2010-05-17 14:25:31,776 INFO solr.SolrMappingReader - source: segment |> > dest: |> > segment |> > 2010-05-17 14:25:31,777 INFO solr.SolrMappingReader - source: boost |> > dest: boost |> > 2010-05-17 14:25:31,777 INFO solr.SolrMappingReader - source: digest |> > dest: digest |> > 2010-05-17 14:25:31,777 INFO solr.SolrMappingReader - source: tstamp |> > dest: tstamp |> > 2010-05-17 14:25:31,777 INFO solr.SolrMappingReader - source: url dest: |> > id 2010-05-17 14:25:31,777 INFO solr.SolrMappingReader - source: url |> > dest: url |> > 2010-05-17 14:25:31,821 INFO collection.CollectionManager - |> > Instantiating CollectionManager |> > 2010-05-17 14:25:31,822 INFO collection.CollectionManager - |initializing |> > CollectionManager |> > 2010-05-17 14:25:31,849 INFO collection.CollectionManager - file has1 |> > elements |> > 2010-05-17 14:25:32,474 WARN mapred.LocalJobRunner - job_local_0001 |> > org.apache.solr.common.SolrException: Bad Request |> > |> > Bad Request |> > |> > request: http://127.0.0.1:8080/solr/update?wt=javabin&version=1 |> > at |> > |> > |org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHt |> >tpSolrServer.java:424) at |> > |> > |org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHt |> >tpSolrServer.java:243) at |> > |> > |org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(Abstra |> >ctUpdateRequest.java:105) at |> > org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49) at |> > org.apache.nutch.indexer.solr.SolrWriter.close(SolrWriter.java:74) |> > at |> > |> > |org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat. |> >java:48) at |> > org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:474) |> > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) |> > at |> > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216) |> > --------------- |> > |> > Because i still use my old 1.0 configuration files i get the following |> > warning |> > from Nutch but doesn't look like it's related to the Sorl integration: |> > |> > --------------- |> > 2010-05-17 14:34:11,529 WARN conf.Configuration - DEPRECATED: |> > hadoop-site.xml |> > found in the classpath. Usage of hadoop-site.xml is deprecated. Instead |> > use core-site.xml, mapred-site.xml and hdfs-site.xml to override |> > properties of core-default.xml, mapred-default.xml and hdfs-default.xml |> > respectively --------------- |> > |> > Did i just stumble upon a regression in 1.1dev and should i file a bug |or |> > could something else spoil the fun? |> > |> > |> > |> > [1]: http://lucene.472066.n3.nabble.com/Adding-jpeg-parser-to-nutch- |> > td710135.html<http://lucene.472066.n3.nabble.com/Adding-jpeg-parser-to- |nu |> >tch-%0Atd710135.html> |> > |> > Cheers, |> > |> > Markus Jelsma - Technisch Architect - Buyways BV |> > http://www.linkedin.com/in/markus17 |> > 050-8536620 / 06-50258350 |> | |Markus Jelsma - Technisch Architect - Buyways BV |http://www.linkedin.com/in/markus17 |050-8536620 / 06-50258350

