Hi Brian,

 

Thanks for your reply. But as can be seen in the stacktrace, it's the ID field 
of a document. It cannot be set to accommodate multiple values and it wouldn't 
make sense either. The ID field should contain the URL of the fetched and 
parsed content. Also, you can clearly see the mapping in the included Nutch 
logs; it maps the URL field to Solr's ID field as well as mapping the URL to 
the URL field which doesn't make sense but it'm still the example schema and 
mapping configuration. Also, i couldn't image if multiple values for a URL 
field in Nutch itself makes any sense at all, how would a piece of content on a 
distinct URL have more than one URL?

 

Do you or anybody else have an idea to solve this mystery? I'm also not getting 
much from Nutch' logs, they don't mention anything else accept that sending the 
data over to a Solr instance failed.

 

Cheers,
 
-----Original message-----
From: Brian Tingle <[email protected]>
Sent: Tue 25-05-2010 20:47
To: [email protected]; Markus Jelsma <[email protected]>; 
Subject: RE: Solr integration in nutch-1.1dev

Update the solr schema.xml so that it allows multiple values for that field?

|-----Original Message-----
|From: Markus Jelsma [mailto:[email protected]]
|Sent: Tuesday, May 25, 2010 4:49 AM
|To: [email protected]
|Subject: Re: Solr integration in nutch-1.1dev
|
|Hello Julien,
|
|
|I picked today's build from your URL but the problem persists as reported
|earlier. Any more ideas on how to tackle this?
|
|
|Cheers,
|
|On Monday 17 May 2010 15:50:55 Julien Nioche wrote:
|> Hi Markus,
|>
|> This has been solved last week and is in the trunk of the SVN repository.
|> The nightly build has just been fixed after the move to the TLP so the
|> version you are using does not have the fix yet. Check
|> http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ to get the latest
|> build or check it out from SVN
|>
|> J.
|>
|> > Hi,
|> >
|> >
|> > I've got a copy of the nutch-2010-05-11_04-34-41 nightly build because i
|> > need
|> > Tika to parse JPEG images and that would be in 1.1 as i read somewhere
|> > [1].
|> >
|> > First i fetch only a single HTML page and send it to Solr as i did with
|> > 1.0 but it fails now. Here's what Solr thinks of the request:
|> >
|> >
|> > ---------------
|> > May 17, 2010 2:25:32 PM org.apache.solr.common.SolrException log
|> > SEVERE: org.apache.solr.common.SolrException: ERROR: multiple values
|> > encountered for non multiValued copy field id: <URL HERE>
|> >        at
|> >
|org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:26
|> >0) at
|> >
|> >
|org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateP
|> >rocessorFactory.java:60) at
|> >
|> >
|org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateP
|> >rocessorFactory.java:94) at
|> >
|> >
|org.apache.solr.update.processor.SignatureUpdateProcessorFactory$Signatur
|> >eUpdateProcessor.processAdd(SignatureUpdateProcessorFactory.java:162) at
|> > org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:139)
|> >        at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
|> >        at
|> >
|> >
|org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Conten
|> >tStreamHandlerBase.java:54) at
|> >
|> >
|org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBa
|> >se.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
|> > at
|> >
|> >
|org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.jav
|> >a:338) at
|> >
|> >
|org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.ja
|> >va:241) at
|> >
|> >
|org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Applicat
|> >ionFilterChain.java:235) at
|> >
|> >
|org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilte
|> >rChain.java:206) at
|> >
|> >
|org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve
|> >.java:233) at
|> >
|> >
|org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve
|> >.java:191) at
|> >
|> >
|org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:
|> >128) at
|> >
|> >
|org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:
|> >102) at
|> >
|> >
|org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.j
|> >ava:109) at
|> >
|org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:28
|> >6) at
|> >
|org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:845
|> >) at
|> >
|> >
|org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(H
|> >ttp11Protocol.java:583) at
|> > org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
|> >        at java.lang.Thread.run(Thread.java:619)
|> > ---------------
|> >
|> >
|> > Well, this obviously is wrong. Although i am still using the old 1.0
|> > schema.xml, it still isn't multiValued in the nightly build's schema.xml
|> > file.
|> >
|> > Below Nutch's relevant log lines:
|> >
|> >
|> > ---------------
|> > 2010-05-17 14:25:31,776 INFO  solr.SolrMappingReader - source: content
|> > dest:
|> > content
|> > 2010-05-17 14:25:31,776 INFO  solr.SolrMappingReader - source: site
|dest:
|> > site
|> > 2010-05-17 14:25:31,776 INFO  solr.SolrMappingReader - source: title
|> > dest: title
|> > 2010-05-17 14:25:31,776 INFO  solr.SolrMappingReader - source: host
|dest:
|> > host
|> > 2010-05-17 14:25:31,776 INFO  solr.SolrMappingReader - source: segment
|> > dest:
|> > segment
|> > 2010-05-17 14:25:31,777 INFO  solr.SolrMappingReader - source: boost
|> > dest: boost
|> > 2010-05-17 14:25:31,777 INFO  solr.SolrMappingReader - source: digest
|> > dest: digest
|> > 2010-05-17 14:25:31,777 INFO  solr.SolrMappingReader - source: tstamp
|> > dest: tstamp
|> > 2010-05-17 14:25:31,777 INFO  solr.SolrMappingReader - source: url dest:
|> > id 2010-05-17 14:25:31,777 INFO  solr.SolrMappingReader - source: url
|> > dest: url
|> > 2010-05-17 14:25:31,821 INFO  collection.CollectionManager -
|> > Instantiating CollectionManager
|> > 2010-05-17 14:25:31,822 INFO  collection.CollectionManager -
|initializing
|> > CollectionManager
|> > 2010-05-17 14:25:31,849 INFO  collection.CollectionManager - file has1
|> > elements
|> > 2010-05-17 14:25:32,474 WARN  mapred.LocalJobRunner - job_local_0001
|> > org.apache.solr.common.SolrException: Bad Request
|> >
|> > Bad Request
|> >
|> > request: http://127.0.0.1:8080/solr/update?wt=javabin&version=1
|> >        at
|> >
|> >
|org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHt
|> >tpSolrServer.java:424) at
|> >
|> >
|org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHt
|> >tpSolrServer.java:243) at
|> >
|> >
|org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(Abstra
|> >ctUpdateRequest.java:105) at
|> > org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49) at
|> > org.apache.nutch.indexer.solr.SolrWriter.close(SolrWriter.java:74)
|> >        at
|> >
|> >
|org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.
|> >java:48) at
|> > org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:474)
|> >        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
|> >        at
|> > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
|> > ---------------
|> >
|> > Because i still use my old 1.0 configuration files i get the following
|> > warning
|> > from Nutch but doesn't look like it's related to the Sorl integration:
|> >
|> > ---------------
|> > 2010-05-17 14:34:11,529 WARN  conf.Configuration - DEPRECATED:
|> > hadoop-site.xml
|> > found in the classpath. Usage of hadoop-site.xml is deprecated. Instead
|> > use core-site.xml, mapred-site.xml and hdfs-site.xml to override
|> > properties of core-default.xml, mapred-default.xml and hdfs-default.xml
|> > respectively ---------------
|> >
|> > Did i just stumble upon a regression in 1.1dev and should i file a bug
|or
|> > could something else spoil the fun?
|> >
|> >
|> >
|> > [1]: http://lucene.472066.n3.nabble.com/Adding-jpeg-parser-to-nutch-
|> > td710135.html<http://lucene.472066.n3.nabble.com/Adding-jpeg-parser-to-
|nu
|> >tch-%0Atd710135.html>
|> >
|> > Cheers,
|> >
|> > Markus Jelsma - Technisch Architect - Buyways BV
|> > http://www.linkedin.com/in/markus17
|> > 050-8536620 / 06-50258350
|>
|
|Markus Jelsma - Technisch Architect - Buyways BV
|http://www.linkedin.com/in/markus17
|050-8536620 / 06-50258350

Reply via email to