Hello,
I recently updated from nutch 1.2 to nutch 1.7 with Solr 1.4.1 I plan to
update Solr as well but would like to have everything running smoothly
first.
Made some minor changes in nutch 1.7 to get this working:
- replaced solrj lib in runtime/local/lib/ with jar from Solr
installation (apache-solr-solrj-1.4.1.jar)
- added indexer-solr in plugin.includes in nutch-site.xml
I am using modified runtime/local/bin/crawl script. Everything works
fine, but solrdedup crashes. There is a NumberFormatException in
catalina.out. I seems the timestamp field (tstamp) changed from a
numbers-only (20130425154536830) to a format like this
"2013-07-12T10:16:14.783Z". I don't really see why. Every (minor) update
in nutch so far has broken something.
I thought it might be a good idea to change the type of tstamp to string
in schema.xml but this breaks as well.
However dedup should not really need the timestamp, it needs the digest.
I changed the signature class in nutch-site.xml (for nutch 1.2) because
this worked better.
<property>
<name>db.signature.class</name>
<value>org.apache.nutch.crawl.TextProfileSignature</value>
<description>The default implementation of a page signature. Signatures
created with this implementation will be used for duplicate detection
and removal.</description>
</property>
I commented this out but dedup still crashes.
What is the recommended Solr version with nutch 1.7? How can I fix the
dedup problem? Does anyone have experience with this version setup and
can advise if other things will break? Is it best to update to solr
3.x.x or 4.x.x?
Thanks for any help
Sybille
solrdedup output:
--------------------------
apache-nutch-1.7/runtime/local/bin/nutch solrdedup
http://localhost:8080/solr2
SolrDeleteDuplicates: starting at 2013-07-12 14:40:03
SolrDeleteDuplicates: Solr url: http://localhost:8080/solr2
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373)
at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.run(SolrDeleteDuplicates.java:390)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.main(SolrDeleteDuplicates.java:395)
catalina.out
-----------------
12.07.2013 14:45:59
org.apache.solr.request.BinaryResponseWriter$Resolver getDoc
WARNUNG: Error reading a field from document :
SolrDocument[{digest=05f734d3795fdf1112a3d979686af57a}]
java.lang.NumberFormatException: For input string:
"2013-07-12T10:16:14.783Z"
at
java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
at java.lang.Long.parseLong(Long.java:419)
at java.lang.Long.valueOf(Long.java:525)
at org.apache.solr.schema.LongField.toObject(LongField.java:82)
at org.apache.solr.schema.LongField.toObject(LongField.java:33)
at
org.apache.solr.request.BinaryResponseWriter$Resolver.getDoc(BinaryResponseWriter.java:148)
at
org.apache.solr.request.BinaryResponseWriter$Resolver.writeDocList(BinaryResponseWriter.java:124)
at
org.apache.solr.request.BinaryResponseWriter$Resolver.resolve(BinaryResponseWriter.java:88)
at
org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:143)
at
org.apache.solr.common.util.JavaBinCodec.writeNamedList(JavaBinCodec.java:133)
at
org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:221)
at
org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:138)
at
org.apache.solr.common.util.JavaBinCodec.marshal(JavaBinCodec.java:87)
at
org.apache.solr.request.BinaryResponseWriter.write(BinaryResponseWriter.java:48)
at
org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:322)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:254)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at
org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:470)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
at
org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:877)
at
org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:594)
at
org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1675)
at java.lang.Thread.run(Thread.java:662)