Hello,

I recently updated from nutch 1.2 to nutch 1.7 with Solr 1.4.1 I plan to update Solr as well but would like to have everything running smoothly first.

Made some minor changes in nutch 1.7 to get this working:
- replaced solrj lib in runtime/local/lib/ with jar from Solr installation (apache-solr-solrj-1.4.1.jar)
- added indexer-solr in plugin.includes in nutch-site.xml

I am using modified runtime/local/bin/crawl script. Everything works fine, but solrdedup crashes. There is a NumberFormatException in catalina.out. I seems the timestamp field (tstamp) changed from a numbers-only (20130425154536830) to a format like this "2013-07-12T10:16:14.783Z". I don't really see why. Every (minor) update in nutch so far has broken something.

I thought it might be a good idea to change the type of tstamp to string in schema.xml but this breaks as well.

However dedup should not really need the timestamp, it needs the digest.

I changed the signature class in nutch-site.xml (for nutch 1.2) because this worked better.

<property>
  <name>db.signature.class</name>
<value>org.apache.nutch.crawl.TextProfileSignature</value>
  <description>The default implementation of a page signature. Signatures
  created with this implementation will be used for duplicate detection
  and removal.</description>
</property>


I commented this out but dedup still crashes.

What is the recommended Solr version with nutch 1.7? How can I fix the dedup problem? Does anyone have experience with this version setup and can advise if other things will break? Is it best to update to solr 3.x.x or 4.x.x?

Thanks for any help

Sybille

solrdedup output:
--------------------------

apache-nutch-1.7/runtime/local/bin/nutch solrdedup http://localhost:8080/solr2
SolrDeleteDuplicates: starting at 2013-07-12 14:40:03
SolrDeleteDuplicates: Solr url: http://localhost:8080/solr2
Exception in thread "main" java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.run(SolrDeleteDuplicates.java:390)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.main(SolrDeleteDuplicates.java:395)

catalina.out
-----------------

12.07.2013 14:45:59 org.apache.solr.request.BinaryResponseWriter$Resolver getDoc WARNUNG: Error reading a field from document : SolrDocument[{digest=05f734d3795fdf1112a3d979686af57a}] java.lang.NumberFormatException: For input string: "2013-07-12T10:16:14.783Z" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
        at java.lang.Long.parseLong(Long.java:419)
        at java.lang.Long.valueOf(Long.java:525)
        at org.apache.solr.schema.LongField.toObject(LongField.java:82)
        at org.apache.solr.schema.LongField.toObject(LongField.java:33)
at org.apache.solr.request.BinaryResponseWriter$Resolver.getDoc(BinaryResponseWriter.java:148) at org.apache.solr.request.BinaryResponseWriter$Resolver.writeDocList(BinaryResponseWriter.java:124) at org.apache.solr.request.BinaryResponseWriter$Resolver.resolve(BinaryResponseWriter.java:88) at org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:143) at org.apache.solr.common.util.JavaBinCodec.writeNamedList(JavaBinCodec.java:133) at org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:221) at org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:138) at org.apache.solr.common.util.JavaBinCodec.marshal(JavaBinCodec.java:87) at org.apache.solr.request.BinaryResponseWriter.write(BinaryResponseWriter.java:48) at org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:322) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:254) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:470) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293) at org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:877) at org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:594) at org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1675)
        at java.lang.Thread.run(Thread.java:662)




Reply via email to