Hi,

I have setup solr and nutch to crawl through a directory of XML files.  I
intend to use this as a search engine for the XML data.

1. Setup solr.  This is a simple configuration.
2. Setup nutch.  copied nutch/conf/schema-solr4.xml to
solr/collection1/conf/schema.xml
3. Added _version_ to the solr/schema.xml
4. Started solr
5. Added http.agent.name to nutch-site.xml.  Added plugin.inclues and
file.content.limit to the same file.
6. plugin.includes have parse-(html|tika|xml)|lib-xml|indexer-solr
7. nutch/schema.xml, set content to stored.
8. Started nutch as  bin/nutch crawl urls -solr
http://localhost:8983/solr/-depth 7 -topN 10
9. After a few minutes, it fails with Exception in thread "main"
java.io.IOException: Job failed!
10. solr complains
1423607 [qtp1239291892-18] INFO
 org.apache.solr.update.processor.LogUpdateProcessor  ? [collection1]
webapp=/solr path=/update params={wt=javabin&version=2} {} 0 1
1423607 [qtp1239291892-18] ERROR org.apache.solr.core.SolrCore  ?
org.apache.solr.common.SolrException: Document is missing mandatory
uniqueKey field: id
        at
org.apache.solr.update.AddUpdateCommand.getIndexedId(AddUpdateCommand.java:92)
        at
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:582)
        at
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:435)
        at
org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
        at
org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:247)
        at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)
        at
org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
        at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
        at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)
        at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:710)
        at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:413)
        at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:197)
        at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
        at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
        at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
        at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
...

Anything I am missing here.  I have tried various and followed the
tutorials I could find on nutch configuration, but this doesn't work at all.
The url to the xml file is unique (and its mapped in the
solrindex-mapping.xml in nutch. So no problem there for id.

I am running solr-4.6.0 and apache-nutch-1.7

any help in getting this sorted is much appreciated.

Thanks

Umapathy

Reply via email to