Hi, I have setup solr and nutch to crawl through a directory of XML files. I intend to use this as a search engine for the XML data.
1. Setup solr. This is a simple configuration. 2. Setup nutch. copied nutch/conf/schema-solr4.xml to solr/collection1/conf/schema.xml 3. Added _version_ to the solr/schema.xml 4. Started solr 5. Added http.agent.name to nutch-site.xml. Added plugin.inclues and file.content.limit to the same file. 6. plugin.includes have parse-(html|tika|xml)|lib-xml|indexer-solr 7. nutch/schema.xml, set content to stored. 8. Started nutch as bin/nutch crawl urls -solr http://localhost:8983/solr/-depth 7 -topN 10 9. After a few minutes, it fails with Exception in thread "main" java.io.IOException: Job failed! 10. solr complains 1423607 [qtp1239291892-18] INFO org.apache.solr.update.processor.LogUpdateProcessor ? [collection1] webapp=/solr path=/update params={wt=javabin&version=2} {} 0 1 1423607 [qtp1239291892-18] ERROR org.apache.solr.core.SolrCore ? org.apache.solr.common.SolrException: Document is missing mandatory uniqueKey field: id at org.apache.solr.update.AddUpdateCommand.getIndexedId(AddUpdateCommand.java:92) at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:582) at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:435) at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100) at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:247) at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174) at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:710) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:413) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:197) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557) ... Anything I am missing here. I have tried various and followed the tutorials I could find on nutch configuration, but this doesn't work at all. The url to the xml file is unique (and its mapped in the solrindex-mapping.xml in nutch. So no problem there for id. I am running solr-4.6.0 and apache-nutch-1.7 any help in getting this sorted is much appreciated. Thanks Umapathy

