Hi everybody

I've a small batch nutch 1.2 based process that is crawling a site and
after that insert data into a Solr instance. After updating to 1.3
version it starts to generate problems in solr configuration. It seems
be generating  duplicate url, segment, boost and other fields that
were not configured as multipleValue. Additionally it seems to be
generating duplicate values.

hadoop.log:



org.apache.solr.common.SolrException: ERROR:
[http://www.marketshare.com/] multiple values encountered for non
multiValued field boost: [1.0985885, 1.0985885]

ERROR: [http://www.marketshare.com/] multiple values encountered for
non multiValued field boost: [1.0985885, 1.0985885]

request: http://localhost:8983/solr/market/update?wt=javabin&version=2
        at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:436)
        at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:245)
        at 
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
        at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49)

My process do the following:

rm -rf crawl
rm *.out

NUTCH_HOME="/home/nutch-nuevo/runtime/local"
${NUTCH_HOME}/bin/nutch crawl urls -dir crawl -depth 10 -topN 300000 > log.out

segment=`ls -d crawl/segments/*`
${NUTCH_HOME}/bin/nutch updatedb crawl/crawldb $segment
${NUTCH_HOME}/bin/nutch invertlinks crawl/linkdb -dir crawl/segments >
invert.out

${NUTCH_HOME}/bin/nutch solrindex http://localhost:8983/solr/bolsa/
crawl/crawldb crawl/linkdb crawl/segments/* > index.out


Additionaly, the crawl process starts saying that solrUrl is not
configured, it means that I could index directly using Solr without
the previous step in Lucene?

Any hint?
Thank you in advance
Germán

Reply via email to