Hey Safdar, This question is best asked on the Apache Solr mailing lists. I believe you'll get better responses there, so I've redirected to Solr's own list (solr-user[at]lucene.apache.org).
BCC'd common-user[at]hadoop.apache.org and CC'd you in case you haven't subscribed to Solr. On Sat, Jun 23, 2012 at 8:14 PM, Safdar Kureishy <safdar.kurei...@gmail.com> wrote: > Hi, > > I couldn't find an answer to this question online, so I'm posting to the > mailing list. > > I've got a crawl of about 10M *fetched* pages (crawl db has about 50 M > pages, since it includes the fetched + failed + unfetched pages). I've also > got a freshly updated linkdb and webgraphdb (having run linkrank). I'm > trying to index the fetched pages (content + anchor links) using solrindex. > > When I launch the "bin/nutch solrindex <solrurl> <crawldb> -linkdb <linkdb> > -dir <segmentsdir>" command, the disk space utilization really jumps. > Before running the solrindex stage, I had about 50% of disk space remaining > for HDFS on my nodes (5 nodes) -- I had consumed about 100G and had about > 100G left over. However, when running the solrindex phase, by the end of > the map phase, the disk space utilization nears 100% and the available HDFS > space drops below 1%. Running "hadoop dfsadmin -report" shows that the jump > in storage is for non-DFS data (i.e. intermediate data) and it happens > during the map phase of the IndexerMapReduce job (solrindex). > > What can I do to reduce the intermediate data being generated for > solrindex? Any configuration settings I should change? I'm using all the > defaults, for the indexing phase, and I'm not using any custom plugins > either. > > Thanks, > Safdar -- Harsh J