date:20100304

Two Nutch parallel crawl with two conf folder.

2010-03-04 Thread Pravin Karne

Hi, I want to do two Nutch parallel crawl with two conf folder. I am using crawl command to do this. I have two separate conf folders, all files from conf are same except crawl-urlfilter.txt . In this file we have different filters(domain filters). e.g . 1 st conf have - +.^http

OutOfMemoryError when index

2010-03-04 Thread xiao yang

Hi, all I get outofmemory Error when index using bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/* I have configure HADOOP_HEAPSIZE in hadoop-env.sh and mapred.child.java.opts in mapred-site.xml to the hardware limit. mapred.child.java.opts -Xmx2600m Large

Error by merging segments ...

2010-03-04 Thread Patricio Galeas

Hello By merging segments with ... nutch mergesegs $crawldir/MERGEDsegments $crawldir/segments/* -slice 5 ... I got the following message in hadoop.log: - . . 2010-03-03 03:27:01,849 INFO segment.SegmentMerger - Slice size: 5 URLs. 2010-03-03 22:47:43,130 INFO