Hi,
I want to do two Nutch parallel crawl with two conf folder.
I am using crawl command to do this. I have two separate conf folders, all
files from conf are same except crawl-urlfilter.txt . In this file we have
different filters(domain filters).
e.g . 1 st conf have -
+.^http
Hi, all
I get outofmemory Error when index using bin/nutch index crawl/indexes
crawl/crawldb crawl/linkdb crawl/segments/*
I have configure HADOOP_HEAPSIZE in hadoop-env.sh and
mapred.child.java.opts in mapred-site.xml to the hardware limit.
mapred.child.java.opts
-Xmx2600m
Large
Hello
By merging segments with ...
nutch mergesegs $crawldir/MERGEDsegments $crawldir/segments/* -slice 5
... I got the following message in hadoop.log:
-
.
.
2010-03-03 03:27:01,849 INFO segment.SegmentMerger - Slice size: 5 URLs.
2010-03-03 22:47:43,130 INFO