MilleBii wrote:
HLPPP !!!
Stuck for 3 days on not able to start any nutch job.
hdfs works fine, ie I can put look at files.
When i start nutch crawl, I get the following error
Job initialization failed:
java.lang.IllegalArgumentException: Pathname
Error occurred in crawldb TestDB/crawldb reduce phase
i get error msg --- java.lang.OutOfMemoryError: Java heap space
my command
bin/nutch crawl url -dir TestDB -depth 4 -threads 3
single fetchlist around in 20
my settings on the memory
hadoop-env.sh
export HADOOP_HEAPSIZE=800
Yes I'm using both relative path cygwin under windows. so /d: is not
introduced by me, but either nutch or hadoop.
Regarding the cygwin path you are righ... actually where I lost quite some
time.
OK will try absolute paths and let you know.
-MilleBii-
2009/6/24 Andrzej Bialecki
we had made a crawler that visit various sites, and i want the crawler to
crawl sites as soon as they are updated, if anyone can help me to know how i
can know when the site is updated and its the time to crawl again
--
View this message in context:
Neeti,
I don't think there is a way to know when a regular web site has been updated.
You can issue GET or HEAD requests and look at the Last-Modified date, but this
is not 100% reliable. You can fetch and compare content, but that's not 100%
reliable either. If you are indexing blogs,
Actually tried and it fails but this is what I found :
bin/hadoop-config.sh does the conversion from relative to absolute path
this=$0
while [ -h $this ]; do
ls=`ls -ld $this`
link=`expr $ls : '.*- \(.*\)$'`
if expr $link : '.*/.*' /dev/null; then
this=$link
else
this=`dirname
What's also i have discovered
+ hadoop (script) works with unix like paths and works fine on windows
+ nutch (script) works with Windows paths
Could it be that there is some incompatibility because one works unix like
paths and not the other ???
2009/6/24 MilleBii mille...@gmail.com
Actually
MilleBii wrote:
What's also i have discovered
+ hadoop (script) works with unix like paths and works fine on windows
+ nutch (script) works with Windows paths
bin/nutch works with Windows paths? I think this could happen only by
accident - both scripts work with Cygwin paths. On the other