Re: Nutch and Hadoop not working proper

2009-06-24 Thread Andrzej Bialecki

MilleBii wrote:

HLPPP !!!

Stuck for 3 days on not able to start any nutch job.

hdfs works fine, ie I can put  look at files.
When i start nutch crawl, I get the following error

Job initialization failed:
java.lang.IllegalArgumentException: Pathname
/d:/Bii/nutch/logs/history/user/_logs/history/localhost_1245788245191_job_200906232217_0001_pc-%5C_inject+urls

It is looking for the file at a wrong location  Indeed in my case the
correct location is /d:/Bii/nutch/logs/history, so why is *
history/user/_logs* added and how can I fix that ?

2009/6/21 MilleBii mille...@gmail.com


Looks like I just needed to transfer from the local filesystem to hdfs:
Is it safe to transfer a crawl directory (and subs) from the local file
system to hdfs and start crawling again ?

1. hadoop fs -put crawl crawl
2. nutch generate crawl/crawldb crawl/segments -topN 500 (where now it
should use the hdfs)

-MilleBii-

2009/6/21 MilleBii mille...@gmail.com

 I have newly installed hadoop in a distributed single node configuration.

When I run nutch commands  it is looking for files my user home directory
and not at the nutch directory ?
How can I change this ?


I suspect your hadoop-site.xml uses relative path somewhere, and not an 
absolute path (with leading slash). Also, /d: looks suspiciously like a 
Windows pathname, in which case you should either use a full URI 
(file:///d:/) or just the disk name d:/ without the leading slash. 
Please also note that if you are running this on Windows under cygwin 
then in your config files you MUST NOT use the cygwin paths (like 
/cygdrive/d/...) because Java can't see them.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



How torunning nutch on 2G memory tasknode

2009-06-24 Thread SunGod
Error occurred in  crawldb TestDB/crawldb reduce phase

i get error msg --- java.lang.OutOfMemoryError: Java heap space

my command
 bin/nutch crawl url -dir TestDB -depth 4 -threads 3

 single fetchlist around in 20

my settings on the memory

hadoop-env.sh
export HADOOP_HEAPSIZE=800

hadoop-site.xml
property
  namemapred.tasktracker.map.tasks.maximum/name
  value4/value
/property
property
  namemapred.tasktracker.reduce.tasks.maximum/name
  value4/value
/property
property
  namemapred.map.tasks/name
  value2/value
/property
property
  namemapred.reduce.tasks/name
  value2/value
/property
property
  namemapred.map.max.attempts/name
  value4/value
/property
property
  namemapred.reduce.max.attempts/name
  value4/value
/property
property
  namemapred.child.java.opts/name
  value-Xmx250m/value
/property


Re: Nutch and Hadoop not working proper

2009-06-24 Thread MilleBii
Yes I'm using both relative path  cygwin under windows. so /d: is not
introduced by me, but either nutch or hadoop.

Regarding the cygwin path you are righ... actually where I lost quite some
time.

OK will try absolute paths and let you know.

-MilleBii-

2009/6/24 Andrzej Bialecki a...@getopt.org

  MilleBii wrote:

 HLPPP !!!

 Stuck for 3 days on not able to start any nutch job.

 hdfs works fine, ie I can put  look at files.
 When i start nutch crawl, I get the following error

 Job initialization failed:
 java.lang.IllegalArgumentException: Pathname

 /d:/Bii/nutch/logs/history/user/_logs/history/localhost_1245788245191_job_200906232217_0001_pc-%5C_inject+urls

 It is looking for the file at a wrong location  Indeed in my case the
 correct location is /d:/Bii/nutch/logs/history, so why is *
 history/user/_logs* added and how can I fix that ?

 2009/6/21 MilleBii mille...@gmail.com

 Looks like I just needed to transfer from the local filesystem to hdfs:
 Is it safe to transfer a crawl directory (and subs) from the local file
 system to hdfs and start crawling again ?

 1. hadoop fs -put crawl crawl
 2. nutch generate crawl/crawldb crawl/segments -topN 500 (where now it
 should use the hdfs)

 -MilleBii-

 2009/6/21 MilleBii mille...@gmail.com

  I have newly installed hadoop in a distributed single node
 configuration.

 When I run nutch commands  it is looking for files my user home
 directory
 and not at the nutch directory ?
 How can I change this ?


 I suspect your hadoop-site.xml uses relative path somewhere, and not an
 absolute path (with leading slash). Also, /d: looks suspiciously like a
 Windows pathname, in which case you should either use a full URI
 (file:///d:/) or just the disk name d:/ without the leading slash. Please
 also note that if you are running this on Windows under cygwin then in your
 config files you MUST NOT use the cygwin paths (like /cygdrive/d/...)
 because Java can't see them.


 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com




-- 
-MilleBii-


recrawling

2009-06-24 Thread Neeti Gupta

we had made a crawler that visit various sites, and i want the crawler to
crawl sites as soon as they are updated, if anyone can help me to know how i
can know when the site is updated and its the time to crawl again
-- 
View this message in context: 
http://www.nabble.com/recrawling-tp24183356p24183356.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Re: recrawling

2009-06-24 Thread Otis Gospodnetic

Neeti,

I don't think there is a way to know when a regular web site has been updated.  
You can issue GET or HEAD requests and look at the Last-Modified date, but this 
is not 100% reliable.  You can fetch and compare content, but that's not 100% 
reliable either.  If you are indexing blogs, then you can get pings when they 
update, or can rely on detecting changes in their feeds.

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: Neeti Gupta neeti_gupt...@yahoo.com
 To: nutch-user@lucene.apache.org
 Sent: Wednesday, June 24, 2009 7:52:47 AM
 Subject: recrawling
 
 
 we had made a crawler that visit various sites, and i want the crawler to
 crawl sites as soon as they are updated, if anyone can help me to know how i
 can know when the site is updated and its the time to crawl again
 -- 
 View this message in context: 
 http://www.nabble.com/recrawling-tp24183356p24183356.html
 Sent from the Nutch - User mailing list archive at Nabble.com.



Re: Nutch and Hadoop not working proper

2009-06-24 Thread MilleBii
Actually tried and it fails but this is what I found :

bin/hadoop-config.sh does the conversion from relative to absolute path

this=$0
while [ -h $this ]; do
  ls=`ls -ld $this`
  link=`expr $ls : '.*- \(.*\)$'`
  if expr $link : '.*/.*'  /dev/null; then
this=$link
  else
this=`dirname $this`/$link
  fi
done

# convert relative path to absolute path
bin=`dirname $this`
script=`basename $this`
bin=`cd $bin; pwd`
this=$bin/$script

# the root of the Hadoop installation
export HADOOP_HOME=`dirname $this`/..

Now if you echo out the script it uses full cygwin path ie:

/cygdrive/d/...

I tried to change the export into an absolute path file:///d:/..., it does
not work and hadoop does not even start.
Where in my case it will start and work as long as your are using hadoop
commands, but none of the nutch commands actually work.

As if the dfs was working but not the mapred part of hadoop.


2009/6/24 MilleBii mille...@gmail.com

 Yes I'm using both relative path  cygwin under windows. so /d: is not
 introduced by me, but either nutch or hadoop.

 Regarding the cygwin path you are righ... actually where I lost quite some
 time.

 OK will try absolute paths and let you know.

 -MilleBii-

 2009/6/24 Andrzej Bialecki a...@getopt.org

   MilleBii wrote:

 HLPPP !!!

 Stuck for 3 days on not able to start any nutch job.

 hdfs works fine, ie I can put  look at files.
 When i start nutch crawl, I get the following error

 Job initialization failed:
 java.lang.IllegalArgumentException: Pathname

 /d:/Bii/nutch/logs/history/user/_logs/history/localhost_1245788245191_job_200906232217_0001_pc-%5C_inject+urls

 It is looking for the file at a wrong location  Indeed in my case the
 correct location is /d:/Bii/nutch/logs/history, so why is *
 history/user/_logs* added and how can I fix that ?

 2009/6/21 MilleBii mille...@gmail.com

 Looks like I just needed to transfer from the local filesystem to hdfs:
 Is it safe to transfer a crawl directory (and subs) from the local file
 system to hdfs and start crawling again ?

 1. hadoop fs -put crawl crawl
 2. nutch generate crawl/crawldb crawl/segments -topN 500 (where now it
 should use the hdfs)

 -MilleBii-

 2009/6/21 MilleBii mille...@gmail.com

  I have newly installed hadoop in a distributed single node
 configuration.

 When I run nutch commands  it is looking for files my user home
 directory
 and not at the nutch directory ?
 How can I change this ?


 I suspect your hadoop-site.xml uses relative path somewhere, and not an
 absolute path (with leading slash). Also, /d: looks suspiciously like a
 Windows pathname, in which case you should either use a full URI
 (file:///d:/) or just the disk name d:/ without the leading slash. Please
 also note that if you are running this on Windows under cygwin then in your
 config files you MUST NOT use the cygwin paths (like /cygdrive/d/...)
 because Java can't see them.


 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com




 --
 -MilleBii-




-- 
-MilleBii-


Re: Nutch and Hadoop not working proper

2009-06-24 Thread MilleBii
What's also i have discovered
+ hadoop (script) works with unix like paths and works fine on windows
+ nutch (script) works with Windows paths

Could it be that there is some incompatibility because one works unix like
paths and not the other ???



2009/6/24 MilleBii mille...@gmail.com

 Actually tried and it fails but this is what I found :

 bin/hadoop-config.sh does the conversion from relative to absolute path

 this=$0
 while [ -h $this ]; do
   ls=`ls -ld $this`
   link=`expr $ls : '.*- \(.*\)$'`
   if expr $link : '.*/.*'  /dev/null; then
 this=$link
   else
 this=`dirname $this`/$link
   fi
 done

 # convert relative path to absolute path
 bin=`dirname $this`
 script=`basename $this`
 bin=`cd $bin; pwd`
 this=$bin/$script

 # the root of the Hadoop installation
 export HADOOP_HOME=`dirname $this`/..

 Now if you echo out the script it uses full cygwin path ie:

 /cygdrive/d/...

 I tried to change the export into an absolute path file:///d:/..., it
 does not work and hadoop does not even start.
 Where in my case it will start and work as long as your are using hadoop
 commands, but none of the nutch commands actually work.

 As if the dfs was working but not the mapred part of hadoop.


 2009/6/24 MilleBii mille...@gmail.com

 Yes I'm using both relative path  cygwin under windows. so /d: is not
 introduced by me, but either nutch or hadoop.

 Regarding the cygwin path you are righ... actually where I lost quite some
 time.

 OK will try absolute paths and let you know.

 -MilleBii-

 2009/6/24 Andrzej Bialecki a...@getopt.org

   MilleBii wrote:

 HLPPP !!!

 Stuck for 3 days on not able to start any nutch job.

 hdfs works fine, ie I can put  look at files.
 When i start nutch crawl, I get the following error

 Job initialization failed:
 java.lang.IllegalArgumentException: Pathname

 /d:/Bii/nutch/logs/history/user/_logs/history/localhost_1245788245191_job_200906232217_0001_pc-%5C_inject+urls

 It is looking for the file at a wrong location  Indeed in my case
 the
 correct location is /d:/Bii/nutch/logs/history, so why is *
 history/user/_logs* added and how can I fix that ?

 2009/6/21 MilleBii mille...@gmail.com

 Looks like I just needed to transfer from the local filesystem to hdfs:
 Is it safe to transfer a crawl directory (and subs) from the local file
 system to hdfs and start crawling again ?

 1. hadoop fs -put crawl crawl
 2. nutch generate crawl/crawldb crawl/segments -topN 500 (where now it
 should use the hdfs)

 -MilleBii-

 2009/6/21 MilleBii mille...@gmail.com

  I have newly installed hadoop in a distributed single node
 configuration.

 When I run nutch commands  it is looking for files my user home
 directory
 and not at the nutch directory ?
 How can I change this ?


 I suspect your hadoop-site.xml uses relative path somewhere, and not an
 absolute path (with leading slash). Also, /d: looks suspiciously like a
 Windows pathname, in which case you should either use a full URI
 (file:///d:/) or just the disk name d:/ without the leading slash. Please
 also note that if you are running this on Windows under cygwin then in your
 config files you MUST NOT use the cygwin paths (like /cygdrive/d/...)
 because Java can't see them.


 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com




 --
 -MilleBii-




 --
 -MilleBii-




-- 
-MilleBii-


Re: Nutch and Hadoop not working proper

2009-06-24 Thread Andrzej Bialecki

MilleBii wrote:

What's also i have discovered
+ hadoop (script) works with unix like paths and works fine on windows
+ nutch (script) works with Windows paths


bin/nutch works with Windows paths? I think this could happen only by 
accident - both scripts work with Cygwin paths. On the other hand, 
arguments passed to JVM must be regular Windows paths.




Could it be that there is some incompatibility because one works unix like
paths and not the other ???


Both scripts work fine for me on Windows XP + Cygwin, without any 
special settings - I suspect there is something strange in your 
environment or config...


Please note that Hadoop and Nutch scripts are regular shell scripts, so 
they are aware of Cygwin path conventions, in fact they don't accept 
un-escaped Windows paths as arguments (i.e. you need to use forward 
slashes, or you need to put double quotes around a Windows path).



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com