Re: Nutch and Hadoop not working proper

2009-06-25 Thread MilleBii
2009/6/24 Andrzej Bialecki a...@getopt.org

 MilleBii wrote:

 What's also i have discovered
 + hadoop (script) works with unix like paths and works fine on windows
 + nutch (script) works with Windows paths


 bin/nutch works with Windows paths? I think this could happen only by
 accident - both scripts work with Cygwin paths. On the other hand, arguments
 passed to JVM must be regular Windows paths.

That's what I meant all path on the JVM call are windows...
Actually this is a question path in nutch-site.xml or hadoop-site.xml
should be what Unix like or Windows like ?



 Could it be that there is some incompatibility because one works unix like
 paths and not the other ???


 Both scripts work fine for me on Windows XP + Cygwin, without any special
 settings - I suspect there is something strange in your environment or
 config...

 Please note that Hadoop and Nutch scripts are regular shell scripts, so
 they are aware of Cygwin path conventions, in fact they don't accept
 un-escaped Windows paths as arguments (i.e. you need to use forward slashes,
 or you need to put double quotes around a Windows path).

 Clear but in a way I don't use any path... since I'm using only relative
paths (at least I think).

The test command that I use is very simple the following : nutch crawl urls
-dir hcrawl -depth 2

Both urls  hcrawl directory do exist in the hdfs filesystem yet I get a
job failed error, and when I look where was the problem, I get this strange
path problem.




 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com




-- 
-MilleBii-


Re: Nutch and Hadoop not working proper

2009-06-25 Thread MilleBii
Did another test and got this error:

2009-06-25 21:19:44,663 ERROR mapred.EagerTaskInitializationListener - Job
initialization failed:
java.lang.IllegalArgumentException: Pathname
/d:/Bii/nutch/logs/history/user/_logs/history/localhost_1245956549829_job_200906252102_0001_pc-xxx%xxx_inject+urls
from
d:/Bii/nutch/logs/history/user/_logs/history/localhost_1245956549829_job_200906252102_0001_pc-xxx%5Cxxx_inject+urls
*is not a valid DFS filename*

Some remarks which may help someone to give an hint
1. log files are not in the dfs but in the local filesystem, so why is
looking in the dfs for the logs ?
2. Of course a windows path... does not fit in DFS
3. even in the local file system it is the wrong path  it should be
/d:/Bii/nutch/logs/history/localhost



2009/6/25 MilleBii mille...@gmail.com




 2009/6/24 Andrzej Bialecki a...@getopt.org

 MilleBii wrote:

 What's also i have discovered
 + hadoop (script) works with unix like paths and works fine on windows
 + nutch (script) works with Windows paths


 bin/nutch works with Windows paths? I think this could happen only by
 accident - both scripts work with Cygwin paths. On the other hand, arguments
 passed to JVM must be regular Windows paths.

 That's what I meant all path on the JVM call are windows...
 Actually this is a question path in nutch-site.xml or hadoop-site.xml
 should be what Unix like or Windows like ?



 Could it be that there is some incompatibility because one works unix
 like
 paths and not the other ???


 Both scripts work fine for me on Windows XP + Cygwin, without any special
 settings - I suspect there is something strange in your environment or
 config...

 Please note that Hadoop and Nutch scripts are regular shell scripts, so
 they are aware of Cygwin path conventions, in fact they don't accept
 un-escaped Windows paths as arguments (i.e. you need to use forward slashes,
 or you need to put double quotes around a Windows path).

 Clear but in a way I don't use any path... since I'm using only relative
 paths (at least I think).

 The test command that I use is very simple the following : nutch crawl urls
 -dir hcrawl -depth 2

 Both urls  hcrawl directory do exist in the hdfs filesystem yet I get
 a job failed error, and when I look where was the problem, I get this
 strange path problem.




 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com




 --
 -MilleBii-




-- 
-MilleBii-


Re: Nutch and Hadoop not working proper

2009-06-24 Thread Andrzej Bialecki

MilleBii wrote:

HLPPP !!!

Stuck for 3 days on not able to start any nutch job.

hdfs works fine, ie I can put  look at files.
When i start nutch crawl, I get the following error

Job initialization failed:
java.lang.IllegalArgumentException: Pathname
/d:/Bii/nutch/logs/history/user/_logs/history/localhost_1245788245191_job_200906232217_0001_pc-%5C_inject+urls

It is looking for the file at a wrong location  Indeed in my case the
correct location is /d:/Bii/nutch/logs/history, so why is *
history/user/_logs* added and how can I fix that ?

2009/6/21 MilleBii mille...@gmail.com


Looks like I just needed to transfer from the local filesystem to hdfs:
Is it safe to transfer a crawl directory (and subs) from the local file
system to hdfs and start crawling again ?

1. hadoop fs -put crawl crawl
2. nutch generate crawl/crawldb crawl/segments -topN 500 (where now it
should use the hdfs)

-MilleBii-

2009/6/21 MilleBii mille...@gmail.com

 I have newly installed hadoop in a distributed single node configuration.

When I run nutch commands  it is looking for files my user home directory
and not at the nutch directory ?
How can I change this ?


I suspect your hadoop-site.xml uses relative path somewhere, and not an 
absolute path (with leading slash). Also, /d: looks suspiciously like a 
Windows pathname, in which case you should either use a full URI 
(file:///d:/) or just the disk name d:/ without the leading slash. 
Please also note that if you are running this on Windows under cygwin 
then in your config files you MUST NOT use the cygwin paths (like 
/cygdrive/d/...) because Java can't see them.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch and Hadoop not working proper

2009-06-24 Thread MilleBii
Yes I'm using both relative path  cygwin under windows. so /d: is not
introduced by me, but either nutch or hadoop.

Regarding the cygwin path you are righ... actually where I lost quite some
time.

OK will try absolute paths and let you know.

-MilleBii-

2009/6/24 Andrzej Bialecki a...@getopt.org

  MilleBii wrote:

 HLPPP !!!

 Stuck for 3 days on not able to start any nutch job.

 hdfs works fine, ie I can put  look at files.
 When i start nutch crawl, I get the following error

 Job initialization failed:
 java.lang.IllegalArgumentException: Pathname

 /d:/Bii/nutch/logs/history/user/_logs/history/localhost_1245788245191_job_200906232217_0001_pc-%5C_inject+urls

 It is looking for the file at a wrong location  Indeed in my case the
 correct location is /d:/Bii/nutch/logs/history, so why is *
 history/user/_logs* added and how can I fix that ?

 2009/6/21 MilleBii mille...@gmail.com

 Looks like I just needed to transfer from the local filesystem to hdfs:
 Is it safe to transfer a crawl directory (and subs) from the local file
 system to hdfs and start crawling again ?

 1. hadoop fs -put crawl crawl
 2. nutch generate crawl/crawldb crawl/segments -topN 500 (where now it
 should use the hdfs)

 -MilleBii-

 2009/6/21 MilleBii mille...@gmail.com

  I have newly installed hadoop in a distributed single node
 configuration.

 When I run nutch commands  it is looking for files my user home
 directory
 and not at the nutch directory ?
 How can I change this ?


 I suspect your hadoop-site.xml uses relative path somewhere, and not an
 absolute path (with leading slash). Also, /d: looks suspiciously like a
 Windows pathname, in which case you should either use a full URI
 (file:///d:/) or just the disk name d:/ without the leading slash. Please
 also note that if you are running this on Windows under cygwin then in your
 config files you MUST NOT use the cygwin paths (like /cygdrive/d/...)
 because Java can't see them.


 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com




-- 
-MilleBii-


Re: Nutch and Hadoop not working proper

2009-06-24 Thread MilleBii
Actually tried and it fails but this is what I found :

bin/hadoop-config.sh does the conversion from relative to absolute path

this=$0
while [ -h $this ]; do
  ls=`ls -ld $this`
  link=`expr $ls : '.*- \(.*\)$'`
  if expr $link : '.*/.*'  /dev/null; then
this=$link
  else
this=`dirname $this`/$link
  fi
done

# convert relative path to absolute path
bin=`dirname $this`
script=`basename $this`
bin=`cd $bin; pwd`
this=$bin/$script

# the root of the Hadoop installation
export HADOOP_HOME=`dirname $this`/..

Now if you echo out the script it uses full cygwin path ie:

/cygdrive/d/...

I tried to change the export into an absolute path file:///d:/..., it does
not work and hadoop does not even start.
Where in my case it will start and work as long as your are using hadoop
commands, but none of the nutch commands actually work.

As if the dfs was working but not the mapred part of hadoop.


2009/6/24 MilleBii mille...@gmail.com

 Yes I'm using both relative path  cygwin under windows. so /d: is not
 introduced by me, but either nutch or hadoop.

 Regarding the cygwin path you are righ... actually where I lost quite some
 time.

 OK will try absolute paths and let you know.

 -MilleBii-

 2009/6/24 Andrzej Bialecki a...@getopt.org

   MilleBii wrote:

 HLPPP !!!

 Stuck for 3 days on not able to start any nutch job.

 hdfs works fine, ie I can put  look at files.
 When i start nutch crawl, I get the following error

 Job initialization failed:
 java.lang.IllegalArgumentException: Pathname

 /d:/Bii/nutch/logs/history/user/_logs/history/localhost_1245788245191_job_200906232217_0001_pc-%5C_inject+urls

 It is looking for the file at a wrong location  Indeed in my case the
 correct location is /d:/Bii/nutch/logs/history, so why is *
 history/user/_logs* added and how can I fix that ?

 2009/6/21 MilleBii mille...@gmail.com

 Looks like I just needed to transfer from the local filesystem to hdfs:
 Is it safe to transfer a crawl directory (and subs) from the local file
 system to hdfs and start crawling again ?

 1. hadoop fs -put crawl crawl
 2. nutch generate crawl/crawldb crawl/segments -topN 500 (where now it
 should use the hdfs)

 -MilleBii-

 2009/6/21 MilleBii mille...@gmail.com

  I have newly installed hadoop in a distributed single node
 configuration.

 When I run nutch commands  it is looking for files my user home
 directory
 and not at the nutch directory ?
 How can I change this ?


 I suspect your hadoop-site.xml uses relative path somewhere, and not an
 absolute path (with leading slash). Also, /d: looks suspiciously like a
 Windows pathname, in which case you should either use a full URI
 (file:///d:/) or just the disk name d:/ without the leading slash. Please
 also note that if you are running this on Windows under cygwin then in your
 config files you MUST NOT use the cygwin paths (like /cygdrive/d/...)
 because Java can't see them.


 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com




 --
 -MilleBii-




-- 
-MilleBii-


Re: Nutch and Hadoop not working proper

2009-06-24 Thread MilleBii
What's also i have discovered
+ hadoop (script) works with unix like paths and works fine on windows
+ nutch (script) works with Windows paths

Could it be that there is some incompatibility because one works unix like
paths and not the other ???



2009/6/24 MilleBii mille...@gmail.com

 Actually tried and it fails but this is what I found :

 bin/hadoop-config.sh does the conversion from relative to absolute path

 this=$0
 while [ -h $this ]; do
   ls=`ls -ld $this`
   link=`expr $ls : '.*- \(.*\)$'`
   if expr $link : '.*/.*'  /dev/null; then
 this=$link
   else
 this=`dirname $this`/$link
   fi
 done

 # convert relative path to absolute path
 bin=`dirname $this`
 script=`basename $this`
 bin=`cd $bin; pwd`
 this=$bin/$script

 # the root of the Hadoop installation
 export HADOOP_HOME=`dirname $this`/..

 Now if you echo out the script it uses full cygwin path ie:

 /cygdrive/d/...

 I tried to change the export into an absolute path file:///d:/..., it
 does not work and hadoop does not even start.
 Where in my case it will start and work as long as your are using hadoop
 commands, but none of the nutch commands actually work.

 As if the dfs was working but not the mapred part of hadoop.


 2009/6/24 MilleBii mille...@gmail.com

 Yes I'm using both relative path  cygwin under windows. so /d: is not
 introduced by me, but either nutch or hadoop.

 Regarding the cygwin path you are righ... actually where I lost quite some
 time.

 OK will try absolute paths and let you know.

 -MilleBii-

 2009/6/24 Andrzej Bialecki a...@getopt.org

   MilleBii wrote:

 HLPPP !!!

 Stuck for 3 days on not able to start any nutch job.

 hdfs works fine, ie I can put  look at files.
 When i start nutch crawl, I get the following error

 Job initialization failed:
 java.lang.IllegalArgumentException: Pathname

 /d:/Bii/nutch/logs/history/user/_logs/history/localhost_1245788245191_job_200906232217_0001_pc-%5C_inject+urls

 It is looking for the file at a wrong location  Indeed in my case
 the
 correct location is /d:/Bii/nutch/logs/history, so why is *
 history/user/_logs* added and how can I fix that ?

 2009/6/21 MilleBii mille...@gmail.com

 Looks like I just needed to transfer from the local filesystem to hdfs:
 Is it safe to transfer a crawl directory (and subs) from the local file
 system to hdfs and start crawling again ?

 1. hadoop fs -put crawl crawl
 2. nutch generate crawl/crawldb crawl/segments -topN 500 (where now it
 should use the hdfs)

 -MilleBii-

 2009/6/21 MilleBii mille...@gmail.com

  I have newly installed hadoop in a distributed single node
 configuration.

 When I run nutch commands  it is looking for files my user home
 directory
 and not at the nutch directory ?
 How can I change this ?


 I suspect your hadoop-site.xml uses relative path somewhere, and not an
 absolute path (with leading slash). Also, /d: looks suspiciously like a
 Windows pathname, in which case you should either use a full URI
 (file:///d:/) or just the disk name d:/ without the leading slash. Please
 also note that if you are running this on Windows under cygwin then in your
 config files you MUST NOT use the cygwin paths (like /cygdrive/d/...)
 because Java can't see them.


 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com




 --
 -MilleBii-




 --
 -MilleBii-




-- 
-MilleBii-


Re: Nutch and Hadoop not working proper

2009-06-24 Thread Andrzej Bialecki

MilleBii wrote:

What's also i have discovered
+ hadoop (script) works with unix like paths and works fine on windows
+ nutch (script) works with Windows paths


bin/nutch works with Windows paths? I think this could happen only by 
accident - both scripts work with Cygwin paths. On the other hand, 
arguments passed to JVM must be regular Windows paths.




Could it be that there is some incompatibility because one works unix like
paths and not the other ???


Both scripts work fine for me on Windows XP + Cygwin, without any 
special settings - I suspect there is something strange in your 
environment or config...


Please note that Hadoop and Nutch scripts are regular shell scripts, so 
they are aware of Cygwin path conventions, in fact they don't accept 
un-escaped Windows paths as arguments (i.e. you need to use forward 
slashes, or you need to put double quotes around a Windows path).



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch and Hadoop not working proper

2009-06-23 Thread MilleBii
HLPPP !!!

Stuck for 3 days on not able to start any nutch job.

hdfs works fine, ie I can put  look at files.
When i start nutch crawl, I get the following error

Job initialization failed:
java.lang.IllegalArgumentException: Pathname
/d:/Bii/nutch/logs/history/user/_logs/history/localhost_1245788245191_job_200906232217_0001_pc-%5C_inject+urls

It is looking for the file at a wrong location  Indeed in my case the
correct location is /d:/Bii/nutch/logs/history, so why is *
history/user/_logs* added and how can I fix that ?

2009/6/21 MilleBii mille...@gmail.com


 Looks like I just needed to transfer from the local filesystem to hdfs:
 Is it safe to transfer a crawl directory (and subs) from the local file
 system to hdfs and start crawling again ?

 1. hadoop fs -put crawl crawl
 2. nutch generate crawl/crawldb crawl/segments -topN 500 (where now it
 should use the hdfs)

 -MilleBii-

 2009/6/21 MilleBii mille...@gmail.com

  I have newly installed hadoop in a distributed single node configuration.
 When I run nutch commands  it is looking for files my user home directory
 and not at the nutch directory ?
 How can I change this ?

 --
 -MilleBii-




 --
 -MilleBii-




-- 
-MilleBii-


Nutch and Hadoop not working proper

2009-06-21 Thread MilleBii
I have newly installed hadoop in a distributed single node configuration.
When I run nutch commands  it is looking for files my user home directory
and not at the nutch directory ?
How can I change this ?

-- 
-MilleBii-


Re: Nutch and Hadoop not working proper

2009-06-21 Thread MilleBii
Looks like I just needed to transfer from the local filesystem to hdfs:
Is it safe to transfer a crawl directory (and subs) from the local file
system to hdfs and start crawling again ?

1. hadoop fs -put crawl crawl
2. nutch generate crawl/crawldb crawl/segments -topN 500 (where now it
should use the hdfs)

-MilleBii-

2009/6/21 MilleBii mille...@gmail.com

 I have newly installed hadoop in a distributed single node configuration.
 When I run nutch commands  it is looking for files my user home directory
 and not at the nutch directory ?
 How can I change this ?

 --
 -MilleBii-




-- 
-MilleBii-