subject:"Nutch and Hadoop not working proper"

Re: Nutch and Hadoop not working proper

2009-06-25 Thread MilleBii

2009/6/24 Andrzej Bialecki a...@getopt.org

 MilleBii wrote:

 What's also i have discovered
 + hadoop (script) works with unix like paths and works fine on windows
 + nutch (script) works with Windows paths


 bin/nutch works with Windows paths? I think this could happen only by
 accident - both scripts work with Cygwin paths. On the other hand, arguments
 passed to JVM must be regular Windows paths.

That's what I meant all path on the JVM call are windows...
Actually this is a question path in nutch-site.xml or hadoop-site.xml
should be what Unix like or Windows like ?



 Could it be that there is some incompatibility because one works unix like
 paths and not the other ???


 Both scripts work fine for me on Windows XP + Cygwin, without any special
 settings - I suspect there is something strange in your environment or
 config...

 Please note that Hadoop and Nutch scripts are regular shell scripts, so
 they are aware of Cygwin path conventions, in fact they don't accept
 un-escaped Windows paths as arguments (i.e. you need to use forward slashes,
 or you need to put double quotes around a Windows path).

 Clear but in a way I don't use any path... since I'm using only relative
paths (at least I think).

The test command that I use is very simple the following : nutch crawl urls
-dir hcrawl -depth 2

Both urls  hcrawl directory do exist in the hdfs filesystem yet I get a
job failed error, and when I look where was the problem, I get this strange
path problem.




 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com




-- 
-MilleBii-

Re: Nutch and Hadoop not working proper

2009-06-25 Thread MilleBii

Did another test and got this error:

2009-06-25 21:19:44,663 ERROR mapred.EagerTaskInitializationListener - Job
initialization failed:
java.lang.IllegalArgumentException: Pathname
/d:/Bii/nutch/logs/history/user/_logs/history/localhost_1245956549829_job_200906252102_0001_pc-xxx%xxx_inject+urls
from
d:/Bii/nutch/logs/history/user/_logs/history/localhost_1245956549829_job_200906252102_0001_pc-xxx%5Cxxx_inject+urls
*is not a valid DFS filename*

Some remarks which may help someone to give an hint
1. log files are not in the dfs but in the local filesystem, so why is
looking in the dfs for the logs ?
2. Of course a windows path... does not fit in DFS
3. even in the local file system it is the wrong path it should be
/d:/Bii/nutch/logs/history/localhost

2009/6/25 MilleBii mille...@gmail.com

2009/6/24 Andrzej Bialecki a...@getopt.org

MilleBii wrote:

What's also i have discovered
+ hadoop (script) works with unix like paths and works fine on windows
+ nutch (script) works with Windows paths

bin/nutch works with Windows paths? I think this could happen only by
accident - both scripts work with Cygwin paths. On the other hand, arguments
passed to JVM must be regular Windows paths.

That's what I meant all path on the JVM call are windows...
Actually this is a question path in nutch-site.xml or hadoop-site.xml
should be what Unix like or Windows like ?

Could it be that there is some incompatibility because one works unix
like
paths and not the other ???

Both scripts work fine for me on Windows XP + Cygwin, without any special
settings - I suspect there is something strange in your environment or
config...

Please note that Hadoop and Nutch scripts are regular shell scripts, so
they are aware of Cygwin path conventions, in fact they don't accept
un-escaped Windows paths as arguments (i.e. you need to use forward slashes,
or you need to put double quotes around a Windows path).

Clear but in a way I don't use any path... since I'm using only relative
paths (at least I think).

The test command that I use is very simple the following : nutch crawl urls
-dir hcrawl -depth 2

Both urls hcrawl directory do exist in the hdfs filesystem yet I get
a job failed error, and when I look where was the problem, I get this
strange path problem.

--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com

--
-MilleBii-

Re: Nutch and Hadoop not working proper

2009-06-24 Thread Andrzej Bialecki


MilleBii wrote:

HLPPP !!!

Stuck for 3 days on not able to start any nutch job.

hdfs works fine, ie I can put  look at files.
When i start nutch crawl, I get the following error

Job initialization failed:
java.lang.IllegalArgumentException: Pathname
/d:/Bii/nutch/logs/history/user/_logs/history/localhost_1245788245191_job_200906232217_0001_pc-%5C_inject+urls

It is looking for the file at a wrong location  Indeed in my case the
correct location is /d:/Bii/nutch/logs/history, so why is *
history/user/_logs* added and how can I fix that ?

2009/6/21 MilleBii mille...@gmail.com


Looks like I just needed to transfer from the local filesystem to hdfs:
Is it safe to transfer a crawl directory (and subs) from the local file
system to hdfs and start crawling again ?

1. hadoop fs -put crawl crawl
2. nutch generate crawl/crawldb crawl/segments -topN 500 (where now it
should use the hdfs)

-MilleBii-

2009/6/21 MilleBii mille...@gmail.com

 I have newly installed hadoop in a distributed single node configuration.

When I run nutch commands  it is looking for files my user home directory
and not at the nutch directory ?
How can I change this ?


I suspect your hadoop-site.xml uses relative path somewhere, and not an 
absolute path (with leading slash). Also, /d: looks suspiciously like a 
Windows pathname, in which case you should either use a full URI 
(file:///d:/) or just the disk name d:/ without the leading slash. 
Please also note that if you are running this on Windows under cygwin 
then in your config files you MUST NOT use the cygwin paths (like 
/cygdrive/d/...) because Java can't see them.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Nutch and Hadoop not working proper

2009-06-24 Thread MilleBii

Yes I'm using both relative path cygwin under windows. so /d: is not
introduced by me, but either nutch or hadoop.

Regarding the cygwin path you are righ... actually where I lost quite some
time.

OK will try absolute paths and let you know.

-MilleBii-

2009/6/24 Andrzej Bialecki a...@getopt.org

MilleBii wrote:

HLPPP !!!

Stuck for 3 days on not able to start any nutch job.

hdfs works fine, ie I can put look at files.
When i start nutch crawl, I get the following error

Job initialization failed:
java.lang.IllegalArgumentException: Pathname

/d:/Bii/nutch/logs/history/user/_logs/history/localhost_1245788245191_job_200906232217_0001_pc-%5C_inject+urls

It is looking for the file at a wrong location Indeed in my case the
correct location is /d:/Bii/nutch/logs/history, so why is *
history/user/_logs* added and how can I fix that ?

2009/6/21 MilleBii mille...@gmail.com

Looks like I just needed to transfer from the local filesystem to hdfs:
Is it safe to transfer a crawl directory (and subs) from the local file
system to hdfs and start crawling again ?

1. hadoop fs -put crawl crawl
2. nutch generate crawl/crawldb crawl/segments -topN 500 (where now it
should use the hdfs)

-MilleBii-

2009/6/21 MilleBii mille...@gmail.com

I have newly installed hadoop in a distributed single node
configuration.

When I run nutch commands it is looking for files my user home
directory
and not at the nutch directory ?
How can I change this ?

I suspect your hadoop-site.xml uses relative path somewhere, and not an
absolute path (with leading slash). Also, /d: looks suspiciously like a
Windows pathname, in which case you should either use a full URI
(file:///d:/) or just the disk name d:/ without the leading slash. Please
also note that if you are running this on Windows under cygwin then in your
config files you MUST NOT use the cygwin paths (like /cygdrive/d/...)
because Java can't see them.

--
-MilleBii-

Re: Nutch and Hadoop not working proper

2009-06-24 Thread MilleBii

Actually tried and it fails but this is what I found :

bin/hadoop-config.sh does the conversion from relative to absolute path

this=$0
while [ -h $this ]; do
ls=`ls -ld $this`
link=`expr $ls : '.*- $.*$$'`
if expr $link : '.*/.*' /dev/null; then
this=$link
else
this=`dirname $this`/$link
fi
done

# convert relative path to absolute path
bin=`dirname $this`
script=`basename $this`
bin=`cd $bin; pwd`
this=$bin/$script

# the root of the Hadoop installation
export HADOOP_HOME=`dirname $this`/..

Now if you echo out the script it uses full cygwin path ie:

/cygdrive/d/...

I tried to change the export into an absolute path file:///d:/..., it does
not work and hadoop does not even start.
Where in my case it will start and work as long as your are using hadoop
commands, but none of the nutch commands actually work.

As if the dfs was working but not the mapred part of hadoop.

2009/6/24 MilleBii mille...@gmail.com

Yes I'm using both relative path cygwin under windows. so /d: is not
introduced by me, but either nutch or hadoop.

Regarding the cygwin path you are righ... actually where I lost quite some
time.

OK will try absolute paths and let you know.

-MilleBii-

2009/6/24 Andrzej Bialecki a...@getopt.org

MilleBii wrote:

HLPPP !!!

Stuck for 3 days on not able to start any nutch job.

hdfs works fine, ie I can put look at files.
When i start nutch crawl, I get the following error

Job initialization failed:
java.lang.IllegalArgumentException: Pathname

/d:/Bii/nutch/logs/history/user/_logs/history/localhost_1245788245191_job_200906232217_0001_pc-%5C_inject+urls

It is looking for the file at a wrong location Indeed in my case the
correct location is /d:/Bii/nutch/logs/history, so why is *
history/user/_logs* added and how can I fix that ?

2009/6/21 MilleBii mille...@gmail.com

Looks like I just needed to transfer from the local filesystem to hdfs:
Is it safe to transfer a crawl directory (and subs) from the local file
system to hdfs and start crawling again ?

1. hadoop fs -put crawl crawl
2. nutch generate crawl/crawldb crawl/segments -topN 500 (where now it
should use the hdfs)

-MilleBii-

2009/6/21 MilleBii mille...@gmail.com

I have newly installed hadoop in a distributed single node
configuration.

When I run nutch commands it is looking for files my user home
directory
and not at the nutch directory ?
How can I change this ?

--
-MilleBii-

Re: Nutch and Hadoop not working proper

2009-06-24 Thread MilleBii

What's also i have discovered
+ hadoop (script) works with unix like paths and works fine on windows
+ nutch (script) works with Windows paths

Could it be that there is some incompatibility because one works unix like
paths and not the other ???

2009/6/24 MilleBii mille...@gmail.com

Actually tried and it fails but this is what I found :

bin/hadoop-config.sh does the conversion from relative to absolute path

this=$0
while [ -h $this ]; do
ls=`ls -ld $this`
link=`expr $ls : '.*- $.*$$'`
if expr $link : '.*/.*' /dev/null; then
this=$link
else
this=`dirname $this`/$link
fi
done

# convert relative path to absolute path
bin=`dirname $this`
script=`basename $this`
bin=`cd $bin; pwd`
this=$bin/$script

# the root of the Hadoop installation
export HADOOP_HOME=`dirname $this`/..

Now if you echo out the script it uses full cygwin path ie:

/cygdrive/d/...

I tried to change the export into an absolute path file:///d:/..., it
does not work and hadoop does not even start.
Where in my case it will start and work as long as your are using hadoop
commands, but none of the nutch commands actually work.

As if the dfs was working but not the mapred part of hadoop.

2009/6/24 MilleBii mille...@gmail.com

Yes I'm using both relative path cygwin under windows. so /d: is not
introduced by me, but either nutch or hadoop.

Regarding the cygwin path you are righ... actually where I lost quite some
time.

OK will try absolute paths and let you know.

-MilleBii-

2009/6/24 Andrzej Bialecki a...@getopt.org

MilleBii wrote:

HLPPP !!!

Stuck for 3 days on not able to start any nutch job.

hdfs works fine, ie I can put look at files.
When i start nutch crawl, I get the following error

Job initialization failed:
java.lang.IllegalArgumentException: Pathname

/d:/Bii/nutch/logs/history/user/_logs/history/localhost_1245788245191_job_200906232217_0001_pc-%5C_inject+urls

It is looking for the file at a wrong location Indeed in my case
the
correct location is /d:/Bii/nutch/logs/history, so why is *
history/user/_logs* added and how can I fix that ?

2009/6/21 MilleBii mille...@gmail.com

Looks like I just needed to transfer from the local filesystem to hdfs:
Is it safe to transfer a crawl directory (and subs) from the local file
system to hdfs and start crawling again ?

1. hadoop fs -put crawl crawl
2. nutch generate crawl/crawldb crawl/segments -topN 500 (where now it
should use the hdfs)

-MilleBii-

2009/6/21 MilleBii mille...@gmail.com

I have newly installed hadoop in a distributed single node
configuration.

When I run nutch commands it is looking for files my user home
directory
and not at the nutch directory ?
How can I change this ?

--
-MilleBii-

Re: Nutch and Hadoop not working proper

2009-06-24 Thread Andrzej Bialecki


MilleBii wrote:

What's also i have discovered
+ hadoop (script) works with unix like paths and works fine on windows
+ nutch (script) works with Windows paths


bin/nutch works with Windows paths? I think this could happen only by 
accident - both scripts work with Cygwin paths. On the other hand, 
arguments passed to JVM must be regular Windows paths.




Could it be that there is some incompatibility because one works unix like
paths and not the other ???


Both scripts work fine for me on Windows XP + Cygwin, without any 
special settings - I suspect there is something strange in your 
environment or config...


Please note that Hadoop and Nutch scripts are regular shell scripts, so 
they are aware of Cygwin path conventions, in fact they don't accept 
un-escaped Windows paths as arguments (i.e. you need to use forward 
slashes, or you need to put double quotes around a Windows path).



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Nutch and Hadoop not working proper

2009-06-23 Thread MilleBii

HLPPP !!!

Stuck for 3 days on not able to start any nutch job.

hdfs works fine, ie I can put  look at files.
When i start nutch crawl, I get the following error

Job initialization failed:
java.lang.IllegalArgumentException: Pathname
/d:/Bii/nutch/logs/history/user/_logs/history/localhost_1245788245191_job_200906232217_0001_pc-%5C_inject+urls

It is looking for the file at a wrong location  Indeed in my case the
correct location is /d:/Bii/nutch/logs/history, so why is *
history/user/_logs* added and how can I fix that ?

2009/6/21 MilleBii mille...@gmail.com


 Looks like I just needed to transfer from the local filesystem to hdfs:
 Is it safe to transfer a crawl directory (and subs) from the local file
 system to hdfs and start crawling again ?

 1. hadoop fs -put crawl crawl
 2. nutch generate crawl/crawldb crawl/segments -topN 500 (where now it
 should use the hdfs)

 -MilleBii-

 2009/6/21 MilleBii mille...@gmail.com

  I have newly installed hadoop in a distributed single node configuration.
 When I run nutch commands  it is looking for files my user home directory
 and not at the nutch directory ?
 How can I change this ?

 --
 -MilleBii-




 --
 -MilleBii-




-- 
-MilleBii-

Nutch and Hadoop not working proper

2009-06-21 Thread MilleBii

I have newly installed hadoop in a distributed single node configuration.
When I run nutch commands  it is looking for files my user home directory
and not at the nutch directory ?
How can I change this ?

-- 
-MilleBii-

Re: Nutch and Hadoop not working proper

2009-06-21 Thread MilleBii

Looks like I just needed to transfer from the local filesystem to hdfs:
Is it safe to transfer a crawl directory (and subs) from the local file
system to hdfs and start crawling again ?

1. hadoop fs -put crawl crawl
2. nutch generate crawl/crawldb crawl/segments -topN 500 (where now it
should use the hdfs)

-MilleBii-

2009/6/21 MilleBii mille...@gmail.com

 I have newly installed hadoop in a distributed single node configuration.
 When I run nutch commands  it is looking for files my user home directory
 and not at the nutch directory ?
 How can I change this ?

 --
 -MilleBii-




-- 
-MilleBii-

Re: Nutch and Hadoop not working proper

Re: Nutch and Hadoop not working proper

Re: Nutch and Hadoop not working proper

Re: Nutch and Hadoop not working proper

Re: Nutch and Hadoop not working proper

Re: Nutch and Hadoop not working proper

Re: Nutch and Hadoop not working proper

Re: Nutch and Hadoop not working proper

Nutch and Hadoop not working proper

Re: Nutch and Hadoop not working proper

10 matches

Site Navigation

Mail list logo

Footer information