Re: Nutch and Hadoop not working proper
2009/6/24 Andrzej Bialecki a...@getopt.org MilleBii wrote: What's also i have discovered + hadoop (script) works with unix like paths and works fine on windows + nutch (script) works with Windows paths bin/nutch works with Windows paths? I think this could happen only by accident - both scripts work with Cygwin paths. On the other hand, arguments passed to JVM must be regular Windows paths. That's what I meant all path on the JVM call are windows... Actually this is a question path in nutch-site.xml or hadoop-site.xml should be what Unix like or Windows like ? Could it be that there is some incompatibility because one works unix like paths and not the other ??? Both scripts work fine for me on Windows XP + Cygwin, without any special settings - I suspect there is something strange in your environment or config... Please note that Hadoop and Nutch scripts are regular shell scripts, so they are aware of Cygwin path conventions, in fact they don't accept un-escaped Windows paths as arguments (i.e. you need to use forward slashes, or you need to put double quotes around a Windows path). Clear but in a way I don't use any path... since I'm using only relative paths (at least I think). The test command that I use is very simple the following : nutch crawl urls -dir hcrawl -depth 2 Both urls hcrawl directory do exist in the hdfs filesystem yet I get a job failed error, and when I look where was the problem, I get this strange path problem. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- -MilleBii-
Re: Nutch and Hadoop not working proper
Did another test and got this error: 2009-06-25 21:19:44,663 ERROR mapred.EagerTaskInitializationListener - Job initialization failed: java.lang.IllegalArgumentException: Pathname /d:/Bii/nutch/logs/history/user/_logs/history/localhost_1245956549829_job_200906252102_0001_pc-xxx%xxx_inject+urls from d:/Bii/nutch/logs/history/user/_logs/history/localhost_1245956549829_job_200906252102_0001_pc-xxx%5Cxxx_inject+urls *is not a valid DFS filename* Some remarks which may help someone to give an hint 1. log files are not in the dfs but in the local filesystem, so why is looking in the dfs for the logs ? 2. Of course a windows path... does not fit in DFS 3. even in the local file system it is the wrong path it should be /d:/Bii/nutch/logs/history/localhost 2009/6/25 MilleBii mille...@gmail.com 2009/6/24 Andrzej Bialecki a...@getopt.org MilleBii wrote: What's also i have discovered + hadoop (script) works with unix like paths and works fine on windows + nutch (script) works with Windows paths bin/nutch works with Windows paths? I think this could happen only by accident - both scripts work with Cygwin paths. On the other hand, arguments passed to JVM must be regular Windows paths. That's what I meant all path on the JVM call are windows... Actually this is a question path in nutch-site.xml or hadoop-site.xml should be what Unix like or Windows like ? Could it be that there is some incompatibility because one works unix like paths and not the other ??? Both scripts work fine for me on Windows XP + Cygwin, without any special settings - I suspect there is something strange in your environment or config... Please note that Hadoop and Nutch scripts are regular shell scripts, so they are aware of Cygwin path conventions, in fact they don't accept un-escaped Windows paths as arguments (i.e. you need to use forward slashes, or you need to put double quotes around a Windows path). Clear but in a way I don't use any path... since I'm using only relative paths (at least I think). The test command that I use is very simple the following : nutch crawl urls -dir hcrawl -depth 2 Both urls hcrawl directory do exist in the hdfs filesystem yet I get a job failed error, and when I look where was the problem, I get this strange path problem. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- -MilleBii- -- -MilleBii-
Re: Nutch and Hadoop not working proper
MilleBii wrote: HLPPP !!! Stuck for 3 days on not able to start any nutch job. hdfs works fine, ie I can put look at files. When i start nutch crawl, I get the following error Job initialization failed: java.lang.IllegalArgumentException: Pathname /d:/Bii/nutch/logs/history/user/_logs/history/localhost_1245788245191_job_200906232217_0001_pc-%5C_inject+urls It is looking for the file at a wrong location Indeed in my case the correct location is /d:/Bii/nutch/logs/history, so why is * history/user/_logs* added and how can I fix that ? 2009/6/21 MilleBii mille...@gmail.com Looks like I just needed to transfer from the local filesystem to hdfs: Is it safe to transfer a crawl directory (and subs) from the local file system to hdfs and start crawling again ? 1. hadoop fs -put crawl crawl 2. nutch generate crawl/crawldb crawl/segments -topN 500 (where now it should use the hdfs) -MilleBii- 2009/6/21 MilleBii mille...@gmail.com I have newly installed hadoop in a distributed single node configuration. When I run nutch commands it is looking for files my user home directory and not at the nutch directory ? How can I change this ? I suspect your hadoop-site.xml uses relative path somewhere, and not an absolute path (with leading slash). Also, /d: looks suspiciously like a Windows pathname, in which case you should either use a full URI (file:///d:/) or just the disk name d:/ without the leading slash. Please also note that if you are running this on Windows under cygwin then in your config files you MUST NOT use the cygwin paths (like /cygdrive/d/...) because Java can't see them. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch and Hadoop not working proper
Yes I'm using both relative path cygwin under windows. so /d: is not introduced by me, but either nutch or hadoop. Regarding the cygwin path you are righ... actually where I lost quite some time. OK will try absolute paths and let you know. -MilleBii- 2009/6/24 Andrzej Bialecki a...@getopt.org MilleBii wrote: HLPPP !!! Stuck for 3 days on not able to start any nutch job. hdfs works fine, ie I can put look at files. When i start nutch crawl, I get the following error Job initialization failed: java.lang.IllegalArgumentException: Pathname /d:/Bii/nutch/logs/history/user/_logs/history/localhost_1245788245191_job_200906232217_0001_pc-%5C_inject+urls It is looking for the file at a wrong location Indeed in my case the correct location is /d:/Bii/nutch/logs/history, so why is * history/user/_logs* added and how can I fix that ? 2009/6/21 MilleBii mille...@gmail.com Looks like I just needed to transfer from the local filesystem to hdfs: Is it safe to transfer a crawl directory (and subs) from the local file system to hdfs and start crawling again ? 1. hadoop fs -put crawl crawl 2. nutch generate crawl/crawldb crawl/segments -topN 500 (where now it should use the hdfs) -MilleBii- 2009/6/21 MilleBii mille...@gmail.com I have newly installed hadoop in a distributed single node configuration. When I run nutch commands it is looking for files my user home directory and not at the nutch directory ? How can I change this ? I suspect your hadoop-site.xml uses relative path somewhere, and not an absolute path (with leading slash). Also, /d: looks suspiciously like a Windows pathname, in which case you should either use a full URI (file:///d:/) or just the disk name d:/ without the leading slash. Please also note that if you are running this on Windows under cygwin then in your config files you MUST NOT use the cygwin paths (like /cygdrive/d/...) because Java can't see them. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- -MilleBii-
Re: Nutch and Hadoop not working proper
Actually tried and it fails but this is what I found : bin/hadoop-config.sh does the conversion from relative to absolute path this=$0 while [ -h $this ]; do ls=`ls -ld $this` link=`expr $ls : '.*- \(.*\)$'` if expr $link : '.*/.*' /dev/null; then this=$link else this=`dirname $this`/$link fi done # convert relative path to absolute path bin=`dirname $this` script=`basename $this` bin=`cd $bin; pwd` this=$bin/$script # the root of the Hadoop installation export HADOOP_HOME=`dirname $this`/.. Now if you echo out the script it uses full cygwin path ie: /cygdrive/d/... I tried to change the export into an absolute path file:///d:/..., it does not work and hadoop does not even start. Where in my case it will start and work as long as your are using hadoop commands, but none of the nutch commands actually work. As if the dfs was working but not the mapred part of hadoop. 2009/6/24 MilleBii mille...@gmail.com Yes I'm using both relative path cygwin under windows. so /d: is not introduced by me, but either nutch or hadoop. Regarding the cygwin path you are righ... actually where I lost quite some time. OK will try absolute paths and let you know. -MilleBii- 2009/6/24 Andrzej Bialecki a...@getopt.org MilleBii wrote: HLPPP !!! Stuck for 3 days on not able to start any nutch job. hdfs works fine, ie I can put look at files. When i start nutch crawl, I get the following error Job initialization failed: java.lang.IllegalArgumentException: Pathname /d:/Bii/nutch/logs/history/user/_logs/history/localhost_1245788245191_job_200906232217_0001_pc-%5C_inject+urls It is looking for the file at a wrong location Indeed in my case the correct location is /d:/Bii/nutch/logs/history, so why is * history/user/_logs* added and how can I fix that ? 2009/6/21 MilleBii mille...@gmail.com Looks like I just needed to transfer from the local filesystem to hdfs: Is it safe to transfer a crawl directory (and subs) from the local file system to hdfs and start crawling again ? 1. hadoop fs -put crawl crawl 2. nutch generate crawl/crawldb crawl/segments -topN 500 (where now it should use the hdfs) -MilleBii- 2009/6/21 MilleBii mille...@gmail.com I have newly installed hadoop in a distributed single node configuration. When I run nutch commands it is looking for files my user home directory and not at the nutch directory ? How can I change this ? I suspect your hadoop-site.xml uses relative path somewhere, and not an absolute path (with leading slash). Also, /d: looks suspiciously like a Windows pathname, in which case you should either use a full URI (file:///d:/) or just the disk name d:/ without the leading slash. Please also note that if you are running this on Windows under cygwin then in your config files you MUST NOT use the cygwin paths (like /cygdrive/d/...) because Java can't see them. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- -MilleBii- -- -MilleBii-
Re: Nutch and Hadoop not working proper
What's also i have discovered + hadoop (script) works with unix like paths and works fine on windows + nutch (script) works with Windows paths Could it be that there is some incompatibility because one works unix like paths and not the other ??? 2009/6/24 MilleBii mille...@gmail.com Actually tried and it fails but this is what I found : bin/hadoop-config.sh does the conversion from relative to absolute path this=$0 while [ -h $this ]; do ls=`ls -ld $this` link=`expr $ls : '.*- \(.*\)$'` if expr $link : '.*/.*' /dev/null; then this=$link else this=`dirname $this`/$link fi done # convert relative path to absolute path bin=`dirname $this` script=`basename $this` bin=`cd $bin; pwd` this=$bin/$script # the root of the Hadoop installation export HADOOP_HOME=`dirname $this`/.. Now if you echo out the script it uses full cygwin path ie: /cygdrive/d/... I tried to change the export into an absolute path file:///d:/..., it does not work and hadoop does not even start. Where in my case it will start and work as long as your are using hadoop commands, but none of the nutch commands actually work. As if the dfs was working but not the mapred part of hadoop. 2009/6/24 MilleBii mille...@gmail.com Yes I'm using both relative path cygwin under windows. so /d: is not introduced by me, but either nutch or hadoop. Regarding the cygwin path you are righ... actually where I lost quite some time. OK will try absolute paths and let you know. -MilleBii- 2009/6/24 Andrzej Bialecki a...@getopt.org MilleBii wrote: HLPPP !!! Stuck for 3 days on not able to start any nutch job. hdfs works fine, ie I can put look at files. When i start nutch crawl, I get the following error Job initialization failed: java.lang.IllegalArgumentException: Pathname /d:/Bii/nutch/logs/history/user/_logs/history/localhost_1245788245191_job_200906232217_0001_pc-%5C_inject+urls It is looking for the file at a wrong location Indeed in my case the correct location is /d:/Bii/nutch/logs/history, so why is * history/user/_logs* added and how can I fix that ? 2009/6/21 MilleBii mille...@gmail.com Looks like I just needed to transfer from the local filesystem to hdfs: Is it safe to transfer a crawl directory (and subs) from the local file system to hdfs and start crawling again ? 1. hadoop fs -put crawl crawl 2. nutch generate crawl/crawldb crawl/segments -topN 500 (where now it should use the hdfs) -MilleBii- 2009/6/21 MilleBii mille...@gmail.com I have newly installed hadoop in a distributed single node configuration. When I run nutch commands it is looking for files my user home directory and not at the nutch directory ? How can I change this ? I suspect your hadoop-site.xml uses relative path somewhere, and not an absolute path (with leading slash). Also, /d: looks suspiciously like a Windows pathname, in which case you should either use a full URI (file:///d:/) or just the disk name d:/ without the leading slash. Please also note that if you are running this on Windows under cygwin then in your config files you MUST NOT use the cygwin paths (like /cygdrive/d/...) because Java can't see them. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- -MilleBii- -- -MilleBii- -- -MilleBii-
Re: Nutch and Hadoop not working proper
MilleBii wrote: What's also i have discovered + hadoop (script) works with unix like paths and works fine on windows + nutch (script) works with Windows paths bin/nutch works with Windows paths? I think this could happen only by accident - both scripts work with Cygwin paths. On the other hand, arguments passed to JVM must be regular Windows paths. Could it be that there is some incompatibility because one works unix like paths and not the other ??? Both scripts work fine for me on Windows XP + Cygwin, without any special settings - I suspect there is something strange in your environment or config... Please note that Hadoop and Nutch scripts are regular shell scripts, so they are aware of Cygwin path conventions, in fact they don't accept un-escaped Windows paths as arguments (i.e. you need to use forward slashes, or you need to put double quotes around a Windows path). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch and Hadoop not working proper
HLPPP !!! Stuck for 3 days on not able to start any nutch job. hdfs works fine, ie I can put look at files. When i start nutch crawl, I get the following error Job initialization failed: java.lang.IllegalArgumentException: Pathname /d:/Bii/nutch/logs/history/user/_logs/history/localhost_1245788245191_job_200906232217_0001_pc-%5C_inject+urls It is looking for the file at a wrong location Indeed in my case the correct location is /d:/Bii/nutch/logs/history, so why is * history/user/_logs* added and how can I fix that ? 2009/6/21 MilleBii mille...@gmail.com Looks like I just needed to transfer from the local filesystem to hdfs: Is it safe to transfer a crawl directory (and subs) from the local file system to hdfs and start crawling again ? 1. hadoop fs -put crawl crawl 2. nutch generate crawl/crawldb crawl/segments -topN 500 (where now it should use the hdfs) -MilleBii- 2009/6/21 MilleBii mille...@gmail.com I have newly installed hadoop in a distributed single node configuration. When I run nutch commands it is looking for files my user home directory and not at the nutch directory ? How can I change this ? -- -MilleBii- -- -MilleBii- -- -MilleBii-
Nutch and Hadoop not working proper
I have newly installed hadoop in a distributed single node configuration. When I run nutch commands it is looking for files my user home directory and not at the nutch directory ? How can I change this ? -- -MilleBii-
Re: Nutch and Hadoop not working proper
Looks like I just needed to transfer from the local filesystem to hdfs: Is it safe to transfer a crawl directory (and subs) from the local file system to hdfs and start crawling again ? 1. hadoop fs -put crawl crawl 2. nutch generate crawl/crawldb crawl/segments -topN 500 (where now it should use the hdfs) -MilleBii- 2009/6/21 MilleBii mille...@gmail.com I have newly installed hadoop in a distributed single node configuration. When I run nutch commands it is looking for files my user home directory and not at the nutch directory ? How can I change this ? -- -MilleBii- -- -MilleBii-