Re: Nutch and Hadoop not working proper
MilleBii wrote: HLPPP !!! Stuck for 3 days on not able to start any nutch job. hdfs works fine, ie I can put look at files. When i start nutch crawl, I get the following error Job initialization failed: java.lang.IllegalArgumentException: Pathname /d:/Bii/nutch/logs/history/user/_logs/history/localhost_1245788245191_job_200906232217_0001_pc-%5C_inject+urls It is looking for the file at a wrong location Indeed in my case the correct location is /d:/Bii/nutch/logs/history, so why is * history/user/_logs* added and how can I fix that ? 2009/6/21 MilleBii mille...@gmail.com Looks like I just needed to transfer from the local filesystem to hdfs: Is it safe to transfer a crawl directory (and subs) from the local file system to hdfs and start crawling again ? 1. hadoop fs -put crawl crawl 2. nutch generate crawl/crawldb crawl/segments -topN 500 (where now it should use the hdfs) -MilleBii- 2009/6/21 MilleBii mille...@gmail.com I have newly installed hadoop in a distributed single node configuration. When I run nutch commands it is looking for files my user home directory and not at the nutch directory ? How can I change this ? I suspect your hadoop-site.xml uses relative path somewhere, and not an absolute path (with leading slash). Also, /d: looks suspiciously like a Windows pathname, in which case you should either use a full URI (file:///d:/) or just the disk name d:/ without the leading slash. Please also note that if you are running this on Windows under cygwin then in your config files you MUST NOT use the cygwin paths (like /cygdrive/d/...) because Java can't see them. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
How torunning nutch on 2G memory tasknode
Error occurred in crawldb TestDB/crawldb reduce phase i get error msg --- java.lang.OutOfMemoryError: Java heap space my command bin/nutch crawl url -dir TestDB -depth 4 -threads 3 single fetchlist around in 20 my settings on the memory hadoop-env.sh export HADOOP_HEAPSIZE=800 hadoop-site.xml property namemapred.tasktracker.map.tasks.maximum/name value4/value /property property namemapred.tasktracker.reduce.tasks.maximum/name value4/value /property property namemapred.map.tasks/name value2/value /property property namemapred.reduce.tasks/name value2/value /property property namemapred.map.max.attempts/name value4/value /property property namemapred.reduce.max.attempts/name value4/value /property property namemapred.child.java.opts/name value-Xmx250m/value /property
Re: Nutch and Hadoop not working proper
Yes I'm using both relative path cygwin under windows. so /d: is not introduced by me, but either nutch or hadoop. Regarding the cygwin path you are righ... actually where I lost quite some time. OK will try absolute paths and let you know. -MilleBii- 2009/6/24 Andrzej Bialecki a...@getopt.org MilleBii wrote: HLPPP !!! Stuck for 3 days on not able to start any nutch job. hdfs works fine, ie I can put look at files. When i start nutch crawl, I get the following error Job initialization failed: java.lang.IllegalArgumentException: Pathname /d:/Bii/nutch/logs/history/user/_logs/history/localhost_1245788245191_job_200906232217_0001_pc-%5C_inject+urls It is looking for the file at a wrong location Indeed in my case the correct location is /d:/Bii/nutch/logs/history, so why is * history/user/_logs* added and how can I fix that ? 2009/6/21 MilleBii mille...@gmail.com Looks like I just needed to transfer from the local filesystem to hdfs: Is it safe to transfer a crawl directory (and subs) from the local file system to hdfs and start crawling again ? 1. hadoop fs -put crawl crawl 2. nutch generate crawl/crawldb crawl/segments -topN 500 (where now it should use the hdfs) -MilleBii- 2009/6/21 MilleBii mille...@gmail.com I have newly installed hadoop in a distributed single node configuration. When I run nutch commands it is looking for files my user home directory and not at the nutch directory ? How can I change this ? I suspect your hadoop-site.xml uses relative path somewhere, and not an absolute path (with leading slash). Also, /d: looks suspiciously like a Windows pathname, in which case you should either use a full URI (file:///d:/) or just the disk name d:/ without the leading slash. Please also note that if you are running this on Windows under cygwin then in your config files you MUST NOT use the cygwin paths (like /cygdrive/d/...) because Java can't see them. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- -MilleBii-
recrawling
we had made a crawler that visit various sites, and i want the crawler to crawl sites as soon as they are updated, if anyone can help me to know how i can know when the site is updated and its the time to crawl again -- View this message in context: http://www.nabble.com/recrawling-tp24183356p24183356.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: recrawling
Neeti, I don't think there is a way to know when a regular web site has been updated. You can issue GET or HEAD requests and look at the Last-Modified date, but this is not 100% reliable. You can fetch and compare content, but that's not 100% reliable either. If you are indexing blogs, then you can get pings when they update, or can rely on detecting changes in their feeds. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Neeti Gupta neeti_gupt...@yahoo.com To: nutch-user@lucene.apache.org Sent: Wednesday, June 24, 2009 7:52:47 AM Subject: recrawling we had made a crawler that visit various sites, and i want the crawler to crawl sites as soon as they are updated, if anyone can help me to know how i can know when the site is updated and its the time to crawl again -- View this message in context: http://www.nabble.com/recrawling-tp24183356p24183356.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Nutch and Hadoop not working proper
Actually tried and it fails but this is what I found : bin/hadoop-config.sh does the conversion from relative to absolute path this=$0 while [ -h $this ]; do ls=`ls -ld $this` link=`expr $ls : '.*- \(.*\)$'` if expr $link : '.*/.*' /dev/null; then this=$link else this=`dirname $this`/$link fi done # convert relative path to absolute path bin=`dirname $this` script=`basename $this` bin=`cd $bin; pwd` this=$bin/$script # the root of the Hadoop installation export HADOOP_HOME=`dirname $this`/.. Now if you echo out the script it uses full cygwin path ie: /cygdrive/d/... I tried to change the export into an absolute path file:///d:/..., it does not work and hadoop does not even start. Where in my case it will start and work as long as your are using hadoop commands, but none of the nutch commands actually work. As if the dfs was working but not the mapred part of hadoop. 2009/6/24 MilleBii mille...@gmail.com Yes I'm using both relative path cygwin under windows. so /d: is not introduced by me, but either nutch or hadoop. Regarding the cygwin path you are righ... actually where I lost quite some time. OK will try absolute paths and let you know. -MilleBii- 2009/6/24 Andrzej Bialecki a...@getopt.org MilleBii wrote: HLPPP !!! Stuck for 3 days on not able to start any nutch job. hdfs works fine, ie I can put look at files. When i start nutch crawl, I get the following error Job initialization failed: java.lang.IllegalArgumentException: Pathname /d:/Bii/nutch/logs/history/user/_logs/history/localhost_1245788245191_job_200906232217_0001_pc-%5C_inject+urls It is looking for the file at a wrong location Indeed in my case the correct location is /d:/Bii/nutch/logs/history, so why is * history/user/_logs* added and how can I fix that ? 2009/6/21 MilleBii mille...@gmail.com Looks like I just needed to transfer from the local filesystem to hdfs: Is it safe to transfer a crawl directory (and subs) from the local file system to hdfs and start crawling again ? 1. hadoop fs -put crawl crawl 2. nutch generate crawl/crawldb crawl/segments -topN 500 (where now it should use the hdfs) -MilleBii- 2009/6/21 MilleBii mille...@gmail.com I have newly installed hadoop in a distributed single node configuration. When I run nutch commands it is looking for files my user home directory and not at the nutch directory ? How can I change this ? I suspect your hadoop-site.xml uses relative path somewhere, and not an absolute path (with leading slash). Also, /d: looks suspiciously like a Windows pathname, in which case you should either use a full URI (file:///d:/) or just the disk name d:/ without the leading slash. Please also note that if you are running this on Windows under cygwin then in your config files you MUST NOT use the cygwin paths (like /cygdrive/d/...) because Java can't see them. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- -MilleBii- -- -MilleBii-
Re: Nutch and Hadoop not working proper
What's also i have discovered + hadoop (script) works with unix like paths and works fine on windows + nutch (script) works with Windows paths Could it be that there is some incompatibility because one works unix like paths and not the other ??? 2009/6/24 MilleBii mille...@gmail.com Actually tried and it fails but this is what I found : bin/hadoop-config.sh does the conversion from relative to absolute path this=$0 while [ -h $this ]; do ls=`ls -ld $this` link=`expr $ls : '.*- \(.*\)$'` if expr $link : '.*/.*' /dev/null; then this=$link else this=`dirname $this`/$link fi done # convert relative path to absolute path bin=`dirname $this` script=`basename $this` bin=`cd $bin; pwd` this=$bin/$script # the root of the Hadoop installation export HADOOP_HOME=`dirname $this`/.. Now if you echo out the script it uses full cygwin path ie: /cygdrive/d/... I tried to change the export into an absolute path file:///d:/..., it does not work and hadoop does not even start. Where in my case it will start and work as long as your are using hadoop commands, but none of the nutch commands actually work. As if the dfs was working but not the mapred part of hadoop. 2009/6/24 MilleBii mille...@gmail.com Yes I'm using both relative path cygwin under windows. so /d: is not introduced by me, but either nutch or hadoop. Regarding the cygwin path you are righ... actually where I lost quite some time. OK will try absolute paths and let you know. -MilleBii- 2009/6/24 Andrzej Bialecki a...@getopt.org MilleBii wrote: HLPPP !!! Stuck for 3 days on not able to start any nutch job. hdfs works fine, ie I can put look at files. When i start nutch crawl, I get the following error Job initialization failed: java.lang.IllegalArgumentException: Pathname /d:/Bii/nutch/logs/history/user/_logs/history/localhost_1245788245191_job_200906232217_0001_pc-%5C_inject+urls It is looking for the file at a wrong location Indeed in my case the correct location is /d:/Bii/nutch/logs/history, so why is * history/user/_logs* added and how can I fix that ? 2009/6/21 MilleBii mille...@gmail.com Looks like I just needed to transfer from the local filesystem to hdfs: Is it safe to transfer a crawl directory (and subs) from the local file system to hdfs and start crawling again ? 1. hadoop fs -put crawl crawl 2. nutch generate crawl/crawldb crawl/segments -topN 500 (where now it should use the hdfs) -MilleBii- 2009/6/21 MilleBii mille...@gmail.com I have newly installed hadoop in a distributed single node configuration. When I run nutch commands it is looking for files my user home directory and not at the nutch directory ? How can I change this ? I suspect your hadoop-site.xml uses relative path somewhere, and not an absolute path (with leading slash). Also, /d: looks suspiciously like a Windows pathname, in which case you should either use a full URI (file:///d:/) or just the disk name d:/ without the leading slash. Please also note that if you are running this on Windows under cygwin then in your config files you MUST NOT use the cygwin paths (like /cygdrive/d/...) because Java can't see them. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- -MilleBii- -- -MilleBii- -- -MilleBii-
Re: Nutch and Hadoop not working proper
MilleBii wrote: What's also i have discovered + hadoop (script) works with unix like paths and works fine on windows + nutch (script) works with Windows paths bin/nutch works with Windows paths? I think this could happen only by accident - both scripts work with Cygwin paths. On the other hand, arguments passed to JVM must be regular Windows paths. Could it be that there is some incompatibility because one works unix like paths and not the other ??? Both scripts work fine for me on Windows XP + Cygwin, without any special settings - I suspect there is something strange in your environment or config... Please note that Hadoop and Nutch scripts are regular shell scripts, so they are aware of Cygwin path conventions, in fact they don't accept un-escaped Windows paths as arguments (i.e. you need to use forward slashes, or you need to put double quotes around a Windows path). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com