Re: Please, unsubscribe me
List-Help: mailto:nutch-user-h...@lucene.apache.org List-Unsubscribe: mailto:nutch-user-unsubscr...@lucene.apache.org List-Post: mailto:nutch-user@lucene.apache.org List-Id: nutch-user.lucene.apache.org 2009/10/29 Le Manh Cuong cuong...@gmail.com Me too, Could you please help to remove me (cuong09m @gmail.com) from the nutch and hadoop mail list? -Original Message- From: caoyuzhong [mailto:caoyuzh...@hotmail.com] Sent: Thursday, October 29, 2009 9:49 AM To: nutch-user@lucene.apache.org Subject: RE: Please, unsubscribe me The unsubscription message does not work for me too. Could you please help to remove me (caoyuzh...@hotmail.com) from the nutch and hadoop mail list? Subject: Please, unsubscribe me From: nsa...@officinedigitali.it To: nutch-user@lucene.apache.org Date: Wed, 28 Oct 2009 16:43:05 +0100 Hi, the unsubscription message doesn't work. Please, remove me from the list. Thanks. _ 全新 Windows 7:寻找最适合您的 PC。了解详情。 http://www.microsoft.com/china/windows/buy/
Re: how to crawl a page but not index it
1.create work dir test first 2.insert url ../bin/nutch inject test -urlfile urls 3.create fetchlist ../bin/nutch generate test test/segments 4.fetch url s1=`ls -d crawl/segments/2* | tail -1` echo $s1 ../bin/nutch fetch test/segments/20090628160619 5.update crawldb ../bin/nutch updatedb test test/segments/20090628160619 loop step 3 - 5, write a bash script running is best! next time please use google search first 2009/7/13 Beats tarun_agrawal...@yahoo.com can anyone help me on this.. i m using solr to index the nutch doc. So i think prune tool will not work. i do not want to index the document taken from a particular set of sites with regards Beats -- View this message in context: http://www.nabble.com/how-to-crawl-a-page-but-not-index-it-tp24437901p24459435.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: how to crawl a page but not index it
PS : these command line like nutch 0.8 nutch 1.0 changes, but similar 2009/7/13 SunGod sun...@cheemer.org 1.create work dir test first 2.insert url ../bin/nutch inject test -urlfile urls 3.create fetchlist ../bin/nutch generate test test/segments 4.fetch url s1=`ls -d crawl/segments/2* | tail -1` echo $s1 ../bin/nutch fetch test/segments/20090628160619 5.update crawldb ../bin/nutch updatedb test test/segments/20090628160619 loop step 3 - 5, write a bash script running is best! next time please use google search first 2009/7/13 Beats tarun_agrawal...@yahoo.com can anyone help me on this.. i m using solr to index the nutch doc. So i think prune tool will not work. i do not want to index the document taken from a particular set of sites with regards Beats -- View this message in context: http://www.nabble.com/how-to-crawl-a-page-but-not-index-it-tp24437901p24459435.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Job failed help
if you use hadoop run nutch please add property namehadoop.tmp.dir/name value/youtempfs/hadoop-${user.name}/value descriptionA base for other temporary directories./description /property to you hadoop-site.xml 2009/7/13 Jake Jacobson jakecjacob...@gmail.com Hi, I have tried to run nutch 1.0 several times and it fails due to lack of disk space. I have defined the crawl to place all files on a disk that has plenty of space but when it starts building the linkdb it wants to put temp files in the home dir which doesn't have enough space. How can I force Nutch not to do this? Jake Jacobson http://www.linkedin.com/in/jakejacobson http://www.facebook.com/jakecjacobson http://twitter.com/jakejacobson Our greatest fear should not be of failure, but of succeeding at something that doesn't really matter. -- ANONYMOUS
Re: Favorite Linux Distribution for Nutch
centos 2009/7/4 schroedi schroedi2...@gmail.com What is yours favorite Linux distri? Debian, Ubuntu or Gentoo? IMHO: I check gentoo out. -- Mario Schröder | http://www.finanz-checks.de
Fwd: cluster crawldb error
next error msg 2009-06-28 18:59:24,035 WARN mapred.TaskTracker - Error running child java.lang.OutOfMemoryError: Java heap space last fetch 9 pages mapred child max memory = 1500M 3 data node,3 reduce node mapred child need for more memory? -- Forwarded message -- From: SunGod sun...@cheemer.org Date: 2009/6/28 Subject: cluster crawldb error To: nutch-user@lucene.apache.org at reduce phase error msg java.io.IOException: Task process exit with nonzero status of 65. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:425)
How torunning nutch on 2G memory tasknode
Error occurred in crawldb TestDB/crawldb reduce phase i get error msg --- java.lang.OutOfMemoryError: Java heap space my command bin/nutch crawl url -dir TestDB -depth 4 -threads 3 single fetchlist around in 20 my settings on the memory hadoop-env.sh export HADOOP_HEAPSIZE=800 hadoop-site.xml property namemapred.tasktracker.map.tasks.maximum/name value4/value /property property namemapred.tasktracker.reduce.tasks.maximum/name value4/value /property property namemapred.map.tasks/name value2/value /property property namemapred.reduce.tasks/name value2/value /property property namemapred.map.max.attempts/name value4/value /property property namemapred.reduce.max.attempts/name value4/value /property property namemapred.child.java.opts/name value-Xmx250m/value /property