Re: Please, unsubscribe me

2009-10-28 Thread SunGod
List-Help: mailto:nutch-user-h...@lucene.apache.org
List-Unsubscribe: mailto:nutch-user-unsubscr...@lucene.apache.org
List-Post: mailto:nutch-user@lucene.apache.org
List-Id: nutch-user.lucene.apache.org

2009/10/29 Le Manh Cuong cuong...@gmail.com

 Me too, Could you please help to remove me (cuong09m @gmail.com) from the
 nutch and hadoop mail list?

 -Original Message-
 From: caoyuzhong [mailto:caoyuzh...@hotmail.com]
 Sent: Thursday, October 29, 2009 9:49 AM
 To: nutch-user@lucene.apache.org
  Subject: RE: Please, unsubscribe me


 The unsubscription message does not work for me too.
 Could you please help to remove me (caoyuzh...@hotmail.com) from the nutch
 and hadoop mail list?

  Subject: Please, unsubscribe  me
  From: nsa...@officinedigitali.it
  To: nutch-user@lucene.apache.org
  Date: Wed, 28 Oct 2009 16:43:05 +0100
 
  Hi,
  the unsubscription message doesn't work. Please, remove me from the
  list.
 
  Thanks.
 
 

 _
 全新 Windows 7:寻找最适合您的 PC。了解详情。
 http://www.microsoft.com/china/windows/buy/




Re: how to crawl a page but not index it

2009-07-13 Thread SunGod
1.create work dir test first

2.insert url
../bin/nutch inject test -urlfile urls

3.create fetchlist
../bin/nutch generate test test/segments

4.fetch url
s1=`ls -d crawl/segments/2* | tail -1`
echo $s1
../bin/nutch fetch test/segments/20090628160619

5.update crawldb
../bin/nutch updatedb test test/segments/20090628160619

loop step 3 - 5, write a bash script running is best!

next time please use google search first

2009/7/13 Beats tarun_agrawal...@yahoo.com


 can anyone help me on this..

 i m using solr to index the nutch doc.
 So i think prune tool will not work.

 i do not want to index the document taken from a particular set of sites

 with regards Beats
 --
 View this message in context:
 http://www.nabble.com/how-to-crawl-a-page-but-not-index-it-tp24437901p24459435.html
  Sent from the Nutch - User mailing list archive at Nabble.com.




Re: how to crawl a page but not index it

2009-07-13 Thread SunGod
PS :
these command line like nutch 0.8

nutch 1.0 changes, but similar

2009/7/13 SunGod sun...@cheemer.org

 1.create work dir test first

 2.insert url
 ../bin/nutch inject test -urlfile urls

 3.create fetchlist
 ../bin/nutch generate test test/segments

 4.fetch url
 s1=`ls -d crawl/segments/2* | tail -1`
 echo $s1
 ../bin/nutch fetch test/segments/20090628160619

 5.update crawldb
 ../bin/nutch updatedb test test/segments/20090628160619

 loop step 3 - 5, write a bash script running is best!

 next time please use google search first

 2009/7/13 Beats tarun_agrawal...@yahoo.com


 can anyone help me on this..

 i m using solr to index the nutch doc.
 So i think prune tool will not work.

 i do not want to index the document taken from a particular set of sites

 with regards Beats
 --
 View this message in context:
 http://www.nabble.com/how-to-crawl-a-page-but-not-index-it-tp24437901p24459435.html
  Sent from the Nutch - User mailing list archive at Nabble.com.





Re: Job failed help

2009-07-13 Thread SunGod
if you use hadoop run nutch

please add

property
  namehadoop.tmp.dir/name
  value/youtempfs/hadoop-${user.name}/value
  descriptionA base for other temporary directories./description
/property

to you hadoop-site.xml

2009/7/13 Jake Jacobson jakecjacob...@gmail.com

 Hi,

 I have tried to run nutch 1.0 several times and it fails due to lack
 of disk space.  I have defined the crawl to place all files on a disk
 that has plenty of space but when it starts building the linkdb it
 wants to put temp files in the home dir which doesn't have enough
 space.  How can I force Nutch not to do this?

 Jake Jacobson

 http://www.linkedin.com/in/jakejacobson
 http://www.facebook.com/jakecjacobson
 http://twitter.com/jakejacobson

 Our greatest fear should not be of failure,
 but of succeeding at something that doesn't really matter.
   -- ANONYMOUS



Re: Favorite Linux Distribution for Nutch

2009-07-04 Thread SunGod
centos

2009/7/4 schroedi schroedi2...@gmail.com

 What is yours favorite Linux distri?

 Debian, Ubuntu or Gentoo?

 IMHO: I check gentoo out.

 --

 Mario Schröder | http://www.finanz-checks.de




Fwd: cluster crawldb error

2009-06-28 Thread SunGod
next error msg

2009-06-28 18:59:24,035 WARN  mapred.TaskTracker - Error running child
java.lang.OutOfMemoryError: Java heap space

last fetch 9 pages

mapred child max memory = 1500M

3 data node,3 reduce node

mapred child  need for more memory?



-- Forwarded message --
From: SunGod sun...@cheemer.org
Date: 2009/6/28
Subject: cluster crawldb error
To: nutch-user@lucene.apache.org


at reduce phase

error msg

java.io.IOException: Task process exit with nonzero status of 65. at
org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:425)


How torunning nutch on 2G memory tasknode

2009-06-24 Thread SunGod
Error occurred in  crawldb TestDB/crawldb reduce phase

i get error msg --- java.lang.OutOfMemoryError: Java heap space

my command
 bin/nutch crawl url -dir TestDB -depth 4 -threads 3

 single fetchlist around in 20

my settings on the memory

hadoop-env.sh
export HADOOP_HEAPSIZE=800

hadoop-site.xml
property
  namemapred.tasktracker.map.tasks.maximum/name
  value4/value
/property
property
  namemapred.tasktracker.reduce.tasks.maximum/name
  value4/value
/property
property
  namemapred.map.tasks/name
  value2/value
/property
property
  namemapred.reduce.tasks/name
  value2/value
/property
property
  namemapred.map.max.attempts/name
  value4/value
/property
property
  namemapred.reduce.max.attempts/name
  value4/value
/property
property
  namemapred.child.java.opts/name
  value-Xmx250m/value
/property