Re: recrawl.sh stopped at depth 7/10 without error

2009-12-07 Thread yangfeng
I sill want to  know the reason.

2009/12/2 BELLINI ADAM mbel...@msn.com


 hi,

 anay idea guys ??



 thanx

  From: mbel...@msn.com
  To: nutch-user@lucene.apache.org
  Subject: RE: recrawl.sh stopped at depth 7/10 without error
  Date: Fri, 27 Nov 2009 20:11:12 +
 
 
 
  hi,
 
  this is the main loop of my recrawl.sh
 
 
  do
 
echo --- Beginning crawl at depth `expr $i + 1` of $depth ---
$NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments $topN \
-adddays $adddays
if [ $? -ne 0 ]
then
  echo runbot: Stopping at depth $depth. No more URLs to fetch.
  break
fi
segment=`ls -d $crawl/segments/* | tail -1`
 
$NUTCH_HOME/bin/nutch fetch $segment -threads $threads
if [ $? -ne 0 ]
then
  echo runbot: fetch $segment at depth `expr $i + 1` failed.
  echo runbot: Deleting segment $segment.
  rm $RMARGS $segment
  continue
fi
 
$NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment
 
  done
 
  echo - Merge Segments (Step 3 of $steps) -
 
 
 
  in my log file i never find the message - Merge Segments (Step 3 of
 $steps) - ! so it breaks the loop and stops the process.
 
  i dont understand why it stops at depth 7 without any errors !
 
 
   From: mbel...@msn.com
   To: nutch-user@lucene.apache.org
   Subject: recrawl.sh stopped at depth 7/10 without error
   Date: Wed, 25 Nov 2009 15:43:33 +
  
  
  
   hi,
  
   i'm running recrawl.sh and it stops every time at depth 7/10 without
 any error ! but when run the bin/crawl with the same crawl-urlfilter and the
 same seeds file it finishs softly in 1h50
  
   i checked the hadoop.log, and dont find any error there...i just find
 the last url it was parsing
   do fetching or crawling has a timeout ?
   my recrawl takes 2 hours before it stops. i set the time fetch interval
 24 hours and i'm running the generate with adddays = 1
  
   best regards
  
   _
   Eligible CDN College  University students can upgrade to Windows 7
 before Jan 3 for only $39.99. Upgrade now!
   http://go.microsoft.com/?linkid=9691819
 
  _
  Eligible CDN College  University students can upgrade to Windows 7
 before Jan 3 for only $39.99. Upgrade now!
  http://go.microsoft.com/?linkid=9691819

 _
 Ready. Set. Get a great deal on Windows 7. See fantastic deals on Windows 7
 now
 http://go.microsoft.com/?linkid=9691818


Re: How to successfully crawl and index office 2007 documents in Nutch 1.0

2009-12-07 Thread yangfeng
docx should be parsed,A plugin can be used to Parsed docx file. you get some
help info from parse-html plugin and so on.

2009/12/4 Rupesh Mankar rupesh_man...@persistent.co.in

 Hi,

 I am new to Nutch. I want to crawl and search office 2007 documents (.docx,
 .pptx etc) from Nutch. But when I try to crawl, crawler throws following
 error:

 fetching http://10.88.45.140:8081/tutorial/Office-2007-document.docx
 Error parsing: http://10.88.45.140:8081/tutorial/Office-2007-document.docx:
 org.apache.nutch.parse.ParseException: parser not found for
 contentType=application/zip url=
 http://10.88.45.140:8081/tutorial/Office-2007-document.docx
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
at
 org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:766)
at
 org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:552)

 When I add zip plugin in nutch-site.xml under plugin.includes, crawling
 becomes successful but nothing gets search.

 How can we successfully crawl and search contents of office 2007 documents?

 Thanks,
 Rupesh

 DISCLAIMER
 ==
 This e-mail may contain privileged and confidential information which is
 the property of Persistent Systems Ltd. It is intended only for the use of
 the individual or entity to which it is addressed. If you are not the
 intended recipient, you are not authorized to read, retain, copy, print,
 distribute or use this message. If you have received this communication in
 error, please notify the sender and delete all copies of this message.
 Persistent Systems Ltd. does not accept any liability for virus infected
 mails.



Nutch 1.0 wml plugin

2009-12-07 Thread yangfeng
I have completed the plugin for parsing the wml(wiredless mark language). I
hope to add it to lucene, what i do?


add parse-wml plugin to Nutch!

2009-11-26 Thread yangfeng
hi,
  i have to add parse-wml plugin  to Nutch,  if it has been finished,pls
give me some advise.

   Tks!


Re: nutch 1.0 Question

2009-08-29 Thread yangfeng
Your should use JDK,not JRE and  please change the JDK version to 1.6

2009/8/29 �v 磊 stone54321...@mac.com

 Dears,

 I come across a problem when I use eclipse to import nutch 1.0.

 The problem source file is DistributedSegmentBean.java.

 The problem code is RPC.getProxy(RPCSegmentBean.class,
 FetchedSegments.VERSION, addr, conf);

 I cannot compile this java file with eclipse.

 Moreover, When I use ant to rebuild the nutch 1.0, the ant also throw an
 error.

 It is ${conf.dir} not found.

 I am a new user for nutch.

 Please help me to solve the two problems.

 Thank you very much.

 Rai Kan