Re: recrawl.sh stopped at depth 7/10 without error
I sill want to know the reason. 2009/12/2 BELLINI ADAM mbel...@msn.com hi, anay idea guys ?? thanx From: mbel...@msn.com To: nutch-user@lucene.apache.org Subject: RE: recrawl.sh stopped at depth 7/10 without error Date: Fri, 27 Nov 2009 20:11:12 + hi, this is the main loop of my recrawl.sh do echo --- Beginning crawl at depth `expr $i + 1` of $depth --- $NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments $topN \ -adddays $adddays if [ $? -ne 0 ] then echo runbot: Stopping at depth $depth. No more URLs to fetch. break fi segment=`ls -d $crawl/segments/* | tail -1` $NUTCH_HOME/bin/nutch fetch $segment -threads $threads if [ $? -ne 0 ] then echo runbot: fetch $segment at depth `expr $i + 1` failed. echo runbot: Deleting segment $segment. rm $RMARGS $segment continue fi $NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment done echo - Merge Segments (Step 3 of $steps) - in my log file i never find the message - Merge Segments (Step 3 of $steps) - ! so it breaks the loop and stops the process. i dont understand why it stops at depth 7 without any errors ! From: mbel...@msn.com To: nutch-user@lucene.apache.org Subject: recrawl.sh stopped at depth 7/10 without error Date: Wed, 25 Nov 2009 15:43:33 + hi, i'm running recrawl.sh and it stops every time at depth 7/10 without any error ! but when run the bin/crawl with the same crawl-urlfilter and the same seeds file it finishs softly in 1h50 i checked the hadoop.log, and dont find any error there...i just find the last url it was parsing do fetching or crawling has a timeout ? my recrawl takes 2 hours before it stops. i set the time fetch interval 24 hours and i'm running the generate with adddays = 1 best regards _ Eligible CDN College University students can upgrade to Windows 7 before Jan 3 for only $39.99. Upgrade now! http://go.microsoft.com/?linkid=9691819 _ Eligible CDN College University students can upgrade to Windows 7 before Jan 3 for only $39.99. Upgrade now! http://go.microsoft.com/?linkid=9691819 _ Ready. Set. Get a great deal on Windows 7. See fantastic deals on Windows 7 now http://go.microsoft.com/?linkid=9691818
Re: How to successfully crawl and index office 2007 documents in Nutch 1.0
docx should be parsed,A plugin can be used to Parsed docx file. you get some help info from parse-html plugin and so on. 2009/12/4 Rupesh Mankar rupesh_man...@persistent.co.in Hi, I am new to Nutch. I want to crawl and search office 2007 documents (.docx, .pptx etc) from Nutch. But when I try to crawl, crawler throws following error: fetching http://10.88.45.140:8081/tutorial/Office-2007-document.docx Error parsing: http://10.88.45.140:8081/tutorial/Office-2007-document.docx: org.apache.nutch.parse.ParseException: parser not found for contentType=application/zip url= http://10.88.45.140:8081/tutorial/Office-2007-document.docx at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74) at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:766) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:552) When I add zip plugin in nutch-site.xml under plugin.includes, crawling becomes successful but nothing gets search. How can we successfully crawl and search contents of office 2007 documents? Thanks, Rupesh DISCLAIMER == This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
Nutch 1.0 wml plugin
I have completed the plugin for parsing the wml(wiredless mark language). I hope to add it to lucene, what i do?
add parse-wml plugin to Nutch!
hi, i have to add parse-wml plugin to Nutch, if it has been finished,pls give me some advise. Tks!
Re: nutch 1.0 Question
Your should use JDK,not JRE and please change the JDK version to 1.6 2009/8/29 �v 磊 stone54321...@mac.com Dears, I come across a problem when I use eclipse to import nutch 1.0. The problem source file is DistributedSegmentBean.java. The problem code is RPC.getProxy(RPCSegmentBean.class, FetchedSegments.VERSION, addr, conf); I cannot compile this java file with eclipse. Moreover, When I use ant to rebuild the nutch 1.0, the ant also throw an error. It is ${conf.dir} not found. I am a new user for nutch. Please help me to solve the two problems. Thank you very much. Rai Kan