Re: Deleting indexes

2009-07-14 Thread Beats
Doğacan Güney-3 wrote: On Mon, Jul 13, 2009 at 10:10, Beatstarun_agrawal...@yahoo.com wrote: hi, i want delete indexes whose url field contain certain character. Are you using solr or lucene for your indexes? i m using Solr. any help would b appriciated. thanx in advance

url normalizer

2009-07-14 Thread Neeti Gupta
Hi, i want to use url normalizer to my project as per its requirement, can any one guide me how to fetch it from nutch and add. Thanks Regards neeti -- View this message in context: http://www.nabble.com/url-normalizer-tp24474519p24474519.html Sent from the Nutch - User mailing list

Re: recrawling

2009-07-14 Thread Neeti Gupta
But are there any rules by which we can define when to crawl a website to get its updated contents as soon as possible. Otis Gospodnetic-2 wrote: Neeti, I don't think there is a way to know when a regular web site has been updated. You can issue GET or HEAD requests and look at the

Re: recrawling

2009-07-14 Thread Sjaiful Bahri
You have to detect changes of web content. http://zipclue.com --- On Tue, 7/14/09, Neeti Gupta neeti_gupt...@yahoo.com wrote: From: Neeti Gupta neeti_gupt...@yahoo.com Subject: Re: recrawling To: nutch-user@lucene.apache.org Date: Tuesday, July 14, 2009, 6:50 AM But are there any rules

Re: How To Generate the JavaDoc

2009-07-14 Thread Neeti Gupta
Are you using eclipse to make a javadoc or commandline. schroedi wrote: How may I generate the JAVA Doc in html format for all nutch and hadoop classen and everythink else. -- Mario Schröder | http://www.finanz-checks.de -- View this message in context:

Ignoring robots.txt

2009-07-14 Thread Beats
hi all, i m trying to make ntch crawler ignore robots.txt file. i hv tried to change fetcher.java RobotsRulesParser.java. But NullPointException error is reported. Can anybody gv the correct changes require to made. with regards Beats be...@yahoo.com -- View this message in context:

Re: Just getting started w/tutorial- errors in crawl.log

2009-07-14 Thread Alex McLintock
but I get a number of messages in crawl.log, like: Error parsing: http://lucene.apache.org/skin/getMenu.js: org.apache.nutch.parse.ParseException: parser not found for contentType=application/javascript url=http://lucene.apache.org/skin/getMenu.js        at

Re: Just getting started w/tutorial- errors in crawl.log

2009-07-14 Thread Beats
hi jim, what i think ur error statement says it couldn't find plugin for parsing a perticular content type. go to parse-plugins.xml in conf directory. there u will find different plugin id define for different Content type. add perticular plugin-id in nutch-site.xml file under plugin.includes

Re: Just getting started w/tutorial- errors in crawl.log

2009-07-14 Thread xiao yang
Hi, Jim I got the second error too. It's because the previous crawl failed abnormally. There should be the following sub-directories in /segments/20090713171413: content crawl_fetch crawl_generate crawl_parse parse_data parse_text My solution is deleting the corrupted directory and

job failed for java.io.IOException: Task process exit with nonzero status of 255.

2009-07-14 Thread lei wang
I run nutch to convert arc file to segements, it works well for 1 millions pages, but when i increase the page counts to 500 millions, it failed for the error messges as below. can anyone help me ? I java.io.IOException: Task process exit with nonzero status of 255. at

Re: Nutch Tutorial 1.0 based off of the French Version

2009-07-14 Thread Jake Jacobson
I did attach it. Jake Jacobson http://www.linkedin.com/in/jakejacobson http://www.facebook.com/jakecjacobson http://twitter.com/jakejacobson Our greatest fear should not be of failure, but of succeeding at something that doesn't really matter. -- ANONYMOUS On Mon, Jul 13, 2009 at 9:04 PM,

Re: Nutch Tutorial 1.0 based off of the French Version

2009-07-14 Thread Alex McLintock
2009/7/14 Jake Jacobson jakecjacob...@gmail.com: I did attach it. I am afraid that I cant see anything either. Can you perhaps upload it somewhere and link to it? I'd like to say thank you for your effort. We could do with more tutorials which look at it in different ways. Alex

Re: Nutch Tutorial 1.0 based off of the French Version

2009-07-14 Thread Jake Jacobson
Posted it to my blog, http://jakecjacobson.blogspot.com/2009/07/nutch10installationguide.html Jake Jacobson http://www.linkedin.com/in/jakejacobson http://www.facebook.com/jakecjacobson http://twitter.com/jakejacobson Our greatest fear should not be of failure, but of succeeding at something

A few questions about crawl-urlfilter.txt

2009-07-14 Thread Hrishikesh Agashe
Here are few questions I had about crawl-urlfilter.txt. - Does Nutch obey crawl-urlfilter.txt properly? By default, it is set to not download css, but when I do the crawl, I do see parse.ParseUtil exceptions in my Hadoop.log (org.apache.nutch.parse.ParseException: parser not found

Re: how to crawl a page but not index it

2009-07-14 Thread Beats
hi, actually what i want is to crawl a web page say 'page A' and all its outlinks. i want to index all the content gathered by crawling the outlinks. But not the 'page A'. is there any way to do it in single run. with Regards Beats be...@yahoo.com SunGod wrote: 1.create work dir test

How to crawl page displayed as response to search query in solr

2009-07-14 Thread Beats
hi everyone, i want to crawl search results displayed by the solr. the source code of search page doesn't give the content Type. thus Invalid XML error is shown. does anyone knows how to crawl this kind of page??? i m using 'Feed' plugin and also tried with parse-rss. page source code :

Re: Just getting started w/tutorial- errors in crawl.log

2009-07-14 Thread ohaya
Alex (et al), There was/is plenty of space on the drive (3GB). I was trying the command line from the tutorial: bin/nutch crawl urls -dir crawl.test -depth 3 crawl.log I'm re-running again, to see what happens. If I get that error again, I'll delete the dirs, as yourself and xiao yang

Re: A few questions about crawl-urlfilter.txt

2009-07-14 Thread Ken Krugler
Here are few questions I had about crawl-urlfilter.txt. Some very quick responses - others will know better. - Does Nutch obey crawl-urlfilter.txt properly? By default, it is set to not download css, but when I do the crawl, I do see parse.ParseUtil exceptions in my Hadoop.log

Tutorial followup - Nutch webapp not seeing stuff?

2009-07-14 Thread ohaya
Hi, I'm still following the Tutorial, per earlier post, and I think that I've gotten past the earlier errors with the intranet crawl (it's still running), so I wanted to try to get the web app running. I had Tomcat installed and working, so I deployed the nutch WAR file (I put it in

Re: Tutorial followup - Nutch webapp not seeing stuff?

2009-07-14 Thread ohaya
Hi, I noticed that in tomcat/webapps/nutch-1.0/WEB-INF/classes, there was a crawl-urlfilters.txt, which still had MY.DOMAIN in it, so I tried changing the parameter to *apache.org to match what was in the same file in /opt/nutch-1.0. But, even after that, and bouncing Tomcat, I get just the

Re: Tutorial followup - Nutch webapp not seeing stuff?

2009-07-14 Thread ohaya
Hi, I think that there must've been something messed up. I tried running a new crawl: bin/nutch crawl urls -dir crawl3.test -depth 2 crawl3.log and I modified the nutch-site.xml file to point to crawl3.test directory. Then, after I bounce Tomcat, I can search successfully. However, I then

Re: Tutorial followup - Nutch webapp not seeing stuff?

2009-07-14 Thread ohaya
Hi All, I'm getting totally frustrated with this nutch web app :(. I re-installed Nutch 1.0 completely. I created the urls file in /opt/nutch-1.0 I added http.agent.name of test1 and modified http.robots.agent in nutch-default.xml. I modified /opt/nutch-1.0/conf/nutch-site.xml to: ?xml

Re: Tutorial followup - Nutch webapp not seeing stuff?

2009-07-14 Thread Doğacan Güney
On Tue, Jul 14, 2009 at 21:17, oh...@cox.net wrote: Hi All, I'm getting totally frustrated with this nutch web app :(. I re-installed Nutch 1.0 completely. I created the urls file in /opt/nutch-1.0 I added http.agent.name of test1 and modified http.robots.agent in nutch-default.xml. I

Re: Tutorial followup - Nutch webapp not seeing stuff?

2009-07-14 Thread ohaya
Doğacan Güney doga...@gmail.com wrote: On Tue, Jul 14, 2009 at 21:17, oh...@cox.net wrote: Hi All, I'm getting totally frustrated with this nutch web app :(. I re-installed Nutch 1.0 completely. I created the urls file in /opt/nutch-1.0 I added http.agent.name of test1

Re: Search History and Top Searches

2009-07-14 Thread Kenan Azam
any ideas On Mon, Jul 13, 2009 at 11:58 AM, Kenan Azam azam.ke...@gmail.com wrote: Hi there,I am utilizing nutch 0.8.1. Is there a log or mechanism to see what the users are searching for. Something like a Search History or Top Searches. Thanks much..

Re: job failed for java.io.IOException: Task process exit with nonzero status of 255.

2009-07-14 Thread lei wang
can anyone help me? On Tue, Jul 14, 2009 at 7:05 PM, lei wang nutchmaill...@gmail.com wrote: I run nutch to convert arc file to segements, it works well for 1 millions pages, but when i increase the page counts to 500 millions, it failed for the error messges as below. can anyone help me ?