How come I have so many retries listed in stats?

2010-01-09 Thread Jesse Hires
In nutch-default.xml I have the following db.fetch.retry.max 3 The maximum number of times a url that has encountered recoverable errors is generated for fetch. Yet after letting things run for some time, if I look at the stats I have the following... Is there some other setting I shoul

Re: How to use multiple indexes

2010-01-09 Thread ravi chintakunta
ravi chintakunta wrote: > > My reply to this feature of searching multiple indexes with a single > instance of Nutch has bounced because of an attachment. > > To search multiple indexes with a single instance of Nutch: > > - I modified web.xml to include the paths to various search indexes > -

Re: regex-urlfilter.txt: only crawl .com tld

2010-01-09 Thread reinhard schwab
Ken Ken schrieb: > /nutch-1.0/conf/regex-urlfilter.txt > > Hello, > > I just want to fetch/crawl all .com domain names, so what should I put in the > /nutch-1.0/conf/regex-urlfilter.txt file > > e.g. > +^http://([a-z0-9]*\.)*apache.org/ > > Correct me if I am wrong. I think the above only crawl/f

Re: Purging from Nutch after indexing with Solr

2010-01-09 Thread Andrzej Bialecki
On 2010-01-09 10:18, MilleBii wrote: @Andrzej, To be more specific if one uses cached content (which I do), what is the "minimal" staff to keep, I guess : + crawl_fetch + parse_data + parse_text the rest is not used ... I guess, before I start testing could you confirm ? crawl_fetch you can i

Re: Purging from Nutch after indexing with Solr

2010-01-09 Thread MilleBii
@Andrzej, To be more specific if one uses cached content (which I do), what is the "minimal" staff to keep, I guess : + crawl_fetch + parse_data + parse_text the rest is not used ... I guess, before I start testing could you confirm ? @Ulysse, The other reason to keep all data is if you will ne

Re: Crawl specific urls and depth argument

2010-01-09 Thread MilleBii
I agree it is a miss-leading at first. 2010/1/9 Kumar Krishnasami > Thanks, MilleBii. That explains it. All the docs I came across mentioned > something like "-depth /depth/ indicates the link depth from the root page > that should be crawled" (from > http://lucene.apache.org/nutch/tutorial8.htm

Re: regex-urlfilter.txt: only crawl .com tld

2010-01-09 Thread James Todd
here's how i test regex-urlfilter entries: $ echo "[url]" | java -cp ./nutch-1.0.jar:./plugins/urlfilter-regex/urlfilter-regex.jar:./plugins/lib-regex-filter/lib-regex-filter.jar:./lib/hadoop-0.19.1-core.jar:./lib/commons-logging-1.0.4.jar:./lib/commons-logging-api-1.0.4.jar:./conf org.apache.nut

regex-urlfilter.txt: only crawl .com tld

2010-01-09 Thread Ken Ken
/nutch-1.0/conf/regex-urlfilter.txt Hello, I just want to fetch/crawl all .com domain names, so what should I put in the /nutch-1.0/conf/regex-urlfilter.txt file e.g. +^http://([a-z0-9]*\.)*apache.org/ Correct me if I am wrong. I think the above only crawl/fetch apache.org and apache.org's s