In nutch-default.xml I have the following
db.fetch.retry.max
3
The maximum number of times a url that has encountered
recoverable errors is generated for fetch.
Yet after letting things run for some time, if I look at the stats I have
the following... Is there some other setting I shoul
ravi chintakunta wrote:
>
> My reply to this feature of searching multiple indexes with a single
> instance of Nutch has bounced because of an attachment.
>
> To search multiple indexes with a single instance of Nutch:
>
> - I modified web.xml to include the paths to various search indexes
> -
Ken Ken schrieb:
> /nutch-1.0/conf/regex-urlfilter.txt
>
> Hello,
>
> I just want to fetch/crawl all .com domain names, so what should I put in the
> /nutch-1.0/conf/regex-urlfilter.txt file
>
> e.g.
> +^http://([a-z0-9]*\.)*apache.org/
>
> Correct me if I am wrong. I think the above only crawl/f
On 2010-01-09 10:18, MilleBii wrote:
@Andrzej,
To be more specific if one uses cached content (which I do), what is the
"minimal" staff to keep, I guess :
+ crawl_fetch
+ parse_data
+ parse_text
the rest is not used ... I guess, before I start testing could you confirm ?
crawl_fetch you can i
@Andrzej,
To be more specific if one uses cached content (which I do), what is the
"minimal" staff to keep, I guess :
+ crawl_fetch
+ parse_data
+ parse_text
the rest is not used ... I guess, before I start testing could you confirm ?
@Ulysse,
The other reason to keep all data is if you will ne
I agree it is a miss-leading at first.
2010/1/9 Kumar Krishnasami
> Thanks, MilleBii. That explains it. All the docs I came across mentioned
> something like "-depth /depth/ indicates the link depth from the root page
> that should be crawled" (from
> http://lucene.apache.org/nutch/tutorial8.htm
here's how i test regex-urlfilter entries:
$ echo "[url]" | java -cp
./nutch-1.0.jar:./plugins/urlfilter-regex/urlfilter-regex.jar:./plugins/lib-regex-filter/lib-regex-filter.jar:./lib/hadoop-0.19.1-core.jar:./lib/commons-logging-1.0.4.jar:./lib/commons-logging-api-1.0.4.jar:./conf
org.apache.nut
/nutch-1.0/conf/regex-urlfilter.txt
Hello,
I just want to fetch/crawl all .com domain names, so what should I put in the
/nutch-1.0/conf/regex-urlfilter.txt file
e.g.
+^http://([a-z0-9]*\.)*apache.org/
Correct me if I am wrong. I think the above only crawl/fetch apache.org and
apache.org's s