I agree it is a miss-leading at first.
2010/1/9 Kumar Krishnasami kumara...@vembu.com
Thanks, MilleBii. That explains it. All the docs I came across mentioned
something like -depth /depth/ indicates the link depth from the root page
that should be crawled (from
Hi,
I am a newbie to nutch. Just started looking at. I have a requirement to
crawl and index only urls that are specified under the urls folder. I do
not want nutch to crawl to any depth beyond the ones that are listed in
the urls folder.
Can I accomplish this by setting the depth argument
Hello Kumar,
There is a config property you can set in conf/nutch-site.xml, as follows :
!--
property
namedb.max.outlinks.per.page/name
value0/value
descriptionThe maximum number of outlinks that we'll process for a page.
If this value is nonnegative (=0), at most
Thanks, Mischa. That worked!!
So, it looks like once this config property is set, crawl ignores the
'depth' argument. Even if I set 'depth' to 2, 3 etc., it will never
crawl any of the outlinks. Is that correct?
Regards,
Kumar.
Mischa Tuffield wrote:
Hello Kumar,
There is a config
Hi Kumar,
Am happy that that was of use to you. Sadly I have no feel for what the depth
argument does, I don't tend to ever use it, I tend to use nutch's more specific
commands: inject, generate, fetch, updatedb, merge, etc ...
Perhaps someone else could shed light on the crawl command.
Depth argument is only used for the crawl command and basically is the
number of run cycles craw/fetch/update/index
2010/1/8, Mischa Tuffield mischa.tuffi...@garlik.com:
Hi Kumar,
Am happy that that was of use to you. Sadly I have no feel for what the
depth argument does, I don't tend to ever
Thanks, MilleBii. That explains it. All the docs I came across mentioned
something like -depth /depth/ indicates the link depth from the root
page that should be crawled (from
http://lucene.apache.org/nutch/tutorial8.html).
MilleBii wrote:
Depth argument is only used for the crawl command