Hi All,
I am trying to decide if I could use Nutch for a project I am working on
with the following requirements:
1. I need to build the ability to search a bunch of urls.
2. These urls are given to me and there is no need to crawl links from
or to these urls.
3. From time to time new urls
Hi,
I am a newbie to nutch. Just started looking at. I have a requirement to
crawl and index only urls that are specified under the urls folder. I do
not want nutch to crawl to any depth beyond the ones that are listed in
the urls folder.
Can I accomplish this by setting the depth argument
Hi,
I am a newbie to nutch. Just started looking at. I have a requirement to
crawl and index only urls that are specified under the urls folder. I do
not want nutch to crawl to any depth beyond the ones that are listed in
the urls folder.
Can I accomplish this by setting the depth argument
for a page; otherwise, all outlinks will be processed.
/description
/property
--
This will force nutch to only fetch items of depth 0, i.e. it wont
attempt to follow any of the outlinks from pages you tell it to go and
fetch.
Regards,
Mischa
On 8 Jan 2010, at 10:59, Kumar
Hi All,
I have some urls that need to be crawled that have a query string in
them. I've commented out the appropriate line in crawl_urlfilter.txt and
regex-urlfilter.txt to enable crawling of urls that contain a '?' in them.
If I crawl urls like: http://queue.acm.org/detail.cfm?id=988409
commands: inject, generate, fetch, updatedb, merge, etc ...
Perhaps someone else could shed light on the crawl command.
Regards, and happy new years!
Mischa
On 8 Jan 2010, at 11:49, Kumar Krishnasami wrote:
Thanks, Mischa. That worked!!
So, it looks like once this config property is set, crawl
Not sure if Peano's sixth axiom has any specific meaning in the
context of nutch.
I did try using a depth of 1 and it retrieved the root url as well as
urls under subfolders of the root url.
Godmar Back wrote:
Have you tried using Peano's sixth axiom?
On Fri, Jan 8, 2010 at 5:41 AM, Kumar