Using Nutch to crawl and use it as input to Solr

2010-01-22 Thread Kumar Krishnasami
Hi All, I am trying to decide if I could use Nutch for a project I am working on with the following requirements: 1. I need to build the ability to search a bunch of urls. 2. These urls are given to me and there is no need to crawl links from or to these urls. 3. From time to time new urls

Crawling only specific urls and depth

2010-01-08 Thread Kumar Krishnasami
Hi, I am a newbie to nutch. Just started looking at. I have a requirement to crawl and index only urls that are specified under the urls folder. I do not want nutch to crawl to any depth beyond the ones that are listed in the urls folder. Can I accomplish this by setting the depth argument

Crawl specific urls and depth argument

2010-01-08 Thread Kumar Krishnasami
Hi, I am a newbie to nutch. Just started looking at. I have a requirement to crawl and index only urls that are specified under the urls folder. I do not want nutch to crawl to any depth beyond the ones that are listed in the urls folder. Can I accomplish this by setting the depth argument

Re: Crawl specific urls and depth argument

2010-01-08 Thread Kumar Krishnasami
for a page; otherwise, all outlinks will be processed. /description /property -- This will force nutch to only fetch items of depth 0, i.e. it wont attempt to follow any of the outlinks from pages you tell it to go and fetch. Regards, Mischa On 8 Jan 2010, at 10:59, Kumar

Enabling Query Strings in *filter.txt files

2010-01-08 Thread Kumar Krishnasami
Hi All, I have some urls that need to be crawled that have a query string in them. I've commented out the appropriate line in crawl_urlfilter.txt and regex-urlfilter.txt to enable crawling of urls that contain a '?' in them. If I crawl urls like: http://queue.acm.org/detail.cfm?id=988409

Re: Crawl specific urls and depth argument

2010-01-08 Thread Kumar Krishnasami
commands: inject, generate, fetch, updatedb, merge, etc ... Perhaps someone else could shed light on the crawl command. Regards, and happy new years! Mischa On 8 Jan 2010, at 11:49, Kumar Krishnasami wrote: Thanks, Mischa. That worked!! So, it looks like once this config property is set, crawl

Re: Crawling only specific urls and depth

2010-01-08 Thread Kumar Krishnasami
Not sure if Peano's sixth axiom has any specific meaning in the context of nutch. I did try using a depth of 1 and it retrieved the root url as well as urls under subfolders of the root url. Godmar Back wrote: Have you tried using Peano's sixth axiom? On Fri, Jan 8, 2010 at 5:41 AM, Kumar