Re: NutchBean refresh index problem

2009-10-05 Thread Marko Bauhardt
On Oct 2, 2009, at 3:38 PM, Haris Papadopoulos wrote: Hi, hi haris. maybe you can use some code snippets from the nutch gui v0.2 (http://github.com/101tec/nutch ). this version has an api to reload the searcher (only nutchbeans are supported). for example: SearcherFactory

Nutch - DFS environment. Is it stable?

2009-10-05 Thread tittutomen
Hi, I've been trying to set up a Nutch-hadoop distributed environment to crawl a 3 Million URL list. My experience so far been is: 1. Nutch is working fine on a single machine environ. Here I wrote a script file which calls nutch crawl command first to crawl 1000 urls. Then it crawls the next

Targeting Specific Links for Crawling

2009-10-05 Thread Eric
Does anyone know if it possible to target only certain links for crawling dynamically during a crawl? My goal would be to write a plugin for this functionality but I don't know where to start. Thanks, EO

Re: Targeting Specific Links for Crawling

2009-10-05 Thread Andrzej Bialecki
Eric wrote: Does anyone know if it possible to target only certain links for crawling dynamically during a crawl? My goal would be to write a plugin for this functionality but I don't know where to start. URLFilter plugins may be what you want. -- Best regards, Andrzej Bialecki ___.

Incremental Whole Web Crawling

2009-10-05 Thread Eric
My plan is to crawl ~1.6M TLD's to a depth of 2. Is there a way I can crawl it in increments of 100K? e.g. crawl 100K 16 times for the TLD's then crawl the links generated from the TLD's in increments of 100K? Thanks, EO

RE: Targeting Specific Links for Crawling

2009-10-05 Thread BELLINI ADAM
how to target certain links !! do you know how the links are made !? i mean their format ? you can just set a regular expression to accept only those kind of links Date: Mon, 5 Oct 2009 21:39:52 +0200 From: a...@getopt.org To: nutch-user@lucene.apache.org Subject: Re: Targeting

Re: Targeting Specific Links for Crawling

2009-10-05 Thread Eric
Adam, Yes, I have a list of strings I would look for in the link. My plan is to look for X number of links on the site - First looking for the links I want and if they exist, add them, if they don't exist add X links from the site. I am planning to start in the URL Filter plugin. Eric

Re: indexing just certain content

2009-10-05 Thread Eric
Adam, You could turn off all the indexing plugins and write your own plugin that only indexes certain meta content from your intranet - giving you complete control of the fields indexed. Eric On Oct 5, 2009, at 1:06 PM, BELLINI ADAM wrote: hi does anybody know if it's possible to

RE: Targeting Specific Links for Crawling

2009-10-05 Thread BELLINI ADAM
but when you will start by inject your starting point from your seed...after that nutch will fetch urls and it will bypass those filtred by urlfilter (regular expression)...so to calculate the number X of those URLS you have to crawl all your site !! so for sure if you will not have any

Re: Incremental Whole Web Crawling

2009-10-05 Thread Andrzej Bialecki
Eric wrote: My plan is to crawl ~1.6M TLD's to a depth of 2. Is there a way I can crawl it in increments of 100K? e.g. crawl 100K 16 times for the TLD's then crawl the links generated from the TLD's in increments of 100K? Yes. Make sure that you have the generate.update.db property set to

Re: Incremental Whole Web Crawling

2009-10-05 Thread Eric
Andrzej, Just to make sure I have this straight, set the generate.update.db property to true then bin/nutch generate crawl/crawldb crawl/segments -topN 10: 16 times? Thanks, Eric On Oct 5, 2009, at 1:27 PM, Andrzej Bialecki wrote: Eric wrote: My plan is to crawl ~1.6M TLD's to a

generate, fetch- nutch commands

2009-10-05 Thread Gaurang Patel
All, I am a masters student and want to crawl the whole web for my masters project. While trying to generate, fetch, crawl the whole web using Nutch (I am following steps from http://lucene.apache.org/nutch/tutorial8.html), I got confused among various nutch terms and usage: 1) What is the

Re: Incremental Whole Web Crawling

2009-10-05 Thread Andrzej Bialecki
Eric wrote: Andrzej, Just to make sure I have this straight, set the generate.update.db property to true then bin/nutch generate crawl/crawldb crawl/segments -topN 10: 16 times? Yes. When this property is set to true, then each fetchlist will be different, because the records for

Number of urls in the crawl database.

2009-10-05 Thread Gaurang Patel
All- At any point of time, is there a way to know how many urls are there in my *crawldb *? Regards, Gaurang

Re: Incremental Whole Web Crawling

2009-10-05 Thread Gaurang Patel
Hey Andrzej, Can you tell me where to set this property (generate.update.db)? I am trying to run similar kind of crawl scenario that Eric is running. -Gaurang 2009/10/5 Andrzej Bialecki a...@getopt.org Eric wrote: Andrzej, Just to make sure I have this straight, set the generate.update.db

Re: whole web crawl

2009-10-05 Thread Gaurang Patel
Hey Jack, *One concern:* I am not sure where can I get 0.1 billion page urls? I am using DMOZ Open Directory(which has around 3M urls) to inject the crawldb. Please help. Regards, Gaurang 2009/10/4 Jack Yu jackyu...@gmail.com 0.1 billion pages for 1.5TB On 10/5/09, Gaurang Patel

Re: Incremental Whole Web Crawling

2009-10-05 Thread Gaurang Patel
Hey, Never mind. I got *generate.update.db* in *nutch-default.xml* and set it true. Regards, Gaurang 2009/10/5 Gaurang Patel gaurangtpa...@gmail.com Hey Andrzej, Can you tell me where to set this property (generate.update.db)? I am trying to run similar kind of crawl scenario that Eric is

Re: whole web crawl

2009-10-05 Thread Jack Yu
0.1billion is pages not urls, sorry for that should be 4TB 0.1 billion pages On 10/6/09, Gaurang Patel gaurangtpa...@gmail.com wrote: Hey Jack, *One concern:* I am not sure where can I get 0.1 billion page urls? I am using DMOZ Open Directory(which has around 3M urls) to inject the