On Oct 2, 2009, at 3:38 PM, Haris Papadopoulos wrote:
Hi,
hi haris.
maybe you can use some code snippets from the nutch gui v0.2 (http://github.com/101tec/nutch
). this version has an api to reload the searcher (only nutchbeans are
supported).
for example:
SearcherFactory
Hi,
I've been trying to set up a Nutch-hadoop distributed environment to crawl a
3 Million URL list.
My experience so far been is:
1. Nutch is working fine on a single machine environ. Here I wrote a script
file which calls nutch crawl command first to crawl 1000 urls. Then it
crawls the next
Does anyone know if it possible to target only certain links for
crawling dynamically during a crawl? My goal would be to write a
plugin for this functionality but I don't know where to start.
Thanks,
EO
Eric wrote:
Does anyone know if it possible to target only certain links for
crawling dynamically during a crawl? My goal would be to write a plugin
for this functionality but I don't know where to start.
URLFilter plugins may be what you want.
--
Best regards,
Andrzej Bialecki
___.
My plan is to crawl ~1.6M TLD's to a depth of 2. Is there a way I can
crawl it in increments of 100K? e.g. crawl 100K 16 times for the TLD's
then crawl the links generated from the TLD's in increments of 100K?
Thanks,
EO
how to target certain links !! do you know how the links are made !? i mean
their format ?
you can just set a regular expression to accept only those kind of links
Date: Mon, 5 Oct 2009 21:39:52 +0200
From: a...@getopt.org
To: nutch-user@lucene.apache.org
Subject: Re: Targeting
Adam,
Yes, I have a list of strings I would look for in the link. My plan is
to look for X number of links on the site - First looking for the
links I want and if they exist, add them, if they don't exist add X
links from the site. I am planning to start in the URL Filter plugin.
Eric
Adam,
You could turn off all the indexing plugins and write your own plugin
that only indexes certain meta content from your intranet - giving you
complete control of the fields indexed.
Eric
On Oct 5, 2009, at 1:06 PM, BELLINI ADAM wrote:
hi
does anybody know if it's possible to
but when you will start by inject your starting point from your seed...after
that nutch will fetch urls and it will bypass those filtred by urlfilter
(regular expression)...so to calculate the number X of those URLS you have to
crawl all your site !!
so for sure if you will not have any
Eric wrote:
My plan is to crawl ~1.6M TLD's to a depth of 2. Is there a way I can
crawl it in increments of 100K? e.g. crawl 100K 16 times for the TLD's
then crawl the links generated from the TLD's in increments of 100K?
Yes. Make sure that you have the generate.update.db property set to
Andrzej,
Just to make sure I have this straight, set the generate.update.db
property to true then
bin/nutch generate crawl/crawldb crawl/segments -topN 10: 16 times?
Thanks,
Eric
On Oct 5, 2009, at 1:27 PM, Andrzej Bialecki wrote:
Eric wrote:
My plan is to crawl ~1.6M TLD's to a
All,
I am a masters student and want to crawl the whole web for my masters
project.
While trying to generate, fetch, crawl the whole web using Nutch (I am
following steps from http://lucene.apache.org/nutch/tutorial8.html), I got
confused among various nutch terms and usage:
1) What is the
Eric wrote:
Andrzej,
Just to make sure I have this straight, set the generate.update.db
property to true then
bin/nutch generate crawl/crawldb crawl/segments -topN 10: 16 times?
Yes. When this property is set to true, then each fetchlist will be
different, because the records for
All-
At any point of time, is there a way to know how many urls are there
in my *crawldb
*?
Regards,
Gaurang
Hey Andrzej,
Can you tell me where to set this property (generate.update.db)? I am trying
to run similar kind of crawl scenario that Eric is running.
-Gaurang
2009/10/5 Andrzej Bialecki a...@getopt.org
Eric wrote:
Andrzej,
Just to make sure I have this straight, set the generate.update.db
Hey Jack,
*One concern:*
I am not sure where can I get 0.1 billion page urls? I am using DMOZ Open
Directory(which has around 3M urls) to inject the crawldb.
Please help.
Regards,
Gaurang
2009/10/4 Jack Yu jackyu...@gmail.com
0.1 billion pages for 1.5TB
On 10/5/09, Gaurang Patel
Hey,
Never mind. I got *generate.update.db* in *nutch-default.xml* and set it
true.
Regards,
Gaurang
2009/10/5 Gaurang Patel gaurangtpa...@gmail.com
Hey Andrzej,
Can you tell me where to set this property (generate.update.db)? I am
trying to run similar kind of crawl scenario that Eric is
0.1billion is pages not urls,
sorry for that should be 4TB 0.1 billion pages
On 10/6/09, Gaurang Patel gaurangtpa...@gmail.com wrote:
Hey Jack,
*One concern:*
I am not sure where can I get 0.1 billion page urls? I am using DMOZ Open
Directory(which has around 3M urls) to inject the
18 matches
Mail list logo