RE: domain crawl using bin/nutch

2009-12-21 Thread Jun Mao
But how could we tell Nutch every time to do crawling in this way?
I do not want to edit *-filter.txt every time.

Thanks,
 
Jun

-Original Message-
From: Jesse Hires [mailto:jhi...@gmail.com] 
Sent: 2009年12月22日 9:23
To: nutch-user@lucene.apache.org
Subject: Re: domain crawl using bin/nutch

You should be able to do this using one of the variations of *-urlfilter.txt
files. Instead of using + in front of the regex, you can tell it to
exclude lines that match the regex with a -.

Just a guess, I haven't actually tried it, but you could probably use
something like the following. (I'm sure you would have to fiddle with it to
get it to work correctly).

+^http://([a-z0-9]*\.)*mydomain.com/
-*/(pagename1.php|pagename2.php)



Jesse

int GetRandomNumber()
{
   return 4; // Chosen by fair roll of dice
// Guaranteed to be random
} // xkcd.com



On Mon, Dec 21, 2009 at 2:14 PM, Ted Yu yuzhih...@gmail.com wrote:

 Hi,
 I found db.ignore.external.links property.
 How do I limit the crawl by also excluding links within the same domain as
 well ?

 Thanks



RE: Multiple Nutch instances for crawling?

2009-12-17 Thread Jun Mao
In my case, I am running many nucth instances. I called them spider pool.
From client side, someone will submit urls frequenly. Each time my server 
received a url,
It will use this url as seed, send out a nutch crawler to that site(limited 
only to that site),
Craw a few hundreds of pages and analyze them.
I guess I could not do this by the command line, so I write some code myself.

Thanks,
 
Jun
-Original Message-
From: Felix Zimmermann [mailto:feliz...@gmx.de] 
Sent: 2009年12月17日 5:26
To: Nutch Mailinglist
Subject: Multiple Nutch instances for crawling?

Hi,

I would like to run at least two instances of nutch ONLY for crawling at
one time; one for very frequently updated sites and one for other sites.
Will the nutch instances get in trouble when running several
crawlscripts, especially the nutch confdir variable?

Thanks!
Felix.





RE: Multiple Nutch instances for crawling?

2009-12-17 Thread Jun Mao
Is that still true if I start two jobs( they will not share crawdb,linkdb) and 
write index to two different locations?

Thanks,
 
Jun

-Original Message-
From: MilleBii [mailto:mille...@gmail.com] 
Sent: 2009年12月17日 16:57
To: nutch-user@lucene.apache.org
Subject: Re: Multiple Nutch instances for crawling?

I guess because of the different nutch-site.xml  url filter that you want
to use it won't work... but you could try installing nutch twice run the
crawl/fetch/parse from those two locations. And joined the segments to
recreate a unified searchable index (make sure you put all your segments
under the same location).

Just one comment though I think hadoop will serialize your jobs any how so
you won't get a parallel execution of your hadoop jobs unless you run them
from different hardware.

2009/12/16 Christopher Bader cbba...@gmail.com

 Felix,

 I've had trouble running multiple instances.  I would be interested in
 hearing from anyone who has done it successfully.

 CB


 On Wed, Dec 16, 2009 at 4:26 PM, Felix Zimmermann feliz...@gmx.de wrote:

  Hi,
 
  I would like to run at least two instances of nutch ONLY for crawling at
  one time; one for very frequently updated sites and one for other sites.
  Will the nutch instances get in trouble when running several
  crawlscripts, especially the nutch confdir variable?
 
  Thanks!
  Felix.
 
 
 
 




-- 
-MilleBii-