RE: domain crawl using bin/nutch
But how could we tell Nutch every time to do crawling in this way? I do not want to edit *-filter.txt every time. Thanks, Jun -Original Message- From: Jesse Hires [mailto:jhi...@gmail.com] Sent: 2009年12月22日 9:23 To: nutch-user@lucene.apache.org Subject: Re: domain crawl using bin/nutch You should be able to do this using one of the variations of *-urlfilter.txt files. Instead of using + in front of the regex, you can tell it to exclude lines that match the regex with a -. Just a guess, I haven't actually tried it, but you could probably use something like the following. (I'm sure you would have to fiddle with it to get it to work correctly). +^http://([a-z0-9]*\.)*mydomain.com/ -*/(pagename1.php|pagename2.php) Jesse int GetRandomNumber() { return 4; // Chosen by fair roll of dice // Guaranteed to be random } // xkcd.com On Mon, Dec 21, 2009 at 2:14 PM, Ted Yu yuzhih...@gmail.com wrote: Hi, I found db.ignore.external.links property. How do I limit the crawl by also excluding links within the same domain as well ? Thanks
RE: Multiple Nutch instances for crawling?
In my case, I am running many nucth instances. I called them spider pool. From client side, someone will submit urls frequenly. Each time my server received a url, It will use this url as seed, send out a nutch crawler to that site(limited only to that site), Craw a few hundreds of pages and analyze them. I guess I could not do this by the command line, so I write some code myself. Thanks, Jun -Original Message- From: Felix Zimmermann [mailto:feliz...@gmx.de] Sent: 2009年12月17日 5:26 To: Nutch Mailinglist Subject: Multiple Nutch instances for crawling? Hi, I would like to run at least two instances of nutch ONLY for crawling at one time; one for very frequently updated sites and one for other sites. Will the nutch instances get in trouble when running several crawlscripts, especially the nutch confdir variable? Thanks! Felix.
RE: Multiple Nutch instances for crawling?
Is that still true if I start two jobs( they will not share crawdb,linkdb) and write index to two different locations? Thanks, Jun -Original Message- From: MilleBii [mailto:mille...@gmail.com] Sent: 2009年12月17日 16:57 To: nutch-user@lucene.apache.org Subject: Re: Multiple Nutch instances for crawling? I guess because of the different nutch-site.xml url filter that you want to use it won't work... but you could try installing nutch twice run the crawl/fetch/parse from those two locations. And joined the segments to recreate a unified searchable index (make sure you put all your segments under the same location). Just one comment though I think hadoop will serialize your jobs any how so you won't get a parallel execution of your hadoop jobs unless you run them from different hardware. 2009/12/16 Christopher Bader cbba...@gmail.com Felix, I've had trouble running multiple instances. I would be interested in hearing from anyone who has done it successfully. CB On Wed, Dec 16, 2009 at 4:26 PM, Felix Zimmermann feliz...@gmx.de wrote: Hi, I would like to run at least two instances of nutch ONLY for crawling at one time; one for very frequently updated sites and one for other sites. Will the nutch instances get in trouble when running several crawlscripts, especially the nutch confdir variable? Thanks! Felix. -- -MilleBii-