Got it, I will try that out, that's an excellent feature. Thank you for the help.
On Thu, Nov 29, 2012 at 4:06 AM, Markus Jelsma <[email protected]>wrote: > As i said, you don't rebuild, you just overwrite the config file in the > hadoop config directory on the data nodes. Config files are looked up there > as well. Just copy the file to the data nodes. > > -----Original message----- > > From:AC Nutch <[email protected]> > > Sent: Thu 29-Nov-2012 05:38 > > To: [email protected] > > Subject: Re: Nutch efficiency and multiple single URL crawls > > > > Thanks for the help. Perhaps I am misunderstanding, what would be the > > proper way to leverage this? I am a bit new to Nutch 1.5.1, I've been > using > > 1.4 and have generally been using runtime/deploy/bin/nutch with a .job > > file. I notice things are done a bit differently in 1.5.1 with the lack > of > > a nutch runtime and nutch deploy directories. How can I run a crawl while > > leveraging this functionality and not having to rebuild the job file each > > new crawl? More specifically, I'm picturing the following workflow... > > > > (1) update config file to restrict domain crawls -> (2) run command that > > crawls a domain with changes from config file while not having to rebuild > > job file -> (3) index to Solr > > > > What would the (general) command be for step (2) is my question. > > > > On Mon, Nov 26, 2012 at 5:16 AM, Markus Jelsma > > <[email protected]>wrote: > > > > > Hi, > > > > > > Rebuilding the job file for each domain is not a good idea indeed, > plus it > > > adds the Hadoop overhead. But you don't have to, we write dynamic > config > > > files to each node's Hadoop configuration directory and it is picked up > > > instead of the embedded configuration file. > > > > > > Cheers, > > > > > > -----Original message----- > > > > From:AC Nutch <[email protected]> > > > > Sent: Mon 26-Nov-2012 06:50 > > > > To: [email protected] > > > > Subject: Nutch efficiency and multiple single URL crawls > > > > > > > > Hello, > > > > > > > > I am using Nutch 1.5.1 and I am looking to do something specific with > > > it. I > > > > have a few million base domains in a Solr index, so for example: > > > > http://www.nutch.org, http://www.apache.org, > http://www.whatever.cometc. I > > > > am trying to crawl each of these base domains in deploy mode and > retrieve > > > > all of their sub-urls associated with that domain in the most > efficient > > > way > > > > possible. To give you an example of the workflow I am trying to > achieve: > > > > (1) Grab a base domain, let's say http://www.nutch.org (2) Crawl the > > > base > > > > domain for all URLs in that domain, let's say > http://www.nutch.org/page1 > > > , > > > > http://www.nutch.org/page2, http://www.nutch.org/page3, etc. etc. > (3) > > > store > > > > these results somewhere (perhaps another Solr instance) and (4) move > on > > > to > > > > the next base domain in my Solr index and repeat the process. > Essentially > > > > just trying to grab all links associated with a page and then move > on to > > > > the next page. > > > > > > > > The part I am having trouble with is ensuring that this workflow is > > > > efficient. The only way I can think to do this would be: (1) Grab a > base > > > > domain from Solr from my shell script (simple enough) (2) Add an > entry to > > > > regex-urlfilter with the domain I am looking to restrict the crawl > to, in > > > > the example above that would be an entry that says to only keep > sub-pages > > > > of http://www.nutch.org/ (3) Recreate the Nutch job file (~25 sec.) > (4) > > > > Start the crawl for pages associated with a domain and do the > indexing > > > > > > > > My issue is with step #3, AFAIK if I want to restrict a crawl to a > > > specific > > > > domain I have to change regex-urlfilter and reload the job file. > This is > > > a > > > > pretty significant problem, since adding 25 seconds every single > time I > > > > start a new base domain is going to add way too many seconds to my > > > workflow > > > > (25 sec x a few million = way too much time). Finally the > question...is > > > > there a way to add url filters on the fly when I start a crawl and/or > > > > restrict a crawl to a particular domain on the fly. OR can you think > of a > > > > decent solution to the problem/am I missing something? > > > > > > > > > > -- ___ Alejandro Caceres Hyperion Gray, LLC Owner/CTO

