Re: Nutch efficiency and multiple single URL crawls

Alejandro Caceres Thu, 29 Nov 2012 13:46:55 -0800

Got it, I will try that out, that's an excellent feature. Thank you for the
help.



On Thu, Nov 29, 2012 at 4:06 AM, Markus Jelsma
<[email protected]>wrote:

> As i said, you don't rebuild, you just overwrite the config file in the
> hadoop config directory on the data nodes. Config files are looked up there
> as well. Just copy the file to the data nodes.
>
> -----Original message-----
> > From:AC Nutch <[email protected]>
> > Sent: Thu 29-Nov-2012 05:38
> > To: [email protected]
> > Subject: Re: Nutch efficiency and multiple single URL crawls
> >
> > Thanks for the help. Perhaps I am misunderstanding, what would be the
> > proper way to leverage this? I am a bit new to Nutch 1.5.1, I've been
> using
> > 1.4 and have generally been using runtime/deploy/bin/nutch with a .job
> > file. I notice things are done a bit differently in 1.5.1 with the lack
> of
> > a nutch runtime and nutch deploy directories. How can I run a crawl while
> > leveraging this functionality and not having to rebuild the job file each
> > new crawl? More specifically, I'm picturing the following workflow...
> >
> > (1) update config file to restrict domain crawls -> (2) run command that
> > crawls a domain with changes from config file while not having to rebuild
> > job file  -> (3) index to Solr
> >
> > What would the (general) command be for step (2) is my question.
> >
> > On Mon, Nov 26, 2012 at 5:16 AM, Markus Jelsma
> > <[email protected]>wrote:
> >
> > > Hi,
> > >
> > > Rebuilding the job file for each domain is not a good idea indeed,
> plus it
> > > adds the Hadoop overhead. But you don't have to, we write dynamic
> config
> > > files to each node's Hadoop configuration directory and it is picked up
> > > instead of the embedded configuration file.
> > >
> > > Cheers,
> > >
> > > -----Original message-----
> > > > From:AC Nutch <[email protected]>
> > > > Sent: Mon 26-Nov-2012 06:50
> > > > To: [email protected]
> > > > Subject: Nutch efficiency and multiple single URL crawls
> > > >
> > > > Hello,
> > > >
> > > > I am using Nutch 1.5.1 and I am looking to do something specific with
> > > it. I
> > > > have a few million base domains in a Solr index, so for example:
> > > > http://www.nutch.org, http://www.apache.org,
> http://www.whatever.cometc. I
> > > > am trying to crawl each of these base domains in deploy mode and
> retrieve
> > > > all of their sub-urls associated with that domain in the most
> efficient
> > > way
> > > > possible. To give you an example of the workflow I am trying to
> achieve:
> > > > (1) Grab a base domain, let's say http://www.nutch.org (2) Crawl the
> > > base
> > > > domain for all URLs in that domain, let's say
> http://www.nutch.org/page1
> > > ,
> > > > http://www.nutch.org/page2, http://www.nutch.org/page3, etc. etc.
> (3)
> > > store
> > > > these results somewhere (perhaps another Solr instance) and (4) move
> on
> > > to
> > > > the next base domain in my Solr index and repeat the process.
> Essentially
> > > > just trying to grab all links associated with a page and then move
> on to
> > > > the next page.
> > > >
> > > > The part I am having trouble with is ensuring that this workflow is
> > > > efficient. The only way I can think to do this would be: (1) Grab a
> base
> > > > domain from Solr from my shell script (simple enough) (2) Add an
> entry to
> > > > regex-urlfilter with the domain I am looking to restrict the crawl
> to, in
> > > > the example above that would be an entry that says to only keep
> sub-pages
> > > > of http://www.nutch.org/ (3) Recreate the Nutch job file (~25 sec.)
> (4)
> > > > Start the crawl for pages associated with a domain and do the
> indexing
> > > >
> > > > My issue is with step #3, AFAIK if I want to restrict a crawl to a
> > > specific
> > > > domain I have to change regex-urlfilter and reload the job file.
> This is
> > > a
> > > > pretty significant problem, since adding 25 seconds every single
> time I
> > > > start a new base domain is going to add way too many seconds to my
> > > workflow
> > > > (25 sec x a few million = way too much time). Finally the
> question...is
> > > > there a way to add url filters on the fly when I start a crawl and/or
> > > > restrict a crawl to a particular domain on the fly. OR can you think
> of a
> > > > decent solution to the problem/am I missing something?
> > > >
> > >
> >
>



-- 
___

Alejandro Caceres
Hyperion Gray, LLC
Owner/CTO

Re: Nutch efficiency and multiple single URL crawls

Reply via email to