Re: Understanding Nutch workflow

Markus Jelsma Tue, 27 Sep 2011 11:42:42 -0700

You can the the segment reader to read downloaded content.


> this is helpful -- can someone also explain whether there is mechanism to
> extract full text of pages from where they are stored in mapreduce?
> 
> On Tue, Sep 27, 2011 at 11:24, Bai Shen <[email protected]> wrote:
> > I'm trying to understand exactly what the Nutch workflow is and I have a
> > few
> > questions.  From the tutorial:
> > 
> > bin/nutch inject crawldb urls
> > 
> > This takes a list of urls and creates a database of urls for nutch to
> > fetch.
> > 
> > bin/nutch generate crawldb segments
> > 
> > This generates a segment which contains all of the urls that need to be
> > fetched.  From what I understand, you can generate multiple segments, but
> > I'm not sure I see the value in that as the fetching is primarily limited
> > by
> > your connection, not any one machine.
> > 
> > s1=`ls -d crawl/segments/2* | tail -1`
> > echo $s1
> > bin/nutch fetch $s1
> > 
> > This takes the segment and fetchs all of the content and stores it in
> > hadoop
> > for the mapreduce job.  I'm not quite sure how that works, as when I ran
> > the
> > fetch, my connection showed 12GB of data downloaded, yet the hadoop
> > directory was using over 40GB of space.  Is this normal?
> > 
> > bin/nutch parse $1
> > 
> > This parses the fetched data using hadoop in order to extract more urls
> > to fetch.  It doesn't do any actual indexing, however.  Is this correct?
> > 
> > bin/nutch updatedb crawldb $s1
> > 
> > Now the parsed urls are added back to the initial database of urls.
> > 
> > bin/nutch invertlinks linkdb -dir segments
> > 
> > I'm not exactly sure what this does.  The tutorial says "Before indexing
> > we first invert all of the links, so that we may index incoming anchor
> > text with the pages."  Does that mean that if there's a link such as < A
> > HREF="url" > click here for more info < /A > that it adds the "click
> > here for more info" to the database for indexing in addition to the
> > actual link content?
> > 
> > bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb
> > crawl/linkdb crawl/segments/*
> > 
> > This is where the actual indexing takes place, correct?  Or is Nutch just
> > posting the various documents to Solr and leaving Solr to do the actual
> > indexing?  Is this the only step that uses the schema.xml file?
> > 
> > 
> > Thanks.

Re: Understanding Nutch workflow

Reply via email to