Re: Understanding Nutch workflow

Fred Zimmerman Tue, 27 Sep 2011 08:42:58 -0700

this is helpful -- can someone also explain whether there is mechanism to
extract full text of pages from where they are stored in mapreduce?



On Tue, Sep 27, 2011 at 11:24, Bai Shen <[email protected]> wrote:

> I'm trying to understand exactly what the Nutch workflow is and I have a
> few
> questions.  From the tutorial:
>
> bin/nutch inject crawldb urls
>
> This takes a list of urls and creates a database of urls for nutch to
> fetch.
>
> bin/nutch generate crawldb segments
>
> This generates a segment which contains all of the urls that need to be
> fetched.  From what I understand, you can generate multiple segments, but
> I'm not sure I see the value in that as the fetching is primarily limited
> by
> your connection, not any one machine.
>
> s1=`ls -d crawl/segments/2* | tail -1`
> echo $s1
> bin/nutch fetch $s1
>
> This takes the segment and fetchs all of the content and stores it in
> hadoop
> for the mapreduce job.  I'm not quite sure how that works, as when I ran
> the
> fetch, my connection showed 12GB of data downloaded, yet the hadoop
> directory was using over 40GB of space.  Is this normal?
>
> bin/nutch parse $1
>
> This parses the fetched data using hadoop in order to extract more urls to
> fetch.  It doesn't do any actual indexing, however.  Is this correct?
>
> bin/nutch updatedb crawldb $s1
>
> Now the parsed urls are added back to the initial database of urls.
>
> bin/nutch invertlinks linkdb -dir segments
>
> I'm not exactly sure what this does.  The tutorial says "Before indexing we
> first invert all of the links, so that we may index incoming anchor text
> with the pages."  Does that mean that if there's a link such as < A
> HREF="url" > click here for more info < /A > that it adds the "click here
> for more info" to the database for indexing in addition to the actual link
> content?
>
> bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb
> crawl/linkdb crawl/segments/*
>
> This is where the actual indexing takes place, correct?  Or is Nutch just
> posting the various documents to Solr and leaving Solr to do the actual
> indexing?  Is this the only step that uses the schema.xml file?
>
>
> Thanks.
>

Re: Understanding Nutch workflow

Reply via email to