Re: Understanding Nutch workflow

Markus Jelsma Tue, 27 Sep 2011 08:55:39 -0700

> I'm trying to understand exactly what the Nutch workflow is and I have a
> few questions.  From the tutorial:
> 
> bin/nutch inject crawldb urls
> 
> This takes a list of urls and creates a database of urls for nutch to
> fetch.


Yes, but it also merges if crawldb already exists.

> 
> bin/nutch generate crawldb segments
> 
> This generates a segment which contains all of the urls that need to be
> fetched.  From what I understand, you can generate multiple segments, but
> I'm not sure I see the value in that as the fetching is primarily limited
> by your connection, not any one machine.

It does not contain all URL's due for fetch.  The generator is limited by many 
options. Check nutch-default for settings and descriptions. Generating 
multiple segments is useful. We prefer, due to hardware limitations, segments 
of no more than 500.000 URL's each. So we create many small segments, it's 
easier to handle.

> 
> s1=`ls -d crawl/segments/2* | tail -1`
> echo $s1
> bin/nutch fetch $s1
> 
> This takes the segment and fetchs all of the content and stores it in
> hadoop for the mapreduce job.  I'm not quite sure how that works, as when
> I ran the fetch, my connection showed 12GB of data downloaded, yet the
> hadoop directory was using over 40GB of space.  Is this normal?

The content dir in the segments contains actually downloaded data with some 
overhead. The rest is generated by the various jobs.

> 
> bin/nutch parse $1
> 
> This parses the fetched data using hadoop in order to extract more urls to
> fetch.  It doesn't do any actual indexing, however.  Is this correct?

Correct. It also executes optional parse filter plugins and normalizes and 
filters all extracted URL's.

> 
> bin/nutch updatedb crawldb $s1
> 
> Now the parsed urls are added back to the initial database of urls.

Correct.

> 
> bin/nutch invertlinks linkdb -dir segments
> 
> I'm not exactly sure what this does.  The tutorial says "Before indexing we
> first invert all of the links, so that we may index incoming anchor text
> with the pages."  Does that mean that if there's a link such as < A
> HREF="url" > click here for more info < /A > that it adds the "click here
> for more info" to the database for indexing in addition to the actual link
> content?

It's not required anymore in current Nutch 1.4-dev. It builds a data structure 
of all URL's with all their inlinks and anchors. You can use this to do better 
scoring of relevance in your search engine.

> 
> bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb
> crawl/linkdb crawl/segments/*
> 
> This is where the actual indexing takes place, correct?  Or is Nutch just
> posting the various documents to Solr and leaving Solr to do the actual
> indexing?  Is this the only step that uses the schema.xml file?

Correct but does not use the schema.xml. It will index all fields as dictated 
by your index filter plugins. Check the current schema, it lists the fields 
anddthe plugins used for those fields.

> 
> 
> Thanks.

Re: Understanding Nutch workflow

Reply via email to