Re: Understanding Nutch workflow

Markus Jelsma Tue, 27 Sep 2011 09:56:34 -0700

> On Tue, Sep 27, 2011 at 11:52 AM, Markus Jelsma
> 
> <[email protected]>wrote:
> > > I'm trying to understand exactly what the Nutch workflow is and I have
> > > a few questions.  From the tutorial:
> > > 
> > > bin/nutch inject crawldb urls
> > > 
> > > This takes a list of urls and creates a database of urls for nutch to
> > > fetch.
> > 
> > Yes, but it also merges if crawldb already exists.
> 
> Ah, right.
> 
> > > bin/nutch generate crawldb segments
> > > 
> > > This generates a segment which contains all of the urls that need to be
> > > fetched.  From what I understand, you can generate multiple segments,
> > > but I'm not sure I see the value in that as the fetching is primarily
> > > limited by your connection, not any one machine.
> > 
> > It does not contain all URL's due for fetch.  The generator is limited by
> > many
> > options. Check nutch-default for settings and descriptions. Generating
> > multiple segments is useful. We prefer, due to hardware limitations,
> > segments
> > of no more than 500.000 URL's each. So we create many small segments,
> > it's easier to handle.
> 
> I didn't mean that the segment would contain every unfetched url that was
> in the db, if that's what you mean.
> 
> I don't think I've hit more than 5000 urls in my current segments.  At
> least that's the highest I've seen the queue.  Is there a way to determine
> how many urls are in a segment?


Sure, segment X contains the same number of URL's as there are reduce output 
records in the partioner job for X. You can see that statistic in the output 
of every mapred job.

> 
> What kind of connection do you use to fetch 500k urls?  What are your
> fetcher threads set to?

We usually don't exceed 30mbit/second in short bursts per node with 128 
threads. This only happens for many small fetch queues, e.g. a few URL's (e.g. 
2) for 250.000 domains. Then it's fast.

> 
> > > s1=`ls -d crawl/segments/2* | tail -1`
> > > echo $s1
> > > bin/nutch fetch $s1
> > > 
> > > This takes the segment and fetchs all of the content and stores it in
> > > hadoop for the mapreduce job.  I'm not quite sure how that works, as
> > > when I ran the fetch, my connection showed 12GB of data downloaded,
> > > yet the hadoop directory was using over 40GB of space.  Is this
> > > normal?
> > 
> > The content dir in the segments contains actually downloaded data with
> > some overhead. The rest is generated by the various jobs.
> 
> So the downloaded data gets stored in the segment directories, not the
> mapreduce temp files?  Why does mapreduce get so large then?

It is stored in the tmp during the job and writte to to the segment in the 
reducer.

> 
> > > bin/nutch parse $1
> > > 
> > > This parses the fetched data using hadoop in order to extract more urls
> > 
> > to
> > 
> > > fetch.  It doesn't do any actual indexing, however.  Is this correct?
> > 
> > Correct. It also executes optional parse filter plugins and normalizes
> > and filters all extracted URL's.
> 
> And any parse filter plugins are only used to search for urls, right?  So
> if I'm worried about additional indexing, this is not the place to be
> looking, correct?

Nono, a parse filter can, for instance, extract information from the parsed 
DOM such as headings, meta elements or whatever and output it as a field. 

> 
> > > bin/nutch invertlinks linkdb -dir segments
> > > 
> > > I'm not exactly sure what this does.  The tutorial says "Before
> > > indexing
> > 
> > we
> > 
> > > first invert all of the links, so that we may index incoming anchor
> > > text with the pages."  Does that mean that if there's a link such as <
> > > A HREF="url" > click here for more info < /A > that it adds the "click
> > > here for more info" to the database for indexing in addition to the
> > > actual
> > 
> > link
> > 
> > > content?
> > 
> > It's not required anymore in current Nutch 1.4-dev It builds a data
> > structure
> > of all URL's with all their inlinks and anchors. You can use this to do
> > better
> > scoring of relevance in your search engine.
> 
> Gotcha.  I'm still on 1.3, however, so I'll need to keep it in the process.

Sure. :)

> 
> > > bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb
> > > crawl/linkdb crawl/segments/*
> > > 
> > > This is where the actual indexing takes place, correct?  Or is Nutch
> > > just posting the various documents to Solr and leaving Solr to do the
> > > actual indexing?  Is this the only step that uses the schema.xml file?
> > 
> > Correct but does not use the schema.xml. It will index all fields as
> > dictated
> > by your index filter plugins. Check the current schema, it lists the
> > fields anddthe plugins used for those fields.
> 
> What do you mean?  What is the current schema if not schema.xml?  My
> understanding is that the schema.xml file in the Nutch conf dir should be
> the same as the schema.xml file in Solr.

The provided schema file is only an example, Nutch does not use it but Solr 
does. You must copy the schema from Nutch to Solr, that's all. We ship it for 
completeness. Later we might ship other Solr files for better integration on 
the Solr side such as Velocity template files.

> 
> If I want to modify and add additional indexing, how would I set that up? 
> I swapped out the schema.xml file, but wasn't able to get the solrindex
> command to work.  It kicked back the error that I was missing the site
> field.

If you want to add new fields  you must create or modify indexing plugins such 
as index-basic, index-more, index-anchor.

Re: Understanding Nutch workflow

Reply via email to