Re: Fwd: Understanding Nutch workflow

Markus Jelsma Tue, 27 Sep 2011 11:26:20 -0700

> Not sure why gmail keeps sending my replies to people instead of back to
> the list.  Have to keep a better eye out for it.
> 
> ---------- Forwarded message ----------
> From: Bai Shen <[email protected]>
> Date: Tue, Sep 27, 2011 at 1:38 PM
> Subject: Re: Understanding Nutch workflow
> To: [email protected]
> 
> > > I didn't mean that the segment would contain every unfetched url that
> > > was in the db, if that's what you mean.
> > > 
> > > I don't think I've hit more than 5000 urls in my current segments.  At
> > > least that's the highest I've seen the queue.  Is there a way to
> > 
> > determine
> > 
> > > how many urls are in a segment?
> > 
> > Sure, segment X contains the same number of URL's as there are reduce
> > output
> > records in the partioner job for X. You can see that statistic in the
> > output
> > of every mapred job.
> 
> How do I see the outpud of the mapred job?  I don't recall seeing anything
> like that in the log file.


This output on stdout, which can be viewed realtime using the web gui:
11/09/27 16:54:35 INFO mapred.JobClient: Job complete: job_201109261414_0039
11/09/27 16:54:37 INFO mapred.JobClient: Counters: 27
11/09/27 16:54:37 INFO mapred.JobClient:   Job Counters 
11/09/27 16:54:37 INFO mapred.JobClient:     Launched reduce tasks=9
11/09/27 16:54:37 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=4561078
11/09/27 16:54:37 INFO mapred.JobClient:     Total time spent by all reduces 
waiting after reserving slots (ms)=0
11/09/27 16:54:37 INFO mapred.JobClient:     Total time spent by all maps 
waiting after reserving slots (ms)=0
11/09/27 16:54:37 INFO mapred.JobClient:     Rack-local map tasks=2
11/09/27 16:54:37 INFO mapred.JobClient:     Launched map tasks=417
11/09/27 16:54:37 INFO mapred.JobClient:     Data-local map tasks=415
11/09/27 16:54:37 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=6166304
11/09/27 16:54:37 INFO mapred.JobClient:   File Input Format Counters 
11/09/27 16:54:37 INFO mapred.JobClient:     Bytes Read=10396521777
11/09/27 16:54:37 INFO mapred.JobClient:   File Output Format Counters 
11/09/27 16:54:37 INFO mapred.JobClient:     Bytes Written=917655979
11/09/27 16:54:37 INFO mapred.JobClient:   FileSystemCounters
11/09/27 16:54:37 INFO mapred.JobClient:     FILE_BYTES_READ=3278262704
11/09/27 16:54:37 INFO mapred.JobClient:     HDFS_BYTES_READ=10396613577
11/09/27 16:54:37 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=6539342397
11/09/27 16:54:37 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=917655979
11/09/27 16:54:37 INFO mapred.JobClient:   Map-Reduce Framework
11/09/27 16:54:37 INFO mapred.JobClient:     Map output materialized 
bytes=3250364133
11/09/27 16:54:37 INFO mapred.JobClient:     Map input records=7494536
11/09/27 16:54:37 INFO mapred.JobClient:     Reduce shuffle bytes=3250360919
11/09/27 16:54:37 INFO mapred.JobClient:     Spilled Records=18455792
11/09/27 16:54:37 INFO mapred.JobClient:     Map output bytes=4421256434
11/09/27 16:54:37 INFO mapred.JobClient:     Map input bytes=10396451841
11/09/27 16:54:37 INFO mapred.JobClient:     Combine input records=42643906
11/09/27 16:54:37 INFO mapred.JobClient:     SPLIT_RAW_BYTES=64218
11/09/27 16:54:37 INFO mapred.JobClient:     Reduce input records=6966070
11/09/27 16:54:37 INFO mapred.JobClient:     Reduce input groups=3065036
11/09/27 16:54:37 INFO mapred.JobClient:     Combine output records=13178184
11/09/27 16:54:37 INFO mapred.JobClient:     Reduce output records=3065036
11/09/27 16:54:37 INFO mapred.JobClient:     Map output records=36431792


> 
> > > What kind of connection do you use to fetch 500k urls?  What are your
> > > fetcher threads set to?
> > 
> > We usually don't exceed 30mbit/second in short bursts per node with 128
> > threads. This only happens for many small fetch queues, e.g. a few URL's
> > (e.g.
> > 2) for 250.000 domains. Then it's fast.
> > 
> > I see.  I've been using 100 threads with 10 per host and it seems to
> 
> saturate the current connection pretty well, and that's just from one
> machine.  Which is why I was wondering about your splitting of segments.
> What machine limitations have you run into?

10 threads per host? That's a lot, doesn't seem polite to me. Segment size 
doesn't affect bandwidth.

> 
> > > So the downloaded data gets stored in the segment directories, not the
> > > mapreduce temp files?  Why does mapreduce get so large then?
> > 
> > It is stored in the tmp during the job and writte to to the segment in
> > the reducer.
> > 
> > The mapred jobs require a factor of four for overhead?  The fetch
> 
> downloaded 12GB of data, but the mapred dir was around 50GB(I think).  Just
> trying to understand what it's doing to use all that space.
> 
> > > And any parse filter plugins are only used to search for urls, right? 
> > > So if I'm worried about additional indexing, this is not the place to
> > > be looking, correct?
> > 
> > Nono, a parse filter can, for instance, extract information from the
> > parsed DOM such as headings, meta elements or whatever and output it as
> > a field.
> 
> I see.  I don't think I'll need that, but we'll see once I get the rest
> working.
> 
> >  > What do you mean?  What is the current schema if not schema.xml?  My
> > > 
> > > understanding is that the schema.xml file in the Nutch conf dir should
> > > be the same as the schema.xml file in Solr.
> > 
> > The provided schema file is only an example, Nutch does not use it but
> > Solr does. You must copy the schema from Nutch to Solr, that's all. We
> > ship it for
> > completeness. Later we might ship other Solr files for better integration
> > on
> > the Solr side such as Velocity template files.
> > 
> > Ah, okay.  So I just need to add the Nutch fields to my current Solr
> 
> schema.

Yes.

> 
> > > If I want to modify and add additional indexing, how would I set that
> > > up? I swapped out the schema.xml file, but wasn't able to get the
> > > solrindex command to work.  It kicked back the error that I was
> > > missing the site field.
> > 
> > If you want to add new fields  you must create or modify indexing plugins
> > such
> > as index-basic, index-more, index-anchor.
> 
> Okay.  I'll take a look at the plugin documentation and see if I can figure
> that out.
> 
> 
> 
> BTW, do you know what the timeline is to have the documentation updated for
> 1.3?

It is as we speak. Lewis did quite a good job for the wiki docs on 1.3.

Re: Fwd: Understanding Nutch workflow

Reply via email to