Re: Fwd: Understanding Nutch workflow

Markus Jelsma Tue, 27 Sep 2011 11:41:05 -0700

> > > How do I see the outpud of the mapred job?  I don't recall seeing
> > 
> > anything
> > 
> > > like that in the log file.
> > 
> > This output on stdout, which can be viewed realtime using the web gui:
> > 11/09/27 16:54:35 INFO mapred.JobClient: Job complete:
> > job_201109261414_0039
> > 11/09/27 16:54:37 INFO mapred.JobClient: Counters: 27
> > 11/09/27 16:54:37 INFO mapred.JobClient:   Job Counters
> > 11/09/27 16:54:37 INFO mapred.JobClient:     Launched reduce tasks=9
> > 11/09/27 16:54:37 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=4561078
> > 11/09/27 16:54:37 INFO mapred.JobClient:     Total time spent by all
> > reduces
> > waiting after reserving slots (ms)=0
> > 11/09/27 16:54:37 INFO mapred.JobClient:     Total time spent by all maps
> > waiting after reserving slots (ms)=0
> > 11/09/27 16:54:37 INFO mapred.JobClient:     Rack-local map tasks=2
> > 11/09/27 16:54:37 INFO mapred.JobClient:     Launched map tasks=417
> > 11/09/27 16:54:37 INFO mapred.JobClient:     Data-local map tasks=415
> > 11/09/27 16:54:37 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=6166304
> > 11/09/27 16:54:37 INFO mapred.JobClient:   File Input Format Counters
> > 11/09/27 16:54:37 INFO mapred.JobClient:     Bytes Read=10396521777
> > 11/09/27 16:54:37 INFO mapred.JobClient:   File Output Format Counters
> > 11/09/27 16:54:37 INFO mapred.JobClient:     Bytes Written=917655979
> > 11/09/27 16:54:37 INFO mapred.JobClient:   FileSystemCounters
> > 11/09/27 16:54:37 INFO mapred.JobClient:     FILE_BYTES_READ=3278262704
> > 11/09/27 16:54:37 INFO mapred.JobClient:     HDFS_BYTES_READ=10396613577
> > 11/09/27 16:54:37 INFO mapred.JobClient:    
> > FILE_BYTES_WRITTEN=6539342397 11/09/27 16:54:37 INFO mapred.JobClient:  
> >   HDFS_BYTES_WRITTEN=917655979 11/09/27 16:54:37 INFO mapred.JobClient: 
> >  Map-Reduce Framework
> > 11/09/27 16:54:37 INFO mapred.JobClient:     Map output materialized
> > bytes=3250364133
> > 11/09/27 16:54:37 INFO mapred.JobClient:     Map input records=7494536
> > 11/09/27 16:54:37 INFO mapred.JobClient:     Reduce shuffle
> > bytes=3250360919
> > 11/09/27 16:54:37 INFO mapred.JobClient:     Spilled Records=18455792
> > 11/09/27 16:54:37 INFO mapred.JobClient:     Map output bytes=4421256434
> > 11/09/27 16:54:37 INFO mapred.JobClient:     Map input bytes=10396451841
> > 11/09/27 16:54:37 INFO mapred.JobClient:     Combine input
> > records=42643906 11/09/27 16:54:37 INFO mapred.JobClient:    
> > SPLIT_RAW_BYTES=64218 11/09/27 16:54:37 INFO mapred.JobClient:    
> > Reduce input records=6966070 11/09/27 16:54:37 INFO mapred.JobClient:   
> >  Reduce input groups=3065036 11/09/27 16:54:37 INFO mapred.JobClient:   
> >  Combine output
> > records=13178184
> > 11/09/27 16:54:37 INFO mapred.JobClient:     Reduce output
> > records=3065036 11/09/27 16:54:37 INFO mapred.JobClient:     Map output
> > records=36431792
> > 
> > web gui?  Is that something that's only available in deploy mode, or can
> 
> you access it in local?


Perhaps in pseudo-distributed mode.

> 
> > > I see.  I've been using 100 threads with 10 per host and it seems to
> > > saturate the current connection pretty well, and that's just from one
> > > machine.  Which is why I was wondering about your splitting of
> > > segments. What machine limitations have you run into?
> > 
> > 10 threads per host? That's a lot, doesn't seem polite to me. Segment
> > size doesn't affect bandwidth.
> > 
> > When I run with less threads per host I get hung up in the fetching. 
> > Some
> 
> sites have more urls than others and the nice value kicks in and I end up
> with idle threads.  I was looking at the docs, but it seems the max urls
> per host is deprecated, so I'm not sure what settings to use in order to
> get them to distribute across the fetcher threads more evenly.

It's replaced by a new switch, check the config. This is current 1.4

  <property>
    <name>generate.count.mode</name>
    <value>domain</value>
  </property>
  <property>
    <name>generate.max.count</name>
    <value>10</value>
  </property>

1.4 also got a feature to kill the threads when it wastes time on single 
hosts.

> 
> > BTW, do you know what the timeline is to have the documentation updated
> > for
> > 
> > > 1.3?
> > 
> > It is as we speak. Lewis did quite a good job for the wiki docs on 1.3.
> 
> Okay. Sounds good.

Re: Fwd: Understanding Nutch workflow

Reply via email to