Re: Fwd: Understanding Nutch workflow

Bai Shen Tue, 27 Sep 2011 11:31:40 -0700

>

> > How do I see the outpud of the mapred job?  I don't recall seeing
> anything
> > like that in the log file.
>
> This output on stdout, which can be viewed realtime using the web gui:
> 11/09/27 16:54:35 INFO mapred.JobClient: Job complete:
> job_201109261414_0039
> 11/09/27 16:54:37 INFO mapred.JobClient: Counters: 27
> 11/09/27 16:54:37 INFO mapred.JobClient:   Job Counters
> 11/09/27 16:54:37 INFO mapred.JobClient:     Launched reduce tasks=9
> 11/09/27 16:54:37 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=4561078
> 11/09/27 16:54:37 INFO mapred.JobClient:     Total time spent by all
> reduces
> waiting after reserving slots (ms)=0
> 11/09/27 16:54:37 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
> 11/09/27 16:54:37 INFO mapred.JobClient:     Rack-local map tasks=2
> 11/09/27 16:54:37 INFO mapred.JobClient:     Launched map tasks=417
> 11/09/27 16:54:37 INFO mapred.JobClient:     Data-local map tasks=415
> 11/09/27 16:54:37 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=6166304
> 11/09/27 16:54:37 INFO mapred.JobClient:   File Input Format Counters
> 11/09/27 16:54:37 INFO mapred.JobClient:     Bytes Read=10396521777
> 11/09/27 16:54:37 INFO mapred.JobClient:   File Output Format Counters
> 11/09/27 16:54:37 INFO mapred.JobClient:     Bytes Written=917655979
> 11/09/27 16:54:37 INFO mapred.JobClient:   FileSystemCounters
> 11/09/27 16:54:37 INFO mapred.JobClient:     FILE_BYTES_READ=3278262704
> 11/09/27 16:54:37 INFO mapred.JobClient:     HDFS_BYTES_READ=10396613577
> 11/09/27 16:54:37 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=6539342397
> 11/09/27 16:54:37 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=917655979
> 11/09/27 16:54:37 INFO mapred.JobClient:   Map-Reduce Framework
> 11/09/27 16:54:37 INFO mapred.JobClient:     Map output materialized
> bytes=3250364133
> 11/09/27 16:54:37 INFO mapred.JobClient:     Map input records=7494536
> 11/09/27 16:54:37 INFO mapred.JobClient:     Reduce shuffle
> bytes=3250360919
> 11/09/27 16:54:37 INFO mapred.JobClient:     Spilled Records=18455792
> 11/09/27 16:54:37 INFO mapred.JobClient:     Map output bytes=4421256434
> 11/09/27 16:54:37 INFO mapred.JobClient:     Map input bytes=10396451841
> 11/09/27 16:54:37 INFO mapred.JobClient:     Combine input records=42643906
> 11/09/27 16:54:37 INFO mapred.JobClient:     SPLIT_RAW_BYTES=64218
> 11/09/27 16:54:37 INFO mapred.JobClient:     Reduce input records=6966070
> 11/09/27 16:54:37 INFO mapred.JobClient:     Reduce input groups=3065036
> 11/09/27 16:54:37 INFO mapred.JobClient:     Combine output
> records=13178184
> 11/09/27 16:54:37 INFO mapred.JobClient:     Reduce output records=3065036
> 11/09/27 16:54:37 INFO mapred.JobClient:     Map output records=36431792
>
> web gui?  Is that something that's only available in deploy mode, or can
you access it in local?



> > I see.  I've been using 100 threads with 10 per host and it seems to
> > saturate the current connection pretty well, and that's just from one
> > machine.  Which is why I was wondering about your splitting of segments.
> > What machine limitations have you run into?
>
> 10 threads per host? That's a lot, doesn't seem polite to me. Segment size
> doesn't affect bandwidth.
>
> When I run with less threads per host I get hung up in the fetching.  Some
sites have more urls than others and the nice value kicks in and I end up
with idle threads.  I was looking at the docs, but it seems the max urls per
host is deprecated, so I'm not sure what settings to use in order to get
them to distribute across the fetcher threads more evenly.

> BTW, do you know what the timeline is to have the documentation updated
> for
> > 1.3?
>
> It is as we speak. Lewis did quite a good job for the wiki docs on 1.3.
>

Okay. Sounds good.

Re: Fwd: Understanding Nutch workflow

Reply via email to