> > > How do I see the outpud of the mapred job? I don't recall seeing
> >
> > anything
> >
> > > like that in the log file.
> >
> > This output on stdout, which can be viewed realtime using the web gui:
> > 11/09/27 16:54:35 INFO mapred.JobClient: Job complete:
> > job_201109261414_0039
> > 11/09/27 16:54:37 INFO mapred.JobClient: Counters: 27
> > 11/09/27 16:54:37 INFO mapred.JobClient: Job Counters
> > 11/09/27 16:54:37 INFO mapred.JobClient: Launched reduce tasks=9
> > 11/09/27 16:54:37 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=4561078
> > 11/09/27 16:54:37 INFO mapred.JobClient: Total time spent by all
> > reduces
> > waiting after reserving slots (ms)=0
> > 11/09/27 16:54:37 INFO mapred.JobClient: Total time spent by all maps
> > waiting after reserving slots (ms)=0
> > 11/09/27 16:54:37 INFO mapred.JobClient: Rack-local map tasks=2
> > 11/09/27 16:54:37 INFO mapred.JobClient: Launched map tasks=417
> > 11/09/27 16:54:37 INFO mapred.JobClient: Data-local map tasks=415
> > 11/09/27 16:54:37 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=6166304
> > 11/09/27 16:54:37 INFO mapred.JobClient: File Input Format Counters
> > 11/09/27 16:54:37 INFO mapred.JobClient: Bytes Read=10396521777
> > 11/09/27 16:54:37 INFO mapred.JobClient: File Output Format Counters
> > 11/09/27 16:54:37 INFO mapred.JobClient: Bytes Written=917655979
> > 11/09/27 16:54:37 INFO mapred.JobClient: FileSystemCounters
> > 11/09/27 16:54:37 INFO mapred.JobClient: FILE_BYTES_READ=3278262704
> > 11/09/27 16:54:37 INFO mapred.JobClient: HDFS_BYTES_READ=10396613577
> > 11/09/27 16:54:37 INFO mapred.JobClient:
> > FILE_BYTES_WRITTEN=6539342397 11/09/27 16:54:37 INFO mapred.JobClient:
> > HDFS_BYTES_WRITTEN=917655979 11/09/27 16:54:37 INFO mapred.JobClient:
> > Map-Reduce Framework
> > 11/09/27 16:54:37 INFO mapred.JobClient: Map output materialized
> > bytes=3250364133
> > 11/09/27 16:54:37 INFO mapred.JobClient: Map input records=7494536
> > 11/09/27 16:54:37 INFO mapred.JobClient: Reduce shuffle
> > bytes=3250360919
> > 11/09/27 16:54:37 INFO mapred.JobClient: Spilled Records=18455792
> > 11/09/27 16:54:37 INFO mapred.JobClient: Map output bytes=4421256434
> > 11/09/27 16:54:37 INFO mapred.JobClient: Map input bytes=10396451841
> > 11/09/27 16:54:37 INFO mapred.JobClient: Combine input
> > records=42643906 11/09/27 16:54:37 INFO mapred.JobClient:
> > SPLIT_RAW_BYTES=64218 11/09/27 16:54:37 INFO mapred.JobClient:
> > Reduce input records=6966070 11/09/27 16:54:37 INFO mapred.JobClient:
> > Reduce input groups=3065036 11/09/27 16:54:37 INFO mapred.JobClient:
> > Combine output
> > records=13178184
> > 11/09/27 16:54:37 INFO mapred.JobClient: Reduce output
> > records=3065036 11/09/27 16:54:37 INFO mapred.JobClient: Map output
> > records=36431792
> >
> > web gui? Is that something that's only available in deploy mode, or can
>
> you access it in local?
Perhaps in pseudo-distributed mode.
>
> > > I see. I've been using 100 threads with 10 per host and it seems to
> > > saturate the current connection pretty well, and that's just from one
> > > machine. Which is why I was wondering about your splitting of
> > > segments. What machine limitations have you run into?
> >
> > 10 threads per host? That's a lot, doesn't seem polite to me. Segment
> > size doesn't affect bandwidth.
> >
> > When I run with less threads per host I get hung up in the fetching.
> > Some
>
> sites have more urls than others and the nice value kicks in and I end up
> with idle threads. I was looking at the docs, but it seems the max urls
> per host is deprecated, so I'm not sure what settings to use in order to
> get them to distribute across the fetcher threads more evenly.
It's replaced by a new switch, check the config. This is current 1.4
<property>
<name>generate.count.mode</name>
<value>domain</value>
</property>
<property>
<name>generate.max.count</name>
<value>10</value>
</property>
1.4 also got a feature to kill the threads when it wastes time on single
hosts.
>
> > BTW, do you know what the timeline is to have the documentation updated
> > for
> >
> > > 1.3?
> >
> > It is as we speak. Lewis did quite a good job for the wiki docs on 1.3.
>
> Okay. Sounds good.