Hello - I'm working with nutch 2.3.1 with HBase for the webpage table. I
have all the phases (inject, generate, fetch, parse, and updatedb) working
fine. Nutch is a crawling beast!
On our cluster, the generate phase uses around 60 mappers and 128 reducers,
but the fetch phase always uses just 2 reducers. In a recent test, the
fetch phase used 60 mappers and 2 reducers.
The configuration I have uses:
generate.max.count=250
fetcher.threads.fetch=256
fetcher.server.min.delay=1
fetcher.threads.per.queue=5
Output from the generate phase:
---------------
16/05/02 18:10:57 INFO mapreduce.Job: Job job_1461352180552_0008 completed
successfully
16/05/02 18:10:57 INFO mapreduce.Job: Counters: 52
File System Counters
FILE: Number of bytes read=534466703
FILE: Number of bytes written=1093638467
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=45663
HDFS: Number of bytes written=0
HDFS: Number of read operations=60
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Launched map tasks=60
Launched reduce tasks=128
Data-local map tasks=47
Rack-local map tasks=13
Total time spent by all maps in occupied slots (ms)=41550640
Total time spent by all reduces in occupied slots
(ms)=482838568
Total time spent by all map tasks (ms)=10387660
Total time spent by all reduce tasks (ms)=60354821
Total vcore-seconds taken by all map tasks=10387660
Total vcore-seconds taken by all reduce tasks=60354821
Total megabyte-seconds taken by all map tasks=42547855360
Total megabyte-seconds taken by all reduce
tasks=494426693632
Map-Reduce Framework
Map input records=22514605
Map output records=21459377
Map output bytes=2302304271
Map output materialized bytes=532738342
Input split bytes=45663
Combine input records=0
Combine output records=0
Reduce input groups=21458913
Reduce shuffle bytes=532738342
Reduce input records=21459377
Reduce output records=7506045
Spilled Records=42918754
Shuffled Maps =7680
Failed Shuffles=0
Merged Map outputs=7680
GC time elapsed (ms)=100632
CPU time spent (ms)=16005360
Physical memory (bytes) snapshot=205304303616
Virtual memory (bytes) snapshot=1838431825920
Total committed heap usage (bytes)=365396230144
Generator
GENERATE_MARK=7506045
MALFORMED_URL=1
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
16/05/02 18:10:57 INFO crawl.GeneratorJob: GeneratorJob: finished at
2016-05-02 18:10:57, time elapsed: 00:28:19
16/05/02 18:10:57 INFO crawl.GeneratorJob: GeneratorJob: generated batch
id: 1462225358-1352746578 containing 7506045 URLs
---------------
Output from the fetch phase:
---------------
16/05/02 19:18:09 INFO mapreduce.Job: Job job_1461352180552_0009 completed
successfully
16/05/02 19:18:09 INFO mapreduce.Job: Counters: 60
File System Counters
FILE: Number of bytes read=483484507
FILE: Number of bytes written=942430295
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=51243
HDFS: Number of bytes written=0
HDFS: Number of read operations=60
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Launched map tasks=60
Launched reduce tasks=2
Data-local map tasks=47
Rack-local map tasks=13
Total time spent by all maps in occupied slots (ms)=11392544
Total time spent by all reduces in occupied slots
(ms)=61953544
Total time spent by all map tasks (ms)=2848136
Total time spent by all reduce tasks (ms)=7744193
Total vcore-seconds taken by all map tasks=2848136
Total vcore-seconds taken by all reduce tasks=7744193
Total megabyte-seconds taken by all map tasks=11665965056
Total megabyte-seconds taken by all reduce tasks=63440429056
Map-Reduce Framework
Map input records=7503906
Map output records=7503906
Map output bytes=1081616122
Map output materialized bytes=450300347
Input split bytes=51243
Combine input records=0
Combine output records=0
Reduce input groups=131072
Reduce shuffle bytes=450300347
Reduce input records=7503906
Reduce output records=609920
Spilled Records=15007812
Shuffled Maps =120
Failed Shuffles=0
Merged Map outputs=120
GC time elapsed (ms)=132204
CPU time spent (ms)=19741790
Physical memory (bytes) snapshot=107981033472
Virtual memory (bytes) snapshot=336697593856
Total committed heap usage (bytes)=158064443392
FetcherStatus
ACCESS_DENIED=131
EXCEPTION=36676
GONE=295
HitByTimeLimit-QueueFeeder=6883654
HitByTimeLimit-Queues=10291
MOVED=37141
NOTFOUND=10490
NOTMODIFIED=732
SUCCESS=485083
TEMP_MOVED=14589
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
16/05/02 19:18:09 INFO fetcher.FetcherJob: FetcherJob: finished at
2016-05-02 19:18:09, time elapsed: 01:06:23
Any idea on what I need to adjust to use more nodes for the reduce phase?
Any other issues from the above that I should be aware of? I'm very new to
nutch.
Thank you!
-Joe Obernberger