Thanks Sebastian,
The output with set -x is below. I'm new to Nutch and was not aware that 1.13
requires Hadoop 2.7.2 specifically. While I see it now in pom.xml, it may be a
good idea to document it in the download page and provide a download link
(since the Hadoop releases page contains 2.7.3 but not 2.7.2). I will try to
install 2.7.2 and retest tomorrow.
root@crawler001:/data/apache-nutch-1.13/runtime/deploy/bin# ./crawl urls crawl 2
Injecting seed URLs
/data/apache-nutch-1.13/runtime/deploy/bin/nutch inject crawl/crawldb urls
+ cygwin=false
+ case "`uname`" in
++ uname
+ THIS=/data/apache-nutch-1.13/runtime/deploy/bin/nutch
+ '[' -h /data/apache-nutch-1.13/runtime/deploy/bin/nutch ']'
+ '[' 3 = 0 ']'
+ COMMAND=inject
+ shift
++ dirname /data/apache-nutch-1.13/runtime/deploy/bin/nutch
+ THIS_DIR=/data/apache-nutch-1.13/runtime/deploy/bin
++ cd /data/apache-nutch-1.13/runtime/deploy/bin/..
++ pwd
+ NUTCH_HOME=/data/apache-nutch-1.13/runtime/deploy
+ '[' '' '!=' '' ']'
+ '[' /usr/lib/jvm/java-8-oracle/jre/ = '' ']'
+ local=true
+ '[' -f /data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job ']'
+ local=false
+ for f in '"$NUTCH_HOME"/*nutch*.job'
+ NUTCH_JOB=/data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job
+ false
+ JAVA=/usr/lib/jvm/java-8-oracle/jre//bin/java
+ JAVA_HEAP_MAX=-Xmx1000m
+ '[' '' '!=' '' ']'
+ CLASSPATH=/data/apache-nutch-1.13/runtime/deploy/conf
+
CLASSPATH=/data/apache-nutch-1.13/runtime/deploy/conf:/usr/lib/jvm/java-8-oracle/jre//lib/tools.jar
+ IFS=
+ false
+ false
+ JAVA_LIBRARY_PATH=
+ '[' -d /data/apache-nutch-1.13/runtime/deploy/lib/native ']'
+ '[' false = true -a X '!=' X ']'
+ unset IFS
+ '[' '' = '' ']'
+ NUTCH_LOG_DIR=/data/apache-nutch-1.13/runtime/deploy/logs
+ '[' '' = '' ']'
+ NUTCH_LOGFILE=hadoop.log
+ false
+ NUTCH_OPTS=($NUTCH_OPTS -Dhadoop.log.dir="$NUTCH_LOG_DIR")
+ NUTCH_OPTS=("${NUTCH_OPTS[@]}" -Dhadoop.log.file="$NUTCH_LOGFILE")
+ '[' x '!=' x ']'
+ '[' inject = crawl ']'
+ '[' inject = inject ']'
+ CLASS=org.apache.nutch.crawl.Injector
+ EXEC_CALL=(hadoop jar "$NUTCH_JOB")
+ false
++ which hadoop
++ wc -l
+ '[' 1 -eq 0 ']'
+ exec hadoop jar /data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job
org.apache.nutch.crawl.Injector crawl/crawldb urls
17/05/02 06:00:24 INFO crawl.Injector: Injector: starting at 2017-05-02 06:00:24
17/05/02 06:00:24 INFO crawl.Injector: Injector: crawlDb: crawl/crawldb
17/05/02 06:00:24 INFO crawl.Injector: Injector: urlDir: urls
17/05/02 06:00:24 INFO crawl.Injector: Injector: Converting injected urls to
crawl db entries.
17/05/02 06:00:25 INFO Configuration.deprecation: session.id is deprecated.
Instead, use dfs.metrics.session-id
17/05/02 06:00:25 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=JobTracker, sessionId=
17/05/02 06:00:26 INFO input.FileInputFormat: Total input files to process : 1
17/05/02 06:00:26 INFO input.FileInputFormat: Total input files to process : 1
17/05/02 06:00:26 INFO mapreduce.JobSubmitter: number of splits:2
17/05/02 06:00:26 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_local307378419_0001
17/05/02 06:00:26 INFO mapreduce.Job: The url to track the job:
http://localhost:8080/
17/05/02 06:00:26 INFO mapreduce.Job: Running job: job_local307378419_0001
17/05/02 06:00:26 INFO mapred.LocalJobRunner: OutputCommitter set in config null
17/05/02 06:00:26 INFO output.FileOutputCommitter: File Output Committer
Algorithm version is 1
17/05/02 06:00:26 INFO output.FileOutputCommitter: FileOutputCommitter skip
cleanup _temporary folders under output directory:false, ignore cleanup
failures: false
17/05/02 06:00:26 INFO mapred.LocalJobRunner: OutputCommitter is
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
17/05/02 06:00:26 INFO mapred.LocalJobRunner: Waiting for map tasks
17/05/02 06:00:26 INFO mapred.LocalJobRunner: Starting task:
attempt_local307378419_0001_m_000000_0
17/05/02 06:00:26 INFO output.FileOutputCommitter: File Output Committer
Algorithm version is 1
17/05/02 06:00:26 INFO output.FileOutputCommitter: FileOutputCommitter skip
cleanup _temporary folders under output directory:false, ignore cleanup
failures: false
17/05/02 06:00:26 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
17/05/02 06:00:26 INFO mapred.MapTask: Processing split:
hdfs://localhost:9000/user/root/crawl/crawldb/current/part-r-00000/data:0+148
17/05/02 06:00:26 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
17/05/02 06:00:26 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
17/05/02 06:00:26 INFO mapred.MapTask: soft limit at 83886080
17/05/02 06:00:26 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
17/05/02 06:00:26 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
17/05/02 06:00:26 INFO mapred.MapTask: Map output collector class =
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
17/05/02 06:00:26 INFO plugin.PluginRepository: Plugins: looking in:
/tmp/hadoop-unjar333276722181778867/classes/plugins
17/05/02 06:00:26 INFO plugin.PluginRepository: Plugin Auto-activation mode:
[true]
17/05/02 06:00:26 INFO plugin.PluginRepository: Registered Plugins:
17/05/02 06:00:26 INFO plugin.PluginRepository: Regex URL Filter
(urlfilter-regex)
17/05/02 06:00:26 INFO plugin.PluginRepository: Html Parse Plug-in
(parse-html)
17/05/02 06:00:26 INFO plugin.PluginRepository: HTTP Framework
(lib-http)
17/05/02 06:00:26 INFO plugin.PluginRepository: the nutch core
extension points (nutch-extensionpoints)
17/05/02 06:00:26 INFO plugin.PluginRepository: Basic Indexing Filter
(index-basic)
17/05/02 06:00:26 INFO plugin.PluginRepository: Anchor Indexing Filter
(index-anchor)
17/05/02 06:00:26 INFO plugin.PluginRepository: Tika Parser Plug-in
(parse-tika)
17/05/02 06:00:26 INFO plugin.PluginRepository: Basic URL Normalizer
(urlnormalizer-basic)
17/05/02 06:00:26 INFO plugin.PluginRepository: Regex URL Filter
Framework (lib-regex-filter)
17/05/02 06:00:26 INFO plugin.PluginRepository: Regex URL Normalizer
(urlnormalizer-regex)
17/05/02 06:00:26 INFO plugin.PluginRepository: CyberNeko HTML Parser
(lib-nekohtml)
17/05/02 06:00:26 INFO plugin.PluginRepository: OPIC Scoring Plug-in
(scoring-opic)
17/05/02 06:00:26 INFO plugin.PluginRepository: Pass-through URL
Normalizer (urlnormalizer-pass)
17/05/02 06:00:26 INFO plugin.PluginRepository: Http Protocol Plug-in
(protocol-http)
17/05/02 06:00:26 INFO plugin.PluginRepository: ElasticIndexWriter
(indexer-elastic)
17/05/02 06:00:26 INFO plugin.PluginRepository: Registered Extension-Points:
17/05/02 06:00:26 INFO plugin.PluginRepository: Nutch Content Parser
(org.apache.nutch.parse.Parser)
17/05/02 06:00:26 INFO plugin.PluginRepository: Nutch URL Filter
(org.apache.nutch.net.URLFilter)
17/05/02 06:00:26 INFO plugin.PluginRepository: HTML Parse Filter
(org.apache.nutch.parse.HtmlParseFilter)
17/05/02 06:00:26 INFO plugin.PluginRepository: Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
17/05/02 06:00:26 INFO plugin.PluginRepository: Nutch URL Normalizer
(org.apache.nutch.net.URLNormalizer)
17/05/02 06:00:26 INFO plugin.PluginRepository: Nutch Publisher
(org.apache.nutch.publisher.NutchPublisher)
17/05/02 06:00:26 INFO plugin.PluginRepository: Nutch Protocol
(org.apache.nutch.protocol.Protocol)
17/05/02 06:00:26 INFO plugin.PluginRepository: Nutch URL Ignore
Exemption Filter (org.apache.nutch.net.URLExemptionFilter)
17/05/02 06:00:26 INFO plugin.PluginRepository: Nutch Index Writer
(org.apache.nutch.indexer.IndexWriter)
17/05/02 06:00:26 INFO plugin.PluginRepository: Nutch Segment Merge
Filter (org.apache.nutch.segment.SegmentMergeFilter)
17/05/02 06:00:26 INFO plugin.PluginRepository: Nutch Indexing Filter
(org.apache.nutch.indexer.IndexingFilter)
17/05/02 06:00:26 INFO conf.Configuration: found resource regex-normalize.xml
at file:/tmp/hadoop-unjar333276722181778867/regex-normalize.xml
17/05/02 06:00:26 INFO conf.Configuration: found resource regex-urlfilter.txt
at file:/tmp/hadoop-unjar333276722181778867/regex-urlfilter.txt
17/05/02 06:00:26 INFO regex.RegexURLNormalizer: can't find rules for scope
'inject', using default
17/05/02 06:00:26 INFO mapred.LocalJobRunner:
17/05/02 06:00:26 INFO mapred.MapTask: Starting flush of map output
17/05/02 06:00:26 INFO mapred.MapTask: Spilling map output
17/05/02 06:00:26 INFO mapred.MapTask: bufstart = 0; bufend = 54; bufvoid =
104857600
17/05/02 06:00:26 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend =
26214396(104857584); length = 1/6553600
17/05/02 06:00:26 INFO mapred.MapTask: Finished spill 0
17/05/02 06:00:26 INFO mapred.Task: Task:attempt_local307378419_0001_m_000000_0
is done. And is in the process of committing
17/05/02 06:00:26 INFO mapred.LocalJobRunner: map
17/05/02 06:00:26 INFO mapred.Task: Task
'attempt_local307378419_0001_m_000000_0' done.
17/05/02 06:00:26 INFO mapred.LocalJobRunner: Finishing task:
attempt_local307378419_0001_m_000000_0
17/05/02 06:00:26 INFO mapred.LocalJobRunner: Starting task:
attempt_local307378419_0001_m_000001_0
17/05/02 06:00:26 INFO output.FileOutputCommitter: File Output Committer
Algorithm version is 1
17/05/02 06:00:26 INFO output.FileOutputCommitter: FileOutputCommitter skip
cleanup _temporary folders under output directory:false, ignore cleanup
failures: false
17/05/02 06:00:26 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
17/05/02 06:00:26 INFO mapred.MapTask: Processing split:
hdfs://localhost:9000/user/root/urls/seed.txt:0+24
17/05/02 06:00:26 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
17/05/02 06:00:26 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
17/05/02 06:00:26 INFO mapred.MapTask: soft limit at 83886080
17/05/02 06:00:26 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
17/05/02 06:00:26 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
17/05/02 06:00:26 INFO mapred.MapTask: Map output collector class =
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
17/05/02 06:00:26 INFO conf.Configuration: found resource regex-normalize.xml
at file:/tmp/hadoop-unjar333276722181778867/regex-normalize.xml
17/05/02 06:00:26 INFO regex.RegexURLNormalizer: can't find rules for scope
'inject', using default
17/05/02 06:00:26 INFO mapred.LocalJobRunner:
17/05/02 06:00:26 INFO mapred.MapTask: Starting flush of map output
17/05/02 06:00:26 INFO mapred.MapTask: Spilling map output
17/05/02 06:00:26 INFO mapred.MapTask: bufstart = 0; bufend = 54; bufvoid =
104857600
17/05/02 06:00:26 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend =
26214396(104857584); length = 1/6553600
17/05/02 06:00:26 INFO mapred.MapTask: Finished spill 0
17/05/02 06:00:26 INFO mapred.Task: Task:attempt_local307378419_0001_m_000001_0
is done. And is in the process of committing
17/05/02 06:00:26 INFO mapred.LocalJobRunner:
hdfs://localhost:9000/user/root/urls/seed.txt:0+24
17/05/02 06:00:26 INFO mapred.Task: Task
'attempt_local307378419_0001_m_000001_0' done.
17/05/02 06:00:26 INFO mapred.LocalJobRunner: Finishing task:
attempt_local307378419_0001_m_000001_0
17/05/02 06:00:26 INFO mapred.LocalJobRunner: map task executor complete.
17/05/02 06:00:26 INFO mapred.LocalJobRunner: Waiting for reduce tasks
17/05/02 06:00:26 INFO mapred.LocalJobRunner: Starting task:
attempt_local307378419_0001_r_000000_0
17/05/02 06:00:26 INFO output.FileOutputCommitter: File Output Committer
Algorithm version is 1
17/05/02 06:00:26 INFO output.FileOutputCommitter: FileOutputCommitter skip
cleanup _temporary folders under output directory:false, ignore cleanup
failures: false
17/05/02 06:00:26 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
17/05/02 06:00:26 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin:
org.apache.hadoop.mapreduce.task.reduce.Shuffle@504b0ec4
17/05/02 06:00:26 INFO reduce.MergeManagerImpl: MergerManager:
memoryLimit=334338464, maxSingleShuffleLimit=83584616,
mergeThreshold=220663392, ioSortFactor=10, memToMemMergeOutputsThreshold=10
17/05/02 06:00:26 INFO reduce.EventFetcher:
attempt_local307378419_0001_r_000000_0 Thread started: EventFetcher for
fetching Map Completion Events
17/05/02 06:00:26 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle
output of map attempt_local307378419_0001_m_000001_0 decomp: 58 len: 62 to
MEMORY
17/05/02 06:00:26 INFO reduce.InMemoryMapOutput: Read 58 bytes from map-output
for attempt_local307378419_0001_m_000001_0
17/05/02 06:00:26 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output
of size: 58, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->58
17/05/02 06:00:26 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle
output of map attempt_local307378419_0001_m_000000_0 decomp: 58 len: 62 to
MEMORY
17/05/02 06:00:26 INFO reduce.InMemoryMapOutput: Read 58 bytes from map-output
for attempt_local307378419_0001_m_000000_0
17/05/02 06:00:26 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output
of size: 58, inMemoryMapOutputs.size() -> 2, commitMemory -> 58, usedMemory
->116
17/05/02 06:00:26 INFO reduce.EventFetcher: EventFetcher is interrupted..
Returning
17/05/02 06:00:26 INFO mapred.LocalJobRunner: 2 / 2 copied.
17/05/02 06:00:26 INFO reduce.MergeManagerImpl: finalMerge called with 2
in-memory map-outputs and 0 on-disk map-outputs
17/05/02 06:00:26 INFO mapred.Merger: Merging 2 sorted segments
17/05/02 06:00:26 INFO mapred.Merger: Down to the last merge-pass, with 2
segments left of total size: 62 bytes
17/05/02 06:00:26 INFO reduce.MergeManagerImpl: Merged 2 segments, 116 bytes to
disk to satisfy reduce memory limit
17/05/02 06:00:26 INFO reduce.MergeManagerImpl: Merging 1 files, 118 bytes from
disk
17/05/02 06:00:26 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes
from memory into reduce
17/05/02 06:00:26 INFO mapred.Merger: Merging 1 sorted segments
17/05/02 06:00:26 INFO mapred.Merger: Down to the last merge-pass, with 1
segments left of total size: 87 bytes
17/05/02 06:00:26 INFO mapred.LocalJobRunner: 2 / 2 copied.
17/05/02 06:00:27 INFO zlib.ZlibFactory: Successfully loaded & initialized
native-zlib library
17/05/02 06:00:27 INFO compress.CodecPool: Got brand-new compressor [.deflate]
17/05/02 06:00:27 INFO Configuration.deprecation: mapred.skip.on is deprecated.
Instead, use mapreduce.job.skiprecords
17/05/02 06:00:27 INFO crawl.Injector: Injector: overwrite: false
17/05/02 06:00:27 INFO crawl.Injector: Injector: update: false
17/05/02 06:00:27 INFO mapreduce.Job: Job job_local307378419_0001 running in
uber mode : false
17/05/02 06:00:27 INFO mapreduce.Job: map 100% reduce 0%
17/05/02 06:00:27 INFO mapred.Task: Task:attempt_local307378419_0001_r_000000_0
is done. And is in the process of committing
17/05/02 06:00:27 INFO mapred.LocalJobRunner: 2 / 2 copied.
17/05/02 06:00:27 INFO mapred.Task: Task attempt_local307378419_0001_r_000000_0
is allowed to commit now
17/05/02 06:00:27 INFO output.FileOutputCommitter: Saved output of task
'attempt_local307378419_0001_r_000000_0' to
hdfs://localhost:9000/user/root/crawl/crawldb/crawldb-921346783/_temporary/0/task_local307378419_0001_r_000000
17/05/02 06:00:27 INFO mapred.LocalJobRunner: reduce > reduce
17/05/02 06:00:27 INFO mapred.Task: Task
'attempt_local307378419_0001_r_000000_0' done.
17/05/02 06:00:27 INFO mapred.LocalJobRunner: Finishing task:
attempt_local307378419_0001_r_000000_0
17/05/02 06:00:27 INFO mapred.LocalJobRunner: reduce task executor complete.
17/05/02 06:00:28 INFO mapreduce.Job: map 100% reduce 100%
17/05/02 06:00:28 INFO mapreduce.Job: Job job_local307378419_0001 completed
successfully
17/05/02 06:00:28 INFO mapreduce.Job: Counters: 37
File System Counters
FILE: Number of bytes read=652298479
FILE: Number of bytes written=658557993
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=492
HDFS: Number of bytes written=365
HDFS: Number of read operations=46
HDFS: Number of large read operations=0
HDFS: Number of write operations=13
Map-Reduce Framework
Map input records=2
Map output records=2
Map output bytes=108
Map output materialized bytes=124
Input split bytes=570
Combine input records=0
Combine output records=0
Reduce input groups=1
Reduce shuffle bytes=124
Reduce input records=2
Reduce output records=1
Spilled Records=4
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=15
Total committed heap usage (bytes)=1044381696
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
injector
urls_injected=1
urls_merged=1
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=365
17/05/02 06:00:28 INFO crawl.Injector: Injector: Total urls rejected by
filters: 0
17/05/02 06:00:28 INFO crawl.Injector: Injector: Total urls injected after
normalization and filtering: 1
17/05/02 06:00:28 INFO crawl.Injector: Injector: Total urls injected but
already in CrawlDb: 1
17/05/02 06:00:28 INFO crawl.Injector: Injector: Total new urls injected: 0
17/05/02 06:00:28 INFO crawl.Injector: Injector: finished at 2017-05-02
06:00:28, elapsed: 00:00:04
Tue May 2 06:00:28 CDT 2017 : Iteration 1 of 2
Generating a new segment
/data/apache-nutch-1.13/runtime/deploy/bin/nutch generate -D
mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D
mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D
mapreduce.map.output.compress=true crawl/crawldb crawl/segments -topN 50000
-numFetchers 1 -noFilter
+ cygwin=false
+ case "`uname`" in
++ uname
+ THIS=/data/apache-nutch-1.13/runtime/deploy/bin/nutch
+ '[' -h /data/apache-nutch-1.13/runtime/deploy/bin/nutch ']'
+ '[' 18 = 0 ']'
+ COMMAND=generate
+ shift
++ dirname /data/apache-nutch-1.13/runtime/deploy/bin/nutch
+ THIS_DIR=/data/apache-nutch-1.13/runtime/deploy/bin
++ cd /data/apache-nutch-1.13/runtime/deploy/bin/..
++ pwd
+ NUTCH_HOME=/data/apache-nutch-1.13/runtime/deploy
+ '[' '' '!=' '' ']'
+ '[' /usr/lib/jvm/java-8-oracle/jre/ = '' ']'
+ local=true
+ '[' -f /data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job ']'
+ local=false
+ for f in '"$NUTCH_HOME"/*nutch*.job'
+ NUTCH_JOB=/data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job
+ false
+ JAVA=/usr/lib/jvm/java-8-oracle/jre//bin/java
+ JAVA_HEAP_MAX=-Xmx1000m
+ '[' '' '!=' '' ']'
+ CLASSPATH=/data/apache-nutch-1.13/runtime/deploy/conf
+
CLASSPATH=/data/apache-nutch-1.13/runtime/deploy/conf:/usr/lib/jvm/java-8-oracle/jre//lib/tools.jar
+ IFS=
+ false
+ false
+ JAVA_LIBRARY_PATH=
+ '[' -d /data/apache-nutch-1.13/runtime/deploy/lib/native ']'
+ '[' false = true -a X '!=' X ']'
+ unset IFS
+ '[' '' = '' ']'
+ NUTCH_LOG_DIR=/data/apache-nutch-1.13/runtime/deploy/logs
+ '[' '' = '' ']'
+ NUTCH_LOGFILE=hadoop.log
+ false
+ NUTCH_OPTS=($NUTCH_OPTS -Dhadoop.log.dir="$NUTCH_LOG_DIR")
+ NUTCH_OPTS=("${NUTCH_OPTS[@]}" -Dhadoop.log.file="$NUTCH_LOGFILE")
+ '[' x '!=' x ']'
+ '[' generate = crawl ']'
+ '[' generate = inject ']'
+ '[' generate = generate ']'
+ CLASS=org.apache.nutch.crawl.Generator
+ EXEC_CALL=(hadoop jar "$NUTCH_JOB")
+ false
++ which hadoop
++ wc -l
+ '[' 1 -eq 0 ']'
+ exec hadoop jar /data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job
org.apache.nutch.crawl.Generator -D mapreduce.job.reduces=2 -D
mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D
mapreduce.map.speculative=false -D mapreduce.map.output.compress=true
crawl/crawldb crawl/segments -topN 50000 -numFetchers 1 -noFilter
17/05/02 06:00:32 INFO crawl.Generator: Generator: starting at 2017-05-02
06:00:32
17/05/02 06:00:32 INFO crawl.Generator: Generator: Selecting best-scoring urls
due for fetch.
17/05/02 06:00:32 INFO crawl.Generator: Generator: filtering: false
17/05/02 06:00:32 INFO crawl.Generator: Generator: normalizing: true
17/05/02 06:00:32 INFO crawl.Generator: Generator: topN: 50000
17/05/02 06:00:32 INFO Configuration.deprecation: session.id is deprecated.
Instead, use dfs.metrics.session-id
17/05/02 06:00:32 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=JobTracker, sessionId=
17/05/02 06:00:32 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with
processName=JobTracker, sessionId= - already initialized
17/05/02 06:00:33 INFO mapred.FileInputFormat: Total input files to process : 1
17/05/02 06:00:33 INFO mapreduce.JobSubmitter: number of splits:1
17/05/02 06:00:33 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_local1706016672_0001
17/05/02 06:00:33 INFO mapreduce.Job: The url to track the job:
http://localhost:8080/
17/05/02 06:00:33 INFO mapred.LocalJobRunner: OutputCommitter set in config null
17/05/02 06:00:33 INFO mapreduce.Job: Running job: job_local1706016672_0001
17/05/02 06:00:33 INFO mapred.LocalJobRunner: OutputCommitter is
org.apache.hadoop.mapred.FileOutputCommitter
17/05/02 06:00:33 INFO output.FileOutputCommitter: File Output Committer
Algorithm version is 1
17/05/02 06:00:33 INFO output.FileOutputCommitter: FileOutputCommitter skip
cleanup _temporary folders under output directory:false, ignore cleanup
failures: false
17/05/02 06:00:34 INFO mapred.LocalJobRunner: Waiting for map tasks
17/05/02 06:00:34 INFO mapred.LocalJobRunner: Starting task:
attempt_local1706016672_0001_m_000000_0
17/05/02 06:00:34 INFO output.FileOutputCommitter: File Output Committer
Algorithm version is 1
17/05/02 06:00:34 INFO output.FileOutputCommitter: FileOutputCommitter skip
cleanup _temporary folders under output directory:false, ignore cleanup
failures: false
17/05/02 06:00:34 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
17/05/02 06:00:34 INFO mapred.MapTask: Processing split:
hdfs://localhost:9000/user/root/crawl/crawldb/current/part-r-00000/data:0+148
17/05/02 06:00:34 INFO mapred.MapTask: numReduceTasks: 2
17/05/02 06:00:34 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
17/05/02 06:00:34 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
17/05/02 06:00:34 INFO mapred.MapTask: soft limit at 83886080
17/05/02 06:00:34 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
17/05/02 06:00:34 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
17/05/02 06:00:34 INFO mapred.MapTask: Map output collector class =
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
17/05/02 06:00:34 INFO plugin.PluginRepository: Plugins: looking in:
/tmp/hadoop-unjar7886623985863993949/classes/plugins
17/05/02 06:00:34 INFO plugin.PluginRepository: Plugin Auto-activation mode:
[true]
17/05/02 06:00:34 INFO plugin.PluginRepository: Registered Plugins:
17/05/02 06:00:34 INFO plugin.PluginRepository: Regex URL Filter
(urlfilter-regex)
17/05/02 06:00:34 INFO plugin.PluginRepository: Html Parse Plug-in
(parse-html)
17/05/02 06:00:34 INFO plugin.PluginRepository: HTTP Framework
(lib-http)
17/05/02 06:00:34 INFO plugin.PluginRepository: the nutch core
extension points (nutch-extensionpoints)
17/05/02 06:00:34 INFO plugin.PluginRepository: Basic Indexing Filter
(index-basic)
17/05/02 06:00:34 INFO plugin.PluginRepository: Anchor Indexing Filter
(index-anchor)
17/05/02 06:00:34 INFO plugin.PluginRepository: Tika Parser Plug-in
(parse-tika)
17/05/02 06:00:34 INFO plugin.PluginRepository: Basic URL Normalizer
(urlnormalizer-basic)
17/05/02 06:00:34 INFO plugin.PluginRepository: Regex URL Filter
Framework (lib-regex-filter)
17/05/02 06:00:34 INFO plugin.PluginRepository: Regex URL Normalizer
(urlnormalizer-regex)
17/05/02 06:00:34 INFO plugin.PluginRepository: CyberNeko HTML Parser
(lib-nekohtml)
17/05/02 06:00:34 INFO plugin.PluginRepository: OPIC Scoring Plug-in
(scoring-opic)
17/05/02 06:00:34 INFO plugin.PluginRepository: Pass-through URL
Normalizer (urlnormalizer-pass)
17/05/02 06:00:34 INFO plugin.PluginRepository: Http Protocol Plug-in
(protocol-http)
17/05/02 06:00:34 INFO plugin.PluginRepository: ElasticIndexWriter
(indexer-elastic)
17/05/02 06:00:34 INFO plugin.PluginRepository: Registered Extension-Points:
17/05/02 06:00:34 INFO plugin.PluginRepository: Nutch Content Parser
(org.apache.nutch.parse.Parser)
17/05/02 06:00:34 INFO plugin.PluginRepository: Nutch URL Filter
(org.apache.nutch.net.URLFilter)
17/05/02 06:00:34 INFO plugin.PluginRepository: HTML Parse Filter
(org.apache.nutch.parse.HtmlParseFilter)
17/05/02 06:00:34 INFO plugin.PluginRepository: Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
17/05/02 06:00:34 INFO plugin.PluginRepository: Nutch URL Normalizer
(org.apache.nutch.net.URLNormalizer)
17/05/02 06:00:34 INFO plugin.PluginRepository: Nutch Publisher
(org.apache.nutch.publisher.NutchPublisher)
17/05/02 06:00:34 INFO plugin.PluginRepository: Nutch Protocol
(org.apache.nutch.protocol.Protocol)
17/05/02 06:00:34 INFO plugin.PluginRepository: Nutch URL Ignore
Exemption Filter (org.apache.nutch.net.URLExemptionFilter)
17/05/02 06:00:34 INFO plugin.PluginRepository: Nutch Index Writer
(org.apache.nutch.indexer.IndexWriter)
17/05/02 06:00:34 INFO plugin.PluginRepository: Nutch Segment Merge
Filter (org.apache.nutch.segment.SegmentMergeFilter)
17/05/02 06:00:34 INFO plugin.PluginRepository: Nutch Indexing Filter
(org.apache.nutch.indexer.IndexingFilter)
17/05/02 06:00:34 INFO conf.Configuration: found resource regex-urlfilter.txt
at file:/tmp/hadoop-unjar7886623985863993949/regex-urlfilter.txt
17/05/02 06:00:34 INFO conf.Configuration: found resource regex-normalize.xml
at file:/tmp/hadoop-unjar7886623985863993949/regex-normalize.xml
17/05/02 06:00:34 INFO crawl.FetchScheduleFactory: Using FetchSchedule impl:
org.apache.nutch.crawl.DefaultFetchSchedule
17/05/02 06:00:34 INFO crawl.AbstractFetchSchedule: defaultInterval=2592000
17/05/02 06:00:34 INFO crawl.AbstractFetchSchedule: maxInterval=7776000
17/05/02 06:00:34 INFO regex.RegexURLNormalizer: can't find rules for scope
'partition', using default
17/05/02 06:00:34 INFO mapred.LocalJobRunner:
17/05/02 06:00:34 INFO mapred.MapTask: Starting flush of map output
17/05/02 06:00:34 INFO mapred.MapTask: Spilling map output
17/05/02 06:00:34 INFO mapred.MapTask: bufstart = 0; bufend = 83; bufvoid =
104857600
17/05/02 06:00:34 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend =
26214396(104857584); length = 1/6553600
17/05/02 06:00:34 INFO zlib.ZlibFactory: Successfully loaded & initialized
native-zlib library
17/05/02 06:00:34 INFO compress.CodecPool: Got brand-new compressor [.deflate]
17/05/02 06:00:34 INFO mapred.MapTask: Finished spill 0
17/05/02 06:00:34 INFO mapred.Task:
Task:attempt_local1706016672_0001_m_000000_0 is done. And is in the process of
committing
17/05/02 06:00:34 INFO mapred.LocalJobRunner:
hdfs://localhost:9000/user/root/crawl/crawldb/current/part-r-00000/data:0+148
17/05/02 06:00:34 INFO mapred.Task: Task
'attempt_local1706016672_0001_m_000000_0' done.
17/05/02 06:00:34 INFO mapred.LocalJobRunner: Finishing task:
attempt_local1706016672_0001_m_000000_0
17/05/02 06:00:34 INFO mapred.LocalJobRunner: map task executor complete.
17/05/02 06:00:34 INFO mapred.LocalJobRunner: Waiting for reduce tasks
17/05/02 06:00:34 INFO mapred.LocalJobRunner: Starting task:
attempt_local1706016672_0001_r_000000_0
17/05/02 06:00:34 INFO output.FileOutputCommitter: File Output Committer
Algorithm version is 1
17/05/02 06:00:34 INFO output.FileOutputCommitter: FileOutputCommitter skip
cleanup _temporary folders under output directory:false, ignore cleanup
failures: false
17/05/02 06:00:34 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
17/05/02 06:00:34 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin:
org.apache.hadoop.mapreduce.task.reduce.Shuffle@2fd7e5ad
17/05/02 06:00:34 INFO reduce.MergeManagerImpl: MergerManager:
memoryLimit=334338464, maxSingleShuffleLimit=83584616,
mergeThreshold=220663392, ioSortFactor=10, memToMemMergeOutputsThreshold=10
17/05/02 06:00:34 INFO reduce.EventFetcher:
attempt_local1706016672_0001_r_000000_0 Thread started: EventFetcher for
fetching Map Completion Events
17/05/02 06:00:34 INFO compress.CodecPool: Got brand-new decompressor [.deflate]
17/05/02 06:00:34 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle
output of map attempt_local1706016672_0001_m_000000_0 decomp: 87 len: 83 to
MEMORY
17/05/02 06:00:34 INFO reduce.InMemoryMapOutput: Read 87 bytes from map-output
for attempt_local1706016672_0001_m_000000_0
17/05/02 06:00:34 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output
of size: 87, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->87
17/05/02 06:00:34 INFO reduce.EventFetcher: EventFetcher is interrupted..
Returning
17/05/02 06:00:34 INFO mapred.LocalJobRunner: 1 / 1 copied.
17/05/02 06:00:34 INFO reduce.MergeManagerImpl: finalMerge called with 1
in-memory map-outputs and 0 on-disk map-outputs
17/05/02 06:00:34 INFO mapred.Merger: Merging 1 sorted segments
17/05/02 06:00:34 INFO mapred.Merger: Down to the last merge-pass, with 1
segments left of total size: 81 bytes
17/05/02 06:00:34 INFO reduce.MergeManagerImpl: Merged 1 segments, 87 bytes to
disk to satisfy reduce memory limit
17/05/02 06:00:34 INFO reduce.MergeManagerImpl: Merging 1 files, 91 bytes from
disk
17/05/02 06:00:34 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes
from memory into reduce
17/05/02 06:00:34 INFO mapred.Merger: Merging 1 sorted segments
17/05/02 06:00:34 INFO mapred.Merger: Down to the last merge-pass, with 1
segments left of total size: 81 bytes
17/05/02 06:00:34 INFO mapred.LocalJobRunner: 1 / 1 copied.
17/05/02 06:00:34 INFO conf.Configuration: found resource regex-normalize.xml
at file:/tmp/hadoop-unjar7886623985863993949/regex-normalize.xml
17/05/02 06:00:34 INFO crawl.FetchScheduleFactory: Using FetchSchedule impl:
org.apache.nutch.crawl.DefaultFetchSchedule
17/05/02 06:00:34 INFO crawl.AbstractFetchSchedule: defaultInterval=2592000
17/05/02 06:00:34 INFO crawl.AbstractFetchSchedule: maxInterval=7776000
17/05/02 06:00:34 INFO regex.RegexURLNormalizer: can't find rules for scope
'generate_host_count', using default
17/05/02 06:00:34 INFO mapred.Task:
Task:attempt_local1706016672_0001_r_000000_0 is done. And is in the process of
committing
17/05/02 06:00:34 INFO mapred.LocalJobRunner: 1 / 1 copied.
17/05/02 06:00:34 INFO mapred.Task: Task
attempt_local1706016672_0001_r_000000_0 is allowed to commit now
17/05/02 06:00:34 INFO output.FileOutputCommitter: Saved output of task
'attempt_local1706016672_0001_r_000000_0' to
hdfs://localhost:9000/user/root/generate-temp-ca817ccd-332b-4fa3-afe3-dab7d80ea711/_temporary/0/task_local1706016672_0001_r_000000
17/05/02 06:00:34 INFO mapred.LocalJobRunner: reduce > reduce
17/05/02 06:00:34 INFO mapred.Task: Task
'attempt_local1706016672_0001_r_000000_0' done.
17/05/02 06:00:34 INFO mapred.LocalJobRunner: Finishing task:
attempt_local1706016672_0001_r_000000_0
17/05/02 06:00:34 INFO mapred.LocalJobRunner: Starting task:
attempt_local1706016672_0001_r_000001_0
17/05/02 06:00:34 INFO output.FileOutputCommitter: File Output Committer
Algorithm version is 1
17/05/02 06:00:34 INFO output.FileOutputCommitter: FileOutputCommitter skip
cleanup _temporary folders under output directory:false, ignore cleanup
failures: false
17/05/02 06:00:34 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
17/05/02 06:00:34 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin:
org.apache.hadoop.mapreduce.task.reduce.Shuffle@29cfa49
17/05/02 06:00:34 INFO reduce.MergeManagerImpl: MergerManager:
memoryLimit=334338464, maxSingleShuffleLimit=83584616,
mergeThreshold=220663392, ioSortFactor=10, memToMemMergeOutputsThreshold=10
17/05/02 06:00:34 INFO reduce.EventFetcher:
attempt_local1706016672_0001_r_000001_0 Thread started: EventFetcher for
fetching Map Completion Events
17/05/02 06:00:34 INFO reduce.LocalFetcher: localfetcher#2 about to shuffle
output of map attempt_local1706016672_0001_m_000000_0 decomp: 2 len: 14 to
MEMORY
17/05/02 06:00:34 INFO reduce.InMemoryMapOutput: Read 2 bytes from map-output
for attempt_local1706016672_0001_m_000000_0
17/05/02 06:00:34 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output
of size: 2, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->2
17/05/02 06:00:34 INFO reduce.EventFetcher: EventFetcher is interrupted..
Returning
17/05/02 06:00:34 INFO mapred.LocalJobRunner: 1 / 1 copied.
17/05/02 06:00:34 INFO reduce.MergeManagerImpl: finalMerge called with 1
in-memory map-outputs and 0 on-disk map-outputs
17/05/02 06:00:34 INFO mapred.Merger: Merging 1 sorted segments
17/05/02 06:00:34 INFO mapred.Merger: Down to the last merge-pass, with 0
segments left of total size: 0 bytes
17/05/02 06:00:34 INFO reduce.MergeManagerImpl: Merged 1 segments, 2 bytes to
disk to satisfy reduce memory limit
17/05/02 06:00:34 INFO reduce.MergeManagerImpl: Merging 1 files, 22 bytes from
disk
17/05/02 06:00:34 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes
from memory into reduce
17/05/02 06:00:34 INFO mapred.Merger: Merging 1 sorted segments
17/05/02 06:00:34 INFO mapred.Merger: Down to the last merge-pass, with 0
segments left of total size: 0 bytes
17/05/02 06:00:34 INFO mapred.LocalJobRunner: 1 / 1 copied.
17/05/02 06:00:34 INFO conf.Configuration: found resource regex-normalize.xml
at file:/tmp/hadoop-unjar7886623985863993949/regex-normalize.xml
17/05/02 06:00:34 INFO crawl.FetchScheduleFactory: Using FetchSchedule impl:
org.apache.nutch.crawl.DefaultFetchSchedule
17/05/02 06:00:34 INFO crawl.AbstractFetchSchedule: defaultInterval=2592000
17/05/02 06:00:34 INFO crawl.AbstractFetchSchedule: maxInterval=7776000
17/05/02 06:00:34 INFO mapred.Task:
Task:attempt_local1706016672_0001_r_000001_0 is done. And is in the process of
committing
17/05/02 06:00:34 INFO mapred.LocalJobRunner: reduce > reduce
17/05/02 06:00:34 INFO mapred.Task: Task
'attempt_local1706016672_0001_r_000001_0' done.
17/05/02 06:00:34 INFO mapred.LocalJobRunner: Finishing task:
attempt_local1706016672_0001_r_000001_0
17/05/02 06:00:34 INFO mapred.LocalJobRunner: reduce task executor complete.
17/05/02 06:00:34 INFO mapreduce.Job: Job job_local1706016672_0001 running in
uber mode : false
17/05/02 06:00:34 INFO mapreduce.Job: map 100% reduce 100%
17/05/02 06:00:34 INFO mapreduce.Job: Job job_local1706016672_0001 completed
successfully
17/05/02 06:00:35 INFO mapreduce.Job: Counters: 35
File System Counters
FILE: Number of bytes read=652296139
FILE: Number of bytes written=658571046
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=444
HDFS: Number of bytes written=398
HDFS: Number of read operations=37
HDFS: Number of large read operations=0
HDFS: Number of write operations=13
Map-Reduce Framework
Map input records=1
Map output records=1
Map output bytes=83
Map output materialized bytes=97
Input split bytes=123
Combine input records=0
Combine output records=0
Reduce input groups=1
Reduce shuffle bytes=97
Reduce input records=1
Reduce output records=1
Spilled Records=2
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=8
Total committed heap usage (bytes)=1036517376
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=148
File Output Format Counters
Bytes Written=199
17/05/02 06:00:35 INFO crawl.Generator: Generator: Partitioning selected urls
for politeness.
17/05/02 06:00:36 INFO crawl.Generator: Generator: segment:
crawl/segments/20170502060036
17/05/02 06:00:36 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with
processName=JobTracker, sessionId= - already initialized
17/05/02 06:00:36 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with
processName=JobTracker, sessionId= - already initialized
17/05/02 06:00:36 INFO mapred.FileInputFormat: Total input files to process : 1
17/05/02 06:00:36 INFO mapreduce.JobSubmitter: number of splits:1
17/05/02 06:00:36 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_local1332900929_0002
17/05/02 06:00:36 INFO mapreduce.Job: The url to track the job:
http://localhost:8080/
17/05/02 06:00:36 INFO mapreduce.Job: Running job: job_local1332900929_0002
17/05/02 06:00:36 INFO mapred.LocalJobRunner: OutputCommitter set in config null
17/05/02 06:00:36 INFO mapred.LocalJobRunner: OutputCommitter is
org.apache.hadoop.mapred.FileOutputCommitter
17/05/02 06:00:36 INFO output.FileOutputCommitter: File Output Committer
Algorithm version is 1
17/05/02 06:00:36 INFO output.FileOutputCommitter: FileOutputCommitter skip
cleanup _temporary folders under output directory:false, ignore cleanup
failures: false
17/05/02 06:00:36 INFO mapred.LocalJobRunner: Waiting for map tasks
17/05/02 06:00:36 INFO mapred.LocalJobRunner: Starting task:
attempt_local1332900929_0002_m_000000_0
17/05/02 06:00:36 INFO output.FileOutputCommitter: File Output Committer
Algorithm version is 1
17/05/02 06:00:36 INFO output.FileOutputCommitter: FileOutputCommitter skip
cleanup _temporary folders under output directory:false, ignore cleanup
failures: false
17/05/02 06:00:36 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
17/05/02 06:00:36 INFO mapred.MapTask: Processing split:
hdfs://localhost:9000/user/root/generate-temp-ca817ccd-332b-4fa3-afe3-dab7d80ea711/fetchlist-1/part-00000:0+199
17/05/02 06:00:36 INFO mapred.MapTask: numReduceTasks: 1
17/05/02 06:00:36 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
17/05/02 06:00:36 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
17/05/02 06:00:36 INFO mapred.MapTask: soft limit at 83886080
17/05/02 06:00:36 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
17/05/02 06:00:36 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
17/05/02 06:00:36 INFO mapred.MapTask: Map output collector class =
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
17/05/02 06:00:36 INFO mapred.LocalJobRunner:
17/05/02 06:00:36 INFO mapred.MapTask: Starting flush of map output
17/05/02 06:00:36 INFO mapred.MapTask: Spilling map output
17/05/02 06:00:36 INFO mapred.MapTask: bufstart = 0; bufend = 104; bufvoid =
104857600
17/05/02 06:00:36 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend =
26214396(104857584); length = 1/6553600
17/05/02 06:00:36 INFO mapred.MapTask: Finished spill 0
17/05/02 06:00:36 INFO mapred.Task:
Task:attempt_local1332900929_0002_m_000000_0 is done. And is in the process of
committing
17/05/02 06:00:36 INFO mapred.LocalJobRunner:
hdfs://localhost:9000/user/root/generate-temp-ca817ccd-332b-4fa3-afe3-dab7d80ea711/fetchlist-1/part-00000:0+199
17/05/02 06:00:36 INFO mapred.Task: Task
'attempt_local1332900929_0002_m_000000_0' done.
17/05/02 06:00:36 INFO mapred.LocalJobRunner: Finishing task:
attempt_local1332900929_0002_m_000000_0
17/05/02 06:00:36 INFO mapred.LocalJobRunner: map task executor complete.
17/05/02 06:00:36 INFO mapred.LocalJobRunner: Waiting for reduce tasks
17/05/02 06:00:36 INFO mapred.LocalJobRunner: Starting task:
attempt_local1332900929_0002_r_000000_0
17/05/02 06:00:36 INFO output.FileOutputCommitter: File Output Committer
Algorithm version is 1
17/05/02 06:00:36 INFO output.FileOutputCommitter: FileOutputCommitter skip
cleanup _temporary folders under output directory:false, ignore cleanup
failures: false
17/05/02 06:00:36 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
17/05/02 06:00:36 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin:
org.apache.hadoop.mapreduce.task.reduce.Shuffle@57dcd1f6
17/05/02 06:00:36 INFO reduce.MergeManagerImpl: MergerManager:
memoryLimit=334338464, maxSingleShuffleLimit=83584616,
mergeThreshold=220663392, ioSortFactor=10, memToMemMergeOutputsThreshold=10
17/05/02 06:00:36 INFO reduce.EventFetcher:
attempt_local1332900929_0002_r_000000_0 Thread started: EventFetcher for
fetching Map Completion Events
17/05/02 06:00:36 INFO reduce.LocalFetcher: localfetcher#3 about to shuffle
output of map attempt_local1332900929_0002_m_000000_0 decomp: 108 len: 82 to
MEMORY
17/05/02 06:00:36 INFO reduce.InMemoryMapOutput: Read 108 bytes from map-output
for attempt_local1332900929_0002_m_000000_0
17/05/02 06:00:36 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output
of size: 108, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory
->108
17/05/02 06:00:36 INFO reduce.EventFetcher: EventFetcher is interrupted..
Returning
17/05/02 06:00:36 INFO mapred.LocalJobRunner: 1 / 1 copied.
17/05/02 06:00:36 INFO reduce.MergeManagerImpl: finalMerge called with 1
in-memory map-outputs and 0 on-disk map-outputs
17/05/02 06:00:36 INFO mapred.Merger: Merging 1 sorted segments
17/05/02 06:00:36 INFO mapred.Merger: Down to the last merge-pass, with 1
segments left of total size: 81 bytes
17/05/02 06:00:36 INFO reduce.MergeManagerImpl: Merged 1 segments, 108 bytes to
disk to satisfy reduce memory limit
17/05/02 06:00:36 INFO reduce.MergeManagerImpl: Merging 1 files, 90 bytes from
disk
17/05/02 06:00:36 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes
from memory into reduce
17/05/02 06:00:36 INFO mapred.Merger: Merging 1 sorted segments
17/05/02 06:00:36 INFO mapred.Merger: Down to the last merge-pass, with 1
segments left of total size: 81 bytes
17/05/02 06:00:36 INFO mapred.LocalJobRunner: 1 / 1 copied.
17/05/02 06:00:36 INFO mapred.Task:
Task:attempt_local1332900929_0002_r_000000_0 is done. And is in the process of
committing
17/05/02 06:00:36 INFO mapred.LocalJobRunner: 1 / 1 copied.
17/05/02 06:00:36 INFO mapred.Task: Task
attempt_local1332900929_0002_r_000000_0 is allowed to commit now
17/05/02 06:00:36 INFO output.FileOutputCommitter: Saved output of task
'attempt_local1332900929_0002_r_000000_0' to
hdfs://localhost:9000/user/root/crawl/segments/20170502060036/crawl_generate/_temporary/0/task_local1332900929_0002_r_000000
17/05/02 06:00:36 INFO mapred.LocalJobRunner: reduce > reduce
17/05/02 06:00:36 INFO mapred.Task: Task
'attempt_local1332900929_0002_r_000000_0' done.
17/05/02 06:00:36 INFO mapred.LocalJobRunner: Finishing task:
attempt_local1332900929_0002_r_000000_0
17/05/02 06:00:36 INFO mapred.LocalJobRunner: reduce task executor complete.
17/05/02 06:00:37 INFO mapreduce.Job: Job job_local1332900929_0002 running in
uber mode : false
17/05/02 06:00:37 INFO mapreduce.Job: map 100% reduce 100%
17/05/02 06:00:37 INFO mapreduce.Job: Job job_local1332900929_0002 completed
successfully
17/05/02 06:00:37 INFO mapreduce.Job: Counters: 35
File System Counters
FILE: Number of bytes read=869728356
FILE: Number of bytes written=878093356
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=694
HDFS: Number of bytes written=567
HDFS: Number of read operations=53
HDFS: Number of large read operations=0
HDFS: Number of write operations=18
Map-Reduce Framework
Map input records=1
Map output records=1
Map output bytes=104
Map output materialized bytes=82
Input split bytes=157
Combine input records=0
Combine output records=0
Reduce input groups=1
Reduce shuffle bytes=82
Reduce input records=1
Reduce output records=1
Spilled Records=2
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=0
Total committed heap usage (bytes)=901775360
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=199
File Output Format Counters
Bytes Written=169
17/05/02 06:00:37 INFO crawl.Generator: Generator: finished at 2017-05-02
06:00:37, elapsed: 00:00:05
Operating on segment : 20170502060036
Fetching : 20170502060036
/data/apache-nutch-1.13/runtime/deploy/bin/nutch fetch -D
mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D
mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D
mapreduce.map.output.compress=true -D fetcher.timelimit.mins=180
crawl/segments/20170502060036 -noParsing -threads 50
+ cygwin=false
+ case "`uname`" in
++ uname
+ THIS=/data/apache-nutch-1.13/runtime/deploy/bin/nutch
+ '[' -h /data/apache-nutch-1.13/runtime/deploy/bin/nutch ']'
+ '[' 17 = 0 ']'
+ COMMAND=fetch
+ shift
++ dirname /data/apache-nutch-1.13/runtime/deploy/bin/nutch
+ THIS_DIR=/data/apache-nutch-1.13/runtime/deploy/bin
++ cd /data/apache-nutch-1.13/runtime/deploy/bin/..
++ pwd
+ NUTCH_HOME=/data/apache-nutch-1.13/runtime/deploy
+ '[' '' '!=' '' ']'
+ '[' /usr/lib/jvm/java-8-oracle/jre/ = '' ']'
+ local=true
+ '[' -f /data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job ']'
+ local=false
+ for f in '"$NUTCH_HOME"/*nutch*.job'
+ NUTCH_JOB=/data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job
+ false
+ JAVA=/usr/lib/jvm/java-8-oracle/jre//bin/java
+ JAVA_HEAP_MAX=-Xmx1000m
+ '[' '' '!=' '' ']'
+ CLASSPATH=/data/apache-nutch-1.13/runtime/deploy/conf
+
CLASSPATH=/data/apache-nutch-1.13/runtime/deploy/conf:/usr/lib/jvm/java-8-oracle/jre//lib/tools.jar
+ IFS=
+ false
+ false
+ JAVA_LIBRARY_PATH=
+ '[' -d /data/apache-nutch-1.13/runtime/deploy/lib/native ']'
+ '[' false = true -a X '!=' X ']'
+ unset IFS
+ '[' '' = '' ']'
+ NUTCH_LOG_DIR=/data/apache-nutch-1.13/runtime/deploy/logs
+ '[' '' = '' ']'
+ NUTCH_LOGFILE=hadoop.log
+ false
+ NUTCH_OPTS=($NUTCH_OPTS -Dhadoop.log.dir="$NUTCH_LOG_DIR")
+ NUTCH_OPTS=("${NUTCH_OPTS[@]}" -Dhadoop.log.file="$NUTCH_LOGFILE")
+ '[' x '!=' x ']'
+ '[' fetch = crawl ']'
+ '[' fetch = inject ']'
+ '[' fetch = generate ']'
+ '[' fetch = freegen ']'
+ '[' fetch = fetch ']'
+ CLASS=org.apache.nutch.fetcher.Fetcher
+ EXEC_CALL=(hadoop jar "$NUTCH_JOB")
+ false
++ which hadoop
++ wc -l
+ '[' 1 -eq 0 ']'
+ exec hadoop jar /data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job
org.apache.nutch.fetcher.Fetcher -D mapreduce.job.reduces=2 -D
mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D
mapreduce.map.speculative=false -D mapreduce.map.output.compress=true -D
fetcher.timelimit.mins=180 crawl/segments/20170502060036 -noParsing -threads 50
17/05/02 06:00:43 INFO fetcher.Fetcher: Fetcher: starting at 2017-05-02 06:00:43
17/05/02 06:00:43 INFO fetcher.Fetcher: Fetcher: segment:
crawl/segments/20170502060036
17/05/02 06:00:43 INFO fetcher.Fetcher: Fetcher Timelimit set for :
1493733643194
17/05/02 06:00:44 INFO Configuration.deprecation: session.id is deprecated.
Instead, use dfs.metrics.session-id
17/05/02 06:00:44 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=JobTracker, sessionId=
17/05/02 06:00:44 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with
processName=JobTracker, sessionId= - already initialized
17/05/02 06:00:44 ERROR fetcher.Fetcher: Fetcher:
java.lang.IllegalArgumentException: Wrong FS:
hdfs://localhost:9000/user/root/crawl/segments/20170502060036/crawl_fetch,
expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:665)
at
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:86)
at
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:630)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:861)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:625)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:435)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1436)
at
org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:55)
at
org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:270)
at
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:141)
at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1341)
at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1338)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1807)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1338)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:575)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:570)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1807)
at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:570)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:561)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:870)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:486)
at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:521)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:495)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:234)
at org.apache.hadoop.util.RunJar.main(RunJar.java:148)
Error running:
/data/apache-nutch-1.13/runtime/deploy/bin/nutch fetch -D
mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D
mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D
mapreduce.map.output.compress=true -D fetcher.timelimit.mins=180
crawl/segments/20170502060036 -noParsing -threads 50
Failed with exit value 255.
-----Original Message-----
From: Sebastian Nagel [mailto:[email protected]]
Sent: 02 May 2017 13:54
To: [email protected]
Subject: Re: Wrong FS exception in Fetcher
Hi Yossi,
strange error, indeed. Is it also reproducible in pseudo-distributed mode using
Hadoop 2.7.2,
the version Nutch depends on?n
Could you also add the line
set -x
to bin/nutch and run bin/crawl again to see how all steps are executed.
Thanks,
Sebastian
On 04/30/2017 04:04 PM, Yossi Tamari wrote:
> Hi,
>
>
>
> I'm trying to run Nutch 1.13 on Hadoop 2.8.0 in pseudo-distributed
> distributed mode.
>
> Running the command:
>
> Deploy/bin/crawl urls crawl 2
>
> The Injector and Generator run successfully, but in the Fetcher I get the
> following error:
>
> 17/04/30 08:43:48 ERROR fetcher.Fetcher: Fetcher:
> java.lang.IllegalArgumentException: Wrong FS:
> hdfs://localhost:9000/user/root/crawl/segments/20170430084337/crawl_fetch,
> expected: file:///
>
> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:665)
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:8
> 6)
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFile
> System.java:630)
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFi
> leSystem.java:861)
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.jav
> a:625)
>
> at
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:43
> 5)
>
> at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1436)
>
> at
> org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputF
> ormat.java:55)
>
> at
> org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:270)
>
> at
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java
> :141)
>
> at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1341)
>
> at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1338)
>
> at java.security.AccessController.doPrivileged(Native Method)
>
> at javax.security.auth.Subject.doAs(Subject.java:422)
>
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja
> va:1807)
>
> at org.apache.hadoop.mapreduce.Job.submit(Job.java:1338)
>
> at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:575)
>
> at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:570)
>
> at java.security.AccessController.doPrivileged(Native Method)
>
> at javax.security.auth.Subject.doAs(Subject.java:422)
>
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja
> va:1807)
>
> at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:570)
>
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:561)
>
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:870)
>
> at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:486)
>
> at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:521)
>
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>
> at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:495)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62
> )
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
> .java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:498)
>
> at org.apache.hadoop.util.RunJar.run(RunJar.java:234)
>
> at org.apache.hadoop.util.RunJar.main(RunJar.java:148)
>
>
>
> Error running:
>
> /data/apache-nutch-1.13/runtime/deploy/bin/nutch fetch -D
> mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D
> mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D
> mapreduce.map.output.compress=true -D fetcher.timelimit.mins=180
> crawl/segments/20170430084337 -noParsing -threads 50
>
> Failed with exit value 255.
>
>
>
>
>
> Any ideas how to fix this?
>
>
>
> Thanks,
>
> Yossi.
>
>