RE: Wrong FS exception in Fetcher

Yossi Tamari Tue, 02 May 2017 04:21:55 -0700

Thanks Sebastian,

The output with set -x is below. I'm new to Nutch and was not aware that 1.13 
requires Hadoop 2.7.2 specifically. While I see it now in pom.xml, it may be a 
good idea to document it in the download page and provide a download link 
(since the Hadoop releases page contains 2.7.3 but not 2.7.2). I will try to 
install 2.7.2 and retest tomorrow.


root@crawler001:/data/apache-nutch-1.13/runtime/deploy/bin# ./crawl urls crawl 2
Injecting seed URLs
/data/apache-nutch-1.13/runtime/deploy/bin/nutch inject crawl/crawldb urls
+ cygwin=false
+ case "`uname`" in
++ uname
+ THIS=/data/apache-nutch-1.13/runtime/deploy/bin/nutch
+ '[' -h /data/apache-nutch-1.13/runtime/deploy/bin/nutch ']'
+ '[' 3 = 0 ']'
+ COMMAND=inject
+ shift
++ dirname /data/apache-nutch-1.13/runtime/deploy/bin/nutch
+ THIS_DIR=/data/apache-nutch-1.13/runtime/deploy/bin
++ cd /data/apache-nutch-1.13/runtime/deploy/bin/..
++ pwd
+ NUTCH_HOME=/data/apache-nutch-1.13/runtime/deploy
+ '[' '' '!=' '' ']'
+ '[' /usr/lib/jvm/java-8-oracle/jre/ = '' ']'
+ local=true
+ '[' -f /data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job ']'
+ local=false
+ for f in '"$NUTCH_HOME"/*nutch*.job'
+ NUTCH_JOB=/data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job
+ false
+ JAVA=/usr/lib/jvm/java-8-oracle/jre//bin/java
+ JAVA_HEAP_MAX=-Xmx1000m
+ '[' '' '!=' '' ']'
+ CLASSPATH=/data/apache-nutch-1.13/runtime/deploy/conf
+ 
CLASSPATH=/data/apache-nutch-1.13/runtime/deploy/conf:/usr/lib/jvm/java-8-oracle/jre//lib/tools.jar
+ IFS=
+ false
+ false
+ JAVA_LIBRARY_PATH=
+ '[' -d /data/apache-nutch-1.13/runtime/deploy/lib/native ']'
+ '[' false = true -a X '!=' X ']'
+ unset IFS
+ '[' '' = '' ']'
+ NUTCH_LOG_DIR=/data/apache-nutch-1.13/runtime/deploy/logs
+ '[' '' = '' ']'
+ NUTCH_LOGFILE=hadoop.log
+ false
+ NUTCH_OPTS=($NUTCH_OPTS -Dhadoop.log.dir="$NUTCH_LOG_DIR")
+ NUTCH_OPTS=("${NUTCH_OPTS[@]}" -Dhadoop.log.file="$NUTCH_LOGFILE")
+ '[' x '!=' x ']'
+ '[' inject = crawl ']'
+ '[' inject = inject ']'
+ CLASS=org.apache.nutch.crawl.Injector
+ EXEC_CALL=(hadoop jar "$NUTCH_JOB")
+ false
++ which hadoop
++ wc -l
+ '[' 1 -eq 0 ']'
+ exec hadoop jar /data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job 
org.apache.nutch.crawl.Injector crawl/crawldb urls
17/05/02 06:00:24 INFO crawl.Injector: Injector: starting at 2017-05-02 06:00:24
17/05/02 06:00:24 INFO crawl.Injector: Injector: crawlDb: crawl/crawldb
17/05/02 06:00:24 INFO crawl.Injector: Injector: urlDir: urls
17/05/02 06:00:24 INFO crawl.Injector: Injector: Converting injected urls to 
crawl db entries.
17/05/02 06:00:25 INFO Configuration.deprecation: session.id is deprecated. 
Instead, use dfs.metrics.session-id
17/05/02 06:00:25 INFO jvm.JvmMetrics: Initializing JVM Metrics with 
processName=JobTracker, sessionId=
17/05/02 06:00:26 INFO input.FileInputFormat: Total input files to process : 1
17/05/02 06:00:26 INFO input.FileInputFormat: Total input files to process : 1
17/05/02 06:00:26 INFO mapreduce.JobSubmitter: number of splits:2
17/05/02 06:00:26 INFO mapreduce.JobSubmitter: Submitting tokens for job: 
job_local307378419_0001
17/05/02 06:00:26 INFO mapreduce.Job: The url to track the job: 
http://localhost:8080/
17/05/02 06:00:26 INFO mapreduce.Job: Running job: job_local307378419_0001
17/05/02 06:00:26 INFO mapred.LocalJobRunner: OutputCommitter set in config null
17/05/02 06:00:26 INFO output.FileOutputCommitter: File Output Committer 
Algorithm version is 1
17/05/02 06:00:26 INFO output.FileOutputCommitter: FileOutputCommitter skip 
cleanup _temporary folders under output directory:false, ignore cleanup 
failures: false
17/05/02 06:00:26 INFO mapred.LocalJobRunner: OutputCommitter is 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
17/05/02 06:00:26 INFO mapred.LocalJobRunner: Waiting for map tasks
17/05/02 06:00:26 INFO mapred.LocalJobRunner: Starting task: 
attempt_local307378419_0001_m_000000_0
17/05/02 06:00:26 INFO output.FileOutputCommitter: File Output Committer 
Algorithm version is 1
17/05/02 06:00:26 INFO output.FileOutputCommitter: FileOutputCommitter skip 
cleanup _temporary folders under output directory:false, ignore cleanup 
failures: false
17/05/02 06:00:26 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
17/05/02 06:00:26 INFO mapred.MapTask: Processing split: 
hdfs://localhost:9000/user/root/crawl/crawldb/current/part-r-00000/data:0+148
17/05/02 06:00:26 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
17/05/02 06:00:26 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
17/05/02 06:00:26 INFO mapred.MapTask: soft limit at 83886080
17/05/02 06:00:26 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
17/05/02 06:00:26 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
17/05/02 06:00:26 INFO mapred.MapTask: Map output collector class = 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
17/05/02 06:00:26 INFO plugin.PluginRepository: Plugins: looking in: 
/tmp/hadoop-unjar333276722181778867/classes/plugins
17/05/02 06:00:26 INFO plugin.PluginRepository: Plugin Auto-activation mode: 
[true]
17/05/02 06:00:26 INFO plugin.PluginRepository: Registered Plugins:
17/05/02 06:00:26 INFO plugin.PluginRepository:         Regex URL Filter 
(urlfilter-regex)
17/05/02 06:00:26 INFO plugin.PluginRepository:         Html Parse Plug-in 
(parse-html)
17/05/02 06:00:26 INFO plugin.PluginRepository:         HTTP Framework 
(lib-http)
17/05/02 06:00:26 INFO plugin.PluginRepository:         the nutch core 
extension points (nutch-extensionpoints)
17/05/02 06:00:26 INFO plugin.PluginRepository:         Basic Indexing Filter 
(index-basic)
17/05/02 06:00:26 INFO plugin.PluginRepository:         Anchor Indexing Filter 
(index-anchor)
17/05/02 06:00:26 INFO plugin.PluginRepository:         Tika Parser Plug-in 
(parse-tika)
17/05/02 06:00:26 INFO plugin.PluginRepository:         Basic URL Normalizer 
(urlnormalizer-basic)
17/05/02 06:00:26 INFO plugin.PluginRepository:         Regex URL Filter 
Framework (lib-regex-filter)
17/05/02 06:00:26 INFO plugin.PluginRepository:         Regex URL Normalizer 
(urlnormalizer-regex)
17/05/02 06:00:26 INFO plugin.PluginRepository:         CyberNeko HTML Parser 
(lib-nekohtml)
17/05/02 06:00:26 INFO plugin.PluginRepository:         OPIC Scoring Plug-in 
(scoring-opic)
17/05/02 06:00:26 INFO plugin.PluginRepository:         Pass-through URL 
Normalizer (urlnormalizer-pass)
17/05/02 06:00:26 INFO plugin.PluginRepository:         Http Protocol Plug-in 
(protocol-http)
17/05/02 06:00:26 INFO plugin.PluginRepository:         ElasticIndexWriter 
(indexer-elastic)
17/05/02 06:00:26 INFO plugin.PluginRepository: Registered Extension-Points:
17/05/02 06:00:26 INFO plugin.PluginRepository:         Nutch Content Parser 
(org.apache.nutch.parse.Parser)
17/05/02 06:00:26 INFO plugin.PluginRepository:         Nutch URL Filter 
(org.apache.nutch.net.URLFilter)
17/05/02 06:00:26 INFO plugin.PluginRepository:         HTML Parse Filter 
(org.apache.nutch.parse.HtmlParseFilter)
17/05/02 06:00:26 INFO plugin.PluginRepository:         Nutch Scoring 
(org.apache.nutch.scoring.ScoringFilter)
17/05/02 06:00:26 INFO plugin.PluginRepository:         Nutch URL Normalizer 
(org.apache.nutch.net.URLNormalizer)
17/05/02 06:00:26 INFO plugin.PluginRepository:         Nutch Publisher 
(org.apache.nutch.publisher.NutchPublisher)
17/05/02 06:00:26 INFO plugin.PluginRepository:         Nutch Protocol 
(org.apache.nutch.protocol.Protocol)
17/05/02 06:00:26 INFO plugin.PluginRepository:         Nutch URL Ignore 
Exemption Filter (org.apache.nutch.net.URLExemptionFilter)
17/05/02 06:00:26 INFO plugin.PluginRepository:         Nutch Index Writer 
(org.apache.nutch.indexer.IndexWriter)
17/05/02 06:00:26 INFO plugin.PluginRepository:         Nutch Segment Merge 
Filter (org.apache.nutch.segment.SegmentMergeFilter)
17/05/02 06:00:26 INFO plugin.PluginRepository:         Nutch Indexing Filter 
(org.apache.nutch.indexer.IndexingFilter)
17/05/02 06:00:26 INFO conf.Configuration: found resource regex-normalize.xml 
at file:/tmp/hadoop-unjar333276722181778867/regex-normalize.xml
17/05/02 06:00:26 INFO conf.Configuration: found resource regex-urlfilter.txt 
at file:/tmp/hadoop-unjar333276722181778867/regex-urlfilter.txt
17/05/02 06:00:26 INFO regex.RegexURLNormalizer: can't find rules for scope 
'inject', using default
17/05/02 06:00:26 INFO mapred.LocalJobRunner:
17/05/02 06:00:26 INFO mapred.MapTask: Starting flush of map output
17/05/02 06:00:26 INFO mapred.MapTask: Spilling map output
17/05/02 06:00:26 INFO mapred.MapTask: bufstart = 0; bufend = 54; bufvoid = 
104857600
17/05/02 06:00:26 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 
26214396(104857584); length = 1/6553600
17/05/02 06:00:26 INFO mapred.MapTask: Finished spill 0
17/05/02 06:00:26 INFO mapred.Task: Task:attempt_local307378419_0001_m_000000_0 
is done. And is in the process of committing
17/05/02 06:00:26 INFO mapred.LocalJobRunner: map
17/05/02 06:00:26 INFO mapred.Task: Task 
'attempt_local307378419_0001_m_000000_0' done.
17/05/02 06:00:26 INFO mapred.LocalJobRunner: Finishing task: 
attempt_local307378419_0001_m_000000_0
17/05/02 06:00:26 INFO mapred.LocalJobRunner: Starting task: 
attempt_local307378419_0001_m_000001_0
17/05/02 06:00:26 INFO output.FileOutputCommitter: File Output Committer 
Algorithm version is 1
17/05/02 06:00:26 INFO output.FileOutputCommitter: FileOutputCommitter skip 
cleanup _temporary folders under output directory:false, ignore cleanup 
failures: false
17/05/02 06:00:26 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
17/05/02 06:00:26 INFO mapred.MapTask: Processing split: 
hdfs://localhost:9000/user/root/urls/seed.txt:0+24
17/05/02 06:00:26 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
17/05/02 06:00:26 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
17/05/02 06:00:26 INFO mapred.MapTask: soft limit at 83886080
17/05/02 06:00:26 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
17/05/02 06:00:26 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
17/05/02 06:00:26 INFO mapred.MapTask: Map output collector class = 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
17/05/02 06:00:26 INFO conf.Configuration: found resource regex-normalize.xml 
at file:/tmp/hadoop-unjar333276722181778867/regex-normalize.xml
17/05/02 06:00:26 INFO regex.RegexURLNormalizer: can't find rules for scope 
'inject', using default
17/05/02 06:00:26 INFO mapred.LocalJobRunner:
17/05/02 06:00:26 INFO mapred.MapTask: Starting flush of map output
17/05/02 06:00:26 INFO mapred.MapTask: Spilling map output
17/05/02 06:00:26 INFO mapred.MapTask: bufstart = 0; bufend = 54; bufvoid = 
104857600
17/05/02 06:00:26 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 
26214396(104857584); length = 1/6553600
17/05/02 06:00:26 INFO mapred.MapTask: Finished spill 0
17/05/02 06:00:26 INFO mapred.Task: Task:attempt_local307378419_0001_m_000001_0 
is done. And is in the process of committing
17/05/02 06:00:26 INFO mapred.LocalJobRunner: 
hdfs://localhost:9000/user/root/urls/seed.txt:0+24
17/05/02 06:00:26 INFO mapred.Task: Task 
'attempt_local307378419_0001_m_000001_0' done.
17/05/02 06:00:26 INFO mapred.LocalJobRunner: Finishing task: 
attempt_local307378419_0001_m_000001_0
17/05/02 06:00:26 INFO mapred.LocalJobRunner: map task executor complete.
17/05/02 06:00:26 INFO mapred.LocalJobRunner: Waiting for reduce tasks
17/05/02 06:00:26 INFO mapred.LocalJobRunner: Starting task: 
attempt_local307378419_0001_r_000000_0
17/05/02 06:00:26 INFO output.FileOutputCommitter: File Output Committer 
Algorithm version is 1
17/05/02 06:00:26 INFO output.FileOutputCommitter: FileOutputCommitter skip 
cleanup _temporary folders under output directory:false, ignore cleanup 
failures: false
17/05/02 06:00:26 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
17/05/02 06:00:26 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: 
org.apache.hadoop.mapreduce.task.reduce.Shuffle@504b0ec4
17/05/02 06:00:26 INFO reduce.MergeManagerImpl: MergerManager: 
memoryLimit=334338464, maxSingleShuffleLimit=83584616, 
mergeThreshold=220663392, ioSortFactor=10, memToMemMergeOutputsThreshold=10
17/05/02 06:00:26 INFO reduce.EventFetcher: 
attempt_local307378419_0001_r_000000_0 Thread started: EventFetcher for 
fetching Map Completion Events
17/05/02 06:00:26 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle 
output of map attempt_local307378419_0001_m_000001_0 decomp: 58 len: 62 to 
MEMORY
17/05/02 06:00:26 INFO reduce.InMemoryMapOutput: Read 58 bytes from map-output 
for attempt_local307378419_0001_m_000001_0
17/05/02 06:00:26 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output 
of size: 58, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->58
17/05/02 06:00:26 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle 
output of map attempt_local307378419_0001_m_000000_0 decomp: 58 len: 62 to 
MEMORY
17/05/02 06:00:26 INFO reduce.InMemoryMapOutput: Read 58 bytes from map-output 
for attempt_local307378419_0001_m_000000_0
17/05/02 06:00:26 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output 
of size: 58, inMemoryMapOutputs.size() -> 2, commitMemory -> 58, usedMemory 
->116
17/05/02 06:00:26 INFO reduce.EventFetcher: EventFetcher is interrupted.. 
Returning
17/05/02 06:00:26 INFO mapred.LocalJobRunner: 2 / 2 copied.
17/05/02 06:00:26 INFO reduce.MergeManagerImpl: finalMerge called with 2 
in-memory map-outputs and 0 on-disk map-outputs
17/05/02 06:00:26 INFO mapred.Merger: Merging 2 sorted segments
17/05/02 06:00:26 INFO mapred.Merger: Down to the last merge-pass, with 2 
segments left of total size: 62 bytes
17/05/02 06:00:26 INFO reduce.MergeManagerImpl: Merged 2 segments, 116 bytes to 
disk to satisfy reduce memory limit
17/05/02 06:00:26 INFO reduce.MergeManagerImpl: Merging 1 files, 118 bytes from 
disk
17/05/02 06:00:26 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes 
from memory into reduce
17/05/02 06:00:26 INFO mapred.Merger: Merging 1 sorted segments
17/05/02 06:00:26 INFO mapred.Merger: Down to the last merge-pass, with 1 
segments left of total size: 87 bytes
17/05/02 06:00:26 INFO mapred.LocalJobRunner: 2 / 2 copied.
17/05/02 06:00:27 INFO zlib.ZlibFactory: Successfully loaded & initialized 
native-zlib library
17/05/02 06:00:27 INFO compress.CodecPool: Got brand-new compressor [.deflate]
17/05/02 06:00:27 INFO Configuration.deprecation: mapred.skip.on is deprecated. 
Instead, use mapreduce.job.skiprecords
17/05/02 06:00:27 INFO crawl.Injector: Injector: overwrite: false
17/05/02 06:00:27 INFO crawl.Injector: Injector: update: false
17/05/02 06:00:27 INFO mapreduce.Job: Job job_local307378419_0001 running in 
uber mode : false
17/05/02 06:00:27 INFO mapreduce.Job:  map 100% reduce 0%
17/05/02 06:00:27 INFO mapred.Task: Task:attempt_local307378419_0001_r_000000_0 
is done. And is in the process of committing
17/05/02 06:00:27 INFO mapred.LocalJobRunner: 2 / 2 copied.
17/05/02 06:00:27 INFO mapred.Task: Task attempt_local307378419_0001_r_000000_0 
is allowed to commit now
17/05/02 06:00:27 INFO output.FileOutputCommitter: Saved output of task 
'attempt_local307378419_0001_r_000000_0' to 
hdfs://localhost:9000/user/root/crawl/crawldb/crawldb-921346783/_temporary/0/task_local307378419_0001_r_000000
17/05/02 06:00:27 INFO mapred.LocalJobRunner: reduce > reduce
17/05/02 06:00:27 INFO mapred.Task: Task 
'attempt_local307378419_0001_r_000000_0' done.
17/05/02 06:00:27 INFO mapred.LocalJobRunner: Finishing task: 
attempt_local307378419_0001_r_000000_0
17/05/02 06:00:27 INFO mapred.LocalJobRunner: reduce task executor complete.
17/05/02 06:00:28 INFO mapreduce.Job:  map 100% reduce 100%
17/05/02 06:00:28 INFO mapreduce.Job: Job job_local307378419_0001 completed 
successfully
17/05/02 06:00:28 INFO mapreduce.Job: Counters: 37
        File System Counters
                FILE: Number of bytes read=652298479
                FILE: Number of bytes written=658557993
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=492
                HDFS: Number of bytes written=365
                HDFS: Number of read operations=46
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=13
        Map-Reduce Framework
                Map input records=2
                Map output records=2
                Map output bytes=108
                Map output materialized bytes=124
                Input split bytes=570
                Combine input records=0
                Combine output records=0
                Reduce input groups=1
                Reduce shuffle bytes=124
                Reduce input records=2
                Reduce output records=1
                Spilled Records=4
                Shuffled Maps =2
                Failed Shuffles=0
                Merged Map outputs=2
                GC time elapsed (ms)=15
                Total committed heap usage (bytes)=1044381696
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        injector
                urls_injected=1
                urls_merged=1
        File Input Format Counters
                Bytes Read=0
        File Output Format Counters
                Bytes Written=365
17/05/02 06:00:28 INFO crawl.Injector: Injector: Total urls rejected by 
filters: 0
17/05/02 06:00:28 INFO crawl.Injector: Injector: Total urls injected after 
normalization and filtering: 1
17/05/02 06:00:28 INFO crawl.Injector: Injector: Total urls injected but 
already in CrawlDb: 1
17/05/02 06:00:28 INFO crawl.Injector: Injector: Total new urls injected: 0
17/05/02 06:00:28 INFO crawl.Injector: Injector: finished at 2017-05-02 
06:00:28, elapsed: 00:00:04
Tue May 2 06:00:28 CDT 2017 : Iteration 1 of 2
Generating a new segment
/data/apache-nutch-1.13/runtime/deploy/bin/nutch generate -D 
mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D 
mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D 
mapreduce.map.output.compress=true crawl/crawldb crawl/segments -topN 50000 
-numFetchers 1 -noFilter
+ cygwin=false
+ case "`uname`" in
++ uname
+ THIS=/data/apache-nutch-1.13/runtime/deploy/bin/nutch
+ '[' -h /data/apache-nutch-1.13/runtime/deploy/bin/nutch ']'
+ '[' 18 = 0 ']'
+ COMMAND=generate
+ shift
++ dirname /data/apache-nutch-1.13/runtime/deploy/bin/nutch
+ THIS_DIR=/data/apache-nutch-1.13/runtime/deploy/bin
++ cd /data/apache-nutch-1.13/runtime/deploy/bin/..
++ pwd
+ NUTCH_HOME=/data/apache-nutch-1.13/runtime/deploy
+ '[' '' '!=' '' ']'
+ '[' /usr/lib/jvm/java-8-oracle/jre/ = '' ']'
+ local=true
+ '[' -f /data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job ']'
+ local=false
+ for f in '"$NUTCH_HOME"/*nutch*.job'
+ NUTCH_JOB=/data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job
+ false
+ JAVA=/usr/lib/jvm/java-8-oracle/jre//bin/java
+ JAVA_HEAP_MAX=-Xmx1000m
+ '[' '' '!=' '' ']'
+ CLASSPATH=/data/apache-nutch-1.13/runtime/deploy/conf
+ 
CLASSPATH=/data/apache-nutch-1.13/runtime/deploy/conf:/usr/lib/jvm/java-8-oracle/jre//lib/tools.jar
+ IFS=
+ false
+ false
+ JAVA_LIBRARY_PATH=
+ '[' -d /data/apache-nutch-1.13/runtime/deploy/lib/native ']'
+ '[' false = true -a X '!=' X ']'
+ unset IFS
+ '[' '' = '' ']'
+ NUTCH_LOG_DIR=/data/apache-nutch-1.13/runtime/deploy/logs
+ '[' '' = '' ']'
+ NUTCH_LOGFILE=hadoop.log
+ false
+ NUTCH_OPTS=($NUTCH_OPTS -Dhadoop.log.dir="$NUTCH_LOG_DIR")
+ NUTCH_OPTS=("${NUTCH_OPTS[@]}" -Dhadoop.log.file="$NUTCH_LOGFILE")
+ '[' x '!=' x ']'
+ '[' generate = crawl ']'
+ '[' generate = inject ']'
+ '[' generate = generate ']'
+ CLASS=org.apache.nutch.crawl.Generator
+ EXEC_CALL=(hadoop jar "$NUTCH_JOB")
+ false
++ which hadoop
++ wc -l
+ '[' 1 -eq 0 ']'
+ exec hadoop jar /data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job 
org.apache.nutch.crawl.Generator -D mapreduce.job.reduces=2 -D 
mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D 
mapreduce.map.speculative=false -D mapreduce.map.output.compress=true 
crawl/crawldb crawl/segments -topN 50000 -numFetchers 1 -noFilter
17/05/02 06:00:32 INFO crawl.Generator: Generator: starting at 2017-05-02 
06:00:32
17/05/02 06:00:32 INFO crawl.Generator: Generator: Selecting best-scoring urls 
due for fetch.
17/05/02 06:00:32 INFO crawl.Generator: Generator: filtering: false
17/05/02 06:00:32 INFO crawl.Generator: Generator: normalizing: true
17/05/02 06:00:32 INFO crawl.Generator: Generator: topN: 50000
17/05/02 06:00:32 INFO Configuration.deprecation: session.id is deprecated. 
Instead, use dfs.metrics.session-id
17/05/02 06:00:32 INFO jvm.JvmMetrics: Initializing JVM Metrics with 
processName=JobTracker, sessionId=
17/05/02 06:00:32 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with 
processName=JobTracker, sessionId= - already initialized
17/05/02 06:00:33 INFO mapred.FileInputFormat: Total input files to process : 1
17/05/02 06:00:33 INFO mapreduce.JobSubmitter: number of splits:1
17/05/02 06:00:33 INFO mapreduce.JobSubmitter: Submitting tokens for job: 
job_local1706016672_0001
17/05/02 06:00:33 INFO mapreduce.Job: The url to track the job: 
http://localhost:8080/
17/05/02 06:00:33 INFO mapred.LocalJobRunner: OutputCommitter set in config null
17/05/02 06:00:33 INFO mapreduce.Job: Running job: job_local1706016672_0001
17/05/02 06:00:33 INFO mapred.LocalJobRunner: OutputCommitter is 
org.apache.hadoop.mapred.FileOutputCommitter
17/05/02 06:00:33 INFO output.FileOutputCommitter: File Output Committer 
Algorithm version is 1
17/05/02 06:00:33 INFO output.FileOutputCommitter: FileOutputCommitter skip 
cleanup _temporary folders under output directory:false, ignore cleanup 
failures: false
17/05/02 06:00:34 INFO mapred.LocalJobRunner: Waiting for map tasks
17/05/02 06:00:34 INFO mapred.LocalJobRunner: Starting task: 
attempt_local1706016672_0001_m_000000_0
17/05/02 06:00:34 INFO output.FileOutputCommitter: File Output Committer 
Algorithm version is 1
17/05/02 06:00:34 INFO output.FileOutputCommitter: FileOutputCommitter skip 
cleanup _temporary folders under output directory:false, ignore cleanup 
failures: false
17/05/02 06:00:34 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
17/05/02 06:00:34 INFO mapred.MapTask: Processing split: 
hdfs://localhost:9000/user/root/crawl/crawldb/current/part-r-00000/data:0+148
17/05/02 06:00:34 INFO mapred.MapTask: numReduceTasks: 2
17/05/02 06:00:34 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
17/05/02 06:00:34 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
17/05/02 06:00:34 INFO mapred.MapTask: soft limit at 83886080
17/05/02 06:00:34 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
17/05/02 06:00:34 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
17/05/02 06:00:34 INFO mapred.MapTask: Map output collector class = 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
17/05/02 06:00:34 INFO plugin.PluginRepository: Plugins: looking in: 
/tmp/hadoop-unjar7886623985863993949/classes/plugins
17/05/02 06:00:34 INFO plugin.PluginRepository: Plugin Auto-activation mode: 
[true]
17/05/02 06:00:34 INFO plugin.PluginRepository: Registered Plugins:
17/05/02 06:00:34 INFO plugin.PluginRepository:         Regex URL Filter 
(urlfilter-regex)
17/05/02 06:00:34 INFO plugin.PluginRepository:         Html Parse Plug-in 
(parse-html)
17/05/02 06:00:34 INFO plugin.PluginRepository:         HTTP Framework 
(lib-http)
17/05/02 06:00:34 INFO plugin.PluginRepository:         the nutch core 
extension points (nutch-extensionpoints)
17/05/02 06:00:34 INFO plugin.PluginRepository:         Basic Indexing Filter 
(index-basic)
17/05/02 06:00:34 INFO plugin.PluginRepository:         Anchor Indexing Filter 
(index-anchor)
17/05/02 06:00:34 INFO plugin.PluginRepository:         Tika Parser Plug-in 
(parse-tika)
17/05/02 06:00:34 INFO plugin.PluginRepository:         Basic URL Normalizer 
(urlnormalizer-basic)
17/05/02 06:00:34 INFO plugin.PluginRepository:         Regex URL Filter 
Framework (lib-regex-filter)
17/05/02 06:00:34 INFO plugin.PluginRepository:         Regex URL Normalizer 
(urlnormalizer-regex)
17/05/02 06:00:34 INFO plugin.PluginRepository:         CyberNeko HTML Parser 
(lib-nekohtml)
17/05/02 06:00:34 INFO plugin.PluginRepository:         OPIC Scoring Plug-in 
(scoring-opic)
17/05/02 06:00:34 INFO plugin.PluginRepository:         Pass-through URL 
Normalizer (urlnormalizer-pass)
17/05/02 06:00:34 INFO plugin.PluginRepository:         Http Protocol Plug-in 
(protocol-http)
17/05/02 06:00:34 INFO plugin.PluginRepository:         ElasticIndexWriter 
(indexer-elastic)
17/05/02 06:00:34 INFO plugin.PluginRepository: Registered Extension-Points:
17/05/02 06:00:34 INFO plugin.PluginRepository:         Nutch Content Parser 
(org.apache.nutch.parse.Parser)
17/05/02 06:00:34 INFO plugin.PluginRepository:         Nutch URL Filter 
(org.apache.nutch.net.URLFilter)
17/05/02 06:00:34 INFO plugin.PluginRepository:         HTML Parse Filter 
(org.apache.nutch.parse.HtmlParseFilter)
17/05/02 06:00:34 INFO plugin.PluginRepository:         Nutch Scoring 
(org.apache.nutch.scoring.ScoringFilter)
17/05/02 06:00:34 INFO plugin.PluginRepository:         Nutch URL Normalizer 
(org.apache.nutch.net.URLNormalizer)
17/05/02 06:00:34 INFO plugin.PluginRepository:         Nutch Publisher 
(org.apache.nutch.publisher.NutchPublisher)
17/05/02 06:00:34 INFO plugin.PluginRepository:         Nutch Protocol 
(org.apache.nutch.protocol.Protocol)
17/05/02 06:00:34 INFO plugin.PluginRepository:         Nutch URL Ignore 
Exemption Filter (org.apache.nutch.net.URLExemptionFilter)
17/05/02 06:00:34 INFO plugin.PluginRepository:         Nutch Index Writer 
(org.apache.nutch.indexer.IndexWriter)
17/05/02 06:00:34 INFO plugin.PluginRepository:         Nutch Segment Merge 
Filter (org.apache.nutch.segment.SegmentMergeFilter)
17/05/02 06:00:34 INFO plugin.PluginRepository:         Nutch Indexing Filter 
(org.apache.nutch.indexer.IndexingFilter)
17/05/02 06:00:34 INFO conf.Configuration: found resource regex-urlfilter.txt 
at file:/tmp/hadoop-unjar7886623985863993949/regex-urlfilter.txt
17/05/02 06:00:34 INFO conf.Configuration: found resource regex-normalize.xml 
at file:/tmp/hadoop-unjar7886623985863993949/regex-normalize.xml
17/05/02 06:00:34 INFO crawl.FetchScheduleFactory: Using FetchSchedule impl: 
org.apache.nutch.crawl.DefaultFetchSchedule
17/05/02 06:00:34 INFO crawl.AbstractFetchSchedule: defaultInterval=2592000
17/05/02 06:00:34 INFO crawl.AbstractFetchSchedule: maxInterval=7776000
17/05/02 06:00:34 INFO regex.RegexURLNormalizer: can't find rules for scope 
'partition', using default
17/05/02 06:00:34 INFO mapred.LocalJobRunner:
17/05/02 06:00:34 INFO mapred.MapTask: Starting flush of map output
17/05/02 06:00:34 INFO mapred.MapTask: Spilling map output
17/05/02 06:00:34 INFO mapred.MapTask: bufstart = 0; bufend = 83; bufvoid = 
104857600
17/05/02 06:00:34 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 
26214396(104857584); length = 1/6553600
17/05/02 06:00:34 INFO zlib.ZlibFactory: Successfully loaded & initialized 
native-zlib library
17/05/02 06:00:34 INFO compress.CodecPool: Got brand-new compressor [.deflate]
17/05/02 06:00:34 INFO mapred.MapTask: Finished spill 0
17/05/02 06:00:34 INFO mapred.Task: 
Task:attempt_local1706016672_0001_m_000000_0 is done. And is in the process of 
committing
17/05/02 06:00:34 INFO mapred.LocalJobRunner: 
hdfs://localhost:9000/user/root/crawl/crawldb/current/part-r-00000/data:0+148
17/05/02 06:00:34 INFO mapred.Task: Task 
'attempt_local1706016672_0001_m_000000_0' done.
17/05/02 06:00:34 INFO mapred.LocalJobRunner: Finishing task: 
attempt_local1706016672_0001_m_000000_0
17/05/02 06:00:34 INFO mapred.LocalJobRunner: map task executor complete.
17/05/02 06:00:34 INFO mapred.LocalJobRunner: Waiting for reduce tasks
17/05/02 06:00:34 INFO mapred.LocalJobRunner: Starting task: 
attempt_local1706016672_0001_r_000000_0
17/05/02 06:00:34 INFO output.FileOutputCommitter: File Output Committer 
Algorithm version is 1
17/05/02 06:00:34 INFO output.FileOutputCommitter: FileOutputCommitter skip 
cleanup _temporary folders under output directory:false, ignore cleanup 
failures: false
17/05/02 06:00:34 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
17/05/02 06:00:34 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: 
org.apache.hadoop.mapreduce.task.reduce.Shuffle@2fd7e5ad
17/05/02 06:00:34 INFO reduce.MergeManagerImpl: MergerManager: 
memoryLimit=334338464, maxSingleShuffleLimit=83584616, 
mergeThreshold=220663392, ioSortFactor=10, memToMemMergeOutputsThreshold=10
17/05/02 06:00:34 INFO reduce.EventFetcher: 
attempt_local1706016672_0001_r_000000_0 Thread started: EventFetcher for 
fetching Map Completion Events
17/05/02 06:00:34 INFO compress.CodecPool: Got brand-new decompressor [.deflate]
17/05/02 06:00:34 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle 
output of map attempt_local1706016672_0001_m_000000_0 decomp: 87 len: 83 to 
MEMORY
17/05/02 06:00:34 INFO reduce.InMemoryMapOutput: Read 87 bytes from map-output 
for attempt_local1706016672_0001_m_000000_0
17/05/02 06:00:34 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output 
of size: 87, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->87
17/05/02 06:00:34 INFO reduce.EventFetcher: EventFetcher is interrupted.. 
Returning
17/05/02 06:00:34 INFO mapred.LocalJobRunner: 1 / 1 copied.
17/05/02 06:00:34 INFO reduce.MergeManagerImpl: finalMerge called with 1 
in-memory map-outputs and 0 on-disk map-outputs
17/05/02 06:00:34 INFO mapred.Merger: Merging 1 sorted segments
17/05/02 06:00:34 INFO mapred.Merger: Down to the last merge-pass, with 1 
segments left of total size: 81 bytes
17/05/02 06:00:34 INFO reduce.MergeManagerImpl: Merged 1 segments, 87 bytes to 
disk to satisfy reduce memory limit
17/05/02 06:00:34 INFO reduce.MergeManagerImpl: Merging 1 files, 91 bytes from 
disk
17/05/02 06:00:34 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes 
from memory into reduce
17/05/02 06:00:34 INFO mapred.Merger: Merging 1 sorted segments
17/05/02 06:00:34 INFO mapred.Merger: Down to the last merge-pass, with 1 
segments left of total size: 81 bytes
17/05/02 06:00:34 INFO mapred.LocalJobRunner: 1 / 1 copied.
17/05/02 06:00:34 INFO conf.Configuration: found resource regex-normalize.xml 
at file:/tmp/hadoop-unjar7886623985863993949/regex-normalize.xml
17/05/02 06:00:34 INFO crawl.FetchScheduleFactory: Using FetchSchedule impl: 
org.apache.nutch.crawl.DefaultFetchSchedule
17/05/02 06:00:34 INFO crawl.AbstractFetchSchedule: defaultInterval=2592000
17/05/02 06:00:34 INFO crawl.AbstractFetchSchedule: maxInterval=7776000
17/05/02 06:00:34 INFO regex.RegexURLNormalizer: can't find rules for scope 
'generate_host_count', using default
17/05/02 06:00:34 INFO mapred.Task: 
Task:attempt_local1706016672_0001_r_000000_0 is done. And is in the process of 
committing
17/05/02 06:00:34 INFO mapred.LocalJobRunner: 1 / 1 copied.
17/05/02 06:00:34 INFO mapred.Task: Task 
attempt_local1706016672_0001_r_000000_0 is allowed to commit now
17/05/02 06:00:34 INFO output.FileOutputCommitter: Saved output of task 
'attempt_local1706016672_0001_r_000000_0' to 
hdfs://localhost:9000/user/root/generate-temp-ca817ccd-332b-4fa3-afe3-dab7d80ea711/_temporary/0/task_local1706016672_0001_r_000000
17/05/02 06:00:34 INFO mapred.LocalJobRunner: reduce > reduce
17/05/02 06:00:34 INFO mapred.Task: Task 
'attempt_local1706016672_0001_r_000000_0' done.
17/05/02 06:00:34 INFO mapred.LocalJobRunner: Finishing task: 
attempt_local1706016672_0001_r_000000_0
17/05/02 06:00:34 INFO mapred.LocalJobRunner: Starting task: 
attempt_local1706016672_0001_r_000001_0
17/05/02 06:00:34 INFO output.FileOutputCommitter: File Output Committer 
Algorithm version is 1
17/05/02 06:00:34 INFO output.FileOutputCommitter: FileOutputCommitter skip 
cleanup _temporary folders under output directory:false, ignore cleanup 
failures: false
17/05/02 06:00:34 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
17/05/02 06:00:34 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: 
org.apache.hadoop.mapreduce.task.reduce.Shuffle@29cfa49
17/05/02 06:00:34 INFO reduce.MergeManagerImpl: MergerManager: 
memoryLimit=334338464, maxSingleShuffleLimit=83584616, 
mergeThreshold=220663392, ioSortFactor=10, memToMemMergeOutputsThreshold=10
17/05/02 06:00:34 INFO reduce.EventFetcher: 
attempt_local1706016672_0001_r_000001_0 Thread started: EventFetcher for 
fetching Map Completion Events
17/05/02 06:00:34 INFO reduce.LocalFetcher: localfetcher#2 about to shuffle 
output of map attempt_local1706016672_0001_m_000000_0 decomp: 2 len: 14 to 
MEMORY
17/05/02 06:00:34 INFO reduce.InMemoryMapOutput: Read 2 bytes from map-output 
for attempt_local1706016672_0001_m_000000_0
17/05/02 06:00:34 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output 
of size: 2, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->2
17/05/02 06:00:34 INFO reduce.EventFetcher: EventFetcher is interrupted.. 
Returning
17/05/02 06:00:34 INFO mapred.LocalJobRunner: 1 / 1 copied.
17/05/02 06:00:34 INFO reduce.MergeManagerImpl: finalMerge called with 1 
in-memory map-outputs and 0 on-disk map-outputs
17/05/02 06:00:34 INFO mapred.Merger: Merging 1 sorted segments
17/05/02 06:00:34 INFO mapred.Merger: Down to the last merge-pass, with 0 
segments left of total size: 0 bytes
17/05/02 06:00:34 INFO reduce.MergeManagerImpl: Merged 1 segments, 2 bytes to 
disk to satisfy reduce memory limit
17/05/02 06:00:34 INFO reduce.MergeManagerImpl: Merging 1 files, 22 bytes from 
disk
17/05/02 06:00:34 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes 
from memory into reduce
17/05/02 06:00:34 INFO mapred.Merger: Merging 1 sorted segments
17/05/02 06:00:34 INFO mapred.Merger: Down to the last merge-pass, with 0 
segments left of total size: 0 bytes
17/05/02 06:00:34 INFO mapred.LocalJobRunner: 1 / 1 copied.
17/05/02 06:00:34 INFO conf.Configuration: found resource regex-normalize.xml 
at file:/tmp/hadoop-unjar7886623985863993949/regex-normalize.xml
17/05/02 06:00:34 INFO crawl.FetchScheduleFactory: Using FetchSchedule impl: 
org.apache.nutch.crawl.DefaultFetchSchedule
17/05/02 06:00:34 INFO crawl.AbstractFetchSchedule: defaultInterval=2592000
17/05/02 06:00:34 INFO crawl.AbstractFetchSchedule: maxInterval=7776000
17/05/02 06:00:34 INFO mapred.Task: 
Task:attempt_local1706016672_0001_r_000001_0 is done. And is in the process of 
committing
17/05/02 06:00:34 INFO mapred.LocalJobRunner: reduce > reduce
17/05/02 06:00:34 INFO mapred.Task: Task 
'attempt_local1706016672_0001_r_000001_0' done.
17/05/02 06:00:34 INFO mapred.LocalJobRunner: Finishing task: 
attempt_local1706016672_0001_r_000001_0
17/05/02 06:00:34 INFO mapred.LocalJobRunner: reduce task executor complete.
17/05/02 06:00:34 INFO mapreduce.Job: Job job_local1706016672_0001 running in 
uber mode : false
17/05/02 06:00:34 INFO mapreduce.Job:  map 100% reduce 100%
17/05/02 06:00:34 INFO mapreduce.Job: Job job_local1706016672_0001 completed 
successfully
17/05/02 06:00:35 INFO mapreduce.Job: Counters: 35
        File System Counters
                FILE: Number of bytes read=652296139
                FILE: Number of bytes written=658571046
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=444
                HDFS: Number of bytes written=398
                HDFS: Number of read operations=37
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=13
        Map-Reduce Framework
                Map input records=1
                Map output records=1
                Map output bytes=83
                Map output materialized bytes=97
                Input split bytes=123
                Combine input records=0
                Combine output records=0
                Reduce input groups=1
                Reduce shuffle bytes=97
                Reduce input records=1
                Reduce output records=1
                Spilled Records=2
                Shuffled Maps =2
                Failed Shuffles=0
                Merged Map outputs=2
                GC time elapsed (ms)=8
                Total committed heap usage (bytes)=1036517376
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=148
        File Output Format Counters
                Bytes Written=199
17/05/02 06:00:35 INFO crawl.Generator: Generator: Partitioning selected urls 
for politeness.
17/05/02 06:00:36 INFO crawl.Generator: Generator: segment: 
crawl/segments/20170502060036
17/05/02 06:00:36 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with 
processName=JobTracker, sessionId= - already initialized
17/05/02 06:00:36 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with 
processName=JobTracker, sessionId= - already initialized
17/05/02 06:00:36 INFO mapred.FileInputFormat: Total input files to process : 1
17/05/02 06:00:36 INFO mapreduce.JobSubmitter: number of splits:1
17/05/02 06:00:36 INFO mapreduce.JobSubmitter: Submitting tokens for job: 
job_local1332900929_0002
17/05/02 06:00:36 INFO mapreduce.Job: The url to track the job: 
http://localhost:8080/
17/05/02 06:00:36 INFO mapreduce.Job: Running job: job_local1332900929_0002
17/05/02 06:00:36 INFO mapred.LocalJobRunner: OutputCommitter set in config null
17/05/02 06:00:36 INFO mapred.LocalJobRunner: OutputCommitter is 
org.apache.hadoop.mapred.FileOutputCommitter
17/05/02 06:00:36 INFO output.FileOutputCommitter: File Output Committer 
Algorithm version is 1
17/05/02 06:00:36 INFO output.FileOutputCommitter: FileOutputCommitter skip 
cleanup _temporary folders under output directory:false, ignore cleanup 
failures: false
17/05/02 06:00:36 INFO mapred.LocalJobRunner: Waiting for map tasks
17/05/02 06:00:36 INFO mapred.LocalJobRunner: Starting task: 
attempt_local1332900929_0002_m_000000_0
17/05/02 06:00:36 INFO output.FileOutputCommitter: File Output Committer 
Algorithm version is 1
17/05/02 06:00:36 INFO output.FileOutputCommitter: FileOutputCommitter skip 
cleanup _temporary folders under output directory:false, ignore cleanup 
failures: false
17/05/02 06:00:36 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
17/05/02 06:00:36 INFO mapred.MapTask: Processing split: 
hdfs://localhost:9000/user/root/generate-temp-ca817ccd-332b-4fa3-afe3-dab7d80ea711/fetchlist-1/part-00000:0+199
17/05/02 06:00:36 INFO mapred.MapTask: numReduceTasks: 1
17/05/02 06:00:36 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
17/05/02 06:00:36 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
17/05/02 06:00:36 INFO mapred.MapTask: soft limit at 83886080
17/05/02 06:00:36 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
17/05/02 06:00:36 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
17/05/02 06:00:36 INFO mapred.MapTask: Map output collector class = 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
17/05/02 06:00:36 INFO mapred.LocalJobRunner:
17/05/02 06:00:36 INFO mapred.MapTask: Starting flush of map output
17/05/02 06:00:36 INFO mapred.MapTask: Spilling map output
17/05/02 06:00:36 INFO mapred.MapTask: bufstart = 0; bufend = 104; bufvoid = 
104857600
17/05/02 06:00:36 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 
26214396(104857584); length = 1/6553600
17/05/02 06:00:36 INFO mapred.MapTask: Finished spill 0
17/05/02 06:00:36 INFO mapred.Task: 
Task:attempt_local1332900929_0002_m_000000_0 is done. And is in the process of 
committing
17/05/02 06:00:36 INFO mapred.LocalJobRunner: 
hdfs://localhost:9000/user/root/generate-temp-ca817ccd-332b-4fa3-afe3-dab7d80ea711/fetchlist-1/part-00000:0+199
17/05/02 06:00:36 INFO mapred.Task: Task 
'attempt_local1332900929_0002_m_000000_0' done.
17/05/02 06:00:36 INFO mapred.LocalJobRunner: Finishing task: 
attempt_local1332900929_0002_m_000000_0
17/05/02 06:00:36 INFO mapred.LocalJobRunner: map task executor complete.
17/05/02 06:00:36 INFO mapred.LocalJobRunner: Waiting for reduce tasks
17/05/02 06:00:36 INFO mapred.LocalJobRunner: Starting task: 
attempt_local1332900929_0002_r_000000_0
17/05/02 06:00:36 INFO output.FileOutputCommitter: File Output Committer 
Algorithm version is 1
17/05/02 06:00:36 INFO output.FileOutputCommitter: FileOutputCommitter skip 
cleanup _temporary folders under output directory:false, ignore cleanup 
failures: false
17/05/02 06:00:36 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
17/05/02 06:00:36 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: 
org.apache.hadoop.mapreduce.task.reduce.Shuffle@57dcd1f6
17/05/02 06:00:36 INFO reduce.MergeManagerImpl: MergerManager: 
memoryLimit=334338464, maxSingleShuffleLimit=83584616, 
mergeThreshold=220663392, ioSortFactor=10, memToMemMergeOutputsThreshold=10
17/05/02 06:00:36 INFO reduce.EventFetcher: 
attempt_local1332900929_0002_r_000000_0 Thread started: EventFetcher for 
fetching Map Completion Events
17/05/02 06:00:36 INFO reduce.LocalFetcher: localfetcher#3 about to shuffle 
output of map attempt_local1332900929_0002_m_000000_0 decomp: 108 len: 82 to 
MEMORY
17/05/02 06:00:36 INFO reduce.InMemoryMapOutput: Read 108 bytes from map-output 
for attempt_local1332900929_0002_m_000000_0
17/05/02 06:00:36 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output 
of size: 108, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory 
->108
17/05/02 06:00:36 INFO reduce.EventFetcher: EventFetcher is interrupted.. 
Returning
17/05/02 06:00:36 INFO mapred.LocalJobRunner: 1 / 1 copied.
17/05/02 06:00:36 INFO reduce.MergeManagerImpl: finalMerge called with 1 
in-memory map-outputs and 0 on-disk map-outputs
17/05/02 06:00:36 INFO mapred.Merger: Merging 1 sorted segments
17/05/02 06:00:36 INFO mapred.Merger: Down to the last merge-pass, with 1 
segments left of total size: 81 bytes
17/05/02 06:00:36 INFO reduce.MergeManagerImpl: Merged 1 segments, 108 bytes to 
disk to satisfy reduce memory limit
17/05/02 06:00:36 INFO reduce.MergeManagerImpl: Merging 1 files, 90 bytes from 
disk
17/05/02 06:00:36 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes 
from memory into reduce
17/05/02 06:00:36 INFO mapred.Merger: Merging 1 sorted segments
17/05/02 06:00:36 INFO mapred.Merger: Down to the last merge-pass, with 1 
segments left of total size: 81 bytes
17/05/02 06:00:36 INFO mapred.LocalJobRunner: 1 / 1 copied.
17/05/02 06:00:36 INFO mapred.Task: 
Task:attempt_local1332900929_0002_r_000000_0 is done. And is in the process of 
committing
17/05/02 06:00:36 INFO mapred.LocalJobRunner: 1 / 1 copied.
17/05/02 06:00:36 INFO mapred.Task: Task 
attempt_local1332900929_0002_r_000000_0 is allowed to commit now
17/05/02 06:00:36 INFO output.FileOutputCommitter: Saved output of task 
'attempt_local1332900929_0002_r_000000_0' to 
hdfs://localhost:9000/user/root/crawl/segments/20170502060036/crawl_generate/_temporary/0/task_local1332900929_0002_r_000000
17/05/02 06:00:36 INFO mapred.LocalJobRunner: reduce > reduce
17/05/02 06:00:36 INFO mapred.Task: Task 
'attempt_local1332900929_0002_r_000000_0' done.
17/05/02 06:00:36 INFO mapred.LocalJobRunner: Finishing task: 
attempt_local1332900929_0002_r_000000_0
17/05/02 06:00:36 INFO mapred.LocalJobRunner: reduce task executor complete.
17/05/02 06:00:37 INFO mapreduce.Job: Job job_local1332900929_0002 running in 
uber mode : false
17/05/02 06:00:37 INFO mapreduce.Job:  map 100% reduce 100%
17/05/02 06:00:37 INFO mapreduce.Job: Job job_local1332900929_0002 completed 
successfully
17/05/02 06:00:37 INFO mapreduce.Job: Counters: 35
        File System Counters
                FILE: Number of bytes read=869728356
                FILE: Number of bytes written=878093356
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=694
                HDFS: Number of bytes written=567
                HDFS: Number of read operations=53
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=18
        Map-Reduce Framework
                Map input records=1
                Map output records=1
                Map output bytes=104
                Map output materialized bytes=82
                Input split bytes=157
                Combine input records=0
                Combine output records=0
                Reduce input groups=1
                Reduce shuffle bytes=82
                Reduce input records=1
                Reduce output records=1
                Spilled Records=2
                Shuffled Maps =1
                Failed Shuffles=0
                Merged Map outputs=1
                GC time elapsed (ms)=0
                Total committed heap usage (bytes)=901775360
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=199
        File Output Format Counters
                Bytes Written=169
17/05/02 06:00:37 INFO crawl.Generator: Generator: finished at 2017-05-02 
06:00:37, elapsed: 00:00:05
Operating on segment : 20170502060036
Fetching : 20170502060036
/data/apache-nutch-1.13/runtime/deploy/bin/nutch fetch -D 
mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D 
mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D 
mapreduce.map.output.compress=true -D fetcher.timelimit.mins=180 
crawl/segments/20170502060036 -noParsing -threads 50
+ cygwin=false
+ case "`uname`" in
++ uname
+ THIS=/data/apache-nutch-1.13/runtime/deploy/bin/nutch
+ '[' -h /data/apache-nutch-1.13/runtime/deploy/bin/nutch ']'
+ '[' 17 = 0 ']'
+ COMMAND=fetch
+ shift
++ dirname /data/apache-nutch-1.13/runtime/deploy/bin/nutch
+ THIS_DIR=/data/apache-nutch-1.13/runtime/deploy/bin
++ cd /data/apache-nutch-1.13/runtime/deploy/bin/..
++ pwd
+ NUTCH_HOME=/data/apache-nutch-1.13/runtime/deploy
+ '[' '' '!=' '' ']'
+ '[' /usr/lib/jvm/java-8-oracle/jre/ = '' ']'
+ local=true
+ '[' -f /data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job ']'
+ local=false
+ for f in '"$NUTCH_HOME"/*nutch*.job'
+ NUTCH_JOB=/data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job
+ false
+ JAVA=/usr/lib/jvm/java-8-oracle/jre//bin/java
+ JAVA_HEAP_MAX=-Xmx1000m
+ '[' '' '!=' '' ']'
+ CLASSPATH=/data/apache-nutch-1.13/runtime/deploy/conf
+ 
CLASSPATH=/data/apache-nutch-1.13/runtime/deploy/conf:/usr/lib/jvm/java-8-oracle/jre//lib/tools.jar
+ IFS=
+ false
+ false
+ JAVA_LIBRARY_PATH=
+ '[' -d /data/apache-nutch-1.13/runtime/deploy/lib/native ']'
+ '[' false = true -a X '!=' X ']'
+ unset IFS
+ '[' '' = '' ']'
+ NUTCH_LOG_DIR=/data/apache-nutch-1.13/runtime/deploy/logs
+ '[' '' = '' ']'
+ NUTCH_LOGFILE=hadoop.log
+ false
+ NUTCH_OPTS=($NUTCH_OPTS -Dhadoop.log.dir="$NUTCH_LOG_DIR")
+ NUTCH_OPTS=("${NUTCH_OPTS[@]}" -Dhadoop.log.file="$NUTCH_LOGFILE")
+ '[' x '!=' x ']'
+ '[' fetch = crawl ']'
+ '[' fetch = inject ']'
+ '[' fetch = generate ']'
+ '[' fetch = freegen ']'
+ '[' fetch = fetch ']'
+ CLASS=org.apache.nutch.fetcher.Fetcher
+ EXEC_CALL=(hadoop jar "$NUTCH_JOB")
+ false
++ which hadoop
++ wc -l
+ '[' 1 -eq 0 ']'
+ exec hadoop jar /data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job 
org.apache.nutch.fetcher.Fetcher -D mapreduce.job.reduces=2 -D 
mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D 
mapreduce.map.speculative=false -D mapreduce.map.output.compress=true -D 
fetcher.timelimit.mins=180 crawl/segments/20170502060036 -noParsing -threads 50
17/05/02 06:00:43 INFO fetcher.Fetcher: Fetcher: starting at 2017-05-02 06:00:43
17/05/02 06:00:43 INFO fetcher.Fetcher: Fetcher: segment: 
crawl/segments/20170502060036
17/05/02 06:00:43 INFO fetcher.Fetcher: Fetcher Timelimit set for : 
1493733643194
17/05/02 06:00:44 INFO Configuration.deprecation: session.id is deprecated. 
Instead, use dfs.metrics.session-id
17/05/02 06:00:44 INFO jvm.JvmMetrics: Initializing JVM Metrics with 
processName=JobTracker, sessionId=
17/05/02 06:00:44 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with 
processName=JobTracker, sessionId= - already initialized
17/05/02 06:00:44 ERROR fetcher.Fetcher: Fetcher: 
java.lang.IllegalArgumentException: Wrong FS: 
hdfs://localhost:9000/user/root/crawl/segments/20170502060036/crawl_fetch, 
expected: file:///
        at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:665)
        at 
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:86)
        at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:630)
        at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:861)
        at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:625)
        at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:435)
        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1436)
        at 
org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:55)
        at 
org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:270)
        at 
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:141)
        at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1341)
        at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1338)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1807)
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:1338)
        at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:575)
        at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:570)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1807)
        at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:570)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:561)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:870)
        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:486)
        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:521)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:495)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.util.RunJar.run(RunJar.java:234)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:148)

Error running:
  /data/apache-nutch-1.13/runtime/deploy/bin/nutch fetch -D 
mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D 
mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D 
mapreduce.map.output.compress=true -D fetcher.timelimit.mins=180 
crawl/segments/20170502060036 -noParsing -threads 50
Failed with exit value 255.

-----Original Message-----
From: Sebastian Nagel [mailto:[email protected]] 
Sent: 02 May 2017 13:54
To: [email protected]
Subject: Re: Wrong FS exception in Fetcher

Hi Yossi,

strange error, indeed. Is it also reproducible in pseudo-distributed mode using 
Hadoop 2.7.2,
the version Nutch depends on?n

Could you also add the line
  set -x
to bin/nutch and run bin/crawl again to see how all steps are executed.

Thanks,
Sebastian

On 04/30/2017 04:04 PM, Yossi Tamari wrote:
> Hi,
> 
>  
> 
> I'm trying to run Nutch 1.13 on Hadoop 2.8.0 in pseudo-distributed
> distributed mode.
> 
> Running the command:
> 
> Deploy/bin/crawl urls crawl 2
> 
> The Injector and Generator run successfully, but in the Fetcher I get the
> following error:
> 
> 17/04/30 08:43:48 ERROR fetcher.Fetcher: Fetcher:
> java.lang.IllegalArgumentException: Wrong FS:
> hdfs://localhost:9000/user/root/crawl/segments/20170430084337/crawl_fetch,
> expected: file:///
> 
>         at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:665)
> 
>         at
> org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:8
> 6)
> 
>         at
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFile
> System.java:630)
> 
>         at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFi
> leSystem.java:861)
> 
>         at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.jav
> a:625)
> 
>         at
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:43
> 5)
> 
>         at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1436)
> 
>         at
> org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputF
> ormat.java:55)
> 
>         at
> org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:270)
> 
>         at
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java
> :141)
> 
>         at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1341)
> 
>         at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1338)
> 
>         at java.security.AccessController.doPrivileged(Native Method)
> 
>         at javax.security.auth.Subject.doAs(Subject.java:422)
> 
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja
> va:1807)
> 
>         at org.apache.hadoop.mapreduce.Job.submit(Job.java:1338)
> 
>         at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:575)
> 
>         at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:570)
> 
>         at java.security.AccessController.doPrivileged(Native Method)
> 
>         at javax.security.auth.Subject.doAs(Subject.java:422)
> 
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja
> va:1807)
> 
>         at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:570)
> 
>         at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:561)
> 
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:870)
> 
>         at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:486)
> 
>         at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:521)
> 
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> 
>         at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:495)
> 
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 
>         at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62
> )
> 
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
> .java:43)
> 
>         at java.lang.reflect.Method.invoke(Method.java:498)
> 
>         at org.apache.hadoop.util.RunJar.run(RunJar.java:234)
> 
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:148)
> 
>  
> 
> Error running:
> 
>   /data/apache-nutch-1.13/runtime/deploy/bin/nutch fetch -D
> mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D
> mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D
> mapreduce.map.output.compress=true -D fetcher.timelimit.mins=180
> crawl/segments/20170430084337 -noParsing -threads 50
> 
> Failed with exit value 255.
> 
>  
> 
>  
> 
> Any ideas how to fix this?
> 
>  
> 
> Thanks,
> 
>                Yossi.
> 
>

RE: Wrong FS exception in Fetcher

Reply via email to