Hi all,
When trying to run nutch's crawldb reader to get stats for my crawl
database, I get the following error when calling it using hadoop,
Is this a known issue?
Thanks,
Viksit
sudo -u hdfs hadoop jar /opt/nutch-build/build/nutch-1.2.job
org.apache.nutch.crawl.CrawlDbReader
/crawl/crawl-dir-1305167589/crawldb -stats
1
1/05/12 19:48:08 INFO crawl.CrawlDbReader: CrawlDb statistics start:
/crawl/crawl-dir-1305167589/crawldb
11/05/12 19:48:08 WARN mapred.JobClient: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the
same.
11/05/12 19:48:09 INFO mapred.FileInputFormat: Total input paths to process : 10
11/05/12 19:48:09 INFO mapred.JobClient: Running job: job_201105120113_0202
11/05/12 19:48:10 INFO mapred.JobClient: map 0% reduce 0%
11/05/12 19:48:18 INFO mapred.JobClient: map 10% reduce 0%
11/05/12 19:48:19 INFO mapred.JobClient: map 20% reduce 0%
11/05/12 19:48:20 INFO mapred.JobClient: map 30% reduce 0%
11/05/12 19:48:23 INFO mapred.JobClient: map 40% reduce 0%
11/05/12 19:48:24 INFO mapred.JobClient: map 50% reduce 0%
11/05/12 19:48:25 INFO mapred.JobClient: map 60% reduce 0%
11/05/12 19:48:27 INFO mapred.JobClient: map 70% reduce 0%
11/05/12 19:48:28 INFO mapred.JobClient: map 80% reduce 0%
11/05/12 19:48:30 INFO mapred.JobClient: map 90% reduce 0%
11/05/12 19:48:31 INFO mapred.JobClient: map 100% reduce 0%
11/05/12 19:52:22 INFO mapred.JobClient: map 100% reduce 3%
11/05/12 19:52:23 INFO mapred.JobClient: map 100% reduce 10%
11/05/12 19:52:38 INFO mapred.JobClient: map 100% reduce 13%
11/05/12 19:52:39 INFO mapred.JobClient: map 100% reduce 20%
11/05/12 19:52:48 INFO mapred.JobClient: map 100% reduce 30%
11/05/12 19:53:01 INFO mapred.JobClient: map 100% reduce 33%
11/05/12 19:53:02 INFO mapred.JobClient: map 100% reduce 40%
11/05/12 19:53:20 INFO mapred.JobClient: map 100% reduce 43%
11/05/12 19:53:21 INFO mapred.JobClient: map 100% reduce 50%
11/05/12 19:53:36 INFO mapred.JobClient: map 100% reduce 53%
11/05/12 19:53:38 INFO mapred.JobClient: map 100% reduce 60%
11/05/12 19:53:44 INFO mapred.JobClient: map 100% reduce 63%
11/05/12 19:53:46 INFO mapred.JobClient: map 100% reduce 70%
11/05/12 19:53:54 INFO mapred.JobClient: map 100% reduce 73%
11/05/12 19:53:55 INFO mapred.JobClient: map 100% reduce 80%
11/05/12 19:53:57 INFO mapred.JobClient: map 100% reduce 90%
11/05/12 19:54:05 INFO mapred.JobClient: map 100% reduce 100%
11/05/12 19:54:07 INFO mapred.JobClient: Job complete: job_201105120113_0202
11/05/12 19:54:07 INFO mapred.JobClient: Counters: 23
11/05/12 19:54:07 INFO mapred.JobClient: Job Counters
11/05/12 19:54:07 INFO mapred.JobClient: Launched reduce tasks=10
11/05/12 19:54:07 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=46180
11/05/12 19:54:07 INFO mapred.JobClient: Total time spent by all
reduces waiting after reserving slots (ms)=0
11/05/12 19:54:07 INFO mapred.JobClient: Total time spent by all
maps waiting after reserving slots (ms)=0
11/05/12 19:54:07 INFO mapred.JobClient: Launched map tasks=10
11/05/12 19:54:07 INFO mapred.JobClient: Data-local map tasks=10
11/05/12 19:54:07 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=87373
11/05/12 19:54:07 INFO mapred.JobClient: FileSystemCounters
11/05/12 19:54:07 INFO mapred.JobClient: FILE_BYTES_READ=34517
11/05/12 19:54:07 INFO mapred.JobClient: HDFS_BYTES_READ=111602383
11/05/12 19:54:07 INFO mapred.JobClient: FILE_BYTES_WRITTEN=1395398
11/05/12 19:54:07 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1871
11/05/12 19:54:07 INFO mapred.JobClient: Map-Reduce Framework
11/05/12 19:54:07 INFO mapred.JobClient: Reduce input groups=49
11/05/12 19:54:07 INFO mapred.JobClient: Combine output records=219
11/05/12 19:54:07 INFO mapred.JobClient: Map input records=808925
11/05/12 19:54:07 INFO mapred.JobClient: Reduce shuffle bytes=3161
11/05/12 19:54:07 INFO mapred.JobClient: Reduce output records=49
11/05/12 19:54:07 INFO mapred.JobClient: Spilled Records=657
11/05/12 19:54:07 INFO mapred.JobClient: Map output bytes=42873025
11/05/12 19:54:07 INFO mapred.JobClient: Map input bytes=111599813
11/05/12 19:54:07 INFO mapred.JobClient: Combine input records=3235700
11/05/12 19:54:07 INFO mapred.JobClient: Map output records=3235700
11/05/12 19:54:07 INFO mapred.JobClient: SPLIT_RAW_BYTES=1710
11/05/12 19:54:07 INFO mapred.JobClient: Reduce input records=219
Exception in thread "main" java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:180)
at java.io.DataInputStream.readFully(DataInputStream.java:152)
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1465)
at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1437)
at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1424)
at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1419)
at
org.apache.hadoop.mapred.SequenceFileOutputFormat.getReaders(SequenceFileOutputFormat.java:89)
at
org.apache.nutch.crawl.CrawlDbReader.processStatJob(CrawlDbReader.java:320)
at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:502)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:186)