I fetch 3 segments and then do updatedb with 3 segments. The updatedb job is
completed, but the crawldb is not updated (by checking urls in crawldb). The
locked file and temp directory are still in the crawldb directory.
Apparently updatedb stops before merging is done. There is only one error
message:
2010-09-26 16:02:48,122 WARN mapred.TaskTracker - Error running child
java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:226)
at org.apache.hadoop.hdfs.DFSClient.access$600(DFSClient.java:67)
at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.close(DFSClient.java:1678)
at java.io.FilterInputStream.close(FilterInputStream.java:155)
at
org.apache.hadoop.io.SequenceFile$Reader.close(SequenceFile.java:1584)
at
org.apache.hadoop.mapred.SequenceFileRecordReader.close(SequenceFileRecordReader.java:125)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.close(MapTask.java:198)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:362)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
What causes updatedb to stop?
the updatedb job status is here:
map() completion: 1.0
reduce() completion: 1.0
Counters: 18
Job Counters
Launched reduce tasks=8
Launched map tasks=171
Data-local map tasks=171
FileSystemCounters
FILE_BYTES_READ=31437553119
HDFS_BYTES_READ=17803591396
FILE_BYTES_WRITTEN=47532638653
HDFS_BYTES_WRITTEN=7460484375
Map-Reduce Framework
Reduce input groups=45926460
Combine output records=0
Map input records=139824153
Reduce shuffle bytes=16103425228
Reduce output records=45926460
Spilled Records=409880576
Map output bytes=15813496752
Map input bytes=17803440460
Combine input records=0
Map output records=137962810
Reduce input records=137962810
thanks
aj
--
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA