Hi, I've suddenly been having the JobTracker freeze up every couple hours when it goes into a loop trying to write Job history files.
I get the error in various job but it's always on writing the "_logs/history" files. I'm running MRv1: Hadoop 2.0.0-cdh4.4.0 Here's a sample error: "2013-10-25 01:59:54,445 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /user/etl/pipeline/stage02/b0c6fc02-1729-4a57-8799-553f4dd789a4/_logs/history/job_201310242314_0013_1382663618303_gxetl_GX-ETL.Bucketer retrying.." I have to stop and restart the jobtracker and then it happens again, and the intervals between errors have been getting shorter. I see this ticket: https://issues.apache.org/jira/browse/HDFS-1059 But I ran fsck and the report say 0 corrupt and 0 under-replicated blocks. I also found this thread: http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201110.mbox/%3ccaf8-mnf7p_kr8snhbng1cdj70vget58_v+jnma21owymrc1...@mail.gmail.com%3E I'm not familiar with the different IO schedulers, so before I change this on all our datanodes - *does anyone recommend using deadline instead of CFQ? * We are using Ext4 file system on our datanodes which have 24 drives (we checked for any bad drives and found one that wasn't responding and pulled it from the config for that machine but errors keep happening). Or any other advice on addressing this inifinite loop beyond IO scheduler is much appreciated. Thanks, Alex
