This is the log ... 2014-02-06 17:29:19,087 [Thread-42] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Reduce phase detected, estimating # of required reducers. 2014-02-06 17:29:19,087 [Thread-42] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Using reducer estimator: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator 2014-02-06 17:29:19,087 [Thread-42] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator - BytesPerReducer=100000000 maxReducers=999=-1 totalInputFileSize 2014-02-06 17:29:19,087 [Thread-42] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Could not estimate number of reducers and no requested or default parallelism set. Defaulting to 1 reducer. 2014-02-06 17:29:19,087 [Thread-42] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting Parallelism to 1 2014-02-06 17:29:19,104 [Thread-42] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
InputSizeReducerEstimator cannot calculate map files size, so doesn't estimate reducer size. But I think, I gave the right hadoop file path. I tried many possible pathes like... relative-path/to/file /user/myuser/absolute-path/to/file hdfs://host:8020/user/myuser/absolute-path/to/file hdfs://host:9000/user/myuser/absolute-path/to/file/change-the-hdfs-port etc... but the pig failed to estimate reducer size. I am almost defeated... by this problem. 2014-02-06 21:31 GMT+09:00 최종원 <[email protected]>: > Hello. > > My Pig job always make one reduce job in version 0.12.0-h2, ... because > > InputSizeReducerEstimator class return input file size always -1. > > I'm not sure the reason, but actually, PlanHelper.getPhysicalOperators > method always return 0 size list. > > > public int estimateNumberOfReducers(Job job, MapReduceOper >> mapReduceOper) throws IOException { >> Configuration conf = job.getConfiguration(); >> long bytesPerReducer = conf.getLong(BYTES_PER_REDUCER_PARAM, >> DEFAULT_BYTES_PER_REDUCER); >> int maxReducers = conf.getInt(MAX_REDUCER_COUNT_PARAM, >> DEFAULT_MAX_REDUCER_COUNT_PARAM); >> List<POLoad> poLoads = >> PlanHelper.getPhysicalOperators(mapReduceOper.mapPlan, POLoad.class); >> long totalInputFileSize = getTotalInputFileSize(conf, poLoads, >> job); >> log.info("BytesPerReducer=" + bytesPerReducer + " maxReducers=" >> + maxReducers + " totalInputFileSize=" + totalInputFileSize); >> // if totalInputFileSize == -1, we couldn't get the input size >> so we can't estimate. >> if (totalInputFileSize == -1) { return -1; } >> int reducers = (int)Math.ceil((double)totalInputFileSize / >> bytesPerReducer); >> reducers = Math.max(1, reducers); >> reducers = Math.min(maxReducers, reducers); >> return reducers; >> } > > > > and the pig job ends successful. > > But the reducer planed one one task, it takes very long time. > > > I tried it in apache hadoop 2.2.0 and pig 0.12.0 (h2) version. > > And also another version by installing ambari 1.4.3. > > The result always same. > > > What was wrong ??? >
