Hi all, here is the architechture I'm using : a local java client (pentaho) that interacts with a remote hadoop cluster and a remote MongoDB database server.
When I try to execute a pig script (it could load data from a file under hadoop or direclty from MongoDb server using hadoop mongoDB connector) with my java client, each time the system responds that it can not find from where those datas are loaded, but in the reality they absolutly exists. here is the first pig script that loads data from an HDFS file : ********************************************************************* weblogs = LOAD 'hdfs://sigma-server:54310/user/hduser/pdi/weblogs/parse/part/weblogs_parse.txt' USING PigStorage('\t') AS ( client_ip:chararray, full_request_date:chararray, day:int, month:chararray, month_num:int, year:int, hour:int, minute:int, second:int, timezone:chararray, http_verb:chararray, uri:chararray, http_status_code:chararray, bytes_returned:chararray, referrer:chararray, user_agent:chararray ); weblog_group = GROUP weblogs by (client_ip, year, month_num); weblog_count = FOREACH weblog_group GENERATE group.client_ip, group.year, group.month_num, COUNT_STAR(weblogs) as pageviews; STORE weblog_count INTO 'hdfs://sigma-server:54310/user/hduser/pdi/weblogs/parse/part/mustapha-pentaho.txt'; ********************************************************************* here is the output of the Java client : ********************************************************************* 2016/05/02 11:04:28 - Pentaho Data Integration - Démarrage tâche ... 2016/05/02 11:04:28 - Pig_script_executor - Démarrage tâche 2016/05/02 11:04:28 - Pig_script_executor - Démarrage exécution entrée [Pig Script Executor] 2016/05/02 11:04:28 - Pig_script_executor - Fin exécution entrée tâche [Pig Script Executor] (résultat=[true]) 2016/05/02 11:04:28 - Pig_script_executor - Fin exécution tâche 2016/05/02 11:04:28 - Pentaho Data Integration - L'exécution de la tâche a été achevée. 2016/05/02 11:04:28 - Pig Script Executor - Pig Script Executor in Pig_script_executor has been started asynchronously. Pig_script_executor has been finished and logs from Pig Script Executor can be lost 2016/05/02 11:04:28 - Pig Script Executor - 2016/05/02 11:04:28 - Connecting to hadoop file system at: hdfs://sigma-server:54310 2016/05/02 11:04:28 - Pig Script Executor - 2016/05/02 11:04:28 - Connecting to map-reduce job tracker at: sigma-server:8032 2016/05/02 11:04:28 - Pig Script Executor - 2016/05/02 11:04:28 - Empty string specified for jar path 2016/05/02 11:04:28 - Pig Script Executor - 2016/05/02 11:04:28 - Pig features used in the script: GROUP_BY 2016/05/02 11:04:28 - Pig Script Executor - 2016/05/02 11:04:28 - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, DuplicateForEachColumnRewrite, FilterLogicExpressionSimplifier, GroupByConstParallelSetter, ImplicitSplitInserter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, NewPartitionFilterOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter], RULES_DISABLED=[PartitionFilterOptimizer]} 2016/05/02 11:04:28 - Pig Script Executor - 2016/05/02 11:04:28 - File concatenation threshold: 100 optimistic? false 2016/05/02 11:04:28 - Pig Script Executor - 2016/05/02 11:04:28 - Choosing to move algebraic foreach to combiner 2016/05/02 11:04:28 - Pig Script Executor - 2016/05/02 11:04:28 - MR plan size before optimization: 1 2016/05/02 11:04:28 - Pig Script Executor - 2016/05/02 11:04:28 - MR plan size after optimization: 1 2016/05/02 11:04:28 - Pig Script Executor - 2016/05/02 11:04:28 - Pig script settings are added to the job 2016/05/02 11:04:28 - Pig Script Executor - 2016/05/02 11:04:28 - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 2016/05/02 11:04:28 - Pig Script Executor - 2016/05/02 11:04:28 - Reduce phase detected, estimating # of required reducers. 2016/05/02 11:04:28 - Pig Script Executor - 2016/05/02 11:04:28 - Using reducer estimator: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator 2016/05/02 11:04:28 - Pig Script Executor - 2016/05/02 11:04:28 - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=81468050 2016/05/02 11:04:28 - Pig Script Executor - 2016/05/02 11:04:28 - Setting Parallelism to 1 2016/05/02 11:04:28 - Pig Script Executor - 2016/05/02 11:04:28 - creating jar file Job6383680088751493933.jar 2016/05/02 11:04:30 - Pig Script Executor - 2016/05/02 11:04:30 - jar file Job6383680088751493933.jar created 2016/05/02 11:04:30 - Pig Script Executor - 2016/05/02 11:04:30 - Setting up single store job 2016/05/02 11:04:30 - Pig Script Executor - 2016/05/02 11:04:30 - Key [pig.schematuple] is false, will not generate code. 2016/05/02 11:04:30 - Pig Script Executor - 2016/05/02 11:04:30 - Starting process to move generated code to distributed cache 2016/05/02 11:04:30 - Pig Script Executor - 2016/05/02 11:04:30 - Setting key [pig.schematuple.classes] with classes to deserialize [] 2016/05/02 11:04:30 - Pig Script Executor - 2016/05/02 11:04:30 - 1 map-reduce job(s) waiting for submission. 2016/05/02 11:04:31 - Pig Script Executor - 2016/05/02 11:04:31 - Total input paths to process : 1 2016/05/02 11:04:31 - Pig Script Executor - 2016/05/02 11:04:31 - Total input paths (combined) to process : 1 2016/05/02 11:04:31 - Pig Script Executor - 2016/05/02 11:04:31 - HadoopJobId: job_1462181691937_0009 2016/05/02 11:04:31 - Pig Script Executor - 2016/05/02 11:04:31 - Processing aliases weblog_count,weblog_group,weblogs 2016/05/02 11:04:31 - Pig Script Executor - 2016/05/02 11:04:31 - detailed locations: M: weblogs[1,10],weblogs[-1,-1],weblog_count[22,15],weblog_group[21,15] C: weblog_count[22,15],weblog_group[21,15] R: weblog_count[22,15] 2016/05/02 11:04:31 - Pig Script Executor - 2016/05/02 11:04:31 - More information at: http://sigma-server:50030/jobdetails.jsp?jobid=job_1462181691937_0009 2016/05/02 11:04:31 - Pig Script Executor - 2016/05/02 11:04:31 - 0% complete 2016/05/02 11:04:36 - Pig Script Executor - 2016/05/02 11:04:36 - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure. 2016/05/02 11:04:36 - Pig Script Executor - 2016/05/02 11:04:36 - job job_1462181691937_0009 has failed! Stop running all dependent jobs 2016/05/02 11:04:36 - Pig Script Executor - 2016/05/02 11:04:36 - 100% complete 2016/05/02 11:04:36 - Pig Script Executor - 2016/05/02 11:04:36 - 1 map reduce job(s) failed! 2016/05/02 11:04:36 - Pig Script Executor - 2016/05/02 11:04:36 - Script Statistics: HadoopVersion PigVersion UserId StartedAt FinishedAt Features 2.6.0-cdh5.5.0 0.12.0-cdh5.5.0 msoufiani 2016-05-02 11:04:28 2016-05-02 11:04:36 GROUP_BY Failed! Failed Jobs: JobId Alias Feature Message Outputs job_1462181691937_0009 weblog_count,weblog_group,weblogs GROUP_BY,COMBINER Message: Job failed! hdfs://sigma-server:54310/user/hduser/pdi/weblogs/parse/part/mustapha-pentaho.txt, Input(s): Failed to read data from "hdfs://sigma-server:54310/user/hduser/pdi/weblogs/parse/part/weblogs_parse.txt" Output(s): Failed to produce result in "hdfs://sigma-server:54310/user/hduser/pdi/weblogs/parse/part/mustapha-pentaho.txt" Counters: Total records written : 0 Total bytes written : 0 Spillable Memory Manager spill count : 0 Total bags proactively spilled: 0 Total records proactively spilled: 0 Job DAG: job_1462181691937_0009 2016/05/02 11:04:36 - Pig Script Executor - 2016/05/02 11:04:36 - Failed! 2016/05/02 11:04:36 - Pig Script Executor - 2016/05/02 11:04:36 - ERROR 2244: Job failed, hadoop does not return any error message 2016/05/02 11:04:36 - Pig Script Executor - 2016/05/02 11:04:36 - There is no log file to write to. 2016/05/02 11:04:36 - Pig Script Executor - 2016/05/02 11:04:36 - org.apache.pig.backend.executionengine.ExecException: ERROR 2244: Job failed, hadoop does not return any error message 2016/05/02 11:04:36 - Pig Script Executor - at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:148) 2016/05/02 11:04:36 - Pig Script Executor - at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:202) 2016/05/02 11:04:36 - Pig Script Executor - at org.pentaho.hadoop.shim.common.PigShimImpl.executeScript(PigShimImpl.java:46) 2016/05/02 11:04:36 - Pig Script Executor - at org.pentaho.hadoop.shim.common.delegating.DelegatingPigShim.executeScript(DelegatingPigShim.java:65) 2016/05/02 11:04:36 - Pig Script Executor - at org.pentaho.big.data.impl.shim.pig.PigServiceImpl.executeScript(PigServiceImpl.java:103) 2016/05/02 11:04:36 - Pig Script Executor - at org.pentaho.big.data.kettle.plugins.pig.JobEntryPigScriptExecutor$1.run(JobEntryPigScriptExecutor.java:499) 2016/05/02 11:04:36 - Pig Script Executor - Num successful jobs: 0 num failed jobs: 1 ********************************************************************* here is the pig script that loads data from a MongoDB server : ********************************************************************* REGISTER C:\Users\msoufiani\Desktop\pig.mongo.connector\mongo-java-driver-3.2.3-SNAPSHOT.jar; REGISTER C:\Users\msoufiani\Desktop\pig.mongo.connector\mongo-hadoop-pig-1.5.2.jar; REGISTER C:\Users\msoufiani\Desktop\pig.mongo.connector\mongo-hadoop-core-1.5.2.jar; raw = LOAD 'mongodb://sigma-server:27017/mongo_hadoop.MapReduce_test_in' USING com.mongodb.hadoop.pig.MongoLoader('id, CONTRACT ,PL_PRODUCT_AMC', 'id'); raw_limited = LIMIT raw 100; --DUMP raw_limited; STORE raw_limited INTO 'mongodb://sigma-server:27017/mongo_hadoop.MapReduce_test_out' USING com.mongodb.hadoop.pig.MongoInsertStorage(''); ********************************************************************* Here is the output: ********************************************************************* 2016/05/02 11:08:49 - Pentaho Data Integration - Démarrage tâche ... 2016/05/02 11:08:49 - Pig_script_executor - Démarrage tâche 2016/05/02 11:08:49 - Pig_script_executor - Démarrage exécution entrée [Pig Script Executor] 2016/05/02 11:08:49 - Pig_script_executor - Fin exécution entrée tâche [Pig Script Executor] (résultat=[true]) 2016/05/02 11:08:49 - Pig_script_executor - Fin exécution tâche 2016/05/02 11:08:49 - Pentaho Data Integration - L'exécution de la tâche a été achevée. 2016/05/02 11:08:49 - Pig Script Executor - Pig Script Executor in Pig_script_executor has been started asynchronously. Pig_script_executor has been finished and logs from Pig Script Executor can be lost 2016/05/02 11:08:49 - Pig Script Executor - 2016/05/02 11:08:49 - Connecting to hadoop file system at: hdfs://sigma-server:54310 2016/05/02 11:08:49 - Pig Script Executor - 2016/05/02 11:08:49 - Connecting to map-reduce job tracker at: sigma-server:8032 2016/05/02 11:08:49 - Pig Script Executor - 2016/05/02 11:08:49 - Empty string specified for jar path 2016/05/02 11:08:49 - Pig Script Executor - 2016/05/02 11:08:49 - Pig features used in the script: LIMIT 2016/05/02 11:08:49 - Pig Script Executor - 2016/05/02 11:08:49 - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, DuplicateForEachColumnRewrite, FilterLogicExpressionSimplifier, GroupByConstParallelSetter, ImplicitSplitInserter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, NewPartitionFilterOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter], RULES_DISABLED=[PartitionFilterOptimizer]} 2016/05/02 11:08:49 - Pig Script Executor - 2016/05/02 11:08:49 - File concatenation threshold: 100 optimistic? false 2016/05/02 11:08:49 - Pig Script Executor - 2016/05/02 11:08:49 - MR plan size before optimization: 2 2016/05/02 11:08:49 - Pig Script Executor - 2016/05/02 11:08:49 - MR plan size after optimization: 2 2016/05/02 11:08:49 - Pig Script Executor - 2016/05/02 11:08:49 - Pig script settings are added to the job 2016/05/02 11:08:49 - Pig Script Executor - 2016/05/02 11:08:49 - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 2016/05/02 11:08:49 - Pig Script Executor - 2016/05/02 11:08:49 - Reduce phase detected, estimating # of required reducers. 2016/05/02 11:08:49 - Pig Script Executor - 2016/05/02 11:08:49 - Setting Parallelism to 1 2016/05/02 11:08:50 - Pig Script Executor - 2016/05/02 11:08:50 - creating jar file Job8094356474794659191.jar 2016/05/02 11:08:52 - Pig Script Executor - 2016/05/02 11:08:52 - jar file Job8094356474794659191.jar created 2016/05/02 11:08:52 - Pig Script Executor - 2016/05/02 11:08:52 - Setting up single store job 2016/05/02 11:08:52 - Pig Script Executor - 2016/05/02 11:08:52 - Key [pig.schematuple] is false, will not generate code. 2016/05/02 11:08:52 - Pig Script Executor - 2016/05/02 11:08:52 - Starting process to move generated code to distributed cache 2016/05/02 11:08:52 - Pig Script Executor - 2016/05/02 11:08:52 - Setting key [pig.schematuple.classes] with classes to deserialize [] 2016/05/02 11:08:52 - Pig Script Executor - 2016/05/02 11:08:52 - 1 map-reduce job(s) waiting for submission. 2016/05/02 11:08:53 - Pig Script Executor - 2016/05/02 11:08:53 - Total input paths (combined) to process : 52 2016/05/02 11:08:53 - Pig Script Executor - 2016/05/02 11:08:53 - HadoopJobId: job_1462181691937_0011 2016/05/02 11:08:53 - Pig Script Executor - 2016/05/02 11:08:53 - Processing aliases raw,raw_limited 2016/05/02 11:08:53 - Pig Script Executor - 2016/05/02 11:08:53 - detailed locations: M: raw[5,6],raw_limited[6,14] C: R: 2016/05/02 11:08:53 - Pig Script Executor - 2016/05/02 11:08:53 - More information at: http://sigma-server:50030/jobdetails.jsp?jobid=job_1462181691937_0011 2016/05/02 11:08:53 - Pig Script Executor - 2016/05/02 11:08:53 - 0% complete 2016/05/02 11:08:58 - Pig Script Executor - 2016/05/02 11:08:58 - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure. 2016/05/02 11:08:58 - Pig Script Executor - 2016/05/02 11:08:58 - job job_1462181691937_0011 has failed! Stop running all dependent jobs 2016/05/02 11:08:58 - Pig Script Executor - 2016/05/02 11:08:58 - 100% complete 2016/05/02 11:08:58 - Pig Script Executor - 2016/05/02 11:08:58 - 1 map reduce job(s) failed! 2016/05/02 11:08:58 - Pig Script Executor - 2016/05/02 11:08:58 - Script Statistics: HadoopVersion PigVersion UserId StartedAt FinishedAt Features 2.6.0-cdh5.5.0 0.12.0-cdh5.5.0 msoufiani 2016-05-02 11:08:49 2016-05-02 11:08:58 LIMIT Failed! Failed Jobs: JobId Alias Feature Message Outputs job_1462181691937_0011 raw,raw_limited Message: Job failed! Input(s): Failed to read data from "mongodb://sigma-server:27017/mongo_hadoop.MapReduce_test_in" Output(s): Counters: Total records written : 0 Total bytes written : 0 Spillable Memory Manager spill count : 0 Total bags proactively spilled: 0 Total records proactively spilled: 0 Job DAG: job_1462181691937_0011 -> null, null 2016/05/02 11:08:58 - Pig Script Executor - 2016/05/02 11:08:58 - Failed! 2016/05/02 11:08:58 - Pig Script Executor - 2016/05/02 11:08:58 - ERROR 2244: Job failed, hadoop does not return any error message 2016/05/02 11:08:58 - Pig Script Executor - 2016/05/02 11:08:58 - There is no log file to write to. 2016/05/02 11:08:58 - Pig Script Executor - 2016/05/02 11:08:58 - org.apache.pig.backend.executionengine.ExecException: ERROR 2244: Job failed, hadoop does not return any error message 2016/05/02 11:08:58 - Pig Script Executor - at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:148) 2016/05/02 11:08:58 - Pig Script Executor - at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:202) 2016/05/02 11:08:58 - Pig Script Executor - at org.pentaho.hadoop.shim.common.PigShimImpl.executeScript(PigShimImpl.java:46) 2016/05/02 11:08:58 - Pig Script Executor - at org.pentaho.hadoop.shim.common.delegating.DelegatingPigShim.executeScript(DelegatingPigShim.java:65) 2016/05/02 11:08:58 - Pig Script Executor - at org.pentaho.big.data.impl.shim.pig.PigServiceImpl.executeScript(PigServiceImpl.java:103) 2016/05/02 11:08:58 - Pig Script Executor - at org.pentaho.big.data.kettle.plugins.pig.JobEntryPigScriptExecutor$1.run(JobEntryPigScriptExecutor.java:499) 2016/05/02 11:08:58 - Pig Script Executor - 2016/05/02 11:08:58 - ERROR 2244: Job failed, hadoop does not return any error message 2016/05/02 11:08:58 - Pig Script Executor - 2016/05/02 11:08:58 - There is no log file to write to. 2016/05/02 11:08:58 - Pig Script Executor - 2016/05/02 11:08:58 - org.apache.pig.backend.executionengine.ExecException: ERROR 2244: Job failed, hadoop does not return any error message 2016/05/02 11:08:58 - Pig Script Executor - at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:148) 2016/05/02 11:08:58 - Pig Script Executor - at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:202) 2016/05/02 11:08:58 - Pig Script Executor - at org.pentaho.hadoop.shim.common.PigShimImpl.executeScript(PigShimImpl.java:46) 2016/05/02 11:08:58 - Pig Script Executor - at org.pentaho.hadoop.shim.common.delegating.DelegatingPigShim.executeScript(DelegatingPigShim.java:65) 2016/05/02 11:08:58 - Pig Script Executor - at org.pentaho.big.data.impl.shim.pig.PigServiceImpl.executeScript(PigServiceImpl.java:103) 2016/05/02 11:08:58 - Pig Script Executor - at org.pentaho.big.data.kettle.plugins.pig.JobEntryPigScriptExecutor$1.run(JobEntryPigScriptExecutor.java:499) 2016/05/02 11:08:58 - Pig Script Executor - Num successful jobs: 0 num failed jobs: 2 ********************************************************************* Both of these scripts are successfully runnable via pig shell on the server. Can you help me on this please ? thanks in advance