instead of using 'python2.6 user_id_output.py hbase'
try something like this: using 'user_id_output.py' ... and a #! line with the location of the python binary. I think you can include a parameter too in the call like : using 'user_id_output.py hbase' Cheers, Ajo. On Tue, Mar 1, 2011 at 8:22 AM, Irfan Mohammed <irfan...@gmail.com> wrote: > Hi, > > I have a hive script [given below] which calls a python script using > transform and for large datasets [ > 100M rows ], the reducer is not able to > start the python process and the error message is "argument list too long". > The detailed error stack is given below. > > The python script takes only 1 static argument "hbase". For small datasets > [ 1M rows ], the script works fine. > > 1. Is this problem related to the number of open file handles on the > reducer box? > 2. How do I get the correct error message? > 3. Is there a way to get to the actual unix process with arguments it > is instantiating? > > Thanks, > Irfan > > >>>>>>>>>>>>>> script >>>>>>>>>>>>>>>>>>>>>>>>>>> > > true && echo " > set hive.exec.compress.output=true; > set io.seqfile.compression.type=BLOCK; > set mapred.output.compression.type=BLOCK; > set > mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec; > > set hive.merge.mapfiles=false; > set hive.exec.dynamic.partition=true; > set hive.exec.dynamic.partition.mode=nonstrict; > set hive.exec.max.dynamic.partitions=20000; > > set hive.map.aggr=true; > set mapred.reduce.tasks=200; > > add file /home/irfan/scripts/user_id_output.py; > > from ( > select > user_id, > source_id, > load_id, > user_id_json, > pid > from > user_id_bucket > where > 1 = 1 > and length(user_id) = 40 > distribute by pid > sort by user_id, load_id desc > ) T1 > insert overwrite table user_id_hbase_stg1 partition (pid) > SELECT > transform > ( > T1.user_id > , T1.source_id > , T1.load_id > , T1.user_id_json > , T1.pid > ) > using 'python2.6 user_id_output.py hbase' > as user_id, user_id_info1, user_id_info2, user_id_info3, user_id_info4, > user_id_info5, user_id_info3, pid > ; > > " > ${SQL_FILE}; > > true && ${HHIVE} -f ${SQL_FILE} 1>${TXT_FILE} 2>${LOG_FILE}; > > >>>>>>>>>>>>>> error log >>>>>>>>>>>>>>>>>>>>>>>>> > > 2011-03-01 14:46:13,705 INFO > org.apache.hadoop.hive.ql.exec.ExtractOperator: Initializing Self 5 OP > 2011-03-01 14:46:13,711 INFO > org.apache.hadoop.hive.ql.exec.ExtractOperator: Operator 5 OP initialized > 2011-03-01 14:46:13,711 INFO > org.apache.hadoop.hive.ql.exec.ExtractOperator: Initializing children of 5 > OP > 2011-03-01 14:46:13,711 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: > Initializing child 6 SEL > 2011-03-01 14:46:13,711 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: > Initializing Self 6 SEL > 2011-03-01 14:46:13,711 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: > SELECT > struct<_col0:string,_col1:string,_col2:string,_col3:string,_col4:string> > 2011-03-01 14:46:13,711 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: > Operator 6 SEL initialized > 2011-03-01 14:46:13,711 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: > Initializing children of 6 SEL > 2011-03-01 14:46:13,711 INFO org.apache.hadoop.hive.ql.exec.ScriptOperator: > Initializing child 7 SCR > 2011-03-01 14:46:13,711 INFO org.apache.hadoop.hive.ql.exec.ScriptOperator: > Initializing Self 7 SCR > 2011-03-01 14:46:13,728 INFO org.apache.hadoop.hive.ql.exec.ScriptOperator: > Operator 7 SCR initialized > 2011-03-01 14:46:13,728 INFO org.apache.hadoop.hive.ql.exec.ScriptOperator: > Initializing children of 7 SCR > 2011-03-01 14:46:13,728 INFO > org.apache.hadoop.hive.ql.exec.FileSinkOperator: Initializing child 8 FS > 2011-03-01 14:46:13,728 INFO > org.apache.hadoop.hive.ql.exec.FileSinkOperator: Initializing Self 8 FS > 2011-03-01 14:46:13,730 INFO > org.apache.hadoop.hive.ql.exec.FileSinkOperator: Operator 8 FS initialized > 2011-03-01 14:46:13,730 INFO > org.apache.hadoop.hive.ql.exec.FileSinkOperator: Initialization Done 8 FS > 2011-03-01 14:46:13,730 INFO org.apache.hadoop.hive.ql.exec.ScriptOperator: > Initialization Done 7 SCR > 2011-03-01 14:46:13,730 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: > Initialization Done 6 SEL > 2011-03-01 14:46:13,730 INFO > org.apache.hadoop.hive.ql.exec.ExtractOperator: Initialization Done 5 OP > 2011-03-01 14:46:13,733 INFO ExecReducer: ExecReducer: processing 1 rows: > used memory = 89690888 > 2011-03-01 14:46:13,733 INFO > org.apache.hadoop.hive.ql.exec.ExtractOperator: 5 forwarding 1 rows > 2011-03-01 14:46:13,733 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: > 6 forwarding 1 rows > 2011-03-01 14:46:13,733 INFO org.apache.hadoop.hive.ql.exec.ScriptOperator: > Executing [/usr/bin/python2.6, user_id_output.py, hbase] > 2011-03-01 14:46:13,733 INFO org.apache.hadoop.hive.ql.exec.ScriptOperator: > tablename=null > 2011-03-01 14:46:13,733 INFO org.apache.hadoop.hive.ql.exec.ScriptOperator: > partname=null > 2011-03-01 14:46:13,733 INFO org.apache.hadoop.hive.ql.exec.ScriptOperator: > alias=null > 2011-03-01 14:46:13,777 FATAL ExecReducer: > org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while > processing row > (tag=0) > {"key":{"reducesinkkey0":"AA11223344","reducesinkkey1":"20110210_02"},"value":{"_col0":"xxxxx","_col1":"m1","_col2":"20110210_02","_col3":"{'m07': > 'x12', 'm02': 'x34', 'm01': 'm45'}","_col4":"0A9"},"alias":0} > at > org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:253) > at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:467) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:415) > at org.apache.hadoop.mapred.Child$4.run(Child.java:217) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1063) > at org.apache.hadoop.mapred.Child.main(Child.java:211) > Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Cannot > initialize ScriptOperator > at > org.apache.hadoop.hive.ql.exec.ScriptOperator.processOp(ScriptOperator.java:320) > at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:457) > at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:697) > at > org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84) > at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:457) > at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:697) > at > org.apache.hadoop.hive.ql.exec.ExtractOperator.processOp(ExtractOperator.java:45) > at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:457) > at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:244) > ... 7 more > Caused by: java.io.IOException: Cannot run program "/usr/bin/python2.6": > java.io.IOException: error=7, Argument list too long > at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) > at > org.apache.hadoop.hive.ql.exec.ScriptOperator.processOp(ScriptOperator.java:279) > ... 15 more > Caused by: java.io.IOException: java.io.IOException: error=7, Argument list > too long > at java.lang.UNIXProcess.<init>(UNIXProcess.java:148) > at java.lang.ProcessImpl.start(ProcessImpl.java:65) > at java.lang.ProcessBuilder.start(ProcessBuilder.java:452) > ... 16 more > > 2011-03-01 14:46:13,779 WARN org.apache.hadoop.mapred.Child: Error running > child > java.lang.RuntimeException: > org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while > processing row > (tag=0) > {"key":{"reducesinkkey0":"AA11223344","reducesinkkey1":"20110210_02"},"value":{"_col0":"xxxxx","_col1":"m1","_col2":"20110210_02","_col3":"{'m07': > 'x12', 'm02': 'x34', 'm01': 'm45'}","_col4":"0A9"},"alias":0} > at > org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:265) > at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:467) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:415) > at org.apache.hadoop.mapred.Child$4.run(Child.java:217) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1063) > at org.apache.hadoop.mapred.Child.main(Child.java:211) > Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime > Error while processing row > (tag=0) > {"key":{"reducesinkkey0":"AA11223344","reducesinkkey1":"20110210_02"},"value":{"_col0":"xxxxx","_col1":"m1","_col2":"20110210_02","_col3":"{'m07': > 'x12', 'm02': 'x34', 'm01': 'm45'}","_col4":"0A9"},"alias":0} > at > org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:253) > ... 7 more > Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Cannot > initialize ScriptOperator > at > org.apache.hadoop.hive.ql.exec.ScriptOperator.processOp(ScriptOperator.java:320) > at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:457) > at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:697) > at > org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84) > at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:457) > at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:697) > at > org.apache.hadoop.hive.ql.exec.ExtractOperator.processOp(ExtractOperator.java:45) > at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:457) > at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:244) > ... 7 more > Caused by: java.io.IOException: Cannot run program "/usr/bin/python2.6": > java.io.IOException: error=7, Argument list too long > at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) > at > org.apache.hadoop.hive.ql.exec.ScriptOperator.processOp(ScriptOperator.java:279) > ... 15 more > Caused by: java.io.IOException: java.io.IOException: error=7, Argument list > too long > at java.lang.UNIXProcess.<init>(UNIXProcess.java:148) > at java.lang.ProcessImpl.start(ProcessImpl.java:65) > at java.lang.ProcessBuilder.start(ProcessBuilder.java:452) > ... 16 more > 2011-03-01 14:46:13,784 INFO org.apache.hadoop.mapred.Task: Runnning > cleanup for the task > >