Hi there, I'm trying to use ACID transactions in Hive but I have a problem when the data are added with Spark.
First, I created a table with the following statement : ____________________________________________________________ __________________________ CREATE TABLE testdb.test(id string, col1 string) CLUSTERED BY (id) INTO 4 BUCKETS STORED AS ORC TBLPROPERTIES('transactional'='true'); ____________________________________________________________ __________________________ Then I added data with those queries : ____________________________________________________________ __________________________ INSERT INTO testdb.test VALUES("1", "A"); INSERT INTO testdb.test VALUES("2", "B"); INSERT INTO testdb.test VALUES("3", "C"); ____________________________________________________________ __________________________ And I've been able to delete rows with this query : ____________________________________________________________ __________________________ DELETE FROM testdb.test WHERE id="1"; ____________________________________________________________ __________________________ All that worked perfectly, but a problem occurs when I try to delete rows that were added with Spark. What I do in Spark (iPython) : ____________________________________________________________ __________________________ hc = HiveContext(sc) data = sc.parallelize([["1", "A"], ["2", "B"], ["3", "C"]]) data_df = hc.createDataFrame(data) data_df.registerTempTable(data_df) hc.sql("INSERT INTO testdb.test SELECT * FROM data_df"); ____________________________________________________________ __________________________ Then, when I come back to Hive, I'm able to run a SELECT query on this the "test" table. However, when I try to run the exact same DELETE query as before, I have the following error (it happens after the reduce phase) : ____________________________________________________________ __________________________ Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{"reducesinkkey0":{" transactionid":0,"bucketid":-1,"rowid":0}},"value":null} at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce( ExecReducer.java:265) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs( UserGroupInformation.java:1671) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{"reducesinkkey0":{" transactionid":0,"bucketid":-1,"rowid":0}},"value":null} at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce( ExecReducer.java:253) ... 7 more Caused by: java.lang.ArrayIndexOutOfBoundsException: -1 at org.apache.hadoop.hive.ql.exec.FileSinkOperator. processOp(FileSinkOperator.java:723) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815) at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp( SelectOperator.java:84) at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce( ExecReducer.java:244) ... 7 more ____________________________________________________________ __________________________ I have no idea where this is coming from, that is why I'm looking for advices on this mailing list. I'm using the Cloudera Quickstart VM (5.4.2). Hive version : 1.1.0 Spark Version : 1.3.0 And here is the complete output of the Hive DELETE command : ____________________________________________________________ __________________________ hive> delete from testdb.test where id="1"; Query ID = cloudera_20160914090303_795e40b7-ab6a-45b0-8391-6d41d1cfe7bd Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 4 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1473858545651_0036, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1473858545651_0036/ Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1473858545651_0036 Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 4 2016-09-14 09:03:55,571 Stage-1 map = 0%, reduce = 0% 2016-09-14 09:04:14,898 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 1.66 sec 2016-09-14 09:04:15,944 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.33 sec 2016-09-14 09:04:44,101 Stage-1 map = 100%, reduce = 17%, Cumulative CPU 4.21 sec 2016-09-14 09:04:46,523 Stage-1 map = 100%, reduce = 25%, Cumulative CPU 4.79 sec 2016-09-14 09:04:47,673 Stage-1 map = 100%, reduce = 42%, Cumulative CPU 5.8 sec 2016-09-14 09:04:50,041 Stage-1 map = 100%, reduce = 75%, Cumulative CPU 7.45 sec 2016-09-14 09:05:18,486 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7.69 sec MapReduce Total cumulative CPU time: 7 seconds 690 msec Ended Job = job_1473858545651_0036 with errors Error during job, obtaining debugging information... Job Tracking URL: http://quickstart.cloudera:8088/proxy/application_ 1473858545651_0036/ Examining task ID: task_1473858545651_0036_m_000000 (and more) from job job_1473858545651_0036 Task with the most failures(4): ----- Task ID: task_1473858545651_0036_r_000001 URL: http://0.0.0.0:8088/taskdetails.jsp?jobid=job_ 1473858545651_0036&tipid=task_1473858545651_0036_r_000001 ----- Diagnostic Messages for this Task: Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{"reducesinkkey0":{" transactionid":0,"bucketid":-1,"rowid":0}},"value":null} at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce( ExecReducer.java:265) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs( UserGroupInformation.java:1671) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{"reducesinkkey0":{" transactionid":0,"bucketid":-1,"rowid":0}},"value":null} at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce( ExecReducer.java:253) ... 7 more Caused by: java.lang.ArrayIndexOutOfBoundsException: -1 at org.apache.hadoop.hive.ql.exec.FileSinkOperator. processOp(FileSinkOperator.java:723) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815) at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp( SelectOperator.java:84) at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce( ExecReducer.java:244) ... 7 more FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql. exec.mr.MapRedTask MapReduce Jobs Launched: Stage-Stage-1: Map: 2 Reduce: 4 Cumulative CPU: 7.69 sec HDFS Read: 21558 HDFS Write: 114 FAIL Total MapReduce CPU Time Spent: 7 seconds 690 msec ____________________________________________________________ __________________________ Thanks !