Re: ACID transactions on data added from Spark not working

2016-09-14 Thread Mich Talebzadeh
Hi,

I believe this is an issue with Spark handing transactional tables in Hive.

When you add rows from Spark to ORC transactional table, Hive metadata
tables HIVE_LOCKS and TXNS tables are not updated. This does not happen
with Hive itself. As a result these new rows are left in an inconsistent
state.

Also if you delete rows from the said tables using Hive, many delta files
will be produced. In that case Hive will compact them eventually (roll them
into the main file itself), and can read data. However, Spark will not be
able to read the delta files until all is compacted.

Anyway this is my experience.

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 14 September 2016 at 17:47, Jack Wenger  wrote:

> Hi there,
>
> I'm trying to use ACID transactions in Hive but I have a problem when the
> data are added with Spark.
>
> First, I created a table with the following statement :
> 
> __
> CREATE TABLE testdb.test(id string, col1 string)
> CLUSTERED BY (id) INTO 4 BUCKETS
> STORED AS ORC TBLPROPERTIES('transactional'='true');
> 
> __
>
> Then I added data with those queries :
> 
> __
> INSERT INTO testdb.test VALUES("1", "A");
> INSERT INTO testdb.test VALUES("2", "B");
> INSERT INTO testdb.test VALUES("3", "C");
> 
> __
>
> And I've been able to delete rows with this query :
> 
> __
> DELETE FROM testdb.test WHERE id="1";
> 
> __
>
> All that worked perfectly, but a problem occurs when I try to delete rows
> that were added with Spark.
>
> What I do in Spark (iPython) :
> 
> __
> hc = HiveContext(sc)
> data = sc.parallelize([["1", "A"], ["2", "B"], ["3", "C"]])
> data_df = hc.createDataFrame(data)
> data_df.registerTempTable(data_df)
> hc.sql("INSERT INTO testdb.test SELECT * FROM data_df");
> 
> __
>
> Then, when I come back to Hive, I'm able to run a SELECT query on this the
> "test" table.
> However, when I try to run the exact same DELETE query as before, I have
> the following error (it happens after the reduce phase) :
>
> 
> __
>
> Error: java.lang.RuntimeException: 
> org.apache.hadoop.hive.ql.metadata.HiveException:
> Hive Runtime Error while processing row (tag=0)
> {"key":{"reducesinkkey0":{"transactionid":0,"bucketid":-1,"r
> owid":0}},"value":null}
> at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecRed
> ucer.java:265)
> at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGro
> upInformation.java:1671)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime
> Error while processing row (tag=0) {"key":{"reducesinkkey0":{"tra
> nsactionid":0,"bucketid":-1,"rowid":0}},"value":null}
> at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecRed
> ucer.java:253)
> ... 7 more
> Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
> at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(Fi
> leSinkOperator.java:723)
> at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
> at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(Sele
> ctOperator.java:84)
> at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecRed
> ucer.java:244)
> ... 7 more
>
> 
> __
>
> I have no idea where this is coming from, that is why I'm looking for
> advices on this mailing list.
>
> I'm using the Cloudera Quickstart 

ACID transactions on data added from Spark not working

2016-09-14 Thread Jack Wenger
Hi there,

I'm trying to use ACID transactions in Hive but I have a problem when the
data are added with Spark.

First, I created a table with the following statement :

__
CREATE TABLE testdb.test(id string, col1 string)
CLUSTERED BY (id) INTO 4 BUCKETS
STORED AS ORC TBLPROPERTIES('transactional'='true');

__

Then I added data with those queries :

__
INSERT INTO testdb.test VALUES("1", "A");
INSERT INTO testdb.test VALUES("2", "B");
INSERT INTO testdb.test VALUES("3", "C");

__

And I've been able to delete rows with this query :

__
DELETE FROM testdb.test WHERE id="1";

__

All that worked perfectly, but a problem occurs when I try to delete rows
that were added with Spark.

What I do in Spark (iPython) :

__
hc = HiveContext(sc)
data = sc.parallelize([["1", "A"], ["2", "B"], ["3", "C"]])
data_df = hc.createDataFrame(data)
data_df.registerTempTable(data_df)
hc.sql("INSERT INTO testdb.test SELECT * FROM data_df");

__

Then, when I come back to Hive, I'm able to run a SELECT query on this the
"test" table.
However, when I try to run the exact same DELETE query as before, I have
the following error (it happens after the reduce phase) :


__

Error: java.lang.RuntimeException:
org.apache.hadoop.hive.ql.metadata.HiveException:
Hive Runtime Error while processing row (tag=0)
{"key":{"reducesinkkey0":{"transactionid":0,"bucketid":-1,"
rowid":0}},"value":null}
at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecRed
ucer.java:265)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGro
upInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime
Error while processing row (tag=0) {"key":{"reducesinkkey0":{"tra
nsactionid":0,"bucketid":-1,"rowid":0}},"value":null}
at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecRed
ucer.java:253)
... 7 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(
FileSinkOperator.java:723)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(Sele
ctOperator.java:84)
at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecRed
ucer.java:244)
... 7 more


__

I have no idea where this is coming from, that is why I'm looking for
advices on this mailing list.

I'm using the Cloudera Quickstart VM (5.4.2).
Hive version : 1.1.0
Spark Version : 1.3.0

And here is the complete output of the Hive DELETE command :

__

hive> delete from testdb.test where id="1";

Query ID = cloudera_20160914090303_795e40b7-ab6a-45b0-8391-6d41d1cfe7bd
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 4
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapreduce.job.reduces=
Starting Job = job_1473858545651_0036, Tracking URL =
http://quickstart.cloudera:8088/proxy/application_1473858545651_0036/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1473858545651_0036
Hadoop job information for Stage-1: number of mappers: 2; number of
reducers: 4
2016-09-14 09:03:55,571 Stage-1 map = 0%,  reduce = 0%
2016-09-14 09:04:14,898 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU
1.66 sec
2016-09-14 09:04:15,944 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU
3.33 sec
2016-09-14 09:04:44,101 Stage-1 map = 100%,  reduce = 17%, Cumulative CPU
4.21 sec
2016-09-14 09:04:46,523 Stage-1 map = 100%,  reduce = 25%, Cumulative CPU
4.79 sec
2016-09-14 09:04:47,673 Stage-1 map = 100%,  reduce = 42%, Cumulative CPU