[jira] [Commented] (HIVE-17280) Data loss in CONCATENATE ORC created by Spark

2017-09-05 Thread Prasanth Jayachandran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16154377#comment-16154377
 ] 

Prasanth Jayachandran commented on HIVE-17280:
--

[~mgaido] Posted a patch to HIVE-17280 that will fix the issue (along with 
adding restrictions). Tested this locally and it worked. If concatenation finds 
incompatible file, it will rename to Hive's convention to avoid the issue that 
I mentioned above. 

> Data loss in CONCATENATE ORC created by Spark
> -
>
> Key: HIVE-17280
> URL: https://issues.apache.org/jira/browse/HIVE-17280
> Project: Hive
>  Issue Type: Bug
>  Components: Hive, Spark
>Affects Versions: 1.2.1
> Environment: Spark 1.6.3
>Reporter: Marco Gaido
>Priority: Critical
>
> Hive concatenation causes data loss if the ORC files in the table were 
> written by Spark.
> Here are the steps to reproduce the problem:
>  - create a table;
> {code:java}
> hive
> hive> create table aa (a string, b int) stored as orc;
> {code}
>  - insert 2 rows using Spark;
> {code:java}
> spark-shell
> scala> case class AA(a:String, b:Int)
> scala> val df = sc.parallelize(Array(AA("b",2),AA("c",3) )).toDF
> scala> df.write.insertInto("aa")
> {code}
>  - change table schema;
> {code:java}
> hive
> hive> alter table aa add columns(aa string, bb int);
> {code}
>  - insert other 2 rows with Spark
> {code:java}
> spark-shell
> scala> case class BB(a:String, b:Int, aa:String, bb:Int)
> scala> val df = sc.parallelize(Array(BB("b",2,"b",2),BB("c",3,"c",3) )).toDF
> scala> df.write.insertInto("aa")
> {code}
>  - at this point, running a select statement with Hive returns correctly *4 
> rows* in the table; then run the concatenation
> {code:java}
> hive
> hive> alter table aa concatenate;
> {code}
> At this point, a select returns only *3 rows, ie. a row is missing*.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17280) Data loss in CONCATENATE ORC created by Spark

2017-08-30 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16146933#comment-16146933
 ] 

Marco Gaido commented on HIVE-17280:


I see, but this won't fix the problem with files written by Spark. This is the 
way Spark names files to managed tables. Thus the issue will still be there.

> Data loss in CONCATENATE ORC created by Spark
> -
>
> Key: HIVE-17280
> URL: https://issues.apache.org/jira/browse/HIVE-17280
> Project: Hive
>  Issue Type: Bug
>  Components: Hive, Spark
>Affects Versions: 1.2.1
> Environment: Spark 1.6.3
>Reporter: Marco Gaido
>Priority: Critical
>
> Hive concatenation causes data loss if the ORC files in the table were 
> written by Spark.
> Here are the steps to reproduce the problem:
>  - create a table;
> {code:java}
> hive
> hive> create table aa (a string, b int) stored as orc;
> {code}
>  - insert 2 rows using Spark;
> {code:java}
> spark-shell
> scala> case class AA(a:String, b:Int)
> scala> val df = sc.parallelize(Array(AA("b",2),AA("c",3) )).toDF
> scala> df.write.insertInto("aa")
> {code}
>  - change table schema;
> {code:java}
> hive
> hive> alter table aa add columns(aa string, bb int);
> {code}
>  - insert other 2 rows with Spark
> {code:java}
> spark-shell
> scala> case class BB(a:String, b:Int, aa:String, bb:Int)
> scala> val df = sc.parallelize(Array(BB("b",2,"b",2),BB("c",3,"c",3) )).toDF
> scala> df.write.insertInto("aa")
> {code}
>  - at this point, running a select statement with Hive returns correctly *4 
> rows* in the table; then run the concatenation
> {code:java}
> hive
> hive> alter table aa concatenate;
> {code}
> At this point, a select returns only *3 rows, ie. a row is missing*.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17280) Data loss in CONCATENATE ORC created by Spark

2017-08-30 Thread Prasanth Jayachandran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16146914#comment-16146914
 ] 

Prasanth Jayachandran commented on HIVE-17280:
--

That is certainly not the format that hive expects. After concatenation, merged 
and unmerged (incompatible) files gets moved to a staging directory. Then 
MoveTask moves the files from staging directory to final destination directory 
(which is also the source directory in case of concatenation). There are 
certain assumptions around filenames for bucketing, speculative execution etc. 
in move task. In the example files that you had provided, part-0_copy_1 and 
part-1_copy_1 will be considered same file written by different tasks (from 
speculative execution) and the largest file will be picked as the winner of 
speculated execution. This is the same issue as HIVE-17403. Hive usually writes 
files with format 00_0 where 00 is task id/bucket id and digit after _  
is considered a task attempt. I am working on a patch that will restrict 
concatenation for external tables. And for hive managed tables, load data 
command will make sure the filenames conform to Hive's expectation. 

> Data loss in CONCATENATE ORC created by Spark
> -
>
> Key: HIVE-17280
> URL: https://issues.apache.org/jira/browse/HIVE-17280
> Project: Hive
>  Issue Type: Bug
>  Components: Hive, Spark
>Affects Versions: 1.2.1
> Environment: Spark 1.6.3
>Reporter: Marco Gaido
>Priority: Critical
>
> Hive concatenation causes data loss if the ORC files in the table were 
> written by Spark.
> Here are the steps to reproduce the problem:
>  - create a table;
> {code:java}
> hive
> hive> create table aa (a string, b int) stored as orc;
> {code}
>  - insert 2 rows using Spark;
> {code:java}
> spark-shell
> scala> case class AA(a:String, b:Int)
> scala> val df = sc.parallelize(Array(AA("b",2),AA("c",3) )).toDF
> scala> df.write.insertInto("aa")
> {code}
>  - change table schema;
> {code:java}
> hive
> hive> alter table aa add columns(aa string, bb int);
> {code}
>  - insert other 2 rows with Spark
> {code:java}
> spark-shell
> scala> case class BB(a:String, b:Int, aa:String, bb:Int)
> scala> val df = sc.parallelize(Array(BB("b",2,"b",2),BB("c",3,"c",3) )).toDF
> scala> df.write.insertInto("aa")
> {code}
>  - at this point, running a select statement with Hive returns correctly *4 
> rows* in the table; then run the concatenation
> {code:java}
> hive
> hive> alter table aa concatenate;
> {code}
> At this point, a select returns only *3 rows, ie. a row is missing*.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17280) Data loss in CONCATENATE ORC created by Spark

2017-08-30 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16146882#comment-16146882
 ] 

Marco Gaido commented on HIVE-17280:


The names of the files are:

{noformat}
/apps/hive/warehouse/aa/part-0
/apps/hive/warehouse/aa/part-0_copy_1
/apps/hive/warehouse/aa/part-1
/apps/hive/warehouse/aa/part-1_copy_1
/apps/hive/warehouse/aa/part-1_copy_2
/apps/hive/warehouse/aa/part-2
/apps/hive/warehouse/aa/part-2_copy_1
/apps/hive/warehouse/aa/part-3
/apps/hive/warehouse/aa/part-3_copy_1
{noformat}


> Data loss in CONCATENATE ORC created by Spark
> -
>
> Key: HIVE-17280
> URL: https://issues.apache.org/jira/browse/HIVE-17280
> Project: Hive
>  Issue Type: Bug
>  Components: Hive, Spark
>Affects Versions: 1.2.1
> Environment: Spark 1.6.3
>Reporter: Marco Gaido
>Priority: Critical
>
> Hive concatenation causes data loss if the ORC files in the table were 
> written by Spark.
> Here are the steps to reproduce the problem:
>  - create a table;
> {code:java}
> hive
> hive> create table aa (a string, b int) stored as orc;
> {code}
>  - insert 2 rows using Spark;
> {code:java}
> spark-shell
> scala> case class AA(a:String, b:Int)
> scala> val df = sc.parallelize(Array(AA("b",2),AA("c",3) )).toDF
> scala> df.write.insertInto("aa")
> {code}
>  - change table schema;
> {code:java}
> hive
> hive> alter table aa add columns(aa string, bb int);
> {code}
>  - insert other 2 rows with Spark
> {code:java}
> spark-shell
> scala> case class BB(a:String, b:Int, aa:String, bb:Int)
> scala> val df = sc.parallelize(Array(BB("b",2,"b",2),BB("c",3,"c",3) )).toDF
> scala> df.write.insertInto("aa")
> {code}
>  - at this point, running a select statement with Hive returns correctly *4 
> rows* in the table; then run the concatenation
> {code:java}
> hive
> hive> alter table aa concatenate;
> {code}
> At this point, a select returns only *3 rows, ie. a row is missing*.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17280) Data loss in CONCATENATE ORC created by Spark

2017-08-29 Thread Prasanth Jayachandran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16145785#comment-16145785
 ] 

Prasanth Jayachandran commented on HIVE-17280:
--

Do you happen to know the filenames generated by spark? 
Hive has some assumptions around filenames when moving the files from staging 
to final target directory. Recently encountered similar issue (HIVE-17403) 
which could be related to this as well.

> Data loss in CONCATENATE ORC created by Spark
> -
>
> Key: HIVE-17280
> URL: https://issues.apache.org/jira/browse/HIVE-17280
> Project: Hive
>  Issue Type: Bug
>  Components: Hive, Spark
>Affects Versions: 1.2.1
> Environment: Spark 1.6.3
>Reporter: Marco Gaido
>Priority: Critical
>
> Hive concatenation causes data loss if the ORC files in the table were 
> written by Spark.
> Here are the steps to reproduce the problem:
>  - create a table;
> {code:java}
> hive
> hive> create table aa (a string, b int) stored as orc;
> {code}
>  - insert 2 rows using Spark;
> {code:java}
> spark-shell
> scala> case class AA(a:String, b:Int)
> scala> val df = sc.parallelize(Array(AA("b",2),AA("c",3) )).toDF
> scala> df.write.insertInto("aa")
> {code}
>  - change table schema;
> {code:java}
> hive
> hive> alter table aa add columns(aa string, bb int);
> {code}
>  - insert other 2 rows with Spark
> {code:java}
> spark-shell
> scala> case class BB(a:String, b:Int, aa:String, bb:Int)
> scala> val df = sc.parallelize(Array(BB("b",2,"b",2),BB("c",3,"c",3) )).toDF
> scala> df.write.insertInto("aa")
> {code}
>  - at this point, running a select statement with Hive returns correctly *4 
> rows* in the table; then run the concatenation
> {code:java}
> hive
> hive> alter table aa concatenate;
> {code}
> At this point, a select returns only *3 rows, ie. a row is missing*.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)