date:20171031

[jira] [Comment Edited] (SPARK-21725) spark thriftserver insert overwrite table partition select

2017-10-31 Thread xinzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16233578#comment-16233578
 ] 

xinzhang edited comment on SPARK-21725 at 11/1/17 5:25 AM:
---

[~mgaido]

1. hive 1.2.1  
   download a new tar only change hive-site.xml 
  about hive metastore with mysql . metastore(local 9083 or not .Keep metastore 
do not change.It is not a point) 
2.spark-sql copy the hive-site.xml  
3.start spark-thriftserver
4.beeline connect the thriftserver 

!https://user-images.githubusercontent.com/8244097/32257548-af789d42-bef0-11e7-8c04-99137c50fbbf.png!

{color:red}*The metastore has changed from derby to mysql . My suggest is could 
u do it with a new env. Without your current exit env. U could rebuild it 
.*{color}
Like what u say might be related to the metastore. I tested the case in 
cdh5.7(hadoop2.6)   and hadoop2.8(new env) , they will always appear , No 
matter what I did . Hope your help . Thanks .

{color:red}Hi .Do not test it in the spark source code directory !!! Test it 
with mysql(maybe derby) && Test it with the target package 
spark-2.3.0-SNAPSHOT-bin-custom-spark.tgz  . 

target package spark-2.3.0-SNAPSHOT-bin-custom-spark.tgz  + mysql : thrift 
server bad
target package spark-2.3.0-SNAPSHOT-bin-custom-spark.tgz  + derby : thrift 
server bad
spark source code directory  + derby : thrift server good
spark source code directory  + mysql : thrift server bad

Under the two conditions , it always appear the pro. Could u test it {color}


was (Author: zhangxin0112zx):
[~mgaido]

1. hive 1.2.1  
   download a new tar only change hive-site.xml 
  about hive metastore with mysql . metastore(local 9083 or not .Keep metastore 
do not change.It is not a point) 
2.spark-sql copy the hive-site.xml  
3.start spark-thriftserver
4.beeline connect the thriftserver 

!https://user-images.githubusercontent.com/8244097/32257548-af789d42-bef0-11e7-8c04-99137c50fbbf.png!

{color:red}*The metastore has changed from derby to mysql . My suggest is could 
u do it with a new env. Without your current exit env. U could rebuild it 
.*{color}
Like what u say might be related to the metastore. I tested the case in 
cdh5.7(hadoop2.6)   and hadoop2.8(new env) , they will always appear , No 
matter what I did . Hope your help . Thanks .

{color:red}Hi .Do not test it in the spark source code directory !!! Test it 
with mysql(maybe derby) && Test it with the target package 
spark-2.3.0-SNAPSHOT-bin-custom-spark.tgz  . 

target package spark-2.3.0-SNAPSHOT-bin-custom-spark.tgz  + mysql : thrift 
server bad
target package spark-2.3.0-SNAPSHOT-bin-custom-spark.tgz  + derby : thrift 
server bad
spark source code directory  + derby : thrift server good

Under the two conditions , it always appear the pro. Could u test it {color}

> spark thriftserver insert overwrite table partition select 
> ---
>
> Key: SPARK-21725
> URL: https://issues.apache.org/jira/browse/SPARK-21725
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: centos 6.7 spark 2.1  jdk8
>Reporter: xinzhang
>Priority: Major
>  Labels: spark-sql
>
> use thriftserver create table with partitions.
> session 1:
>  SET hive.default.fileformat=Parquet;create table tmp_10(count bigint) 
> partitioned by (pt string) stored as parquet;
> --ok
>  !exit
> session 2:
>  SET hive.default.fileformat=Parquet;create table tmp_11(count bigint) 
> partitioned by (pt string) stored as parquet; 
> --ok
>  !exit
> session 3:
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --ok
>  !exit
> session 4(do it again):
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --error
>  !exit
> -
> 17/08/14 18:13:42 ERROR SparkExecuteStatementOperation: Error executing 
> query, currentState RUNNING, 
> java.lang.reflect.InvocationTargetException
> ..
> ..
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move 
> source 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/.hive-staging_hive_2017-08-14_18-13-39_035_6303339779053
> 512282-2/-ext-1/part-0 to destination 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/pt=1/part-0
> at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2644)
> at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:2711)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1403)
> at 
>

[jira] [Comment Edited] (SPARK-21725) spark thriftserver insert overwrite table partition select

2017-10-31 Thread xinzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16233578#comment-16233578
 ] 

xinzhang edited comment on SPARK-21725 at 11/1/17 3:51 AM:
---

[~mgaido]

1. hive 1.2.1  
   download a new tar only change hive-site.xml 
  about hive metastore with mysql . metastore(local 9083 or not .Keep metastore 
do not change.It is not a point) 
2.spark-sql copy the hive-site.xml  
3.start spark-thriftserver
4.beeline connect the thriftserver 

!https://user-images.githubusercontent.com/8244097/32257548-af789d42-bef0-11e7-8c04-99137c50fbbf.png!

{color:red}*The metastore has changed from derby to mysql . My suggest is could 
u do it with a new env. Without your current exit env. U could rebuild it 
.*{color}
Like what u say might be related to the metastore. I tested the case in 
cdh5.7(hadoop2.6)   and hadoop2.8(new env) , they will always appear , No 
matter what I did . Hope your help . Thanks .

{color:red}Hi .Do not test it in the spark source code directory !!! Test it 
with mysql(maybe derby) && Test it with the target package 
spark-2.3.0-SNAPSHOT-bin-custom-spark.tgz  . 

target package spark-2.3.0-SNAPSHOT-bin-custom-spark.tgz  + mysql : thrift 
server bad
target package spark-2.3.0-SNAPSHOT-bin-custom-spark.tgz  + derby : thrift 
server bad
spark source code directory  + derby : thrift server good

Under the two conditions , it always appear the pro. Could u test it {color}


was (Author: zhangxin0112zx):
[~mgaido]

1. hive 1.2.1  
   download a new tar only change hive-site.xml 
  about hive metastore with mysql . metastore(local 9083 or not .Keep metastore 
do not change.It is not a point) 
2.spark-sql copy the hive-site.xml  
3.start spark-thriftserver
4.beeline connect the thriftserver 

!https://user-images.githubusercontent.com/8244097/32257548-af789d42-bef0-11e7-8c04-99137c50fbbf.png!

{color:red}*The metastore has changed from derby to mysql . My suggest is could 
u do it with a new env. Without your current exit env. U could rebuild it 
.*{color}
Like what u say might be related to the metastore. I tested the case in 
cdh5.7(hadoop2.6)   and hadoop2.8(new env) , they will always appear , No 
matter what I did . Hope your help . Thanks .

{color:red}Hi .Do not test it in the spark source code directory !!! Test it 
with mysql(maybe derby) && Test it with the target package 
spark-2.3.0-SNAPSHOT-bin-custom-spark.tgz  . Under the two conditions , it 
always appear the pro. Could u test it {color}

> spark thriftserver insert overwrite table partition select 
> ---
>
> Key: SPARK-21725
> URL: https://issues.apache.org/jira/browse/SPARK-21725
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: centos 6.7 spark 2.1  jdk8
>Reporter: xinzhang
>Priority: Major
>  Labels: spark-sql
>
> use thriftserver create table with partitions.
> session 1:
>  SET hive.default.fileformat=Parquet;create table tmp_10(count bigint) 
> partitioned by (pt string) stored as parquet;
> --ok
>  !exit
> session 2:
>  SET hive.default.fileformat=Parquet;create table tmp_11(count bigint) 
> partitioned by (pt string) stored as parquet; 
> --ok
>  !exit
> session 3:
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --ok
>  !exit
> session 4(do it again):
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --error
>  !exit
> -
> 17/08/14 18:13:42 ERROR SparkExecuteStatementOperation: Error executing 
> query, currentState RUNNING, 
> java.lang.reflect.InvocationTargetException
> ..
> ..
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move 
> source 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/.hive-staging_hive_2017-08-14_18-13-39_035_6303339779053
> 512282-2/-ext-1/part-0 to destination 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/pt=1/part-0
> at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2644)
> at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:2711)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1403)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1324)
> ... 45 more
> Caused by: java.io.IOException: Filesystem closed
> 
> -
> the doc about the parquet table desc here 
>

[jira] [Comment Edited] (SPARK-21725) spark thriftserver insert overwrite table partition select

2017-10-31 Thread xinzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16233578#comment-16233578
 ] 

xinzhang edited comment on SPARK-21725 at 11/1/17 3:49 AM:
---

[~mgaido]

1. hive 1.2.1  
   download a new tar only change hive-site.xml 
  about hive metastore with mysql . metastore(local 9083 or not .Keep metastore 
do not change.It is not a point) 
2.spark-sql copy the hive-site.xml  
3.start spark-thriftserver
4.beeline connect the thriftserver 

!https://user-images.githubusercontent.com/8244097/32257548-af789d42-bef0-11e7-8c04-99137c50fbbf.png!

{color:red}*The metastore has changed from derby to mysql . My suggest is could 
u do it with a new env. Without your current exit env. U could rebuild it 
.*{color}
Like what u say might be related to the metastore. I tested the case in 
cdh5.7(hadoop2.6)   and hadoop2.8(new env) , they will always appear , No 
matter what I did . Hope your help . Thanks .

{color:red}Hi .Do not test it in the spark source code directory !!! Test it 
with mysql(maybe derby) && Test it with the target package 
spark-2.3.0-SNAPSHOT-bin-custom-spark.tgz  . Under the two conditions , it 
always appear the pro. Could u test it {color}


was (Author: zhangxin0112zx):
[~mgaido]

1. hive 1.2.1  
   download a new tar only change hive-site.xml 
  about hive metastore with mysql . metastore(local 9083 or not .Keep metastore 
do not change.It is not a point) 
2.spark-sql copy the hive-site.xml  
3.start spark-thriftserver
4.beeline connect the thriftserver 

!https://user-images.githubusercontent.com/8244097/32257548-af789d42-bef0-11e7-8c04-99137c50fbbf.png!

{color:red}*The metastore has changed from derby to mysql . My suggest is could 
u do it with a new env. Without your current exit env. U could rebuild it 
.*{color}
Like what u say might be related to the metastore. I tested the case in 
cdh5.7(hadoop2.6)   and hadoop2.8(new env) , they will always appear , No 
matter what I did . Hope your help . Thanks .

{color:red}Hi .Do not test it in the spark source code directory !!! Test it 
with mysql && Test it with the target package 
spark-2.3.0-SNAPSHOT-bin-custom-spark.tgz  . Under the two conditions , it 
always appear the pro. Could u test it {color}

> spark thriftserver insert overwrite table partition select 
> ---
>
> Key: SPARK-21725
> URL: https://issues.apache.org/jira/browse/SPARK-21725
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: centos 6.7 spark 2.1  jdk8
>Reporter: xinzhang
>Priority: Major
>  Labels: spark-sql
>
> use thriftserver create table with partitions.
> session 1:
>  SET hive.default.fileformat=Parquet;create table tmp_10(count bigint) 
> partitioned by (pt string) stored as parquet;
> --ok
>  !exit
> session 2:
>  SET hive.default.fileformat=Parquet;create table tmp_11(count bigint) 
> partitioned by (pt string) stored as parquet; 
> --ok
>  !exit
> session 3:
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --ok
>  !exit
> session 4(do it again):
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --error
>  !exit
> -
> 17/08/14 18:13:42 ERROR SparkExecuteStatementOperation: Error executing 
> query, currentState RUNNING, 
> java.lang.reflect.InvocationTargetException
> ..
> ..
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move 
> source 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/.hive-staging_hive_2017-08-14_18-13-39_035_6303339779053
> 512282-2/-ext-1/part-0 to destination 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/pt=1/part-0
> at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2644)
> at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:2711)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1403)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1324)
> ... 45 more
> Caused by: java.io.IOException: Filesystem closed
> 
> -
> the doc about the parquet table desc here 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files
> Hive metastore Parquet table conversion
> When reading from and writing to Hive metastore Parquet tables, Spark SQL 
> will try to use its own Parquet support instead of Hive

[jira] [Comment Edited] (SPARK-21725) spark thriftserver insert overwrite table partition select

2017-10-31 Thread xinzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16233578#comment-16233578
 ] 

xinzhang edited comment on SPARK-21725 at 11/1/17 3:48 AM:
---

[~mgaido]

1. hive 1.2.1  
   download a new tar only change hive-site.xml 
  about hive metastore with mysql . metastore(local 9083 or not .Keep metastore 
do not change.It is not a point) 
2.spark-sql copy the hive-site.xml  
3.start spark-thriftserver
4.beeline connect the thriftserver 

!https://user-images.githubusercontent.com/8244097/32257548-af789d42-bef0-11e7-8c04-99137c50fbbf.png!

{color:red}*The metastore has changed from derby to mysql . My suggest is could 
u do it with a new env. Without your current exit env. U could rebuild it 
.*{color}
Like what u say might be related to the metastore. I tested the case in 
cdh5.7(hadoop2.6)   and hadoop2.8(new env) , they will always appear , No 
matter what I did . Hope your help . Thanks .

{color:red}Hi .Do not test it in the spark source code directory !!! Test it 
with mysql && Test it with the target package 
spark-2.3.0-SNAPSHOT-bin-custom-spark.tgz  . Under the two conditions , it 
always appear the pro. Could u test it {color}


was (Author: zhangxin0112zx):
[~mgaido]

1. hive 1.2.1  
   download a new tar only change hive-site.xml 
  about hive metastore with mysql . metastore(local 9083 or not .Keep metastore 
do not change.It is not a point) 
2.spark-sql copy the hive-site.xml  
3.start spark-thriftserver
4.beeline connect the thriftserver 

!https://user-images.githubusercontent.com/8244097/32257548-af789d42-bef0-11e7-8c04-99137c50fbbf.png!

{color:red}*The metastore has changed from derby to mysql . My suggest is could 
u do it with a new env. Without your current exit env. U could rebuild it 
.*{color}
Like what u say might be related to the metastore. I tested the case in 
cdh5.7(hadoop2.6)   and hadoop2.8(new env) , they will always appear , No 
matter what I did . Hope your help . Thanks .

{color:red}Hi .Do not test it in the spark source code directory !!! Test it 
with mysql && Test it with the target spark-2.3.0-SNAPSHOT-bin-custom-spark.tgz 
 . Under the two conditions , it always appear the pro. Could u test it {color}

> spark thriftserver insert overwrite table partition select 
> ---
>
> Key: SPARK-21725
> URL: https://issues.apache.org/jira/browse/SPARK-21725
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: centos 6.7 spark 2.1  jdk8
>Reporter: xinzhang
>Priority: Major
>  Labels: spark-sql
>
> use thriftserver create table with partitions.
> session 1:
>  SET hive.default.fileformat=Parquet;create table tmp_10(count bigint) 
> partitioned by (pt string) stored as parquet;
> --ok
>  !exit
> session 2:
>  SET hive.default.fileformat=Parquet;create table tmp_11(count bigint) 
> partitioned by (pt string) stored as parquet; 
> --ok
>  !exit
> session 3:
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --ok
>  !exit
> session 4(do it again):
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --error
>  !exit
> -
> 17/08/14 18:13:42 ERROR SparkExecuteStatementOperation: Error executing 
> query, currentState RUNNING, 
> java.lang.reflect.InvocationTargetException
> ..
> ..
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move 
> source 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/.hive-staging_hive_2017-08-14_18-13-39_035_6303339779053
> 512282-2/-ext-1/part-0 to destination 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/pt=1/part-0
> at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2644)
> at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:2711)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1403)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1324)
> ... 45 more
> Caused by: java.io.IOException: Filesystem closed
> 
> -
> the doc about the parquet table desc here 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files
> Hive metastore Parquet table conversion
> When reading from and writing to Hive metastore Parquet tables, Spark SQL 
> will try to use its own Parquet support instead of Hive SerDe for better 
>

[jira] [Comment Edited] (SPARK-21725) spark thriftserver insert overwrite table partition select

2017-10-31 Thread xinzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16233578#comment-16233578
 ] 

xinzhang edited comment on SPARK-21725 at 11/1/17 3:46 AM:
---

[~mgaido]

1. hive 1.2.1  
   download a new tar only change hive-site.xml 
  about hive metastore with mysql . metastore(local 9083 or not .Keep metastore 
do not change.It is not a point) 
2.spark-sql copy the hive-site.xml  
3.start spark-thriftserver
4.beeline connect the thriftserver 

!https://user-images.githubusercontent.com/8244097/32257548-af789d42-bef0-11e7-8c04-99137c50fbbf.png!

{color:red}*The metastore has changed from derby to mysql . My suggest is could 
u do it with a new env. Without your current exit env. U could rebuild it 
.*{color}
Like what u say might be related to the metastore. I tested the case in 
cdh5.7(hadoop2.6)   and hadoop2.8(new env) , they will always appear , No 
matter what I did . Hope your help . Thanks .

{color:red}Hi .Do not test it in the spark source code directory !!! Test it 
with mysql && Test it with the target spark-2.3.0-SNAPSHOT-bin-custom-spark.tgz 
 . Under the two conditions , it always appear the pro. Could u test it {color}


was (Author: zhangxin0112zx):
[~mgaido]

1. hive 1.2.1  
   download a new tar only change hive-site.xml 
  about hive metastore with mysql . metastore(local 9083 or not .Keep metastore 
do not change.It is not a point) 
2.spark-sql copy the hive-site.xml  
3.start spark-thriftserver
4.beeline connect the thriftserver 

!https://user-images.githubusercontent.com/8244097/32257548-af789d42-bef0-11e7-8c04-99137c50fbbf.png!

{color:red}*The metastore has changed from derby to mysql . My suggest is could 
u do it with a new env. Without your current exit env. U could rebuild it 
.*{color}
Like what u say might be related to the metastore. I tested the case in 
cdh5.7(hadoop2.6)   and hadoop2.8(new env) , they will always appear , No 
matter what I did . Hope your help . Thanks .

{color:red}Hi .It is the metastore !!! I test it with derby .Thriftserver is 
OK. I change it to mysql . It always appear the pro. Could u test it {color}

> spark thriftserver insert overwrite table partition select 
> ---
>
> Key: SPARK-21725
> URL: https://issues.apache.org/jira/browse/SPARK-21725
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: centos 6.7 spark 2.1  jdk8
>Reporter: xinzhang
>Priority: Major
>  Labels: spark-sql
>
> use thriftserver create table with partitions.
> session 1:
>  SET hive.default.fileformat=Parquet;create table tmp_10(count bigint) 
> partitioned by (pt string) stored as parquet;
> --ok
>  !exit
> session 2:
>  SET hive.default.fileformat=Parquet;create table tmp_11(count bigint) 
> partitioned by (pt string) stored as parquet; 
> --ok
>  !exit
> session 3:
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --ok
>  !exit
> session 4(do it again):
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --error
>  !exit
> -
> 17/08/14 18:13:42 ERROR SparkExecuteStatementOperation: Error executing 
> query, currentState RUNNING, 
> java.lang.reflect.InvocationTargetException
> ..
> ..
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move 
> source 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/.hive-staging_hive_2017-08-14_18-13-39_035_6303339779053
> 512282-2/-ext-1/part-0 to destination 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/pt=1/part-0
> at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2644)
> at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:2711)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1403)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1324)
> ... 45 more
> Caused by: java.io.IOException: Filesystem closed
> 
> -
> the doc about the parquet table desc here 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files
> Hive metastore Parquet table conversion
> When reading from and writing to Hive metastore Parquet tables, Spark SQL 
> will try to use its own Parquet support instead of Hive SerDe for better 
> performance. This behavior is controlled by the 
> spark.sql.hive.convertMetastoreParquet

[jira] [Comment Edited] (SPARK-21725) spark thriftserver insert overwrite table partition select

2017-10-31 Thread xinzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16233578#comment-16233578
 ] 

xinzhang edited comment on SPARK-21725 at 11/1/17 3:16 AM:
---

[~mgaido]

1. hive 1.2.1  
   download a new tar only change hive-site.xml 
  about hive metastore with mysql . metastore(local 9083 or not .Keep metastore 
do not change.It is not a point) 
2.spark-sql copy the hive-site.xml  
3.start spark-thriftserver
4.beeline connect the thriftserver 

!https://user-images.githubusercontent.com/8244097/32257548-af789d42-bef0-11e7-8c04-99137c50fbbf.png!

{color:red}*The metastore has changed from derby to mysql . My suggest is could 
u do it with a new env. Without your current exit env. U could rebuild it 
.*{color}
Like what u say might be related to the metastore. I tested the case in 
cdh5.7(hadoop2.6)   and hadoop2.8(new env) , they will always appear , No 
matter what I did . Hope your help . Thanks .

{color:red}Hi .It is the metastore !!! I test it with derby .Thriftserver is 
OK. I change it to mysql . It always appear the pro. Could u test it {color}


was (Author: zhangxin0112zx):
[~mgaido]

1. hive 1.2.1  
   download a new tar only change hive-site.xml 
  about hive metastore with mysql . metastore(local 9083 or not .Keep metastore 
do not change.It is not a point) 
2.spark-sql copy the hive-site.xml  
3.start spark-thriftserver
4.beeline connect the thriftserver 

!https://user-images.githubusercontent.com/8244097/32257548-af789d42-bef0-11e7-8c04-99137c50fbbf.png!

{color:red}*The metastore has changed from derby to mysql . My suggest is could 
u do it with a new env. Without your current exit env. U could rebuild it 
.*{color}
Like what u say might be related to the metastore. I tested the case in 
cdh5.7(hadoop2.6)   and hadoop2.8(new env) , they will always appear , No 
matter what I did . Hope your help . Thanks .



> spark thriftserver insert overwrite table partition select 
> ---
>
> Key: SPARK-21725
> URL: https://issues.apache.org/jira/browse/SPARK-21725
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: centos 6.7 spark 2.1  jdk8
>Reporter: xinzhang
>Priority: Major
>  Labels: spark-sql
>
> use thriftserver create table with partitions.
> session 1:
>  SET hive.default.fileformat=Parquet;create table tmp_10(count bigint) 
> partitioned by (pt string) stored as parquet;
> --ok
>  !exit
> session 2:
>  SET hive.default.fileformat=Parquet;create table tmp_11(count bigint) 
> partitioned by (pt string) stored as parquet; 
> --ok
>  !exit
> session 3:
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --ok
>  !exit
> session 4(do it again):
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --error
>  !exit
> -
> 17/08/14 18:13:42 ERROR SparkExecuteStatementOperation: Error executing 
> query, currentState RUNNING, 
> java.lang.reflect.InvocationTargetException
> ..
> ..
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move 
> source 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/.hive-staging_hive_2017-08-14_18-13-39_035_6303339779053
> 512282-2/-ext-1/part-0 to destination 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/pt=1/part-0
> at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2644)
> at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:2711)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1403)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1324)
> ... 45 more
> Caused by: java.io.IOException: Filesystem closed
> 
> -
> the doc about the parquet table desc here 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files
> Hive metastore Parquet table conversion
> When reading from and writing to Hive metastore Parquet tables, Spark SQL 
> will try to use its own Parquet support instead of Hive SerDe for better 
> performance. This behavior is controlled by the 
> spark.sql.hive.convertMetastoreParquet configuration, and is turned on by 
> default.
> I am confused the problem appear in the table(partitions)  but it is ok with 
> table(with out partitions) . It means spark do not use its own parquet ?
> Maybe someone give any suggest how could I

[jira] [Comment Edited] (SPARK-21725) spark thriftserver insert overwrite table partition select

2017-10-31 Thread xinzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16233578#comment-16233578
 ] 

xinzhang edited comment on SPARK-21725 at 11/1/17 3:08 AM:
---

[~mgaido]

1. hive 1.2.1  
   download a new tar only change hive-site.xml 
  about hive metastore with mysql . metastore(local 9083 or not .Keep metastore 
do not change.It is not a point) 
2.spark-sql copy the hive-site.xml  
3.start spark-thriftserver
4.beeline connect the thriftserver 

!https://user-images.githubusercontent.com/8244097/32257548-af789d42-bef0-11e7-8c04-99137c50fbbf.png!

{color:red}*The metastore has changed from derby to mysql . My suggest is could 
u do it with a new env. Without your current exit env. U could rebuild it 
.*{color}
Like what u say might be related to the metastore. I tested the case in 
cdh5.7(hadoop2.6)   and hadoop2.8(new env) , they will always appear , No 
matter what I did . Hope your help . Thanks .




was (Author: zhangxin0112zx):
[~mgaido]

1. hive 1.2.1  
   download a new tar only change hive-site.xml 
  about hive metastore with mysql . metastore(local 9083 or not .Keep metastore 
do not change.It is not a point) 
2.spark-sql copy the hive-site.xml  
3.start spark-thriftserver
4.beeline connect the thriftserver 

!https://user-images.githubusercontent.com/8244097/32257548-af789d42-bef0-11e7-8c04-99137c50fbbf.png!

{color:red}*The metastore has changed from derby to mysql . My suggest is could 
u do it with a new env. Without your current exit env. U could rebuild it 
.*{color}
Like what u say might be related to the metastore. I tested the case in 
cdh5.7(hadoop2.6)   and hadoop2.8(new env) , they will always appear , No 
matter what I did . Hope your help . Thanks .

{color:red}Hi .It is the metastore !!! I test it with derby .Thriftserver is 
OK. I change it to mysql . It always appear the pro. Could u test it {color}

> spark thriftserver insert overwrite table partition select 
> ---
>
> Key: SPARK-21725
> URL: https://issues.apache.org/jira/browse/SPARK-21725
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: centos 6.7 spark 2.1  jdk8
>Reporter: xinzhang
>Priority: Major
>  Labels: spark-sql
>
> use thriftserver create table with partitions.
> session 1:
>  SET hive.default.fileformat=Parquet;create table tmp_10(count bigint) 
> partitioned by (pt string) stored as parquet;
> --ok
>  !exit
> session 2:
>  SET hive.default.fileformat=Parquet;create table tmp_11(count bigint) 
> partitioned by (pt string) stored as parquet; 
> --ok
>  !exit
> session 3:
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --ok
>  !exit
> session 4(do it again):
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --error
>  !exit
> -
> 17/08/14 18:13:42 ERROR SparkExecuteStatementOperation: Error executing 
> query, currentState RUNNING, 
> java.lang.reflect.InvocationTargetException
> ..
> ..
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move 
> source 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/.hive-staging_hive_2017-08-14_18-13-39_035_6303339779053
> 512282-2/-ext-1/part-0 to destination 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/pt=1/part-0
> at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2644)
> at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:2711)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1403)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1324)
> ... 45 more
> Caused by: java.io.IOException: Filesystem closed
> 
> -
> the doc about the parquet table desc here 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files
> Hive metastore Parquet table conversion
> When reading from and writing to Hive metastore Parquet tables, Spark SQL 
> will try to use its own Parquet support instead of Hive SerDe for better 
> performance. This behavior is controlled by the 
> spark.sql.hive.convertMetastoreParquet configuration, and is turned on by 
> default.
> I am confused the problem appear in the table(partitions)  but it is ok with 
> table(with out partitions) . It means spark do not use its own parquet ?
> Maybe someone give any suggest how could I

[jira] [Comment Edited] (SPARK-21725) spark thriftserver insert overwrite table partition select

2017-10-31 Thread xinzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16233578#comment-16233578
 ] 

xinzhang edited comment on SPARK-21725 at 11/1/17 2:59 AM:
---

[~mgaido]

1. hive 1.2.1  
   download a new tar only change hive-site.xml 
  about hive metastore with mysql . metastore(local 9083 or not .Keep metastore 
do not change.It is not a point) 
2.spark-sql copy the hive-site.xml  
3.start spark-thriftserver
4.beeline connect the thriftserver 

!https://user-images.githubusercontent.com/8244097/32257548-af789d42-bef0-11e7-8c04-99137c50fbbf.png!

{color:red}*The metastore has changed from derby to mysql . My suggest is could 
u do it with a new env. Without your current exit env. U could rebuild it 
.*{color}
Like what u say might be related to the metastore. I tested the case in 
cdh5.7(hadoop2.6)   and hadoop2.8(new env) , they will always appear , No 
matter what I did . Hope your help . Thanks .

{color:red}Hi .It is the metastore !!! I test it with derby .Thriftserver is 
OK. I change it to mysql . It always appear the pro. Could u test it {color}


was (Author: zhangxin0112zx):
[~mgaido]

1. hive 1.2.1  
   download a new tar only change hive-site.xml 
  about hive metastore with mysql . metastore(local 9083 or not .Keep metastore 
do not change.It is not a point) 
2.spark-sql copy the hive-site.xml  
3.start spark-thriftserver
4.beeline connect the thriftserver 

!https://user-images.githubusercontent.com/8244097/32257548-af789d42-bef0-11e7-8c04-99137c50fbbf.png!

{color:red}*The metastore has changed from derby to mysql . My suggest is could 
u do it with a new env. Without your current exit env. U could rebuild it 
.*{color}
Like what u say might be related to the metastore. I tested the case in 
cdh5.7(hadoop2.6)   and hadoop2.8(new env) , they will always appear , No 
matter what I did . Hope your help . Thanks .

{color:red}Hi .May be the point is the metastore .I test it with derby 
.Thriftserver is OK. I change it to mysql . It always appear the pro. {color}

> spark thriftserver insert overwrite table partition select 
> ---
>
> Key: SPARK-21725
> URL: https://issues.apache.org/jira/browse/SPARK-21725
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: centos 6.7 spark 2.1  jdk8
>Reporter: xinzhang
>Priority: Major
>  Labels: spark-sql
>
> use thriftserver create table with partitions.
> session 1:
>  SET hive.default.fileformat=Parquet;create table tmp_10(count bigint) 
> partitioned by (pt string) stored as parquet;
> --ok
>  !exit
> session 2:
>  SET hive.default.fileformat=Parquet;create table tmp_11(count bigint) 
> partitioned by (pt string) stored as parquet; 
> --ok
>  !exit
> session 3:
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --ok
>  !exit
> session 4(do it again):
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --error
>  !exit
> -
> 17/08/14 18:13:42 ERROR SparkExecuteStatementOperation: Error executing 
> query, currentState RUNNING, 
> java.lang.reflect.InvocationTargetException
> ..
> ..
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move 
> source 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/.hive-staging_hive_2017-08-14_18-13-39_035_6303339779053
> 512282-2/-ext-1/part-0 to destination 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/pt=1/part-0
> at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2644)
> at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:2711)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1403)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1324)
> ... 45 more
> Caused by: java.io.IOException: Filesystem closed
> 
> -
> the doc about the parquet table desc here 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files
> Hive metastore Parquet table conversion
> When reading from and writing to Hive metastore Parquet tables, Spark SQL 
> will try to use its own Parquet support instead of Hive SerDe for better 
> performance. This behavior is controlled by the 
> spark.sql.hive.convertMetastoreParquet configuration, and is turned on by 
> default.
> I am confused the problem appear in the

[jira] [Comment Edited] (SPARK-21725) spark thriftserver insert overwrite table partition select

2017-10-31 Thread xinzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16233578#comment-16233578
 ] 

xinzhang edited comment on SPARK-21725 at 11/1/17 2:52 AM:
---

[~mgaido]

1. hive 1.2.1  
   download a new tar only change hive-site.xml 
  about hive metastore with mysql . metastore(local 9083 or not .Keep metastore 
do not change.It is not a point) 
2.spark-sql copy the hive-site.xml  
3.start spark-thriftserver
4.beeline connect the thriftserver 

!https://user-images.githubusercontent.com/8244097/32257548-af789d42-bef0-11e7-8c04-99137c50fbbf.png!

{color:red}*The metastore has changed from derby to mysql . My suggest is could 
u do it with a new env. Without your current exit env. U could rebuild it 
.*{color}
Like what u say might be related to the metastore. I tested the case in 
cdh5.7(hadoop2.6)   and hadoop2.8(new env) , they will always appear , No 
matter what I did . Hope your help . Thanks .

{color:red}Hi .May be the point is the metastore .I test it with derby 
.Thriftserver is OK. I change it to mysql . It always appear the pro. {color}


was (Author: zhangxin0112zx):
[~mgaido]

1. hive 1.2.1  
   download a new tar only change hive-site.xml 
  about hive metastore with mysql . metastore(local 9083 or not .Keep metastore 
do not change.It is not a point) 
2.spark-sql copy the hive-site.xml  
3.start spark-thriftserver
4.beeline connect the thriftserver 

!https://user-images.githubusercontent.com/8244097/32257548-af789d42-bef0-11e7-8c04-99137c50fbbf.png!

{color:red}*The metastore has changed from derby to mysql . My suggest is could 
u do it with a new env. Without your current exit env. U could rebuild it 
.*{color}
Like what u say might be related to the metastore. I tested the case in 
cdh5.7(hadoop2.6)   and hadoop2.8(new env) , they will always appear , No 
matter what I did . Hope your help . Thanks .

> spark thriftserver insert overwrite table partition select 
> ---
>
> Key: SPARK-21725
> URL: https://issues.apache.org/jira/browse/SPARK-21725
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: centos 6.7 spark 2.1  jdk8
>Reporter: xinzhang
>Priority: Major
>  Labels: spark-sql
>
> use thriftserver create table with partitions.
> session 1:
>  SET hive.default.fileformat=Parquet;create table tmp_10(count bigint) 
> partitioned by (pt string) stored as parquet;
> --ok
>  !exit
> session 2:
>  SET hive.default.fileformat=Parquet;create table tmp_11(count bigint) 
> partitioned by (pt string) stored as parquet; 
> --ok
>  !exit
> session 3:
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --ok
>  !exit
> session 4(do it again):
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --error
>  !exit
> -
> 17/08/14 18:13:42 ERROR SparkExecuteStatementOperation: Error executing 
> query, currentState RUNNING, 
> java.lang.reflect.InvocationTargetException
> ..
> ..
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move 
> source 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/.hive-staging_hive_2017-08-14_18-13-39_035_6303339779053
> 512282-2/-ext-1/part-0 to destination 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/pt=1/part-0
> at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2644)
> at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:2711)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1403)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1324)
> ... 45 more
> Caused by: java.io.IOException: Filesystem closed
> 
> -
> the doc about the parquet table desc here 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files
> Hive metastore Parquet table conversion
> When reading from and writing to Hive metastore Parquet tables, Spark SQL 
> will try to use its own Parquet support instead of Hive SerDe for better 
> performance. This behavior is controlled by the 
> spark.sql.hive.convertMetastoreParquet configuration, and is turned on by 
> default.
> I am confused the problem appear in the table(partitions)  but it is ok with 
> table(with out partitions) . It means spark do not use its own parquet ?
> Maybe someone give any suggest how could I avoid

[jira] [Comment Edited] (SPARK-21827) Task fail due to executor exception when enable Sasl Encryption

2017-10-31 Thread Mario Molina (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16233587#comment-16233587
 ] 

Mario Molina edited comment on SPARK-21827 at 11/1/17 2:50 AM:
---

Are you trying to read/write data from/to some db or HDFS or something like 
that? If so, which one? How many cores do you have assigned to each executor?


was (Author: mmolimar):
Are you trying to read/write data to some db or HDFS or something like that? If 
so, which one? How many cores do you have assigned to each executor?

> Task fail due to executor exception when enable Sasl Encryption
> ---
>
> Key: SPARK-21827
> URL: https://issues.apache.org/jira/browse/SPARK-21827
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.6.1, 2.1.1, 2.2.0
> Environment: OS: RedHat 7.1 64bit
>Reporter: Yishan Jiang
>Priority: Major
>
> We met authentication and Sasl encryption on many versions, just append 161 
> version like this:
> spark.local.dir /tmp/test-161
> spark.shuffle.service.enabled true
> *spark.authenticate true*
> *spark.authenticate.enableSaslEncryption true*
> *spark.network.sasl.serverAlwaysEncrypt true*
> spark.authenticate.secret e25d4369-bec3-4266-8fc5-fb6d4fcee66f
> spark.history.ui.port 18089
> spark.shuffle.service.port 7347
> spark.master.rest.port 6076
> spark.deploy.recoveryMode NONE
> spark.ssl.enabled true
> spark.executor.extraJavaOptions -Djava.security.egd=file:/dev/./urandom
> We run an Spark example and task fail with Exception messages:
> 17/08/22 03:56:52 INFO BlockManager: external shuffle service port = 7347
> 17/08/22 03:56:52 INFO BlockManagerMaster: Trying to register BlockManager
> 17/08/22 03:56:52 INFO sasl: DIGEST41:Unmatched MACs
> 17/08/22 03:56:52 WARN TransportChannelHandler: Exception in connection from 
> cws57n6.ma.platformlab.ibm.com/172.29.8.66:49394
> java.lang.IllegalArgumentException: Frame length should be positive: 
> -5594407078713290673   
> at 
> org.spark-project.guava.base.Preconditions.checkArgument(Preconditions.java:119)
> at 
> org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:135)
> at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:82)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> at java.lang.Thread.run(Thread.java:785)
> 17/08/22 03:56:52 ERROR TransportResponseHandler: Still have 1 requests 
> outstanding when connection from 
> cws57n6.ma.platformlab.ibm.com/172.29.8.66:49394 is closed
> 17/08/22 03:56:52 WARN NettyRpcEndpointRef: Error sending message [message = 
> RegisterBlockManager(BlockManagerId(fe9d31da-f70c-40a2-9032-05a5af4ba4c5, 
> cws58n1.ma.platformlab.ibm.com, 45852),2985295872,NettyRpcEn
> dpointRef(null))] in 1 attempts
> java.lang.IllegalArgumentException: Frame length should be positive: 
> -5594407078713290673
> at 
> org.spark-project.guava.base.Preconditions.checkArgument(Preconditions.java:119)
> at 
> org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:135)
> at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:82)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>

[jira] [Comment Edited] (SPARK-21827) Task fail due to executor exception when enable Sasl Encryption

2017-10-31 Thread Mario Molina (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16233587#comment-16233587
 ] 

Mario Molina edited comment on SPARK-21827 at 11/1/17 2:49 AM:
---

Are you trying to read/write data to some db or HDFS or something like that? If 
so, which one? How many cores do you have assigned to each executor?


was (Author: mmolimar):
Are you trying to read/write data to some DB or HDFS or something like that? If 
so, which one? How many core do you have assigned to each executor?

> Task fail due to executor exception when enable Sasl Encryption
> ---
>
> Key: SPARK-21827
> URL: https://issues.apache.org/jira/browse/SPARK-21827
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.6.1, 2.1.1, 2.2.0
> Environment: OS: RedHat 7.1 64bit
>Reporter: Yishan Jiang
>Priority: Major
>
> We met authentication and Sasl encryption on many versions, just append 161 
> version like this:
> spark.local.dir /tmp/test-161
> spark.shuffle.service.enabled true
> *spark.authenticate true*
> *spark.authenticate.enableSaslEncryption true*
> *spark.network.sasl.serverAlwaysEncrypt true*
> spark.authenticate.secret e25d4369-bec3-4266-8fc5-fb6d4fcee66f
> spark.history.ui.port 18089
> spark.shuffle.service.port 7347
> spark.master.rest.port 6076
> spark.deploy.recoveryMode NONE
> spark.ssl.enabled true
> spark.executor.extraJavaOptions -Djava.security.egd=file:/dev/./urandom
> We run an Spark example and task fail with Exception messages:
> 17/08/22 03:56:52 INFO BlockManager: external shuffle service port = 7347
> 17/08/22 03:56:52 INFO BlockManagerMaster: Trying to register BlockManager
> 17/08/22 03:56:52 INFO sasl: DIGEST41:Unmatched MACs
> 17/08/22 03:56:52 WARN TransportChannelHandler: Exception in connection from 
> cws57n6.ma.platformlab.ibm.com/172.29.8.66:49394
> java.lang.IllegalArgumentException: Frame length should be positive: 
> -5594407078713290673   
> at 
> org.spark-project.guava.base.Preconditions.checkArgument(Preconditions.java:119)
> at 
> org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:135)
> at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:82)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> at java.lang.Thread.run(Thread.java:785)
> 17/08/22 03:56:52 ERROR TransportResponseHandler: Still have 1 requests 
> outstanding when connection from 
> cws57n6.ma.platformlab.ibm.com/172.29.8.66:49394 is closed
> 17/08/22 03:56:52 WARN NettyRpcEndpointRef: Error sending message [message = 
> RegisterBlockManager(BlockManagerId(fe9d31da-f70c-40a2-9032-05a5af4ba4c5, 
> cws58n1.ma.platformlab.ibm.com, 45852),2985295872,NettyRpcEn
> dpointRef(null))] in 1 attempts
> java.lang.IllegalArgumentException: Frame length should be positive: 
> -5594407078713290673
> at 
> org.spark-project.guava.base.Preconditions.checkArgument(Preconditions.java:119)
> at 
> org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:135)
> at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:82)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>

[jira] [Commented] (SPARK-21827) Task fail due to executor exception when enable Sasl Encryption

2017-10-31 Thread Mario Molina (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16233587#comment-16233587
 ] 

Mario Molina commented on SPARK-21827:
--

Are you trying to read/write data to some DB or HDFS or something like that? If 
so, which one? How many core do you have assigned to each executor?

> Task fail due to executor exception when enable Sasl Encryption
> ---
>
> Key: SPARK-21827
> URL: https://issues.apache.org/jira/browse/SPARK-21827
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.6.1, 2.1.1, 2.2.0
> Environment: OS: RedHat 7.1 64bit
>Reporter: Yishan Jiang
>Priority: Major
>
> We met authentication and Sasl encryption on many versions, just append 161 
> version like this:
> spark.local.dir /tmp/test-161
> spark.shuffle.service.enabled true
> *spark.authenticate true*
> *spark.authenticate.enableSaslEncryption true*
> *spark.network.sasl.serverAlwaysEncrypt true*
> spark.authenticate.secret e25d4369-bec3-4266-8fc5-fb6d4fcee66f
> spark.history.ui.port 18089
> spark.shuffle.service.port 7347
> spark.master.rest.port 6076
> spark.deploy.recoveryMode NONE
> spark.ssl.enabled true
> spark.executor.extraJavaOptions -Djava.security.egd=file:/dev/./urandom
> We run an Spark example and task fail with Exception messages:
> 17/08/22 03:56:52 INFO BlockManager: external shuffle service port = 7347
> 17/08/22 03:56:52 INFO BlockManagerMaster: Trying to register BlockManager
> 17/08/22 03:56:52 INFO sasl: DIGEST41:Unmatched MACs
> 17/08/22 03:56:52 WARN TransportChannelHandler: Exception in connection from 
> cws57n6.ma.platformlab.ibm.com/172.29.8.66:49394
> java.lang.IllegalArgumentException: Frame length should be positive: 
> -5594407078713290673   
> at 
> org.spark-project.guava.base.Preconditions.checkArgument(Preconditions.java:119)
> at 
> org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:135)
> at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:82)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> at java.lang.Thread.run(Thread.java:785)
> 17/08/22 03:56:52 ERROR TransportResponseHandler: Still have 1 requests 
> outstanding when connection from 
> cws57n6.ma.platformlab.ibm.com/172.29.8.66:49394 is closed
> 17/08/22 03:56:52 WARN NettyRpcEndpointRef: Error sending message [message = 
> RegisterBlockManager(BlockManagerId(fe9d31da-f70c-40a2-9032-05a5af4ba4c5, 
> cws58n1.ma.platformlab.ibm.com, 45852),2985295872,NettyRpcEn
> dpointRef(null))] in 1 attempts
> java.lang.IllegalArgumentException: Frame length should be positive: 
> -5594407078713290673
> at 
> org.spark-project.guava.base.Preconditions.checkArgument(Preconditions.java:119)
> at 
> org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:135)
> at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:82)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
> at 
>

[jira] [Comment Edited] (SPARK-21725) spark thriftserver insert overwrite table partition select

2017-10-31 Thread xinzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16233578#comment-16233578
 ] 

xinzhang edited comment on SPARK-21725 at 11/1/17 2:38 AM:
---

[~mgaido]

1. hive 1.2.1  
   download a new tar only change hive-site.xml 
  about hive metastore with mysql . metastore(local 9083 or not .Keep metastore 
do not change.It is not a point) 
2.spark-sql copy the hive-site.xml  
3.start spark-thriftserver
4.beeline connect the thriftserver 

!https://user-images.githubusercontent.com/8244097/32257548-af789d42-bef0-11e7-8c04-99137c50fbbf.png!

{color:red}*The metastore has changed from derby to mysql . My suggest is could 
u do it with a new env. Without your current exit env. U could rebuild it 
.*{color}
Like what u say might be related to the metastore. I tested the case in 
cdh5.7(hadoop2.6)   and hadoop2.8(new env) , they will always appear , No 
matter what I did . Hope your help . Thanks .


was (Author: zhangxin0112zx):
[~mgaido]

1. hive 1.2.1  
   download a new tar only change hive-site.xml 
  about hive metastore with mysql . metastore(local 9083 or not .Keep metastore 
do not change.It is not a point) 
2.spark-sql copy the hive-site.xml  
3.start spark-thriftserver
4.beeline connect the thriftserver 

!https://user-images.githubusercontent.com/8244097/32257548-af789d42-bef0-11e7-8c04-99137c50fbbf.png!

The metastore has changed from derby to mysql . My suggest is could u do it 
with a new env. Without your current exit env. U could rebuild it .
Like what u say might be related to the metastore. I tested the case in 
cdh5.7(hadoop2.6)   and hadoop2.8(new env) , they will always appear , No 
matter what I did . Hope your help . Thanks .

> spark thriftserver insert overwrite table partition select 
> ---
>
> Key: SPARK-21725
> URL: https://issues.apache.org/jira/browse/SPARK-21725
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: centos 6.7 spark 2.1  jdk8
>Reporter: xinzhang
>Priority: Major
>  Labels: spark-sql
>
> use thriftserver create table with partitions.
> session 1:
>  SET hive.default.fileformat=Parquet;create table tmp_10(count bigint) 
> partitioned by (pt string) stored as parquet;
> --ok
>  !exit
> session 2:
>  SET hive.default.fileformat=Parquet;create table tmp_11(count bigint) 
> partitioned by (pt string) stored as parquet; 
> --ok
>  !exit
> session 3:
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --ok
>  !exit
> session 4(do it again):
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --error
>  !exit
> -
> 17/08/14 18:13:42 ERROR SparkExecuteStatementOperation: Error executing 
> query, currentState RUNNING, 
> java.lang.reflect.InvocationTargetException
> ..
> ..
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move 
> source 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/.hive-staging_hive_2017-08-14_18-13-39_035_6303339779053
> 512282-2/-ext-1/part-0 to destination 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/pt=1/part-0
> at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2644)
> at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:2711)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1403)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1324)
> ... 45 more
> Caused by: java.io.IOException: Filesystem closed
> 
> -
> the doc about the parquet table desc here 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files
> Hive metastore Parquet table conversion
> When reading from and writing to Hive metastore Parquet tables, Spark SQL 
> will try to use its own Parquet support instead of Hive SerDe for better 
> performance. This behavior is controlled by the 
> spark.sql.hive.convertMetastoreParquet configuration, and is turned on by 
> default.
> I am confused the problem appear in the table(partitions)  but it is ok with 
> table(with out partitions) . It means spark do not use its own parquet ?
> Maybe someone give any suggest how could I avoid the issue?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail:

[jira] [Comment Edited] (SPARK-21725) spark thriftserver insert overwrite table partition select

2017-10-31 Thread xinzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16233578#comment-16233578
 ] 

xinzhang edited comment on SPARK-21725 at 11/1/17 2:37 AM:
---

[~mgaido]

1. hive 1.2.1  
   download a new tar only change hive-site.xml 
  about hive metastore with mysql . metastore(local 9083 or not .Keep metastore 
do not change.It is not a point) 
2.spark-sql copy the hive-site.xml  
3.start spark-thriftserver
4.beeline connect the thriftserver 

!https://user-images.githubusercontent.com/8244097/32257548-af789d42-bef0-11e7-8c04-99137c50fbbf.png!

The metastore has changed from derby to mysql . My suggest is could u do it 
with a new env. Without your current exit env. U could rebuild it .
Like what u say might be related to the metastore. I tested the case in 
cdh5.7(hadoop2.6)   and hadoop2.8(new env) , they will always appear , No 
matter what I did . Hope your help . Thanks .


was (Author: zhangxin0112zx):
[~mgaido]

1. hive 1.2.1  
   download a new tar only change hive-site.xml 
  about hive metastore with mysql . metastore(local 9083) 
2.spark-sql copy the hive-site.xml  
3.start spark-thriftserver
4.beeline connect the thriftserver 

The metastore has changed from derby to mysql . My suggest is could u do it 
with a new env. Without your current exit env. U could rebuild it .
Like what u say might be related to the metastore. I tested the case in 
cdh5.7(hadoop2.6)   and hadoop2.8(new env) , they will always appear , No 
matter what I did . Hope your help . Thanks .

> spark thriftserver insert overwrite table partition select 
> ---
>
> Key: SPARK-21725
> URL: https://issues.apache.org/jira/browse/SPARK-21725
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: centos 6.7 spark 2.1  jdk8
>Reporter: xinzhang
>Priority: Major
>  Labels: spark-sql
>
> use thriftserver create table with partitions.
> session 1:
>  SET hive.default.fileformat=Parquet;create table tmp_10(count bigint) 
> partitioned by (pt string) stored as parquet;
> --ok
>  !exit
> session 2:
>  SET hive.default.fileformat=Parquet;create table tmp_11(count bigint) 
> partitioned by (pt string) stored as parquet; 
> --ok
>  !exit
> session 3:
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --ok
>  !exit
> session 4(do it again):
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --error
>  !exit
> -
> 17/08/14 18:13:42 ERROR SparkExecuteStatementOperation: Error executing 
> query, currentState RUNNING, 
> java.lang.reflect.InvocationTargetException
> ..
> ..
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move 
> source 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/.hive-staging_hive_2017-08-14_18-13-39_035_6303339779053
> 512282-2/-ext-1/part-0 to destination 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/pt=1/part-0
> at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2644)
> at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:2711)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1403)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1324)
> ... 45 more
> Caused by: java.io.IOException: Filesystem closed
> 
> -
> the doc about the parquet table desc here 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files
> Hive metastore Parquet table conversion
> When reading from and writing to Hive metastore Parquet tables, Spark SQL 
> will try to use its own Parquet support instead of Hive SerDe for better 
> performance. This behavior is controlled by the 
> spark.sql.hive.convertMetastoreParquet configuration, and is turned on by 
> default.
> I am confused the problem appear in the table(partitions)  but it is ok with 
> table(with out partitions) . It means spark do not use its own parquet ?
> Maybe someone give any suggest how could I avoid the issue?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21725) spark thriftserver insert overwrite table partition select

2017-10-31 Thread xinzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16233578#comment-16233578
 ] 

xinzhang edited comment on SPARK-21725 at 11/1/17 2:32 AM:
---

[~mgaido]

1. hive 1.2.1  
   download a new tar only change hive-site.xml 
  about hive metastore with mysql . metastore(local 9083) 
2.spark-sql copy the hive-site.xml  
3.start spark-thriftserver
4.beeline connect the thriftserver 

The metastore has changed from derby to mysql . My suggest is could u do it 
with a new env. Without your current exit env. U could rebuild it .
Like what u say might be related to the metastore. I tested the case in 
cdh5.7(hadoop2.6)   and hadoop2.8(new env) , they will always appear , No 
matter what I did . Hope your help . Thanks .


was (Author: zhangxin0112zx):
[~mgaido]

1. hive 1.2.1  
   download a new tar only change hive-site.xml 
  about hive metastore with mysql . metastore(local 9083) 
2.spark-sql copy the hive-site.xml  
3.start spark-thriftserver
4.beeline connect the thriftserver 

The metastore has changed from derby to mysql . My suggest is could u do it 
with a new env. Without your current exit env.
Like what u say might be related to the metastore. I tested the case in 
cdh5.7(hadoop2.6)   and hadoop2.8(new env) , they will always appear , No 
matter what I did . Hope your help . Thanks .

> spark thriftserver insert overwrite table partition select 
> ---
>
> Key: SPARK-21725
> URL: https://issues.apache.org/jira/browse/SPARK-21725
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: centos 6.7 spark 2.1  jdk8
>Reporter: xinzhang
>Priority: Major
>  Labels: spark-sql
>
> use thriftserver create table with partitions.
> session 1:
>  SET hive.default.fileformat=Parquet;create table tmp_10(count bigint) 
> partitioned by (pt string) stored as parquet;
> --ok
>  !exit
> session 2:
>  SET hive.default.fileformat=Parquet;create table tmp_11(count bigint) 
> partitioned by (pt string) stored as parquet; 
> --ok
>  !exit
> session 3:
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --ok
>  !exit
> session 4(do it again):
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --error
>  !exit
> -
> 17/08/14 18:13:42 ERROR SparkExecuteStatementOperation: Error executing 
> query, currentState RUNNING, 
> java.lang.reflect.InvocationTargetException
> ..
> ..
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move 
> source 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/.hive-staging_hive_2017-08-14_18-13-39_035_6303339779053
> 512282-2/-ext-1/part-0 to destination 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/pt=1/part-0
> at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2644)
> at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:2711)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1403)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1324)
> ... 45 more
> Caused by: java.io.IOException: Filesystem closed
> 
> -
> the doc about the parquet table desc here 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files
> Hive metastore Parquet table conversion
> When reading from and writing to Hive metastore Parquet tables, Spark SQL 
> will try to use its own Parquet support instead of Hive SerDe for better 
> performance. This behavior is controlled by the 
> spark.sql.hive.convertMetastoreParquet configuration, and is turned on by 
> default.
> I am confused the problem appear in the table(partitions)  but it is ok with 
> table(with out partitions) . It means spark do not use its own parquet ?
> Maybe someone give any suggest how could I avoid the issue?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21725) spark thriftserver insert overwrite table partition select

2017-10-31 Thread xinzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16233578#comment-16233578
 ] 

xinzhang edited comment on SPARK-21725 at 11/1/17 2:31 AM:
---

[~mgaido]

1. hive 1.2.1  
   download a new tar only change hive-site.xml 
  about hive metastore with mysql . metastore(local 9083) 
2.spark-sql copy the hive-site.xml  
3.start spark-thriftserver
4.beeline connect the thriftserver 

The metastore has changed from derby to mysql . My suggest is could u do it 
with a new env. Without your current exit env.
Like what u say might be related to the metastore. I tested the case in 
cdh5.7(hadoop2.6)   and hadoop2.8(new env) , they will always appear , No 
matter what I did . Hope your help . Thanks .


was (Author: zhangxin0112zx):
[~mgaido]

1. hive 1.2.1  
   download a new tar only change hive-site.xml 
  about hive metastore with mysql . metastore(local 9083) 
2.spark-sql copy the hive-site.xml  
3.start spark-thriftserver
4.beeline connect the thriftserver 

The metastore has changed from derby to mysql . My suggest is could u do it as 
a new env without your exit env.
Like what u say might be related to the metastore. I tested the case in 
cdh5.7(hadoop2.6)   and hadoop2.8(new env) , they will always appear , No 
matter what I did . Hope your help . Thanks .

> spark thriftserver insert overwrite table partition select 
> ---
>
> Key: SPARK-21725
> URL: https://issues.apache.org/jira/browse/SPARK-21725
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: centos 6.7 spark 2.1  jdk8
>Reporter: xinzhang
>Priority: Major
>  Labels: spark-sql
>
> use thriftserver create table with partitions.
> session 1:
>  SET hive.default.fileformat=Parquet;create table tmp_10(count bigint) 
> partitioned by (pt string) stored as parquet;
> --ok
>  !exit
> session 2:
>  SET hive.default.fileformat=Parquet;create table tmp_11(count bigint) 
> partitioned by (pt string) stored as parquet; 
> --ok
>  !exit
> session 3:
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --ok
>  !exit
> session 4(do it again):
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --error
>  !exit
> -
> 17/08/14 18:13:42 ERROR SparkExecuteStatementOperation: Error executing 
> query, currentState RUNNING, 
> java.lang.reflect.InvocationTargetException
> ..
> ..
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move 
> source 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/.hive-staging_hive_2017-08-14_18-13-39_035_6303339779053
> 512282-2/-ext-1/part-0 to destination 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/pt=1/part-0
> at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2644)
> at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:2711)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1403)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1324)
> ... 45 more
> Caused by: java.io.IOException: Filesystem closed
> 
> -
> the doc about the parquet table desc here 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files
> Hive metastore Parquet table conversion
> When reading from and writing to Hive metastore Parquet tables, Spark SQL 
> will try to use its own Parquet support instead of Hive SerDe for better 
> performance. This behavior is controlled by the 
> spark.sql.hive.convertMetastoreParquet configuration, and is turned on by 
> default.
> I am confused the problem appear in the table(partitions)  but it is ok with 
> table(with out partitions) . It means spark do not use its own parquet ?
> Maybe someone give any suggest how could I avoid the issue?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21725) spark thriftserver insert overwrite table partition select

2017-10-31 Thread xinzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16233578#comment-16233578
 ] 

xinzhang edited comment on SPARK-21725 at 11/1/17 2:30 AM:
---

[~mgaido]

1. hive 1.2.1  
   download a new tar only change hive-site.xml 
  about hive metastore with mysql . metastore(local 9083) 
2.spark-sql copy the hive-site.xml  
3.start spark-thriftserver
4.beeline connect the thriftserver 

The metastore has changed from derby to mysql . My suggest is could u do it as 
a new env without your exit env.
Like what u say might be related to the metastore. I tested the case in 
cdh5.7(hadoop2.6)   and hadoop2.8(new env) , they will always appear , No 
matter what I did . Hope your help . Thanks .


was (Author: zhangxin0112zx):
1. hive 1.2.1  
   download a new tar only change hive-site.xml 
  about hive metastore with mysql . metastore(local 9083) 
2.spark-sql copy the hive-site.xml  
3.start spark-thriftserver
4.beeline connect the thriftserver 

The metastore has changed from derby to mysql . My suggest is could u do it as 
a new env without your exit env.
Like what u say might be related to the metastore. I tested the case in 
cdh5.7(hadoop2.6)   and hadoop2.8(new env) , they will always appear , No 
matter what I did . Hope your help . Thanks .

> spark thriftserver insert overwrite table partition select 
> ---
>
> Key: SPARK-21725
> URL: https://issues.apache.org/jira/browse/SPARK-21725
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: centos 6.7 spark 2.1  jdk8
>Reporter: xinzhang
>Priority: Major
>  Labels: spark-sql
>
> use thriftserver create table with partitions.
> session 1:
>  SET hive.default.fileformat=Parquet;create table tmp_10(count bigint) 
> partitioned by (pt string) stored as parquet;
> --ok
>  !exit
> session 2:
>  SET hive.default.fileformat=Parquet;create table tmp_11(count bigint) 
> partitioned by (pt string) stored as parquet; 
> --ok
>  !exit
> session 3:
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --ok
>  !exit
> session 4(do it again):
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --error
>  !exit
> -
> 17/08/14 18:13:42 ERROR SparkExecuteStatementOperation: Error executing 
> query, currentState RUNNING, 
> java.lang.reflect.InvocationTargetException
> ..
> ..
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move 
> source 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/.hive-staging_hive_2017-08-14_18-13-39_035_6303339779053
> 512282-2/-ext-1/part-0 to destination 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/pt=1/part-0
> at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2644)
> at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:2711)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1403)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1324)
> ... 45 more
> Caused by: java.io.IOException: Filesystem closed
> 
> -
> the doc about the parquet table desc here 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files
> Hive metastore Parquet table conversion
> When reading from and writing to Hive metastore Parquet tables, Spark SQL 
> will try to use its own Parquet support instead of Hive SerDe for better 
> performance. This behavior is controlled by the 
> spark.sql.hive.convertMetastoreParquet configuration, and is turned on by 
> default.
> I am confused the problem appear in the table(partitions)  but it is ok with 
> table(with out partitions) . It means spark do not use its own parquet ?
> Maybe someone give any suggest how could I avoid the issue?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21725) spark thriftserver insert overwrite table partition select

2017-10-31 Thread xinzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16233578#comment-16233578
 ] 

xinzhang commented on SPARK-21725:
--

1. hive 1.2.1  
   download a new tar only change hive-site.xml 
  about hive metastore with mysql . metastore(local 9083) 
2.spark-sql copy the hive-site.xml  
3.start spark-thriftserver
4.beeline connect the thriftserver 

The metastore has changed from derby to mysql . My suggest is could u do it as 
a new env without your exit env.
Like what u say might be related to the metastore. I tested the case in 
cdh5.7(hadoop2.6)   and hadoop2.8(new env) , they will always appear , No 
matter what I did . Hope your help . Thanks .

> spark thriftserver insert overwrite table partition select 
> ---
>
> Key: SPARK-21725
> URL: https://issues.apache.org/jira/browse/SPARK-21725
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: centos 6.7 spark 2.1  jdk8
>Reporter: xinzhang
>Priority: Major
>  Labels: spark-sql
>
> use thriftserver create table with partitions.
> session 1:
>  SET hive.default.fileformat=Parquet;create table tmp_10(count bigint) 
> partitioned by (pt string) stored as parquet;
> --ok
>  !exit
> session 2:
>  SET hive.default.fileformat=Parquet;create table tmp_11(count bigint) 
> partitioned by (pt string) stored as parquet; 
> --ok
>  !exit
> session 3:
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --ok
>  !exit
> session 4(do it again):
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --error
>  !exit
> -
> 17/08/14 18:13:42 ERROR SparkExecuteStatementOperation: Error executing 
> query, currentState RUNNING, 
> java.lang.reflect.InvocationTargetException
> ..
> ..
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move 
> source 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/.hive-staging_hive_2017-08-14_18-13-39_035_6303339779053
> 512282-2/-ext-1/part-0 to destination 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/pt=1/part-0
> at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2644)
> at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:2711)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1403)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1324)
> ... 45 more
> Caused by: java.io.IOException: Filesystem closed
> 
> -
> the doc about the parquet table desc here 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files
> Hive metastore Parquet table conversion
> When reading from and writing to Hive metastore Parquet tables, Spark SQL 
> will try to use its own Parquet support instead of Hive SerDe for better 
> performance. This behavior is controlled by the 
> spark.sql.hive.convertMetastoreParquet configuration, and is turned on by 
> default.
> I am confused the problem appear in the table(partitions)  but it is ok with 
> table(with out partitions) . It means spark do not use its own parquet ?
> Maybe someone give any suggest how could I avoid the issue?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22406) pyspark version tag is wrong on PyPi

2017-10-31 Thread Kerrick Staley (JIRA)

Kerrick Staley created SPARK-22406:
--

 Summary: pyspark version tag is wrong on PyPi
 Key: SPARK-22406
 URL: https://issues.apache.org/jira/browse/SPARK-22406
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.2.0
Reporter: Kerrick Staley
Priority: Minor


On pypi.python.org, the pyspark package is tagged with version {{2.2.0.post0}}: 
https://pypi.python.org/pypi/pyspark/2.2.0

However, when you install the package, it has version {{2.2.0}}.

This has really annoying consequences: if you try {{pip install 
pyspark==2.2.0}}, it won't work. Instead you have to do {{pip install 
pyspark==2.2.0.post0}}. Then, if you later run the same command ({{pip install 
pyspark==2.2.0.post0}}), it won't recognize the existing pyspark installation 
(because it has version {{2.2.0}}) and instead will reinstall it, which is very 
slow because pyspark is a large package.

This can happen if you add a new package to a {{requirements.txt}} file; you 
end up waiting a lot longer than necessary because every time you run {{pip 
install -r requirements.txt}} it reinstalls pyspark.

Can you please change the package on PyPi to have the version {{2.2.0}}?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22405) Enrich the event information and add new event of ExternalCatalogEvent

2017-10-31 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16233569#comment-16233569
 ] 

Saisai Shao commented on SPARK-22405:
-

Thanks [~hvanhovell] for your comments.

bq. When implementing this we just wanted to have a way to know that some 
metadata was about to change. A consumer could always retrieve more information 
about the the (to-be) changed by querying the catalog

I think this is a feasible approach to satisfy my needs. But there still 
requires more events to be posted, like "AlterTableEvent" or 
"AlterDatabaseEvent", so that user could query the catalog based on the posted 
events, without this user doesn't know when table/db is altered. What do you 
think?

> Enrich the event information and add new event of ExternalCatalogEvent
> --
>
> Key: SPARK-22405
> URL: https://issues.apache.org/jira/browse/SPARK-22405
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Saisai Shao
>Priority: Minor
>
> We're building a data lineage tool in which we need to monitor the metadata 
> changes in {{ExternalCatalog}}, current {{ExternalCatalog}} already provides 
> several useful events like "CreateDatabaseEvent" for custom SparkListener to 
> use. But the information provided by such event is not rich enough, for 
> example {{CreateTablePreEvent}} only provides "database" name and "table" 
> name, not all the table metadata, which is hard for user to get all the table 
> related useful information.
> So here propose to and new {{ExternalCatalogEvent}} and enrich the current 
> existing events for all the catalog related updates.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2984) FileNotFoundException on _temporary directory

2017-10-31 Thread Darron Fuller (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16233524#comment-16233524
 ] 

Darron Fuller commented on SPARK-2984:
--

Similar issue as well. Not related to S3 or HDFS as I am reading files directly 
from linux file system.

{code:java}
Caused by: java.io.FileNotFoundException: 
/tmp/spark-b24b4f13-d1d3-4d4b-986e-420f64febac3/executor-7cbd0dcc-0b8f-4210-8b35-ea72efd3d2a7/blockmgr-81fe3028-4834-40e4-b5ca-e731a03286f1/03/shuffle_33528_20_0.index.a80d3f03-224d-4915-a7a1-26de8b20b59b
 (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.(FileOutputStream.java:213)
at java.io.FileOutputStream.(FileOutputStream.java:162)
at 
org.apache.spark.shuffle.IndexShuffleBlockResolver.writeIndexFileAndCommit(IndexShuffleBlockResolver.scala:144)
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
{code}

> FileNotFoundException on _temporary directory
> -
>
> Key: SPARK-2984
> URL: https://issues.apache.org/jira/browse/SPARK-2984
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Ash
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.3.0
>
>
> We've seen several stacktraces and threads on the user mailing list where 
> people are having issues with a {{FileNotFoundException}} stemming from an 
> HDFS path containing {{_temporary}}.
> I ([~aash]) think this may be related to {{spark.speculation}}.  I think the 
> error condition might manifest in this circumstance:
> 1) task T starts on a executor E1
> 2) it takes a long time, so task T' is started on another executor E2
> 3) T finishes in E1 so moves its data from {{_temporary}} to the final 
> destination and deletes the {{_temporary}} directory during cleanup
> 4) T' finishes in E2 and attempts to move its data from {{_temporary}}, but 
> those files no longer exist!  exception
> Some samples:
> {noformat}
> 14/08/11 08:05:08 ERROR JobScheduler: Error running job streaming job 
> 140774430 ms.0
> java.io.FileNotFoundException: File 
> hdfs://hadoopc/user/csong/output/human_bot/-140774430.out/_temporary/0/task_201408110805__m_07
>  does not exist.
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:708)
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360)
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310)
> at 
> org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136)
> at 
> org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:126)
> at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:841)
> at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:724)
> at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:643)
> at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1068)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:773)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:771)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> at 
>

[jira] [Reopened] (SPARK-22243) streaming job failed to restart from checkpoint

2017-10-31 Thread StephenZou (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

StephenZou reopened SPARK-22243:


reopen
this issue need to merge.

> streaming job failed to restart from checkpoint
> ---
>
> Key: SPARK-22243
> URL: https://issues.apache.org/jira/browse/SPARK-22243
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.0, 2.2.0
>Reporter: StephenZou
>Priority: Major
> Attachments: CheckpointTest.scala
>
>
> My spark-defaults.conf has an item related to the issue, I upload all jars in 
> spark's jars folder to the hdfs path:
> spark.yarn.jars  hdfs:///spark/cache/spark2.2/* 
> Streaming job failed to restart from checkpoint, ApplicationMaster throws  
> "Error: Could not find or load main class 
> org.apache.spark.deploy.yarn.ExecutorLauncher".  The problem is always 
> reproducible.
> I examine the sparkconf object recovered from checkpoint, and find 
> spark.yarn.jars are set empty, which let all jars not exist in AM side. The 
> solution is spark.yarn.jars should be reload from properties files when 
> recovering from checkpoint. 
> attach is a demo to reproduce the issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22242) streaming job failed to restart from checkpoint

2017-10-31 Thread StephenZou (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

StephenZou resolved SPARK-22242.

Resolution: Duplicate

> streaming job failed to restart from checkpoint
> ---
>
> Key: SPARK-22242
> URL: https://issues.apache.org/jira/browse/SPARK-22242
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.0, 2.2.0
>Reporter: StephenZou
>Priority: Major
>
> My spark-defaults.conf has an item related to the issue, I upload all jars in 
> spark's jars folder to the hdfs path:
> spark.yarn.jars  hdfs:///spark/cache/spark2.2/* 
> Streaming job failed to restart from checkpoint, ApplicationMaster throws  
> "Error: Could not find or load main class 
> org.apache.spark.deploy.yarn.ExecutorLauncher".  The problem is always 
> reproducible.
> I examine the sparkconf object recovered from checkpoint, and find 
> spark.yarn.jars are set empty, which let all jars not exist in AM side. The 
> solution is spark.yarn.jars should be reload from properties files when 
> recovering from checkpoint. 
> attach is a demo to reproduce the issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21930) When the number of attempting to restart receiver greater than 0,spark do nothing in 'else'

2017-10-31 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-21930:
-
Component/s: (was: Structured Streaming)
 DStreams

> When the number of  attempting to restart receiver greater than 0,spark do 
> nothing in 'else'
> 
>
> Key: SPARK-21930
> URL: https://issues.apache.org/jira/browse/SPARK-21930
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: liuxianjiao
>Priority: Trivial
>
> When the number of  attempting to restart receiver greater than 0,spark do 
> nothing in 'else'.So I think we should log trace to let users know why.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17556) Executor side broadcast for broadcast joins

2017-10-31 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-17556:

Target Version/s:   (was: 2.3.0)

> Executor side broadcast for broadcast joins
> ---
>
> Key: SPARK-17556
> URL: https://issues.apache.org/jira/browse/SPARK-17556
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Reynold Xin
> Attachments: executor broadcast.pdf, executor-side-broadcast.pdf
>
>
> Currently in Spark SQL, in order to perform a broadcast join, the driver must 
> collect the result of an RDD and then broadcast it. This introduces some 
> extra latency. It might be possible to broadcast directly from executors.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22315) Check for version match between R package and JVM

2017-10-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16227450#comment-16227450
 ] 

Apache Spark commented on SPARK-22315:
--

User 'shivaram' has created a pull request for this issue:
https://github.com/apache/spark/pull/19624

> Check for version match between R package and JVM
> -
>
> Key: SPARK-22315
> URL: https://issues.apache.org/jira/browse/SPARK-22315
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.2.1
>Reporter: Shivaram Venkataraman
>
> With the release of SparkR on CRAN we could have scenarios where users have a 
> newer version of package when compared to the Spark cluster they are 
> connecting to.
> We should print appropriate warnings on either (a) connecting to a different 
> version R Backend (b) connecting to a Spark master running a different 
> version of Spark (this should ideally happen inside Scala ?)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22315) Check for version match between R package and JVM

2017-10-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22315:


Assignee: (was: Apache Spark)

> Check for version match between R package and JVM
> -
>
> Key: SPARK-22315
> URL: https://issues.apache.org/jira/browse/SPARK-22315
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.2.1
>Reporter: Shivaram Venkataraman
>
> With the release of SparkR on CRAN we could have scenarios where users have a 
> newer version of package when compared to the Spark cluster they are 
> connecting to.
> We should print appropriate warnings on either (a) connecting to a different 
> version R Backend (b) connecting to a Spark master running a different 
> version of Spark (this should ideally happen inside Scala ?)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22315) Check for version match between R package and JVM

2017-10-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22315:


Assignee: Apache Spark

> Check for version match between R package and JVM
> -
>
> Key: SPARK-22315
> URL: https://issues.apache.org/jira/browse/SPARK-22315
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.2.1
>Reporter: Shivaram Venkataraman
>Assignee: Apache Spark
>
> With the release of SparkR on CRAN we could have scenarios where users have a 
> newer version of package when compared to the Spark cluster they are 
> connecting to.
> We should print appropriate warnings on either (a) connecting to a different 
> version R Backend (b) connecting to a Spark master running a different 
> version of Spark (this should ideally happen inside Scala ?)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22305) HDFSBackedStateStoreProvider fails with StackOverflowException when attempting to recover state

2017-10-31 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-22305.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

> HDFSBackedStateStoreProvider fails with StackOverflowException when 
> attempting to recover state
> ---
>
> Key: SPARK-22305
> URL: https://issues.apache.org/jira/browse/SPARK-22305
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Yuval Itzchakov
>Assignee: Jose Torres
> Fix For: 2.3.0
>
>
> Environment:
> Spark: 2.2.0
> Java version: 1.8.0_112
> spark.sql.streaming.minBatchesToRetain: 100
> After an application failure due to OOM exceptions, restarting the 
> application with the existing state produces the following OOM:
> {code:java}
> java.io.IOException: com.google.protobuf.ServiceException: 
> java.lang.StackOverflowError
>   at 
> org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:47)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:260)
>   at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy18.getBlockLocations(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1240)
>   at 
> org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1227)
>   at 
> org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1215)
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:303)
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:269)
>   at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:261)
>   at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1540)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:304)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:299)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:312)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$readSnapshotFile(HDFSBackedStateStoreProvider.scala:405)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:296)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:295)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:295)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$4.apply(HDFSBackedStateStoreProvider.scala:297)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$4.apply(HDFSBackedStateStoreProvider.scala:296)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:296)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:295)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
>

[jira] [Assigned] (SPARK-22305) HDFSBackedStateStoreProvider fails with StackOverflowException when attempting to recover state

2017-10-31 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reassigned SPARK-22305:


Assignee: Jose Torres

> HDFSBackedStateStoreProvider fails with StackOverflowException when 
> attempting to recover state
> ---
>
> Key: SPARK-22305
> URL: https://issues.apache.org/jira/browse/SPARK-22305
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Yuval Itzchakov
>Assignee: Jose Torres
> Fix For: 2.3.0
>
>
> Environment:
> Spark: 2.2.0
> Java version: 1.8.0_112
> spark.sql.streaming.minBatchesToRetain: 100
> After an application failure due to OOM exceptions, restarting the 
> application with the existing state produces the following OOM:
> {code:java}
> java.io.IOException: com.google.protobuf.ServiceException: 
> java.lang.StackOverflowError
>   at 
> org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:47)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:260)
>   at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy18.getBlockLocations(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1240)
>   at 
> org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1227)
>   at 
> org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1215)
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:303)
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:269)
>   at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:261)
>   at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1540)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:304)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:299)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:312)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$readSnapshotFile(HDFSBackedStateStoreProvider.scala:405)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:296)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:295)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:295)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$4.apply(HDFSBackedStateStoreProvider.scala:297)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$4.apply(HDFSBackedStateStoreProvider.scala:296)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:296)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:295)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
>

[jira] [Commented] (SPARK-22403) StructuredKafkaWordCount example fails in YARN cluster mode

2017-10-31 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16227321#comment-16227321
 ] 

Shixiong Zhu commented on SPARK-22403:
--

Yeah, feel free to submit a PR to improve the example.

> StructuredKafkaWordCount example fails in YARN cluster mode
> ---
>
> Key: SPARK-22403
> URL: https://issues.apache.org/jira/browse/SPARK-22403
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Wing Yew Poon
>
> When I run the StructuredKafkaWordCount example in YARN client mode, it runs 
> fine. However, when I run it in YARN cluster mode, the application errors 
> during initialization, and dies after the default number of YARN application 
> attempts. In the AM log, I see
> {noformat}
> 17/10/30 11:34:52 INFO execution.SparkSqlParser: Parsing command: CAST(value 
> AS STRING)
> 17/10/30 11:34:53 ERROR streaming.StreamMetadata: Error writing stream 
> metadata StreamMetadata(b71ca714-a7a1-467f-96aa-023375964429) to 
> /data/yarn/nm/usercache/systest/appcache/application_1508800814252_0047/container_1508800814252_0047_01_01/tmp/temporary-b5ced4ae-32e0-4725-b905-aad679aec9b5/metadata
> org.apache.hadoop.security.AccessControlException: Permission denied: 
> user=systest, access=WRITE, inode="/":hdfs:supergroup:drwxr-xr-x
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:397)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:256)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:194)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1842)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1826)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkAncestorAccess(FSDirectory.java:1785)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.resolvePathForStartFile(FSDirWriteFileOp.java:315)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2313)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2257)
> ...
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:280)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1235)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1214)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1152)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:458)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:455)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:469)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:396)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1103)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1083)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:972)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:960)
>   at 
> org.apache.spark.sql.execution.streaming.StreamMetadata$.write(StreamMetadata.scala:76)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$6.apply(StreamExecution.scala:116)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$6.apply(StreamExecution.scala:114)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.(StreamExecution.scala:114)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:240)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:278)
>   at 
> org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:282)
>   at 
> org.apache.spark.examples.sql.streaming.StructuredKafkaWordCount$.main(StructuredKafkaWordCount.scala:79)
>   at 
> org.apache.spark.examples.sql.streaming.StructuredKafkaWordCount.main(StructuredKafkaWordCount.scala)
> {noformat}
> Looking at StreamingQueryManager#createQuery, we have
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala#L198
> {code}
> val checkpointLocation = userSpecifiedCheckpointLocation.map { ...
>   ...
> }.orElse {
>   ...
> }.getOrElse

[jira] [Commented] (SPARK-22403) StructuredKafkaWordCount example fails in YARN cluster mode

2017-10-31 Thread Wing Yew Poon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16227298#comment-16227298
 ] 

Wing Yew Poon commented on SPARK-22403:
---

I realize that in a production application, one would set checkpointLocation 
and avoid this issue. However, there is evidently a problem in the code that 
handles the case when checkpointLocation is not set and a temporary checkpoint 
location is created. Also, the StructuredKafkaWordCount example does not accept 
a parameter for setting the checkpointLocation.


> StructuredKafkaWordCount example fails in YARN cluster mode
> ---
>
> Key: SPARK-22403
> URL: https://issues.apache.org/jira/browse/SPARK-22403
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Wing Yew Poon
>
> When I run the StructuredKafkaWordCount example in YARN client mode, it runs 
> fine. However, when I run it in YARN cluster mode, the application errors 
> during initialization, and dies after the default number of YARN application 
> attempts. In the AM log, I see
> {noformat}
> 17/10/30 11:34:52 INFO execution.SparkSqlParser: Parsing command: CAST(value 
> AS STRING)
> 17/10/30 11:34:53 ERROR streaming.StreamMetadata: Error writing stream 
> metadata StreamMetadata(b71ca714-a7a1-467f-96aa-023375964429) to 
> /data/yarn/nm/usercache/systest/appcache/application_1508800814252_0047/container_1508800814252_0047_01_01/tmp/temporary-b5ced4ae-32e0-4725-b905-aad679aec9b5/metadata
> org.apache.hadoop.security.AccessControlException: Permission denied: 
> user=systest, access=WRITE, inode="/":hdfs:supergroup:drwxr-xr-x
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:397)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:256)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:194)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1842)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1826)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkAncestorAccess(FSDirectory.java:1785)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.resolvePathForStartFile(FSDirWriteFileOp.java:315)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2313)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2257)
> ...
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:280)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1235)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1214)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1152)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:458)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:455)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:469)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:396)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1103)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1083)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:972)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:960)
>   at 
> org.apache.spark.sql.execution.streaming.StreamMetadata$.write(StreamMetadata.scala:76)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$6.apply(StreamExecution.scala:116)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$6.apply(StreamExecution.scala:114)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.(StreamExecution.scala:114)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:240)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:278)
>   at 
> org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:282)
>   at 
> org.apache.spark.examples.sql.streaming.StructuredKafkaWordCount$.main(StructuredKafkaWordCount.scala:79)
>   at 
> org.apache.spark.examples.sql.streaming.StructuredKafkaWordCount.main(StructuredKafkaWordCount.scala)
> {noformat}
> Looking at

[jira] [Commented] (SPARK-22403) StructuredKafkaWordCount example fails in YARN cluster mode

2017-10-31 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16227266#comment-16227266
 ] 

Shixiong Zhu commented on SPARK-22403:
--

Yeah, Spark creates a temp directory for you. You can set "checkpointLocation" 
by yourself to avoid this issue. I don't know if there is an API to create a 
temp directory for all types of file systems.

> StructuredKafkaWordCount example fails in YARN cluster mode
> ---
>
> Key: SPARK-22403
> URL: https://issues.apache.org/jira/browse/SPARK-22403
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Wing Yew Poon
>
> When I run the StructuredKafkaWordCount example in YARN client mode, it runs 
> fine. However, when I run it in YARN cluster mode, the application errors 
> during initialization, and dies after the default number of YARN application 
> attempts. In the AM log, I see
> {noformat}
> 17/10/30 11:34:52 INFO execution.SparkSqlParser: Parsing command: CAST(value 
> AS STRING)
> 17/10/30 11:34:53 ERROR streaming.StreamMetadata: Error writing stream 
> metadata StreamMetadata(b71ca714-a7a1-467f-96aa-023375964429) to 
> /data/yarn/nm/usercache/systest/appcache/application_1508800814252_0047/container_1508800814252_0047_01_01/tmp/temporary-b5ced4ae-32e0-4725-b905-aad679aec9b5/metadata
> org.apache.hadoop.security.AccessControlException: Permission denied: 
> user=systest, access=WRITE, inode="/":hdfs:supergroup:drwxr-xr-x
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:397)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:256)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:194)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1842)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1826)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkAncestorAccess(FSDirectory.java:1785)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.resolvePathForStartFile(FSDirWriteFileOp.java:315)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2313)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2257)
> ...
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:280)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1235)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1214)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1152)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:458)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:455)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:469)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:396)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1103)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1083)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:972)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:960)
>   at 
> org.apache.spark.sql.execution.streaming.StreamMetadata$.write(StreamMetadata.scala:76)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$6.apply(StreamExecution.scala:116)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$6.apply(StreamExecution.scala:114)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.(StreamExecution.scala:114)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:240)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:278)
>   at 
> org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:282)
>   at 
> org.apache.spark.examples.sql.streaming.StructuredKafkaWordCount$.main(StructuredKafkaWordCount.scala:79)
>   at 
> org.apache.spark.examples.sql.streaming.StructuredKafkaWordCount.main(StructuredKafkaWordCount.scala)
> {noformat}
> Looking at StreamingQueryManager#createQuery, we have
>

[jira] [Updated] (SPARK-22333) ColumnReference should get higher priority than timeFunctionCall(CURRENT_DATE, CURRENT_TIMESTAMP)

2017-10-31 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22333:

Fix Version/s: 2.2.1

> ColumnReference should get higher priority than 
> timeFunctionCall(CURRENT_DATE, CURRENT_TIMESTAMP)
> -
>
> Key: SPARK-22333
> URL: https://issues.apache.org/jira/browse/SPARK-22333
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1, 2.1.2, 2.2.0
>Reporter: Feng Zhu
>Assignee: Feng Zhu
> Fix For: 2.2.1, 2.3.0
>
>
> In our cluster, there is a table "T" with column named as "current_date". 
> When we select data from this column with SQL:
> {code:sql}
> select current_date from T
> {code}
> We get the wrong answer, as the column is translated as CURRENT_DATE() 
> function.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21997) Spark shows different results on char/varchar columns on Parquet

2017-10-31 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16227154#comment-16227154
 ] 

Dongjoon Hyun commented on SPARK-21997:
---

Thank you, [~cloud_fan]. Yes, I concur.

> Spark shows different results on char/varchar columns on Parquet
> 
>
> Key: SPARK-21997
> URL: https://issues.apache.org/jira/browse/SPARK-21997
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.1, 2.2.0
>Reporter: Dongjoon Hyun
>
> SPARK-19459 resolves CHAR/VARCHAR issues in general, but Spark shows 
> different results according to the SQL configuration, 
> *spark.sql.hive.convertMetastoreParquet*. We had better fix this. Actually, 
> the default of `spark.sql.hive.convertMetastoreParquet` is true, so the 
> result is wrong by default.
> {code}
> scala> sql("CREATE TABLE t_char(a CHAR(10), b VARCHAR(10)) STORED AS parquet")
> scala> sql("INSERT INTO TABLE t_char SELECT 'a', 'b'")
> scala> sql("SELECT * FROM t_char").show
> +---+---+
> |  a|  b|
> +---+---+
> |  a|  b|
> +---+---+
> scala> sql("set spark.sql.hive.convertMetastoreParquet=false")
> scala> sql("SELECT * FROM t_char").show
> +--+---+
> | a|  b|
> +--+---+
> |a |  b|
> +--+---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-21125) PySpark context missing function to set Job Description.

2017-10-31 Thread Shane Jarvie (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shane Jarvie closed SPARK-21125.


This has been patched and will be live in Spark 2.3.0

> PySpark context missing function to set Job Description.
> 
>
> Key: SPARK-21125
> URL: https://issues.apache.org/jira/browse/SPARK-21125
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.1.1
>Reporter: Shane Jarvie
>Assignee: Shane Jarvie
>Priority: Trivial
>  Labels: beginner
> Fix For: 2.3.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The PySpark API is missing a convienient function currently found in the 
> Scala API, which sets the Job Description for display in the Spark UI.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11334) numRunningTasks can't be less than 0, or it will affect executor allocation

2017-10-31 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-11334.

  Resolution: Fixed
Assignee: Sital Kedia  (was: meiyoula)
Target Version/s: 2.3.0

> numRunningTasks can't be less than 0, or it will affect executor allocation
> ---
>
> Key: SPARK-11334
> URL: https://issues.apache.org/jira/browse/SPARK-11334
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: meiyoula
>Assignee: Sital Kedia
>
> With *Dynamic Allocation* function, a task failed over *maxFailure* time, all 
> the dependent jobs, stages, tasks will be killed or aborted. In this process, 
> *SparkListenerTaskEnd* event will be behind in *SparkListenerStageCompleted* 
> and *SparkListenerJobEnd*. Like the Event Log below:
> {code}
> {"Event":"SparkListenerStageCompleted","Stage Info":{"Stage ID":20,"Stage 
> Attempt ID":0,"Stage Name":"run at AccessController.java:-2","Number of 
> Tasks":200}
> {"Event":"SparkListenerJobEnd","Job ID":9,"Completion Time":1444914699829}
> {"Event":"SparkListenerTaskEnd","Stage ID":20,"Stage Attempt ID":0,"Task 
> Type":"ResultTask","Task End Reason":{"Reason":"TaskKilled"},"Task 
> Info":{"Task ID":1955,"Index":88,"Attempt":2,"Launch 
> Time":1444914699763,"Executor 
> ID":"5","Host":"linux-223","Locality":"PROCESS_LOCAL","Speculative":false,"Getting
>  Result Time":0,"Finish Time":1444914699864,"Failed":true,"Accumulables":[]}}
> {code}
> Because that, the *numRunningTasks* in *ExecutorAllocationManager* class will 
> be less than 0, and it will affect executor allocation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13030) Change OneHotEncoder to Estimator

2017-10-31 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16227094#comment-16227094
 ] 

yuhao yang commented on SPARK-13030:


I see. Thanks for the response [~mlnick].

The Estimator is necessary if we want to automatically infer the size.

Then for adding the extra param size or not, I guess it will be useful in the 
case that automatic inference should not be used (E.g. Sampling before 
training). I would vote for adding.



> Change OneHotEncoder to Estimator
> -
>
> Key: SPARK-13030
> URL: https://issues.apache.org/jira/browse/SPARK-13030
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.6.0
>Reporter: Wojciech Jurczyk
>
> OneHotEncoder should be an Estimator, just like in scikit-learn 
> (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).
> In its current form, it is impossible to use when number of categories is 
> different between training dataset and test dataset.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22405) Enrich the event information and add new event of ExternalCatalogEvent

2017-10-31 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16227039#comment-16227039
 ] 

Herman van Hovell commented on SPARK-22405:
---

For some context. When implementing this we just wanted to have a way to know 
that some metadata was about to change. A consumer could always retrieve more 
information about the the (to-be) changed by querying the catalog (assuming 
that a pre-event does not want to inspect the change itself). Propagating the 
definition is very heavy weight and has an added problem that this sort of 
implies that we should stabilize that class (hierarchy); so I opted not to do 
that.

An additional problem with tracking metadata is if you use multiple clusters, 
that you need to be able to track all metadata changes in all clusters running.

> Enrich the event information and add new event of ExternalCatalogEvent
> --
>
> Key: SPARK-22405
> URL: https://issues.apache.org/jira/browse/SPARK-22405
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Saisai Shao
>Priority: Minor
>
> We're building a data lineage tool in which we need to monitor the metadata 
> changes in {{ExternalCatalog}}, current {{ExternalCatalog}} already provides 
> several useful events like "CreateDatabaseEvent" for custom SparkListener to 
> use. But the information provided by such event is not rich enough, for 
> example {{CreateTablePreEvent}} only provides "database" name and "table" 
> name, not all the table metadata, which is hard for user to get all the table 
> related useful information.
> So here propose to and new {{ExternalCatalogEvent}} and enrich the current 
> existing events for all the catalog related updates.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22403) StructuredKafkaWordCount example fails in YARN cluster mode

2017-10-31 Thread Wing Yew Poon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wing Yew Poon updated SPARK-22403:
--
Description: 
When I run the StructuredKafkaWordCount example in YARN client mode, it runs 
fine. However, when I run it in YARN cluster mode, the application errors 
during initialization, and dies after the default number of YARN application 
attempts. In the AM log, I see
{noformat}
17/10/30 11:34:52 INFO execution.SparkSqlParser: Parsing command: CAST(value AS 
STRING)
17/10/30 11:34:53 ERROR streaming.StreamMetadata: Error writing stream metadata 
StreamMetadata(b71ca714-a7a1-467f-96aa-023375964429) to 
/data/yarn/nm/usercache/systest/appcache/application_1508800814252_0047/container_1508800814252_0047_01_01/tmp/temporary-b5ced4ae-32e0-4725-b905-aad679aec9b5/metadata
org.apache.hadoop.security.AccessControlException: Permission denied: 
user=systest, access=WRITE, inode="/":hdfs:supergroup:drwxr-xr-x
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:397)
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:256)
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:194)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1842)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1826)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkAncestorAccess(FSDirectory.java:1785)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.resolvePathForStartFile(FSDirWriteFileOp.java:315)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2313)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2257)
...
at 
org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:280)
at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1235)
at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1214)
at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1152)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:458)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:455)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:469)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:396)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1103)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1083)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:972)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:960)
at 
org.apache.spark.sql.execution.streaming.StreamMetadata$.write(StreamMetadata.scala:76)
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$6.apply(StreamExecution.scala:116)
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$6.apply(StreamExecution.scala:114)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.(StreamExecution.scala:114)
at 
org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:240)
at 
org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:278)
at 
org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:282)
at 
org.apache.spark.examples.sql.streaming.StructuredKafkaWordCount$.main(StructuredKafkaWordCount.scala:79)
at 
org.apache.spark.examples.sql.streaming.StructuredKafkaWordCount.main(StructuredKafkaWordCount.scala)
{noformat}
Looking at StreamingQueryManager#createQuery, we have
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala#L198
{code}
val checkpointLocation = userSpecifiedCheckpointLocation.map { ...
  ...
}.orElse {
  ...
}.getOrElse {
  if (useTempCheckpointLocation) {
// Delete the temp checkpoint when a query is being stopped without 
errors.
deleteCheckpointOnStop = true
Utils.createTempDir(namePrefix = s"temporary").getCanonicalPath
  } else {
...
  }
}
{code}
And Utils.createTempDir has
{code}
  def createTempDir(
  root: String = System.getProperty("java.io.tmpdir"),
  namePrefix: String = "spark"): File = {
val dir = createDirectory(root, namePrefix)
ShutdownHookManager.registerShutdownDeleteDir(dir)
dir
  }
{code}
In

[jira] [Updated] (SPARK-22403) StructuredKafkaWordCount example fails in YARN cluster mode

2017-10-31 Thread Wing Yew Poon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wing Yew Poon updated SPARK-22403:
--
Description: 
When I run the StructuredKafkaWordCount example in YARN client mode, it runs 
fine. However, when I run it in YARN cluster mode, the application errors 
during initialization, and dies after the default number of YARN application 
attempts. In the AM log, I see
{noformat}
17/10/30 11:34:52 INFO execution.SparkSqlParser: Parsing command: CAST(value AS 
STRING)
17/10/30 11:34:53 ERROR streaming.StreamMetadata: Error writing stream metadata 
StreamMetadata(b71ca714-a7a1-467f-96aa-023375964429) to 
/data/yarn/nm/usercache/systest/appcache/application_1508800814252_0047/container_1508800814252_0047_01_01/tmp/temporary-b5ced4ae-32e0-4725-b905-aad679aec9b5/metadata
org.apache.hadoop.security.AccessControlException: Permission denied: 
user=systest, access=WRITE, inode="/":hdfs:supergroup:drwxr-xr-x
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:397)
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:256)
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:194)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1842)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1826)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkAncestorAccess(FSDirectory.java:1785)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.resolvePathForStartFile(FSDirWriteFileOp.java:315)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2313)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2257)
...
at 
org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:280)
at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1235)
at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1214)
at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1152)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:458)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:455)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:469)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:396)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1103)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1083)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:972)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:960)
at 
org.apache.spark.sql.execution.streaming.StreamMetadata$.write(StreamMetadata.scala:76)
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$6.apply(StreamExecution.scala:116)
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$6.apply(StreamExecution.scala:114)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.(StreamExecution.scala:114)
at 
org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:240)
at 
org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:278)
at 
org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:282)
at 
org.apache.spark.examples.sql.streaming.StructuredKafkaWordCount$.main(StructuredKafkaWordCount.scala:79)
at 
org.apache.spark.examples.sql.streaming.StructuredKafkaWordCount.main(StructuredKafkaWordCount.scala)
{noformat}
Looking at StreamingQueryManager#createQuery, we have
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala#L198
{code}
val checkpointLocation = userSpecifiedCheckpointLocation.map { ...
  ...
}.orElse {
  ...
}.getOrElse {
  if (useTempCheckpointLocation) {
// Delete the temp checkpoint when a query is being stopped without 
errors.
deleteCheckpointOnStop = true
Utils.createTempDir(namePrefix = s"temporary").getCanonicalPath
  } else {
...
  }
}
{code}
And Utils.createTempDir has
{code}
  def createTempDir(
  root: String = System.getProperty("java.io.tmpdir"),
  namePrefix: String = "spark"): File = {
val dir = createDirectory(root, namePrefix)
ShutdownHookManager.registerShutdownDeleteDir(dir)
dir
  }
{code}
In

[jira] [Commented] (SPARK-22078) clarify exception behaviors for all data source v2 interfaces

2017-10-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16227024#comment-16227024
 ] 

Apache Spark commented on SPARK-22078:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/19623

> clarify exception behaviors for all data source v2 interfaces
> -
>
> Key: SPARK-22078
> URL: https://issues.apache.org/jira/browse/SPARK-22078
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22078) clarify exception behaviors for all data source v2 interfaces

2017-10-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22078:


Assignee: Apache Spark

> clarify exception behaviors for all data source v2 interfaces
> -
>
> Key: SPARK-22078
> URL: https://issues.apache.org/jira/browse/SPARK-22078
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22078) clarify exception behaviors for all data source v2 interfaces

2017-10-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22078:


Assignee: (was: Apache Spark)

> clarify exception behaviors for all data source v2 interfaces
> -
>
> Key: SPARK-22078
> URL: https://issues.apache.org/jira/browse/SPARK-22078
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14540) Support Scala 2.12 closures and Java 8 lambdas in ClosureCleaner

2017-10-31 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16226995#comment-16226995
 ] 

Sean Owen commented on SPARK-14540:
---

[~joshrosen] was right that this is actually the hard part. A few notes from 
working on this:

Almost all tests pass with no change to the closure cleaner, except to not 
attempt to treat lambdas as inner class closures. That was kind of surprising. 
I assume that their implementation as a lambda means many of the synthetic 
links the cleaner had to snip just don't exist.

I am still not clear if you can extract referenced fields from the synthetic 
lambda class itself. The "bsmArgs" (boostrap method args) aren't quite that. 
However it looks like you can manually serialize the lambda and get this info 
from the SerializedLambda and examine captured args. Next thing to try.

Still, without this change, I find a lot of code just works already.

> Support Scala 2.12 closures and Java 8 lambdas in ClosureCleaner
> 
>
> Key: SPARK-14540
> URL: https://issues.apache.org/jira/browse/SPARK-14540
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Josh Rosen
>
> Using https://github.com/JoshRosen/spark/tree/build-for-2.12, I tried running 
> ClosureCleanerSuite with Scala 2.12 and ran into two bad test failures:
> {code}
> [info] - toplevel return statements in closures are identified at cleaning 
> time *** FAILED *** (32 milliseconds)
> [info]   Expected exception 
> org.apache.spark.util.ReturnStatementInClosureException to be thrown, but no 
> exception was thrown. (ClosureCleanerSuite.scala:57)
> {code}
> and
> {code}
> [info] - user provided closures are actually cleaned *** FAILED *** (56 
> milliseconds)
> [info]   Expected ReturnStatementInClosureException, but got 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task not 
> serializable: java.io.NotSerializableException: java.lang.Object
> [info]- element of array (index: 0)
> [info]- array (class "[Ljava.lang.Object;", size: 1)
> [info]- field (class "java.lang.invoke.SerializedLambda", name: 
> "capturedArgs", type: "class [Ljava.lang.Object;")
> [info]- object (class "java.lang.invoke.SerializedLambda", 
> SerializedLambda[capturingClass=class 
> org.apache.spark.util.TestUserClosuresActuallyCleaned$, 
> functionalInterfaceMethod=scala/runtime/java8/JFunction1$mcII$sp.apply$mcII$sp:(I)I,
>  implementation=invokeStatic 
> org/apache/spark/util/TestUserClosuresActuallyCleaned$.org$apache$spark$util$TestUserClosuresActuallyCleaned$$$anonfun$69:(Ljava/lang/Object;I)I,
>  instantiatedMethodType=(I)I, numCaptured=1])
> [info]- element of array (index: 0)
> [info]- array (class "[Ljava.lang.Object;", size: 1)
> [info]- field (class "java.lang.invoke.SerializedLambda", name: 
> "capturedArgs", type: "class [Ljava.lang.Object;")
> [info]- object (class "java.lang.invoke.SerializedLambda", 
> SerializedLambda[capturingClass=class org.apache.spark.rdd.RDD, 
> functionalInterfaceMethod=scala/Function3.apply:(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;,
>  implementation=invokeStatic 
> org/apache/spark/rdd/RDD.org$apache$spark$rdd$RDD$$$anonfun$20$adapted:(Lscala/Function1;Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;,
>  
> instantiatedMethodType=(Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;,
>  numCaptured=1])
> [info]- field (class "org.apache.spark.rdd.MapPartitionsRDD", name: 
> "f", type: "interface scala.Function3")
> [info]- object (class "org.apache.spark.rdd.MapPartitionsRDD", 
> MapPartitionsRDD[2] at apply at Transformer.scala:22)
> [info]- field (class "scala.Tuple2", name: "_1", type: "class 
> java.lang.Object")
> [info]- root object (class "scala.Tuple2", (MapPartitionsRDD[2] at 
> apply at 
> Transformer.scala:22,org.apache.spark.SparkContext$$Lambda$957/431842435@6e803685)).
> [info]   This means the closure provided by user is not actually cleaned. 
> (ClosureCleanerSuite.scala:78)
> {code}
> We'll need to figure out a closure cleaning strategy which works for 2.12 
> lambdas.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22398) Partition directories with leading 0s cause wrong results

2017-10-31 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16226991#comment-16226991
 ] 

Marco Gaido commented on SPARK-22398:
-

you just need to set `spark.sql.sources.partitionColumnTypeInference.enabled` 
to `false`.

> Partition directories with leading 0s cause wrong results
> -
>
> Key: SPARK-22398
> URL: https://issues.apache.org/jira/browse/SPARK-22398
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bogdan Raducanu
>
> Repro case:
> {code}
> spark.range(8).selectExpr("'0' || cast(id as string) as id", "id as 
> b").write.mode("overwrite").partitionBy("id").parquet("/tmp/bug1")
> spark.read.parquet("/tmp/bug1").where("id in ('01')").show
> +---+---+
> |  b| id|
> +---+---+
> +---+---+
> spark.read.parquet("/tmp/bug1").where("id = '01'").show
> +---+---+
> |  b| id|
> +---+---+
> |  1|  1|
> +---+---+
> {code}
> I think somewhere there is some special handling of this case for equals but 
> not the same for IN.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21997) Spark shows different results on char/varchar columns on Parquet

2017-10-31 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16226957#comment-16226957
 ] 

Wenchen Fan commented on SPARK-21997:
-

A better fix is to add special handling of varchar type in the read path, which 
appends blanks to string value to satisfy varchar length. But this may be hard 
as you need to fix both normal reader and columnar reader.

> Spark shows different results on char/varchar columns on Parquet
> 
>
> Key: SPARK-21997
> URL: https://issues.apache.org/jira/browse/SPARK-21997
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.1, 2.2.0
>Reporter: Dongjoon Hyun
>
> SPARK-19459 resolves CHAR/VARCHAR issues in general, but Spark shows 
> different results according to the SQL configuration, 
> *spark.sql.hive.convertMetastoreParquet*. We had better fix this. Actually, 
> the default of `spark.sql.hive.convertMetastoreParquet` is true, so the 
> result is wrong by default.
> {code}
> scala> sql("CREATE TABLE t_char(a CHAR(10), b VARCHAR(10)) STORED AS parquet")
> scala> sql("INSERT INTO TABLE t_char SELECT 'a', 'b'")
> scala> sql("SELECT * FROM t_char").show
> +---+---+
> |  a|  b|
> +---+---+
> |  a|  b|
> +---+---+
> scala> sql("set spark.sql.hive.convertMetastoreParquet=false")
> scala> sql("SELECT * FROM t_char").show
> +--+---+
> | a|  b|
> +--+---+
> |a |  b|
> +--+---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21997) Spark shows different results on char/varchar columns on Parquet

2017-10-31 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16226953#comment-16226953
 ] 

Wenchen Fan commented on SPARK-21997:
-

I think a simple fix is to disable `convertMetastoreParquet` if there are 
varchar type columns.

> Spark shows different results on char/varchar columns on Parquet
> 
>
> Key: SPARK-21997
> URL: https://issues.apache.org/jira/browse/SPARK-21997
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.1, 2.2.0
>Reporter: Dongjoon Hyun
>
> SPARK-19459 resolves CHAR/VARCHAR issues in general, but Spark shows 
> different results according to the SQL configuration, 
> *spark.sql.hive.convertMetastoreParquet*. We had better fix this. Actually, 
> the default of `spark.sql.hive.convertMetastoreParquet` is true, so the 
> result is wrong by default.
> {code}
> scala> sql("CREATE TABLE t_char(a CHAR(10), b VARCHAR(10)) STORED AS parquet")
> scala> sql("INSERT INTO TABLE t_char SELECT 'a', 'b'")
> scala> sql("SELECT * FROM t_char").show
> +---+---+
> |  a|  b|
> +---+---+
> |  a|  b|
> +---+---+
> scala> sql("set spark.sql.hive.convertMetastoreParquet=false")
> scala> sql("SELECT * FROM t_char").show
> +--+---+
> | a|  b|
> +--+---+
> |a |  b|
> +--+---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22306) INFER_AND_SAVE overwrites important metadata in Parquet Metastore table

2017-10-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22306:


Assignee: Apache Spark

> INFER_AND_SAVE overwrites important metadata in Parquet Metastore table
> ---
>
> Key: SPARK-22306
> URL: https://issues.apache.org/jira/browse/SPARK-22306
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Hive 2.3.0 (PostgresQL metastore, stored as Parquet)
> Spark 2.2.0
>Reporter: David Malinge
>Assignee: Apache Spark
>Priority: Critical
>
> I noticed some critical changes on my hive tables and realized that they were 
> caused by a simple select on SparkSQL. Looking at the logs, I found out that 
> this select was actually performing an update on the database "Saving 
> case-sensitive schema for table". 
> I then found out that Spark 2.2.0 introduces a new default value for 
> spark.sql.hive.caseSensitiveInferenceMode (see SPARK-20888): INFER_AND_SAVE
> The issue is that this update changes critical metadata of the table, in 
> particular:
> - changes the owner to the current user
> - removes bucketing metadata (BUCKETING_COLS, SDS)
> - removes sorting metadata (SORT_COLS)
> Switching the property to: NEVER_INFER prevents the issue.
> Also, note that the damage can be fix manually in Hive with e.g.:
> {code:sql}
> alter table [table_name] 
> clustered by ([col1], [col2]) 
> sorted by ([colA], [colB])
> into [n] buckets
> {code}
> *REPRODUCE (branch-2.2)*
> In Spark 2.1.x (branch-2.1), NEVER_INFER is used. Spark 2.3 (master) branch 
> is good due to SPARK-17729. This is a regression on Spark 2.2 only. By 
> default, Parquet Hive table is affected and only Hive may suffer from this.
> {code}
> hive> CREATE TABLE t(a string, b string) CLUSTERED BY (a, b) SORTED BY (a, b) 
> INTO 10 BUCKETS STORED AS PARQUET;
> hive> INSERT INTO t VALUES('a','b');
> hive> DESC FORMATTED t;
> ...
> Num Buckets:  10
> Bucket Columns:   [a, b]
> Sort Columns: [Order(col:a, order:1), Order(col:b, order:1)]
> scala> sql("SELECT * FROM t").show(false)
> hive> DESC FORMATTED t;
> Num Buckets:  -1
> Bucket Columns:   []
> Sort Columns: []
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22306) INFER_AND_SAVE overwrites important metadata in Parquet Metastore table

2017-10-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22306:


Assignee: (was: Apache Spark)

> INFER_AND_SAVE overwrites important metadata in Parquet Metastore table
> ---
>
> Key: SPARK-22306
> URL: https://issues.apache.org/jira/browse/SPARK-22306
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Hive 2.3.0 (PostgresQL metastore, stored as Parquet)
> Spark 2.2.0
>Reporter: David Malinge
>Priority: Critical
>
> I noticed some critical changes on my hive tables and realized that they were 
> caused by a simple select on SparkSQL. Looking at the logs, I found out that 
> this select was actually performing an update on the database "Saving 
> case-sensitive schema for table". 
> I then found out that Spark 2.2.0 introduces a new default value for 
> spark.sql.hive.caseSensitiveInferenceMode (see SPARK-20888): INFER_AND_SAVE
> The issue is that this update changes critical metadata of the table, in 
> particular:
> - changes the owner to the current user
> - removes bucketing metadata (BUCKETING_COLS, SDS)
> - removes sorting metadata (SORT_COLS)
> Switching the property to: NEVER_INFER prevents the issue.
> Also, note that the damage can be fix manually in Hive with e.g.:
> {code:sql}
> alter table [table_name] 
> clustered by ([col1], [col2]) 
> sorted by ([colA], [colB])
> into [n] buckets
> {code}
> *REPRODUCE (branch-2.2)*
> In Spark 2.1.x (branch-2.1), NEVER_INFER is used. Spark 2.3 (master) branch 
> is good due to SPARK-17729. This is a regression on Spark 2.2 only. By 
> default, Parquet Hive table is affected and only Hive may suffer from this.
> {code}
> hive> CREATE TABLE t(a string, b string) CLUSTERED BY (a, b) SORTED BY (a, b) 
> INTO 10 BUCKETS STORED AS PARQUET;
> hive> INSERT INTO t VALUES('a','b');
> hive> DESC FORMATTED t;
> ...
> Num Buckets:  10
> Bucket Columns:   [a, b]
> Sort Columns: [Order(col:a, order:1), Order(col:b, order:1)]
> scala> sql("SELECT * FROM t").show(false)
> hive> DESC FORMATTED t;
> Num Buckets:  -1
> Bucket Columns:   []
> Sort Columns: []
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22306) INFER_AND_SAVE overwrites important metadata in Parquet Metastore table

2017-10-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16226944#comment-16226944
 ] 

Apache Spark commented on SPARK-22306:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/19622

> INFER_AND_SAVE overwrites important metadata in Parquet Metastore table
> ---
>
> Key: SPARK-22306
> URL: https://issues.apache.org/jira/browse/SPARK-22306
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Hive 2.3.0 (PostgresQL metastore, stored as Parquet)
> Spark 2.2.0
>Reporter: David Malinge
>Priority: Critical
>
> I noticed some critical changes on my hive tables and realized that they were 
> caused by a simple select on SparkSQL. Looking at the logs, I found out that 
> this select was actually performing an update on the database "Saving 
> case-sensitive schema for table". 
> I then found out that Spark 2.2.0 introduces a new default value for 
> spark.sql.hive.caseSensitiveInferenceMode (see SPARK-20888): INFER_AND_SAVE
> The issue is that this update changes critical metadata of the table, in 
> particular:
> - changes the owner to the current user
> - removes bucketing metadata (BUCKETING_COLS, SDS)
> - removes sorting metadata (SORT_COLS)
> Switching the property to: NEVER_INFER prevents the issue.
> Also, note that the damage can be fix manually in Hive with e.g.:
> {code:sql}
> alter table [table_name] 
> clustered by ([col1], [col2]) 
> sorted by ([colA], [colB])
> into [n] buckets
> {code}
> *REPRODUCE (branch-2.2)*
> In Spark 2.1.x (branch-2.1), NEVER_INFER is used. Spark 2.3 (master) branch 
> is good due to SPARK-17729. This is a regression on Spark 2.2 only. By 
> default, Parquet Hive table is affected and only Hive may suffer from this.
> {code}
> hive> CREATE TABLE t(a string, b string) CLUSTERED BY (a, b) SORTED BY (a, b) 
> INTO 10 BUCKETS STORED AS PARQUET;
> hive> INSERT INTO t VALUES('a','b');
> hive> DESC FORMATTED t;
> ...
> Num Buckets:  10
> Bucket Columns:   [a, b]
> Sort Columns: [Order(col:a, order:1), Order(col:b, order:1)]
> scala> sql("SELECT * FROM t").show(false)
> hive> DESC FORMATTED t;
> Num Buckets:  -1
> Bucket Columns:   []
> Sort Columns: []
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11215) Add multiple columns support to StringIndexer

2017-10-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16226915#comment-16226915
 ] 

Apache Spark commented on SPARK-11215:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/19621

> Add multiple columns support to StringIndexer
> -
>
> Key: SPARK-11215
> URL: https://issues.apache.org/jira/browse/SPARK-11215
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> Add multiple columns support to StringIndexer, then users can transform 
> multiple input columns to multiple output columns simultaneously. See 
> discussion SPARK-8418.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22405) Enrich the event information and add new event of ExternalCatalogEvent

2017-10-31 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16226724#comment-16226724
 ] 

Saisai Shao commented on SPARK-22405:
-

Thanks [~cloud_fan] for your comments. Using fake {{ExternalCatalog}} to 
delegate might be one solution, but this will also force user to use this 
custom {{ExternalCatalog}} only as far as I know. This might be OK for the end 
user, but for us who will deliver packages to user, such restriction seems not 
so feasible.

> Enrich the event information and add new event of ExternalCatalogEvent
> --
>
> Key: SPARK-22405
> URL: https://issues.apache.org/jira/browse/SPARK-22405
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Saisai Shao
>Priority: Minor
>
> We're building a data lineage tool in which we need to monitor the metadata 
> changes in {{ExternalCatalog}}, current {{ExternalCatalog}} already provides 
> several useful events like "CreateDatabaseEvent" for custom SparkListener to 
> use. But the information provided by such event is not rich enough, for 
> example {{CreateTablePreEvent}} only provides "database" name and "table" 
> name, not all the table metadata, which is hard for user to get all the table 
> related useful information.
> So here propose to and new {{ExternalCatalogEvent}} and enrich the current 
> existing events for all the catalog related updates.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21725) spark thriftserver insert overwrite table partition select

2017-10-31 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16226720#comment-16226720
 ] 

Marco Gaido commented on SPARK-21725:
-

[~zhangxin0112zx] I am sorry but I am still unable to reproduce it locally.
Here you are the steps I performed. It might be related to the metastore. May 
you provide more details about your installation and the logs of the spark 
thriftserver?


{code:java}
➜  spark git:(SPARK-21725) ✗ ./bin/beeline -u "jdbc:hive2://localhost:1"
Connecting to jdbc:hive2://localhost:1
log4j:WARN No appenders could be found for logger (org.apache.hive.jdbc.Utils).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
info.
Connected to: Spark SQL (version 2.3.0-SNAPSHOT)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.2.1.spark2 by Apache Hive
0: jdbc:hive2://localhost:1> set hive.default.fileformat=Parquet; 
+--+--+--+
|   key|  value   |
+--+--+--+
| hive.default.fileformat  | Parquet  |
+--+--+--+
1 row selected (0.434 seconds)
0: jdbc:hive2://localhost:1> create table default.test_e(name string) 
partitioned by (pt string);
+-+--+
| Result  |
+-+--+
+-+--+
No rows selected (0.472 seconds)
0: jdbc:hive2://localhost:1> create table default.test_f(name string) 
partitioned by (pt string);
+-+--+
| Result  |
+-+--+
+-+--+
No rows selected (0.067 seconds)
0: jdbc:hive2://localhost:1> !quit
Closing: 0: jdbc:hive2://localhost:1
➜  spark git:(SPARK-21725) ✗ ./bin/beeline -u "jdbc:hive2://localhost:1"
Connecting to jdbc:hive2://localhost:1
log4j:WARN No appenders could be found for logger (org.apache.hive.jdbc.Utils).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
info.
Connected to: Spark SQL (version 2.3.0-SNAPSHOT)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.2.1.spark2 by Apache Hive
0: jdbc:hive2://localhost:1> insert overwrite table default.test_e 
partition(pt="1") select count(1) from default.test_f;
+-+--+
| Result  |
+-+--+
+-+--+
No rows selected (2.351 seconds)
0: jdbc:hive2://localhost:1> !quit
Closing: 0: jdbc:hive2://localhost:1
➜  spark git:(SPARK-21725) ✗ ./bin/beeline -u "jdbc:hive2://localhost:1"
Connecting to jdbc:hive2://localhost:1
log4j:WARN No appenders could be found for logger (org.apache.hive.jdbc.Utils).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
info.
Connected to: Spark SQL (version 2.3.0-SNAPSHOT)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.2.1.spark2 by Apache Hive
0: jdbc:hive2://localhost:1> insert overwrite table default.test_e 
partition(pt="1") select count(1) from default.test_f;
+-+--+
| Result  |
+-+--+
+-+--+
No rows selected (0.612 seconds)
0: jdbc:hive2://localhost:1> 
{code}


> spark thriftserver insert overwrite table partition select 
> ---
>
> Key: SPARK-21725
> URL: https://issues.apache.org/jira/browse/SPARK-21725
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: centos 6.7 spark 2.1  jdk8
>Reporter: xinzhang
>  Labels: spark-sql
>
> use thriftserver create table with partitions.
> session 1:
>  SET hive.default.fileformat=Parquet;create table tmp_10(count bigint) 
> partitioned by (pt string) stored as parquet;
> --ok
>  !exit
> session 2:
>  SET hive.default.fileformat=Parquet;create table tmp_11(count bigint) 
> partitioned by (pt string) stored as parquet; 
> --ok
>  !exit
> session 3:
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --ok
>  !exit
> session 4(do it again):
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --error
>  !exit
> -
> 17/08/14 18:13:42 ERROR SparkExecuteStatementOperation: Error executing 
> query, currentState RUNNING, 
> java.lang.reflect.InvocationTargetException
> ..
> ..
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move 
> source 
>

[jira] [Commented] (SPARK-22306) INFER_AND_SAVE overwrites important metadata in Parquet Metastore table

2017-10-31 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16226640#comment-16226640
 ] 

Wenchen Fan commented on SPARK-22306:
-

This is a known issue before Spark 2.3: ALTER TABLE at Spark side erases the 
bucketing information of a hive table. However, for this certain case, the 
ALTER TABLE is triggered automatically, which it's pretty bad for users because 
of this bug.

I'm going to handle this case specially and suggest users to upgrade to Spark 
2.3.

> INFER_AND_SAVE overwrites important metadata in Parquet Metastore table
> ---
>
> Key: SPARK-22306
> URL: https://issues.apache.org/jira/browse/SPARK-22306
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Hive 2.3.0 (PostgresQL metastore, stored as Parquet)
> Spark 2.2.0
>Reporter: David Malinge
>Priority: Critical
>
> I noticed some critical changes on my hive tables and realized that they were 
> caused by a simple select on SparkSQL. Looking at the logs, I found out that 
> this select was actually performing an update on the database "Saving 
> case-sensitive schema for table". 
> I then found out that Spark 2.2.0 introduces a new default value for 
> spark.sql.hive.caseSensitiveInferenceMode (see SPARK-20888): INFER_AND_SAVE
> The issue is that this update changes critical metadata of the table, in 
> particular:
> - changes the owner to the current user
> - removes bucketing metadata (BUCKETING_COLS, SDS)
> - removes sorting metadata (SORT_COLS)
> Switching the property to: NEVER_INFER prevents the issue.
> Also, note that the damage can be fix manually in Hive with e.g.:
> {code:sql}
> alter table [table_name] 
> clustered by ([col1], [col2]) 
> sorted by ([colA], [colB])
> into [n] buckets
> {code}
> *REPRODUCE (branch-2.2)*
> In Spark 2.1.x (branch-2.1), NEVER_INFER is used. Spark 2.3 (master) branch 
> is good due to SPARK-17729. This is a regression on Spark 2.2 only. By 
> default, Parquet Hive table is affected and only Hive may suffer from this.
> {code}
> hive> CREATE TABLE t(a string, b string) CLUSTERED BY (a, b) SORTED BY (a, b) 
> INTO 10 BUCKETS STORED AS PARQUET;
> hive> INSERT INTO t VALUES('a','b');
> hive> DESC FORMATTED t;
> ...
> Num Buckets:  10
> Bucket Columns:   [a, b]
> Sort Columns: [Order(col:a, order:1), Order(col:b, order:1)]
> scala> sql("SELECT * FROM t").show(false)
> hive> DESC FORMATTED t;
> Num Buckets:  -1
> Bucket Columns:   []
> Sort Columns: []
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-19039) UDF ClosureCleaner bug when UDF, col applied in paste mode in REPL

2017-10-31 Thread Jen-Ming Chung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jen-Ming Chung updated SPARK-19039:
---
Comment: was deleted

(was: It's weird..you will not get error messages if you paste the code 
line-by-line.

{code}
17/10/31 09:37:42 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
Spark context Web UI available at 
http://ip-172-31-9-112.ap-northeast-1.compute.internal:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1509442670084).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
  /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_151)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._

scala> val df = spark.createDataFrame(Seq(
 |   ("hi", 1),
 |   ("there", 2),
 |   ("the", 3),
 |   ("end", 4)
 | )).toDF("a", "b")
df: org.apache.spark.sql.DataFrame = [a: string, b: int]

scala> val myNumbers = Set(1,2,3)
myNumbers: scala.collection.immutable.Set[Int] = Set(1, 2, 3)

scala> val tmpUDF = udf { (n: Int) => myNumbers.contains(n) }
tmpUDF: org.apache.spark.sql.expressions.UserDefinedFunction = 
UserDefinedFunction(,BooleanType,Some(List(IntegerType)))

scala> val rowHasMyNumber = tmpUDF($"b")
rowHasMyNumber: org.apache.spark.sql.Column = UDF(b)

scala> df.where(rowHasMyNumber).show()
+-+---+
|a|  b|
+-+---+
|   hi|  1|
|there|  2|
|  the|  3|
+-+---+
{code} )

> UDF ClosureCleaner bug when UDF, col applied in paste mode in REPL
> --
>
> Key: SPARK-19039
> URL: https://issues.apache.org/jira/browse/SPARK-19039
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.3.0
>Reporter: Joseph K. Bradley
>
> When I try this:
> * Define UDF
> * Apply UDF to get Column
> * Use Column in a DataFrame
> I can find weird behavior in the spark-shell when using paste mode.
> To reproduce this, paste this into the spark-shell:
> {code}
> import org.apache.spark.sql.functions._
> val df = spark.createDataFrame(Seq(
>   ("hi", 1),
>   ("there", 2),
>   ("the", 3),
>   ("end", 4)
> )).toDF("a", "b")
> val myNumbers = Set(1,2,3)
> val tmpUDF = udf { (n: Int) => myNumbers.contains(n) }
> val rowHasMyNumber = tmpUDF($"b")
> df.where(rowHasMyNumber).show()
> {code}
> Stack trace for Spark 2.0 (similar for other versions):
> {code}
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2057)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:817)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:816)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
>   at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:816)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:364)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:225)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:308)
>   at 
> org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:39)
>   at 
>

[jira] [Commented] (SPARK-22405) Enrich the event information and add new event of ExternalCatalogEvent

2017-10-31 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16226551#comment-16226551
 ] 

Wenchen Fan commented on SPARK-22405:
-

If you really want very detailed information about metadata operations, you 
should probably create a fake `ExternalCatalog` which delegates all the 
requests to the real `ExternalCatalog`. Then you can get everything you want.

> Enrich the event information and add new event of ExternalCatalogEvent
> --
>
> Key: SPARK-22405
> URL: https://issues.apache.org/jira/browse/SPARK-22405
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Saisai Shao
>Priority: Minor
>
> We're building a data lineage tool in which we need to monitor the metadata 
> changes in {{ExternalCatalog}}, current {{ExternalCatalog}} already provides 
> several useful events like "CreateDatabaseEvent" for custom SparkListener to 
> use. But the information provided by such event is not rich enough, for 
> example {{CreateTablePreEvent}} only provides "database" name and "table" 
> name, not all the table metadata, which is hard for user to get all the table 
> related useful information.
> So here propose to and new {{ExternalCatalogEvent}} and enrich the current 
> existing events for all the catalog related updates.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19628) Duplicate Spark jobs in 2.1.0

2017-10-31 Thread Jork Zijlstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16226559#comment-16226559
 ] 

Jork Zijlstra commented on SPARK-19628:
---

Hello [~guilhermeslucas],

I'm currently no longer employed at the company where we encountered the 
problem. 
[~skoning] Do you still have the problem and could you help?

How much more code do you need? Usually you want to scale down the test to find 
the problem and this is pretty much the minimal version.

{code}
spark.read.orc(...).show(20) or spark.read.orc(...).collect()
{code}
Both trigger the duplicate jobs.

Regards, Jork

> Duplicate Spark jobs in 2.1.0
> -
>
> Key: SPARK-19628
> URL: https://issues.apache.org/jira/browse/SPARK-19628
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Jork Zijlstra
> Attachments: spark2.0.1.png, spark2.1.0-examplecode.png, 
> spark2.1.0.png
>
>
> After upgrading to Spark 2.1.0 we noticed that they are duplicate jobs 
> executed. Going back to Spark 2.0.1 they are gone again
> {code}
> import org.apache.spark.sql._
> object DoubleJobs {
>   def main(args: Array[String]) {
> System.setProperty("hadoop.home.dir", "/tmp");
> val sparkSession: SparkSession = SparkSession.builder
>   .master("local[4]")
>   .appName("spark session example")
>   .config("spark.driver.maxResultSize", "6G")
>   .config("spark.sql.orc.filterPushdown", true)
>   .config("spark.sql.hive.metastorePartitionPruning", true)
>   .getOrCreate()
> sparkSession.sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
> val paths = Seq(
>   ""//some orc source
> )
> def dataFrame(path: String): DataFrame = {
>   sparkSession.read.orc(path)
> }
> paths.foreach(path => {
>   dataFrame(path).show(20)
> })
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22310) Refactor join estimation to incorporate estimation logic for different kinds of statistics

2017-10-31 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22310.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> Refactor join estimation to incorporate estimation logic for different kinds 
> of statistics
> --
>
> Key: SPARK-22310
> URL: https://issues.apache.org/jira/browse/SPARK-22310
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
> Fix For: 2.3.0
>
>
> The current join estimation logic is only based on basic column statistics 
> (such as ndv, etc). If we want to add estimation for other kinds of 
> statistics (such as histograms), it's not easy to incorporate into the 
> current algorithm:
> 1. When we have multiple pairs of join keys, the current algorithm computes 
> cardinality in a single formula. But if different join keys have different 
> kinds of stats, the computation logic for each pair of join keys become 
> different, so the previous formula does not apply.
> 2. Currently it computes cardinality and updates join keys' column stats 
> separately. It's better to do these two steps together, since both 
> computation and update logic are different for different kinds of stats.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22310) Refactor join estimation to incorporate estimation logic for different kinds of statistics

2017-10-31 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-22310:
---

Assignee: Zhenhua Wang

> Refactor join estimation to incorporate estimation logic for different kinds 
> of statistics
> --
>
> Key: SPARK-22310
> URL: https://issues.apache.org/jira/browse/SPARK-22310
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
> Fix For: 2.3.0
>
>
> The current join estimation logic is only based on basic column statistics 
> (such as ndv, etc). If we want to add estimation for other kinds of 
> statistics (such as histograms), it's not easy to incorporate into the 
> current algorithm:
> 1. When we have multiple pairs of join keys, the current algorithm computes 
> cardinality in a single formula. But if different join keys have different 
> kinds of stats, the computation logic for each pair of join keys become 
> different, so the previous formula does not apply.
> 2. Currently it computes cardinality and updates join keys' column stats 
> separately. It's better to do these two steps together, since both 
> computation and update logic are different for different kinds of stats.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19039) UDF ClosureCleaner bug when UDF, col applied in paste mode in REPL

2017-10-31 Thread Jen-Ming Chung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16226528#comment-16226528
 ] 

Jen-Ming Chung commented on SPARK-19039:


It's weird..you will not get error messages if you paste the code line-by-line.

{code}
17/10/31 09:37:42 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
Spark context Web UI available at 
http://ip-172-31-9-112.ap-northeast-1.compute.internal:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1509442670084).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
  /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_151)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._

scala> val df = spark.createDataFrame(Seq(
 |   ("hi", 1),
 |   ("there", 2),
 |   ("the", 3),
 |   ("end", 4)
 | )).toDF("a", "b")
df: org.apache.spark.sql.DataFrame = [a: string, b: int]

scala> val myNumbers = Set(1,2,3)
myNumbers: scala.collection.immutable.Set[Int] = Set(1, 2, 3)

scala> val tmpUDF = udf { (n: Int) => myNumbers.contains(n) }
tmpUDF: org.apache.spark.sql.expressions.UserDefinedFunction = 
UserDefinedFunction(,BooleanType,Some(List(IntegerType)))

scala> val rowHasMyNumber = tmpUDF($"b")
rowHasMyNumber: org.apache.spark.sql.Column = UDF(b)

scala> df.where(rowHasMyNumber).show()
+-+---+
|a|  b|
+-+---+
|   hi|  1|
|there|  2|
|  the|  3|
+-+---+
{code} 

> UDF ClosureCleaner bug when UDF, col applied in paste mode in REPL
> --
>
> Key: SPARK-19039
> URL: https://issues.apache.org/jira/browse/SPARK-19039
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.3.0
>Reporter: Joseph K. Bradley
>
> When I try this:
> * Define UDF
> * Apply UDF to get Column
> * Use Column in a DataFrame
> I can find weird behavior in the spark-shell when using paste mode.
> To reproduce this, paste this into the spark-shell:
> {code}
> import org.apache.spark.sql.functions._
> val df = spark.createDataFrame(Seq(
>   ("hi", 1),
>   ("there", 2),
>   ("the", 3),
>   ("end", 4)
> )).toDF("a", "b")
> val myNumbers = Set(1,2,3)
> val tmpUDF = udf { (n: Int) => myNumbers.contains(n) }
> val rowHasMyNumber = tmpUDF($"b")
> df.where(rowHasMyNumber).show()
> {code}
> Stack trace for Spark 2.0 (similar for other versions):
> {code}
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2057)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:817)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:816)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
>   at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:816)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:364)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:225)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:308)
>   at 
> org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:39)
>   at 
>

[jira] [Commented] (SPARK-22405) Enrich the event information and add new event of ExternalCatalogEvent

2017-10-31 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16226474#comment-16226474
 ] 

Saisai Shao commented on SPARK-22405:
-

CC [~smilegator], what's your opinion about this proposal? Thanks!

> Enrich the event information and add new event of ExternalCatalogEvent
> --
>
> Key: SPARK-22405
> URL: https://issues.apache.org/jira/browse/SPARK-22405
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Saisai Shao
>Priority: Minor
>
> We're building a data lineage tool in which we need to monitor the metadata 
> changes in {{ExternalCatalog}}, current {{ExternalCatalog}} already provides 
> several useful events like "CreateDatabaseEvent" for custom SparkListener to 
> use. But the information provided by such event is not rich enough, for 
> example {{CreateTablePreEvent}} only provides "database" name and "table" 
> name, not all the table metadata, which is hard for user to get all the table 
> related useful information.
> So here propose to and new {{ExternalCatalogEvent}} and enrich the current 
> existing events for all the catalog related updates.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22405) Enrich the event information and add new event of ExternalCatalogEvent

2017-10-31 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16226468#comment-16226468
 ] 

Saisai Shao commented on SPARK-22405:
-

This is the WIP branch 
(https://github.com/jerryshao/apache-spark/tree/SPARK-22405).

[~cloud_fan], do you think this proposal is feasible or not?

> Enrich the event information and add new event of ExternalCatalogEvent
> --
>
> Key: SPARK-22405
> URL: https://issues.apache.org/jira/browse/SPARK-22405
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Saisai Shao
>Priority: Minor
>
> We're building a data lineage tool in which we need to monitor the metadata 
> changes in {{ExternalCatalog}}, current {{ExternalCatalog}} already provides 
> several useful events like "CreateDatabaseEvent" for custom SparkListener to 
> use. But the information provided by such event is not rich enough, for 
> example {{CreateTablePreEvent}} only provides "database" name and "table" 
> name, not all the table metadata, which is hard for user to get all the table 
> related useful information.
> So here propose to and new {{ExternalCatalogEvent}} and enrich the current 
> existing events for all the catalog related updates.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21918) HiveClient shouldn't share Hive object between different thread

2017-10-31 Thread junzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16226457#comment-16226457
 ] 

junzhang edited comment on SPARK-21918 at 10/31/17 8:36 AM:


[~huLiu] how about the patch? we are suffering DDL and DML problems for STS. It 
will be much appreciated if you provide the patch as soon as possible.


was (Author: junzhang):
[~huLiu] how about the patch? we are suffering DDL and DML problems for STS. It 
will be much appreciated if provide the patch as soon as possible.

> HiveClient shouldn't share Hive object between different thread
> ---
>
> Key: SPARK-21918
> URL: https://issues.apache.org/jira/browse/SPARK-21918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hu Liu,
>
> I'm testing the spark thrift server and found that all the DDL statements are 
> run by user hive even if hive.server2.enable.doAs=true
> The root cause is that Hive object is shared between different thread in 
> HiveClientImpl
> {code:java}
>   private def client: Hive = {
> if (clientLoader.cachedHive != null) {
>   clientLoader.cachedHive.asInstanceOf[Hive]
> } else {
>   val c = Hive.get(conf)
>   clientLoader.cachedHive = c
>   c
> }
>   }
> {code}
> But in impersonation mode, we should just share the Hive object inside the 
> thread so that the  metastore client in Hive could be associated with right 
> user.
> we can  pass the Hive object of parent thread to child thread when running 
> the sql to fix it
> I have already had a initial patch for review and I'm glad to work on it if 
> anyone could assign it to me.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21918) HiveClient shouldn't share Hive object between different thread

2017-10-31 Thread junzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16226457#comment-16226457
 ] 

junzhang commented on SPARK-21918:
--

[~huLiu] how about the patch? we are suffering DDL and DML problems for STS. It 
will be much appreciated if provide the patch as soon as possible.

> HiveClient shouldn't share Hive object between different thread
> ---
>
> Key: SPARK-21918
> URL: https://issues.apache.org/jira/browse/SPARK-21918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hu Liu,
>
> I'm testing the spark thrift server and found that all the DDL statements are 
> run by user hive even if hive.server2.enable.doAs=true
> The root cause is that Hive object is shared between different thread in 
> HiveClientImpl
> {code:java}
>   private def client: Hive = {
> if (clientLoader.cachedHive != null) {
>   clientLoader.cachedHive.asInstanceOf[Hive]
> } else {
>   val c = Hive.get(conf)
>   clientLoader.cachedHive = c
>   c
> }
>   }
> {code}
> But in impersonation mode, we should just share the Hive object inside the 
> thread so that the  metastore client in Hive could be associated with right 
> user.
> we can  pass the Hive object of parent thread to child thread when running 
> the sql to fix it
> I have already had a initial patch for review and I'm glad to work on it if 
> anyone could assign it to me.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22405) Enrich the event information of ExternalCatalogEvent

2017-10-31 Thread Saisai Shao (JIRA)

Saisai Shao created SPARK-22405:
---

 Summary: Enrich the event information of ExternalCatalogEvent
 Key: SPARK-22405
 URL: https://issues.apache.org/jira/browse/SPARK-22405
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Saisai Shao
Priority: Minor


We're building a data lineage tool in which we need to monitor the metadata 
changes in {{ExternalCatalog}}, current {{ExternalCatalog}} already provides 
several useful events like "CreateDatabaseEvent" for custom SparkListener to 
use. But the information provided by such event is not rich enough, for example 
{{CreateTablePreEvent}} only provides "database" name and "table" name, not all 
the table metadata, which is hard for user to get all the table related useful 
information.

So here propose to and new {{ExternalCatalogEvent}} and enrich the current 
existing events for all the catalog related updates.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22405) Enrich the event information and add new event of ExternalCatalogEvent

2017-10-31 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-22405:

Summary: Enrich the event information and add new event of 
ExternalCatalogEvent  (was: Enrich the event information of 
ExternalCatalogEvent)

> Enrich the event information and add new event of ExternalCatalogEvent
> --
>
> Key: SPARK-22405
> URL: https://issues.apache.org/jira/browse/SPARK-22405
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Saisai Shao
>Priority: Minor
>
> We're building a data lineage tool in which we need to monitor the metadata 
> changes in {{ExternalCatalog}}, current {{ExternalCatalog}} already provides 
> several useful events like "CreateDatabaseEvent" for custom SparkListener to 
> use. But the information provided by such event is not rich enough, for 
> example {{CreateTablePreEvent}} only provides "database" name and "table" 
> name, not all the table metadata, which is hard for user to get all the table 
> related useful information.
> So here propose to and new {{ExternalCatalogEvent}} and enrich the current 
> existing events for all the catalog related updates.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13030) Change OneHotEncoder to Estimator

2017-10-31 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16226448#comment-16226448
 ] 

Nick Pentreath commented on SPARK-13030:


I just think it makes sense for OHE to be an Estimator (as it is in sklearn). 
It really should have been from the beginning. The fact that it is not is 
actually a bug, IMO.

The proposal to have a size param could fix the issue but it is a bit of a 
band-aid fix. It requires the user to specify the size (num categories) 
manually. This doesn't really feel like the right workflow to me, the OHE 
should be able to figure that out itself. So that adds one more "speed bump", 
albeit a small one, in using the component in a pipeline.

It is possible that it can use a sort of "hack" for {{fit}} i.e. during the 
first transform call, set the param if not set already. But that just argues 
for the fact that it should be an {{Estimator/Model}} pair. Sure we could wait 
until {{3.0}} but if the work is already done I don't see a compelling reason 
not to do that now.

> Change OneHotEncoder to Estimator
> -
>
> Key: SPARK-13030
> URL: https://issues.apache.org/jira/browse/SPARK-13030
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.6.0
>Reporter: Wojciech Jurczyk
>
> OneHotEncoder should be an Estimator, just like in scikit-learn 
> (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).
> In its current form, it is impossible to use when number of categories is 
> different between training dataset and test dataset.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22401) Missing 2.1.2 tag in git

2017-10-31 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16226445#comment-16226445
 ] 

Sean Owen commented on SPARK-22401:
---

Normally the release plugin would tag and push the tag. I think it's easy 
enough to push a tag manually (should be commit 
2abaea9e40fce81cd4626498e0f5c28a70917499 at 
https://github.com/apache/spark/tree/2abaea9e40fce81cd4626498e0f5c28a70917499 ) 
but will wait to see if anyone knows the right-er way to push this to both 
Apache and github.

> Missing 2.1.2 tag in git
> 
>
> Key: SPARK-22401
> URL: https://issues.apache.org/jira/browse/SPARK-22401
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Deploy
>Affects Versions: 2.1.2
>Reporter: Brian Barker
>Priority: Minor
>
> We only saw a 2.1.2-rc4 tag in git, no official release. The releases web 
> page shows 2.1.2 was released in October 9.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22399) reference in mllib-clustering.html is out of date

2017-10-31 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-22399:
-

Assignee: Bo Meng

> reference in mllib-clustering.html is out of date
> -
>
> Key: SPARK-22399
> URL: https://issues.apache.org/jira/browse/SPARK-22399
> Project: Spark
>  Issue Type: Documentation
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: Nathan Kronenfeld
>Assignee: Bo Meng
>Priority: Minor
> Fix For: 2.3.0
>
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> Under "Power Iteration Clustering", the reference to the paper describing the 
> method redirects to the web site for the conference at which it was 
> originally published.
> The correct reference, as taken from Lin's web page at CMU, should be:
> Ihttp://www.cs.cmu.edu/~frank/papers/icml2010-pic-final.pdf
> (at least, I think that's the one - someone who knows the algorithm should 
> probably check it's the right one - I'm trying to follow it because I don't 
> know it, of course :-)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22399) reference in mllib-clustering.html is out of date

2017-10-31 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22399.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19614
[https://github.com/apache/spark/pull/19614]

> reference in mllib-clustering.html is out of date
> -
>
> Key: SPARK-22399
> URL: https://issues.apache.org/jira/browse/SPARK-22399
> Project: Spark
>  Issue Type: Documentation
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: Nathan Kronenfeld
>Priority: Minor
> Fix For: 2.3.0
>
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> Under "Power Iteration Clustering", the reference to the paper describing the 
> method redirects to the web site for the conference at which it was 
> originally published.
> The correct reference, as taken from Lin's web page at CMU, should be:
> Ihttp://www.cs.cmu.edu/~frank/papers/icml2010-pic-final.pdf
> (at least, I think that's the one - someone who knows the algorithm should 
> probably check it's the right one - I'm trying to follow it because I don't 
> know it, of course :-)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20077) Documentation for ml.stats.Correlation

2017-10-31 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-20077:
--
Priority: Minor  (was: Major)

I don't see any docs at https://spark.apache.org/docs/latest/ml-guide.html 
though, which is what this appears to be about.

> Documentation for ml.stats.Correlation
> --
>
> Key: SPARK-20077
> URL: https://issues.apache.org/jira/browse/SPARK-20077
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
>Priority: Minor
>
> Now that (Pearson) correlations are available in spark.ml, we need to write 
> some documentation to go along with this feature. It can simply be looking at 
> the unit tests for example right now.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21725) spark thriftserver insert overwrite table partition select

2017-10-31 Thread xinzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16226337#comment-16226337
 ] 

xinzhang edited comment on SPARK-21725 at 10/31/17 7:09 AM:


[~mgaido]
[~srowen]
Now I try with the master branch.
The problem is still here.(Important: hive.default.fileformat  Text file is the 
parameter's default value. If I tried set hive.default.fileformat=Parquet; The 
problem has gone! {color:red} But still appear with the partition tables . Do 
not Miss the last pic that is the problem core!!{color})
Steps:
1.download . install . exec hivesql  (hive-1.2.1 . Here prove my hive is OK)
!https://user-images.githubusercontent.com/8244097/32210043-7554300e-be46-11e7-8ce0-f61bc0bfa998.png!

2.download . install . exec spark-sql  (spark-master I build it with master the 
lastest commit 44c4003155c1d243ffe0f73d5537b4c8b3f3b564)
First time . Spark-sql  result: GOOD
!https://user-images.githubusercontent.com/8244097/32210200-5b02de20-be47-11e7-8eac-e0228a7cf7f5.png!

Second time . Spark-sql  result: GOOD
!https://user-images.githubusercontent.com/8244097/32210320-f518aa12-be47-11e7-9a86-a16819583748.png!

3.use spark-sql thriftserver
First time . Spark-sql  result: *{color:red}GOOD{color}*
Second time .Spark-sql result: *{color:red}BAD{color}*
!https://user-images.githubusercontent.com/8244097/32210560-47d431da-be49-11e7-8279-7dd88dda42a6.png!

{color:red}---
---{color}
1.set hive.default.fileformat=Parquet; 
2.create partition table the problem again 

!https://user-images.githubusercontent.com/8244097/32211152-3a4fe52e-be4c-11e7-9a8e-7a2b8f52ac6b.png!


was (Author: zhangxin0112zx):
[~mgaido]
[~srowen]
Now I try with the master branch.
The problem is still here.(Important: hive.default.fileformat  Text file is the 
parameter's default value. If I tried set hive.default.fileformat=Parquet; The 
problem has gone!! {color:red}Do not Miss the last pic that is the problem 
core!!{color})
Steps:
1.download . install . exec hivesql  (hive-1.2.1 . Here prove my hive is OK)
!https://user-images.githubusercontent.com/8244097/32210043-7554300e-be46-11e7-8ce0-f61bc0bfa998.png!

2.download . install . exec spark-sql  (spark-master I build it with master the 
lastest commit 44c4003155c1d243ffe0f73d5537b4c8b3f3b564)
First time . Spark-sql  result: GOOD
!https://user-images.githubusercontent.com/8244097/32210200-5b02de20-be47-11e7-8eac-e0228a7cf7f5.png!

Second time . Spark-sql  result: GOOD
!https://user-images.githubusercontent.com/8244097/32210320-f518aa12-be47-11e7-9a86-a16819583748.png!

3.use spark-sql thriftserver
First time . Spark-sql  result: *{color:red}GOOD{color}*
Second time .Spark-sql result: *{color:red}BAD{color}*
!https://user-images.githubusercontent.com/8244097/32210560-47d431da-be49-11e7-8279-7dd88dda42a6.png!

{color:red}---
---{color}
1.set hive.default.fileformat=Parquet; 
2.create partition table the problem again 

!https://user-images.githubusercontent.com/8244097/32211152-3a4fe52e-be4c-11e7-9a8e-7a2b8f52ac6b.png!

> spark thriftserver insert overwrite table partition select 
> ---
>
> Key: SPARK-21725
> URL: https://issues.apache.org/jira/browse/SPARK-21725
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: centos 6.7 spark 2.1  jdk8
>Reporter: xinzhang
>  Labels: spark-sql
>
> use thriftserver create table with partitions.
> session 1:
>  SET hive.default.fileformat=Parquet;create table tmp_10(count bigint) 
> partitioned by (pt string) stored as parquet;
> --ok
>  !exit
> session 2:
>  SET hive.default.fileformat=Parquet;create table tmp_11(count bigint) 
> partitioned by (pt string) stored as parquet; 
> --ok
>  !exit
> session 3:
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --ok
>  !exit
> session 4(do it again):
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --error
>  !exit
> -
> 17/08/14 18:13:42 ERROR SparkExecuteStatementOperation:

[jira] [Comment Edited] (SPARK-21725) spark thriftserver insert overwrite table partition select

2017-10-31 Thread xinzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16226337#comment-16226337
 ] 

xinzhang edited comment on SPARK-21725 at 10/31/17 7:07 AM:


[~mgaido]
[~srowen]
Now I try with the master branch.
The problem is still here.(Important: hive.default.fileformat  Text file is the 
parameter's default value. If I tried set hive.default.fileformat=Parquet; The 
problem has gone!! {color:red}Do not Miss the last pic that is the problem 
core!!{color})
Steps:
1.download . install . exec hivesql  (hive-1.2.1 . Here prove my hive is OK)
!https://user-images.githubusercontent.com/8244097/32210043-7554300e-be46-11e7-8ce0-f61bc0bfa998.png!

2.download . install . exec spark-sql  (spark-master I build it with master the 
lastest commit 44c4003155c1d243ffe0f73d5537b4c8b3f3b564)
First time . Spark-sql  result: GOOD
!https://user-images.githubusercontent.com/8244097/32210200-5b02de20-be47-11e7-8eac-e0228a7cf7f5.png!

Second time . Spark-sql  result: GOOD
!https://user-images.githubusercontent.com/8244097/32210320-f518aa12-be47-11e7-9a86-a16819583748.png!

3.use spark-sql thriftserver
First time . Spark-sql  result: *{color:red}GOOD{color}*
Second time .Spark-sql result: *{color:red}BAD{color}*
!https://user-images.githubusercontent.com/8244097/32210560-47d431da-be49-11e7-8279-7dd88dda42a6.png!

{color:red}---
---{color}
1.set hive.default.fileformat=Parquet; 
2.create partition table the problem again 

!https://user-images.githubusercontent.com/8244097/32211152-3a4fe52e-be4c-11e7-9a8e-7a2b8f52ac6b.png!


was (Author: zhangxin0112zx):
[~mgaido]
[~srowen]
Now I try with the master branch.
The problem is still here.(Important: hive.default.fileformat  Text file is the 
parameter's default value. If I tried set hive.default.fileformat=Parquet; The 
problem has gone!! {color:red}Do not Miss the last pic that is the problem 
core!!{color})
Steps:
1.download . install . exec hivesql  (hive-1.2.1 . Here prove my hive is OK)
!https://user-images.githubusercontent.com/8244097/32210043-7554300e-be46-11e7-8ce0-f61bc0bfa998.png!

2.download . install . exec spark-sql  (spark-master I build it with master the 
lastest commit 44c4003155c1d243ffe0f73d5537b4c8b3f3b564)
First time . Spark-sql  result: GOOD
!https://user-images.githubusercontent.com/8244097/32210200-5b02de20-be47-11e7-8eac-e0228a7cf7f5.png!

Second time . Spark-sql  result: GOOD
!https://user-images.githubusercontent.com/8244097/32210320-f518aa12-be47-11e7-9a86-a16819583748.png!

3.use spark-sql thriftserver
First time . Spark-sql  result: GOOD
Second time .Spark-sql result: BAD
!https://user-images.githubusercontent.com/8244097/32210560-47d431da-be49-11e7-8279-7dd88dda42a6.png!

---
1.set hive.default.fileformat=Parquet; 
2.create partition table the problem again 

!https://user-images.githubusercontent.com/8244097/32211152-3a4fe52e-be4c-11e7-9a8e-7a2b8f52ac6b.png!

> spark thriftserver insert overwrite table partition select 
> ---
>
> Key: SPARK-21725
> URL: https://issues.apache.org/jira/browse/SPARK-21725
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: centos 6.7 spark 2.1  jdk8
>Reporter: xinzhang
>  Labels: spark-sql
>
> use thriftserver create table with partitions.
> session 1:
>  SET hive.default.fileformat=Parquet;create table tmp_10(count bigint) 
> partitioned by (pt string) stored as parquet;
> --ok
>  !exit
> session 2:
>  SET hive.default.fileformat=Parquet;create table tmp_11(count bigint) 
> partitioned by (pt string) stored as parquet; 
> --ok
>  !exit
> session 3:
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --ok
>  !exit
> session 4(do it again):
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --error
>  !exit
> -
> 17/08/14 18:13:42 ERROR SparkExecuteStatementOperation: Error executing 
> query, currentState RUNNING, 
> java.lang.reflect.InvocationTargetException
> ..
> ..
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move 
> source 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/.hive-staging_hive_2017-08-14_18-13-39_035_6303339779053
> 512282-2/-ext-1/part-0 to

[jira] [Comment Edited] (SPARK-21725) spark thriftserver insert overwrite table partition select

2017-10-31 Thread xinzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16226337#comment-16226337
 ] 

xinzhang edited comment on SPARK-21725 at 10/31/17 7:07 AM:


[~mgaido]
[~srowen]
Now I try with the master branch.
The problem is still here.(Important: hive.default.fileformat  Text file is the 
parameter's default value. If I tried set hive.default.fileformat=Parquet; The 
problem has gone!! {color:red}Do not Miss the last pic that is the problem 
core!!{color})
Steps:
1.download . install . exec hivesql  (hive-1.2.1 . Here prove my hive is OK)
!https://user-images.githubusercontent.com/8244097/32210043-7554300e-be46-11e7-8ce0-f61bc0bfa998.png!

2.download . install . exec spark-sql  (spark-master I build it with master the 
lastest commit 44c4003155c1d243ffe0f73d5537b4c8b3f3b564)
First time . Spark-sql  result: GOOD
!https://user-images.githubusercontent.com/8244097/32210200-5b02de20-be47-11e7-8eac-e0228a7cf7f5.png!

Second time . Spark-sql  result: GOOD
!https://user-images.githubusercontent.com/8244097/32210320-f518aa12-be47-11e7-9a86-a16819583748.png!

3.use spark-sql thriftserver
First time . Spark-sql  result: *{color:red}GOOD{color}*
Second time .Spark-sql result: *{color:red}BAD{color}*
!https://user-images.githubusercontent.com/8244097/32210560-47d431da-be49-11e7-8279-7dd88dda42a6.png!

{color:red}---
---{color}
1.set hive.default.fileformat=Parquet; 
2.create partition table the problem again 

!https://user-images.githubusercontent.com/8244097/32211152-3a4fe52e-be4c-11e7-9a8e-7a2b8f52ac6b.png!


was (Author: zhangxin0112zx):
[~mgaido]
[~srowen]
Now I try with the master branch.
The problem is still here.(Important: hive.default.fileformat  Text file is the 
parameter's default value. If I tried set hive.default.fileformat=Parquet; The 
problem has gone!! {color:red}Do not Miss the last pic that is the problem 
core!!{color})
Steps:
1.download . install . exec hivesql  (hive-1.2.1 . Here prove my hive is OK)
!https://user-images.githubusercontent.com/8244097/32210043-7554300e-be46-11e7-8ce0-f61bc0bfa998.png!

2.download . install . exec spark-sql  (spark-master I build it with master the 
lastest commit 44c4003155c1d243ffe0f73d5537b4c8b3f3b564)
First time . Spark-sql  result: GOOD
!https://user-images.githubusercontent.com/8244097/32210200-5b02de20-be47-11e7-8eac-e0228a7cf7f5.png!

Second time . Spark-sql  result: GOOD
!https://user-images.githubusercontent.com/8244097/32210320-f518aa12-be47-11e7-9a86-a16819583748.png!

3.use spark-sql thriftserver
First time . Spark-sql  result: *{color:red}GOOD{color}*
Second time .Spark-sql result: *{color:red}BAD{color}*
!https://user-images.githubusercontent.com/8244097/32210560-47d431da-be49-11e7-8279-7dd88dda42a6.png!

{color:red}---
---{color}
1.set hive.default.fileformat=Parquet; 
2.create partition table the problem again 

!https://user-images.githubusercontent.com/8244097/32211152-3a4fe52e-be4c-11e7-9a8e-7a2b8f52ac6b.png!

> spark thriftserver insert overwrite table partition select 
> ---
>
> Key: SPARK-21725
> URL: https://issues.apache.org/jira/browse/SPARK-21725
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: centos 6.7 spark 2.1  jdk8
>Reporter: xinzhang
>  Labels: spark-sql
>
> use thriftserver create table with partitions.
> session 1:
>  SET hive.default.fileformat=Parquet;create table tmp_10(count bigint) 
> partitioned by (pt string) stored as parquet;
> --ok
>  !exit
> session 2:
>  SET hive.default.fileformat=Parquet;create table tmp_11(count bigint) 
> partitioned by (pt string) stored as parquet; 
> --ok
>  !exit
> session 3:
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --ok
>  !exit
> session 4(do it again):
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --error
>  !exit
> -
> 17/08/14 18:13:42 ERROR SparkExecuteStatementOperation: Error executing 
> query, currentState RUNNING, 
> java.lang.reflect.InvocationTargetException
> ..
> ..
> Caused by:

[jira] [Comment Edited] (SPARK-21725) spark thriftserver insert overwrite table partition select

2017-10-31 Thread xinzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16226337#comment-16226337
 ] 

xinzhang edited comment on SPARK-21725 at 10/31/17 7:03 AM:


[~mgaido]
[~srowen]
Now I try with the master branch.
The problem is still here.(Important: hive.default.fileformat  Text file is the 
parameter's default value. If I tried set hive.default.fileformat=Parquet; The 
problem has gone!! {color:red}Do not Miss the last pic that is the problem 
core!!{color})
Steps:
1.download . install . exec hivesql  (hive-1.2.1 . Here prove my hive is OK)
!https://user-images.githubusercontent.com/8244097/32210043-7554300e-be46-11e7-8ce0-f61bc0bfa998.png!

2.download . install . exec spark-sql  (spark-master I build it with master the 
lastest commit 44c4003155c1d243ffe0f73d5537b4c8b3f3b564)
First time . Spark-sql  result: GOOD
!https://user-images.githubusercontent.com/8244097/32210200-5b02de20-be47-11e7-8eac-e0228a7cf7f5.png!

Second time . Spark-sql  result: GOOD
!https://user-images.githubusercontent.com/8244097/32210320-f518aa12-be47-11e7-9a86-a16819583748.png!

3.use spark-sql thriftserver
First time . Spark-sql  result: GOOD
Second time .Spark-sql result: BAD
!https://user-images.githubusercontent.com/8244097/32210560-47d431da-be49-11e7-8279-7dd88dda42a6.png!

---
1.set hive.default.fileformat=Parquet; 
2.create partition table the problem again 

!https://user-images.githubusercontent.com/8244097/32211152-3a4fe52e-be4c-11e7-9a8e-7a2b8f52ac6b.png!


was (Author: zhangxin0112zx):
[~mgaido]
[~srowen]
Now I try with the master branch.
The problem is still here.(Important: hive.default.fileformat  Text file is the 
parameter's default value. If I tried set hive.default.fileformat=Parquet; The 
problem has gone!!)
Steps:
1.download . install . exec hivesql  (hive-1.2.1 . Here prove my hive is OK)
!https://user-images.githubusercontent.com/8244097/32210043-7554300e-be46-11e7-8ce0-f61bc0bfa998.png!

2.download . install . exec spark-sql  (spark-master I build it with master the 
lastest commit 44c4003155c1d243ffe0f73d5537b4c8b3f3b564)
First time . Spark-sql  result: GOOD
!https://user-images.githubusercontent.com/8244097/32210200-5b02de20-be47-11e7-8eac-e0228a7cf7f5.png!

Second time . Spark-sql  result: GOOD
!https://user-images.githubusercontent.com/8244097/32210320-f518aa12-be47-11e7-9a86-a16819583748.png!

3.use spark-sql thriftserver
First time . Spark-sql  result: GOOD
Second time .Spark-sql result: BAD
!https://user-images.githubusercontent.com/8244097/32210560-47d431da-be49-11e7-8279-7dd88dda42a6.png!

---
1.set hive.default.fileformat=Parquet; 
2.create partition table the problem again 

!https://user-images.githubusercontent.com/8244097/32211152-3a4fe52e-be4c-11e7-9a8e-7a2b8f52ac6b.png!

> spark thriftserver insert overwrite table partition select 
> ---
>
> Key: SPARK-21725
> URL: https://issues.apache.org/jira/browse/SPARK-21725
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: centos 6.7 spark 2.1  jdk8
>Reporter: xinzhang
>  Labels: spark-sql
>
> use thriftserver create table with partitions.
> session 1:
>  SET hive.default.fileformat=Parquet;create table tmp_10(count bigint) 
> partitioned by (pt string) stored as parquet;
> --ok
>  !exit
> session 2:
>  SET hive.default.fileformat=Parquet;create table tmp_11(count bigint) 
> partitioned by (pt string) stored as parquet; 
> --ok
>  !exit
> session 3:
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --ok
>  !exit
> session 4(do it again):
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --error
>  !exit
> -
> 17/08/14 18:13:42 ERROR SparkExecuteStatementOperation: Error executing 
> query, currentState RUNNING, 
> java.lang.reflect.InvocationTargetException
> ..
> ..
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move 
> source 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/.hive-staging_hive_2017-08-14_18-13-39_035_6303339779053
> 512282-2/-ext-1/part-0 to destination 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/pt=1/part-0
> at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2644)
> at

[jira] [Comment Edited] (SPARK-21725) spark thriftserver insert overwrite table partition select

2017-10-31 Thread xinzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16226337#comment-16226337
 ] 

xinzhang edited comment on SPARK-21725 at 10/31/17 7:01 AM:


[~mgaido]
[~srowen]
Now I try with the master branch.
The problem is still here.(Important: hive.default.fileformat  Text file is the 
parameter's default value. If I tried set hive.default.fileformat=Parquet; The 
problem has gone!!)
Steps:
1.download . install . exec hivesql  (hive-1.2.1 . Here prove my hive is OK)
!https://user-images.githubusercontent.com/8244097/32210043-7554300e-be46-11e7-8ce0-f61bc0bfa998.png!

2.download . install . exec spark-sql  (spark-master I build it with master the 
lastest commit 44c4003155c1d243ffe0f73d5537b4c8b3f3b564)
First time . Spark-sql  result: GOOD
!https://user-images.githubusercontent.com/8244097/32210200-5b02de20-be47-11e7-8eac-e0228a7cf7f5.png!

Second time . Spark-sql  result: GOOD
!https://user-images.githubusercontent.com/8244097/32210320-f518aa12-be47-11e7-9a86-a16819583748.png!

3.use spark-sql thriftserver
First time . Spark-sql  result: GOOD
Second time .Spark-sql result: BAD
!https://user-images.githubusercontent.com/8244097/32210560-47d431da-be49-11e7-8279-7dd88dda42a6.png!

---
1.set hive.default.fileformat=Parquet; 
2.create partition table the problem again 

!https://user-images.githubusercontent.com/8244097/32211152-3a4fe52e-be4c-11e7-9a8e-7a2b8f52ac6b.png!


was (Author: zhangxin0112zx):
[~mgaido]
[~srowen]
Now I try with the master branch.
The problem is still here.(Important: hive.default.fileformat  Text file is the 
parameter's default value. If I tried set hive.default.fileformat=Parquet; The 
problem has gone!!)
Steps:
1.download . install . exec hivesql  (hive-1.2.1 . Here prove my hive is OK)
!https://user-images.githubusercontent.com/8244097/32210043-7554300e-be46-11e7-8ce0-f61bc0bfa998.png!

2.download . install . exec spark-sql  (spark-master I build it with master the 
lastest commit 44c4003155c1d243ffe0f73d5537b4c8b3f3b564)
First time . Spark-sql  result: GOOD
!https://user-images.githubusercontent.com/8244097/32210200-5b02de20-be47-11e7-8eac-e0228a7cf7f5.png!

Second time . Spark-sql  result: GOOD
!https://user-images.githubusercontent.com/8244097/32210320-f518aa12-be47-11e7-9a86-a16819583748.png!

3.use spark-sql thriftserver
First time . Spark-sql  result: GOOD
Second time .Spark-sql result: BAD
!https://user-images.githubusercontent.com/8244097/32210560-47d431da-be49-11e7-8279-7dd88dda42a6.png!



> spark thriftserver insert overwrite table partition select 
> ---
>
> Key: SPARK-21725
> URL: https://issues.apache.org/jira/browse/SPARK-21725
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: centos 6.7 spark 2.1  jdk8
>Reporter: xinzhang
>  Labels: spark-sql
>
> use thriftserver create table with partitions.
> session 1:
>  SET hive.default.fileformat=Parquet;create table tmp_10(count bigint) 
> partitioned by (pt string) stored as parquet;
> --ok
>  !exit
> session 2:
>  SET hive.default.fileformat=Parquet;create table tmp_11(count bigint) 
> partitioned by (pt string) stored as parquet; 
> --ok
>  !exit
> session 3:
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --ok
>  !exit
> session 4(do it again):
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --error
>  !exit
> -
> 17/08/14 18:13:42 ERROR SparkExecuteStatementOperation: Error executing 
> query, currentState RUNNING, 
> java.lang.reflect.InvocationTargetException
> ..
> ..
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move 
> source 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/.hive-staging_hive_2017-08-14_18-13-39_035_6303339779053
> 512282-2/-ext-1/part-0 to destination 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/pt=1/part-0
> at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2644)
> at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:2711)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1403)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1324)
> ... 45 more
> Caused by: java.io.IOException: Filesystem closed
> 
>

[jira] [Comment Edited] (SPARK-21725) spark thriftserver insert overwrite table partition select

2017-10-31 Thread xinzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16226337#comment-16226337
 ] 

xinzhang edited comment on SPARK-21725 at 10/31/17 6:55 AM:


[~mgaido]
[~srowen]
Now I try with the master branch.
The problem is still here.(Important: hive.default.fileformat  Text file is the 
parameter's default value. If I tried set hive.default.fileformat=Parquet; The 
problem has gone!!)
Steps:
1.download . install . exec hivesql  (hive-1.2.1 . Here prove my hive is OK)
!https://user-images.githubusercontent.com/8244097/32210043-7554300e-be46-11e7-8ce0-f61bc0bfa998.png!

2.download . install . exec spark-sql  (spark-master I build it with master the 
lastest commit 44c4003155c1d243ffe0f73d5537b4c8b3f3b564)
First time . Spark-sql  result: GOOD
!https://user-images.githubusercontent.com/8244097/32210200-5b02de20-be47-11e7-8eac-e0228a7cf7f5.png!

Second time . Spark-sql  result: GOOD
!https://user-images.githubusercontent.com/8244097/32210320-f518aa12-be47-11e7-9a86-a16819583748.png!

3.use spark-sql thriftserver
First time . Spark-sql  result: GOOD
Second time .Spark-sql result: BAD
!https://user-images.githubusercontent.com/8244097/32210560-47d431da-be49-11e7-8279-7dd88dda42a6.png!




was (Author: zhangxin0112zx):
[~mgaido]
[~srowen]
Now I try with the master branch.
The problem is still here.
Steps:
1.download . install . exec hivesql  (hive-1.2.1 . Here prove my hive is OK)
!https://user-images.githubusercontent.com/8244097/32210043-7554300e-be46-11e7-8ce0-f61bc0bfa998.png!

2.download . install . exec spark-sql  (spark-master I build it with master the 
lastest commit 44c4003155c1d243ffe0f73d5537b4c8b3f3b564)
First time . Spark-sql  result: GOOD
!https://user-images.githubusercontent.com/8244097/32210200-5b02de20-be47-11e7-8eac-e0228a7cf7f5.png!

Second time . Spark-sql  result: GOOD
!https://user-images.githubusercontent.com/8244097/32210320-f518aa12-be47-11e7-9a86-a16819583748.png!

3.use spark-sql thriftserver
First time . Spark-sql  result: GOOD
Second time .Spark-sql result: BAD
!https://user-images.githubusercontent.com/8244097/32210560-47d431da-be49-11e7-8279-7dd88dda42a6.png!



> spark thriftserver insert overwrite table partition select 
> ---
>
> Key: SPARK-21725
> URL: https://issues.apache.org/jira/browse/SPARK-21725
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: centos 6.7 spark 2.1  jdk8
>Reporter: xinzhang
>  Labels: spark-sql
>
> use thriftserver create table with partitions.
> session 1:
>  SET hive.default.fileformat=Parquet;create table tmp_10(count bigint) 
> partitioned by (pt string) stored as parquet;
> --ok
>  !exit
> session 2:
>  SET hive.default.fileformat=Parquet;create table tmp_11(count bigint) 
> partitioned by (pt string) stored as parquet; 
> --ok
>  !exit
> session 3:
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --ok
>  !exit
> session 4(do it again):
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --error
>  !exit
> -
> 17/08/14 18:13:42 ERROR SparkExecuteStatementOperation: Error executing 
> query, currentState RUNNING, 
> java.lang.reflect.InvocationTargetException
> ..
> ..
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move 
> source 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/.hive-staging_hive_2017-08-14_18-13-39_035_6303339779053
> 512282-2/-ext-1/part-0 to destination 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/pt=1/part-0
> at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2644)
> at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:2711)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1403)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1324)
> ... 45 more
> Caused by: java.io.IOException: Filesystem closed
> 
> -
> the doc about the parquet table desc here 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files
> Hive metastore Parquet table conversion
> When reading from and writing to Hive metastore Parquet tables, Spark SQL 
> will try to use its own Parquet support instead of Hive SerDe for better 
> performance. This behavior is controlled by the 
> spark.sql.hive.convertMetastoreParquet

[jira] [Comment Edited] (SPARK-21725) spark thriftserver insert overwrite table partition select

2017-10-31 Thread xinzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16226337#comment-16226337
 ] 

xinzhang edited comment on SPARK-21725 at 10/31/17 6:43 AM:


[~mgaido]
[~srowen]
Now I try with the master branch.
The problem is still here.
Steps:
1.download . install . exec hivesql  (hive-1.2.1 . Here prove my hive is OK)
!https://user-images.githubusercontent.com/8244097/32210043-7554300e-be46-11e7-8ce0-f61bc0bfa998.png!

2.download . install . exec spark-sql  (spark-master I build it with master the 
lastest commit 44c4003155c1d243ffe0f73d5537b4c8b3f3b564)
First time . Spark-sql  result: GOOD
!https://user-images.githubusercontent.com/8244097/32210200-5b02de20-be47-11e7-8eac-e0228a7cf7f5.png!

Second time . Spark-sql  result: GOOD
!https://user-images.githubusercontent.com/8244097/32210320-f518aa12-be47-11e7-9a86-a16819583748.png!

3.use spark-sql thriftserver
First time . Spark-sql  result: GOOD
Second time .Spark-sql result: BAD
!https://user-images.githubusercontent.com/8244097/32210560-47d431da-be49-11e7-8279-7dd88dda42a6.png!




was (Author: zhangxin0112zx):
Now I try with the master branch.
The problem is still here.
Steps:
1.download . install . exec hivesql  (hive-1.2.1 . Here prove my hive is OK)
!https://user-images.githubusercontent.com/8244097/32210043-7554300e-be46-11e7-8ce0-f61bc0bfa998.png!

2.download . install . exec spark-sql  (spark-master I build it with master the 
lastest commit 44c4003155c1d243ffe0f73d5537b4c8b3f3b564)
First time . Spark-sql  result: GOOD
!https://user-images.githubusercontent.com/8244097/32210200-5b02de20-be47-11e7-8eac-e0228a7cf7f5.png!

Second time . Spark-sql  result: GOOD
!https://user-images.githubusercontent.com/8244097/32210320-f518aa12-be47-11e7-9a86-a16819583748.png!

3.use spark-sql thriftserver
First time . Spark-sql  result: GOOD
Second time .Spark-sql result: BAD
!https://user-images.githubusercontent.com/8244097/32210560-47d431da-be49-11e7-8279-7dd88dda42a6.png!



> spark thriftserver insert overwrite table partition select 
> ---
>
> Key: SPARK-21725
> URL: https://issues.apache.org/jira/browse/SPARK-21725
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: centos 6.7 spark 2.1  jdk8
>Reporter: xinzhang
>  Labels: spark-sql
>
> use thriftserver create table with partitions.
> session 1:
>  SET hive.default.fileformat=Parquet;create table tmp_10(count bigint) 
> partitioned by (pt string) stored as parquet;
> --ok
>  !exit
> session 2:
>  SET hive.default.fileformat=Parquet;create table tmp_11(count bigint) 
> partitioned by (pt string) stored as parquet; 
> --ok
>  !exit
> session 3:
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --ok
>  !exit
> session 4(do it again):
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --error
>  !exit
> -
> 17/08/14 18:13:42 ERROR SparkExecuteStatementOperation: Error executing 
> query, currentState RUNNING, 
> java.lang.reflect.InvocationTargetException
> ..
> ..
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move 
> source 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/.hive-staging_hive_2017-08-14_18-13-39_035_6303339779053
> 512282-2/-ext-1/part-0 to destination 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/pt=1/part-0
> at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2644)
> at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:2711)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1403)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1324)
> ... 45 more
> Caused by: java.io.IOException: Filesystem closed
> 
> -
> the doc about the parquet table desc here 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files
> Hive metastore Parquet table conversion
> When reading from and writing to Hive metastore Parquet tables, Spark SQL 
> will try to use its own Parquet support instead of Hive SerDe for better 
> performance. This behavior is controlled by the 
> spark.sql.hive.convertMetastoreParquet configuration, and is turned on by 
> default.
> I am confused the problem appear in the table(partitions)  but it is ok with 
> table(with out partitions) . It means spark do

[jira] [Commented] (SPARK-21725) spark thriftserver insert overwrite table partition select

2017-10-31 Thread xinzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16226337#comment-16226337
 ] 

xinzhang commented on SPARK-21725:
--

Now I try with the master branch.
The problem is still here.
Steps:
1.download . install . exec hivesql  (hive-1.2.1 . Here prove my hive is OK)
!https://user-images.githubusercontent.com/8244097/32210043-7554300e-be46-11e7-8ce0-f61bc0bfa998.png!

2.download . install . exec spark-sql  (spark-master I build it with master the 
lastest commit 44c4003155c1d243ffe0f73d5537b4c8b3f3b564)
First time . Spark-sql  result: GOOD
!https://user-images.githubusercontent.com/8244097/32210200-5b02de20-be47-11e7-8eac-e0228a7cf7f5.png!

Second time . Spark-sql  result: GOOD
!https://user-images.githubusercontent.com/8244097/32210320-f518aa12-be47-11e7-9a86-a16819583748.png!

3.use spark-sql thriftserver
First time . Spark-sql  result: GOOD
Second time .Spark-sql result: BAD
!https://user-images.githubusercontent.com/8244097/32210560-47d431da-be49-11e7-8279-7dd88dda42a6.png!



> spark thriftserver insert overwrite table partition select 
> ---
>
> Key: SPARK-21725
> URL: https://issues.apache.org/jira/browse/SPARK-21725
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: centos 6.7 spark 2.1  jdk8
>Reporter: xinzhang
>  Labels: spark-sql
>
> use thriftserver create table with partitions.
> session 1:
>  SET hive.default.fileformat=Parquet;create table tmp_10(count bigint) 
> partitioned by (pt string) stored as parquet;
> --ok
>  !exit
> session 2:
>  SET hive.default.fileformat=Parquet;create table tmp_11(count bigint) 
> partitioned by (pt string) stored as parquet; 
> --ok
>  !exit
> session 3:
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --ok
>  !exit
> session 4(do it again):
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --error
>  !exit
> -
> 17/08/14 18:13:42 ERROR SparkExecuteStatementOperation: Error executing 
> query, currentState RUNNING, 
> java.lang.reflect.InvocationTargetException
> ..
> ..
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move 
> source 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/.hive-staging_hive_2017-08-14_18-13-39_035_6303339779053
> 512282-2/-ext-1/part-0 to destination 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/pt=1/part-0
> at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2644)
> at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:2711)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1403)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1324)
> ... 45 more
> Caused by: java.io.IOException: Filesystem closed
> 
> -
> the doc about the parquet table desc here 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files
> Hive metastore Parquet table conversion
> When reading from and writing to Hive metastore Parquet tables, Spark SQL 
> will try to use its own Parquet support instead of Hive SerDe for better 
> performance. This behavior is controlled by the 
> spark.sql.hive.convertMetastoreParquet configuration, and is turned on by 
> default.
> I am confused the problem appear in the table(partitions)  but it is ok with 
> table(with out partitions) . It means spark do not use its own parquet ?
> Maybe someone give any suggest how could I avoid the issue?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13030) Change OneHotEncoder to Estimator

2017-10-31 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16226307#comment-16226307
 ] 

yuhao yang commented on SPARK-13030:


Sorry to jumping in so late. I can see there's been a lot of efforts.

As far as I understand, making the OneHotEncoder an Estimator is essentially to 
fulfill the requirement that we need consistent dimension and mapping for 
OneHotEncoder during training and prediction. 

To achieve the same target, can we just set an optional numCategory: IntParam 
(or call it size) as an parameter for OneHotEncoder ? If set, then all the 
output vector will have the size as numCategory. Any index that's out of the 
bound of numCategory can be resolved by handleInvalid. Comparably, IMO this is 
a much simpler and robust solution. (totally backwards compatible). 


> Change OneHotEncoder to Estimator
> -
>
> Key: SPARK-13030
> URL: https://issues.apache.org/jira/browse/SPARK-13030
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.6.0
>Reporter: Wojciech Jurczyk
>
> OneHotEncoder should be an Estimator, just like in scikit-learn 
> (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).
> In its current form, it is impossible to use when number of categories is 
> different between training dataset and test dataset.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

81 matches

Mail list logo