[jira] [Created] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL

2019-01-04 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-26540:


 Summary: Requirement failed when reading numeric arrays from 
PostgreSQL
 Key: SPARK-26540
 URL: https://issues.apache.org/jira/browse/SPARK-26540
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0, 2.3.2, 2.2.2
Reporter: Takeshi Yamamuro


This bug was reported in spark-user: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html

To reproduce this;
{code}
// Creates a table in a PostgreSQL shell
postgres=# CREATE TABLE t (v numeric[], d  numeric);
CREATE TABLE
postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555);
INSERT 0 1
postgres=# SELECT * FROM t;
  v  |d 
-+--
 {.222,.332} | 222.4555
(1 row)

postgres=# \d t
Table "public.t"
 Column |   Type| Modifiers 
+---+---
 v  | numeric[] | 
 d  | numeric   | 

// Then, reads it in Spark
./bin/spark-shell --jars=postgresql-42.2.4.jar -v

scala> import java.util.Properties
scala> val options = new Properties();
scala> options.setProperty("driver", "org.postgresql.Driver")
scala> options.setProperty("user", "maropu")
scala> options.setProperty("password", "")
scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
scala> pgTable.printSchema
root
 |-- v: array (nullable = true)
 ||-- element: decimal(0,0) (containsNull = true)
 |-- d: decimal(38,18) (nullable = true)

scala> pgTable.show
9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 
exceeds max precision 0
at scala.Predef$.require(Predef.scala:281)
at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116)
at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
...
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL

2019-01-04 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734715#comment-16734715
 ] 

Dongjoon Hyun edited comment on SPARK-26540 at 1/5/19 12:58 AM:


I'm checking the information from PostgresJDBC ResultSet. I'll post the result 
here, [~maropu].


was (Author: dongjoon):
I'm checking the information PostgresJDBC ResultSet. I'll post the result here, 
[~maropu].

> Requirement failed when reading numeric arrays from PostgreSQL
> --
>
> Key: SPARK-26540
> URL: https://issues.apache.org/jira/browse/SPARK-26540
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> This bug was reported in spark-user: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html
> To reproduce this;
> {code}
> // Creates a table in a PostgreSQL shell
> postgres=# CREATE TABLE t (v numeric[], d  numeric);
> CREATE TABLE
> postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555);
> INSERT 0 1
> postgres=# SELECT * FROM t;
>   v  |d 
> -+--
>  {.222,.332} | 222.4555
> (1 row)
> postgres=# \d t
> Table "public.t"
>  Column |   Type| Modifiers 
> +---+---
>  v  | numeric[] | 
>  d  | numeric   | 
> // Then, reads it in Spark
> ./bin/spark-shell --jars=postgresql-42.2.4.jar -v
> scala> import java.util.Properties
> scala> val options = new Properties();
> scala> options.setProperty("driver", "org.postgresql.Driver")
> scala> options.setProperty("user", "maropu")
> scala> options.setProperty("password", "")
> scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
> scala> pgTable.printSchema
> root
>  |-- v: array (nullable = true)
>  ||-- element: decimal(0,0) (containsNull = true)
>  |-- d: decimal(38,18) (nullable = true)
> scala> pgTable.show
> 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 
> exceeds max precision 0
>   at scala.Predef$.require(Predef.scala:281)
>   at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116)
>   at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
> ...
> {code}
> I looked over the related code and then I think we need more logics to handle 
> numeric arrays;
> https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL

2019-01-04 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734715#comment-16734715
 ] 

Dongjoon Hyun commented on SPARK-26540:
---

I'm checking the information PostgresJDBC ResultSet. I'll post the result here, 
[~maropu].

> Requirement failed when reading numeric arrays from PostgreSQL
> --
>
> Key: SPARK-26540
> URL: https://issues.apache.org/jira/browse/SPARK-26540
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> This bug was reported in spark-user: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html
> To reproduce this;
> {code}
> // Creates a table in a PostgreSQL shell
> postgres=# CREATE TABLE t (v numeric[], d  numeric);
> CREATE TABLE
> postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555);
> INSERT 0 1
> postgres=# SELECT * FROM t;
>   v  |d 
> -+--
>  {.222,.332} | 222.4555
> (1 row)
> postgres=# \d t
> Table "public.t"
>  Column |   Type| Modifiers 
> +---+---
>  v  | numeric[] | 
>  d  | numeric   | 
> // Then, reads it in Spark
> ./bin/spark-shell --jars=postgresql-42.2.4.jar -v
> scala> import java.util.Properties
> scala> val options = new Properties();
> scala> options.setProperty("driver", "org.postgresql.Driver")
> scala> options.setProperty("user", "maropu")
> scala> options.setProperty("password", "")
> scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
> scala> pgTable.printSchema
> root
>  |-- v: array (nullable = true)
>  ||-- element: decimal(0,0) (containsNull = true)
>  |-- d: decimal(38,18) (nullable = true)
> scala> pgTable.show
> 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 
> exceeds max precision 0
>   at scala.Predef$.require(Predef.scala:281)
>   at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116)
>   at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
> ...
> {code}
> I looked over the related code and then I think we need more logics to handle 
> numeric arrays;
> https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably

2019-01-04 Thread chenliang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenliang updated SPARK-26543:
--
Description: 
For SparkSQL ,when we  open AE by 'set spark.sql.adapative.enable=true',the 
ExchangeCoordinator will  introduced to determine the number of post-shuffle 
partitions. But in some certain conditions,the coordinator performed not very 
well, there are always some tasks retained and they worked with Shuffle Read 
Size / Records  0.0B/0 ,We could increase the 
spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but  this 
action is  unreasonable as targetPostShuffleInputSize Should not be set too 
large. As follow:

 !15_24_38__12_27_2018.jpg! 

We can filter the useless partition(0B) with ExchangeCoorditinator 
automatically 



  was:
For SparkSQL ,when we  open AE by 'set spark.sql.adapative.enable=true',the 
ExchangeCoordinator will  introduced to determine the number of post-shuffle 
partitions. But in some certain conditions,the coordinator performed not very 
well, there are always some tasks retained and they worked with Shuffle Read 
Size / Records  0.0B/0 ,We could increase the 
spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but  this 
action is  unreasonable as targetPostShuffleInputSize Should not be set too 
large. 


We can filter the useless partition(0B) with ExchangeCoorditinator 
automatically 




> Support the coordinator to demerminte post-shuffle partitions more reasonably
> -
>
> Key: SPARK-26543
> URL: https://issues.apache.org/jira/browse/SPARK-26543
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2
>Reporter: chenliang
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: 15_24_38__12_27_2018.jpg
>
>
> For SparkSQL ,when we  open AE by 'set spark.sql.adapative.enable=true',the 
> ExchangeCoordinator will  introduced to determine the number of post-shuffle 
> partitions. But in some certain conditions,the coordinator performed not very 
> well, there are always some tasks retained and they worked with Shuffle Read 
> Size / Records  0.0B/0 ,We could increase the 
> spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but  this 
> action is  unreasonable as targetPostShuffleInputSize Should not be set too 
> large. As follow:
>  !15_24_38__12_27_2018.jpg! 
> We can filter the useless partition(0B) with ExchangeCoorditinator 
> automatically 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably

2019-01-04 Thread chenliang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenliang updated SPARK-26543:
--
Description: 


For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the 
ExchangeCoordinator will introduced to determine the number of post-shuffle 
partitions. But in some certain conditions,the coordinator performed not very 
well, there are always some tasks retained and they worked with Shuffle Read 
Size / Records 0.0B/0 ,We could increase the 
spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this 
action is unreasonable as targetPostShuffleInputSize Should not be set too 
large. As follow:

!15_24_38__12_27_2018.jpg!

We can filter the useless partition(0B) with ExchangeCoorditinator automatically

  was:
For SparkSQL ,when we  open AE by 'set spark.sql.adapative.enable=true',the 
ExchangeCoordinator will  introduced to determine the number of post-shuffle 
partitions. But in some certain conditions,the coordinator performed not very 
well, there are always some tasks retained and they worked with Shuffle Read 
Size / Records  0.0B/0 ,We could increase the 
spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but  this 
action is  unreasonable as targetPostShuffleInputSize Should not be set too 
large. As follow:

 !15_24_38__12_27_2018.jpg! 

We can filter the useless partition(0B) with ExchangeCoorditinator 
automatically 




> Support the coordinator to demerminte post-shuffle partitions more reasonably
> -
>
> Key: SPARK-26543
> URL: https://issues.apache.org/jira/browse/SPARK-26543
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2
>Reporter: chenliang
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: 15_24_38__12_27_2018.jpg
>
>
> For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the 
> ExchangeCoordinator will introduced to determine the number of post-shuffle 
> partitions. But in some certain conditions,the coordinator performed not very 
> well, there are always some tasks retained and they worked with Shuffle Read 
> Size / Records 0.0B/0 ,We could increase the 
> spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this 
> action is unreasonable as targetPostShuffleInputSize Should not be set too 
> large. As follow:
> !15_24_38__12_27_2018.jpg!
> We can filter the useless partition(0B) with ExchangeCoorditinator 
> automatically



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL

2019-01-04 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734708#comment-16734708
 ] 

Dongjoon Hyun edited comment on SPARK-26540 at 1/5/19 12:41 AM:


For me, this is not a correctness issue and not a regression. Could you check 
the older Spark like 2.0/2.1? I guess this is also not a data loss issue.


was (Author: dongjoon):
For me, this is not a correctness issue and not a regression. Could you check 
the older Spark like 2.0/2.1?

> Requirement failed when reading numeric arrays from PostgreSQL
> --
>
> Key: SPARK-26540
> URL: https://issues.apache.org/jira/browse/SPARK-26540
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> This bug was reported in spark-user: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html
> To reproduce this;
> {code}
> // Creates a table in a PostgreSQL shell
> postgres=# CREATE TABLE t (v numeric[], d  numeric);
> CREATE TABLE
> postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555);
> INSERT 0 1
> postgres=# SELECT * FROM t;
>   v  |d 
> -+--
>  {.222,.332} | 222.4555
> (1 row)
> postgres=# \d t
> Table "public.t"
>  Column |   Type| Modifiers 
> +---+---
>  v  | numeric[] | 
>  d  | numeric   | 
> // Then, reads it in Spark
> ./bin/spark-shell --jars=postgresql-42.2.4.jar -v
> scala> import java.util.Properties
> scala> val options = new Properties();
> scala> options.setProperty("driver", "org.postgresql.Driver")
> scala> options.setProperty("user", "maropu")
> scala> options.setProperty("password", "")
> scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
> scala> pgTable.printSchema
> root
>  |-- v: array (nullable = true)
>  ||-- element: decimal(0,0) (containsNull = true)
>  |-- d: decimal(38,18) (nullable = true)
> scala> pgTable.show
> 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 
> exceeds max precision 0
>   at scala.Predef$.require(Predef.scala:281)
>   at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116)
>   at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
> ...
> {code}
> I looked over the related code and then I think we need more logics to handle 
> numeric arrays;
> https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL

2019-01-04 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734708#comment-16734708
 ] 

Dongjoon Hyun commented on SPARK-26540:
---

For me, this is not a correctness issue and not a regression. Could you check 
the older Spark like 2.0/2.1?

> Requirement failed when reading numeric arrays from PostgreSQL
> --
>
> Key: SPARK-26540
> URL: https://issues.apache.org/jira/browse/SPARK-26540
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> This bug was reported in spark-user: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html
> To reproduce this;
> {code}
> // Creates a table in a PostgreSQL shell
> postgres=# CREATE TABLE t (v numeric[], d  numeric);
> CREATE TABLE
> postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555);
> INSERT 0 1
> postgres=# SELECT * FROM t;
>   v  |d 
> -+--
>  {.222,.332} | 222.4555
> (1 row)
> postgres=# \d t
> Table "public.t"
>  Column |   Type| Modifiers 
> +---+---
>  v  | numeric[] | 
>  d  | numeric   | 
> // Then, reads it in Spark
> ./bin/spark-shell --jars=postgresql-42.2.4.jar -v
> scala> import java.util.Properties
> scala> val options = new Properties();
> scala> options.setProperty("driver", "org.postgresql.Driver")
> scala> options.setProperty("user", "maropu")
> scala> options.setProperty("password", "")
> scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
> scala> pgTable.printSchema
> root
>  |-- v: array (nullable = true)
>  ||-- element: decimal(0,0) (containsNull = true)
>  |-- d: decimal(38,18) (nullable = true)
> scala> pgTable.show
> 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 
> exceeds max precision 0
>   at scala.Predef$.require(Predef.scala:281)
>   at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116)
>   at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
> ...
> {code}
> I looked over the related code and then I think we need more logics to handle 
> numeric arrays;
> https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL

2019-01-04 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26540:
--
Priority: Minor  (was: Major)

> Requirement failed when reading numeric arrays from PostgreSQL
> --
>
> Key: SPARK-26540
> URL: https://issues.apache.org/jira/browse/SPARK-26540
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> This bug was reported in spark-user: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html
> To reproduce this;
> {code}
> // Creates a table in a PostgreSQL shell
> postgres=# CREATE TABLE t (v numeric[], d  numeric);
> CREATE TABLE
> postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555);
> INSERT 0 1
> postgres=# SELECT * FROM t;
>   v  |d 
> -+--
>  {.222,.332} | 222.4555
> (1 row)
> postgres=# \d t
> Table "public.t"
>  Column |   Type| Modifiers 
> +---+---
>  v  | numeric[] | 
>  d  | numeric   | 
> // Then, reads it in Spark
> ./bin/spark-shell --jars=postgresql-42.2.4.jar -v
> scala> import java.util.Properties
> scala> val options = new Properties();
> scala> options.setProperty("driver", "org.postgresql.Driver")
> scala> options.setProperty("user", "maropu")
> scala> options.setProperty("password", "")
> scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
> scala> pgTable.printSchema
> root
>  |-- v: array (nullable = true)
>  ||-- element: decimal(0,0) (containsNull = true)
>  |-- d: decimal(38,18) (nullable = true)
> scala> pgTable.show
> 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 
> exceeds max precision 0
>   at scala.Predef$.require(Predef.scala:281)
>   at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116)
>   at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
> ...
> {code}
> I looked over the related code and then I think we need more logics to handle 
> numeric arrays;
> https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL

2019-01-04 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26540:
--
Issue Type: Improvement  (was: Bug)

> Requirement failed when reading numeric arrays from PostgreSQL
> --
>
> Key: SPARK-26540
> URL: https://issues.apache.org/jira/browse/SPARK-26540
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> This bug was reported in spark-user: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html
> To reproduce this;
> {code}
> // Creates a table in a PostgreSQL shell
> postgres=# CREATE TABLE t (v numeric[], d  numeric);
> CREATE TABLE
> postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555);
> INSERT 0 1
> postgres=# SELECT * FROM t;
>   v  |d 
> -+--
>  {.222,.332} | 222.4555
> (1 row)
> postgres=# \d t
> Table "public.t"
>  Column |   Type| Modifiers 
> +---+---
>  v  | numeric[] | 
>  d  | numeric   | 
> // Then, reads it in Spark
> ./bin/spark-shell --jars=postgresql-42.2.4.jar -v
> scala> import java.util.Properties
> scala> val options = new Properties();
> scala> options.setProperty("driver", "org.postgresql.Driver")
> scala> options.setProperty("user", "maropu")
> scala> options.setProperty("password", "")
> scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
> scala> pgTable.printSchema
> root
>  |-- v: array (nullable = true)
>  ||-- element: decimal(0,0) (containsNull = true)
>  |-- d: decimal(38,18) (nullable = true)
> scala> pgTable.show
> 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 
> exceeds max precision 0
>   at scala.Predef$.require(Predef.scala:281)
>   at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116)
>   at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
> ...
> {code}
> I looked over the related code and then I think we need more logics to handle 
> numeric arrays;
> https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL

2019-01-04 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734718#comment-16734718
 ] 

Dongjoon Hyun commented on SPARK-26540:
---

[~maropu].

Please specify the precision and scale. Then, it will work.

{code}
CREATE TABLE t (v numeric(7,3)[], d  numeric);
{code}

{code}
scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
pgTable: org.apache.spark.sql.DataFrame = [v: array, d: 
decimal(38,18)]

scala> pgTable.show
+++
|   v|   d|
+++
|[.222, .332]|222.45550...|
+++
{code}

> Requirement failed when reading numeric arrays from PostgreSQL
> --
>
> Key: SPARK-26540
> URL: https://issues.apache.org/jira/browse/SPARK-26540
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> This bug was reported in spark-user: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html
> To reproduce this;
> {code}
> // Creates a table in a PostgreSQL shell
> postgres=# CREATE TABLE t (v numeric[], d  numeric);
> CREATE TABLE
> postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555);
> INSERT 0 1
> postgres=# SELECT * FROM t;
>   v  |d 
> -+--
>  {.222,.332} | 222.4555
> (1 row)
> postgres=# \d t
> Table "public.t"
>  Column |   Type| Modifiers 
> +---+---
>  v  | numeric[] | 
>  d  | numeric   | 
> // Then, reads it in Spark
> ./bin/spark-shell --jars=postgresql-42.2.4.jar -v
> scala> import java.util.Properties
> scala> val options = new Properties();
> scala> options.setProperty("driver", "org.postgresql.Driver")
> scala> options.setProperty("user", "maropu")
> scala> options.setProperty("password", "")
> scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
> scala> pgTable.printSchema
> root
>  |-- v: array (nullable = true)
>  ||-- element: decimal(0,0) (containsNull = true)
>  |-- d: decimal(38,18) (nullable = true)
> scala> pgTable.show
> 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 
> exceeds max precision 0
>   at scala.Predef$.require(Predef.scala:281)
>   at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116)
>   at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
> ...
> {code}
> I looked over the related code and then I think we need more logics to handle 
> numeric arrays;
> https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL

2019-01-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26540:


Assignee: (was: Apache Spark)

> Requirement failed when reading numeric arrays from PostgreSQL
> --
>
> Key: SPARK-26540
> URL: https://issues.apache.org/jira/browse/SPARK-26540
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> This bug was reported in spark-user: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html
> To reproduce this;
> {code}
> // Creates a table in a PostgreSQL shell
> postgres=# CREATE TABLE t (v numeric[], d  numeric);
> CREATE TABLE
> postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555);
> INSERT 0 1
> postgres=# SELECT * FROM t;
>   v  |d 
> -+--
>  {.222,.332} | 222.4555
> (1 row)
> postgres=# \d t
> Table "public.t"
>  Column |   Type| Modifiers 
> +---+---
>  v  | numeric[] | 
>  d  | numeric   | 
> // Then, reads it in Spark
> ./bin/spark-shell --jars=postgresql-42.2.4.jar -v
> scala> import java.util.Properties
> scala> val options = new Properties();
> scala> options.setProperty("driver", "org.postgresql.Driver")
> scala> options.setProperty("user", "maropu")
> scala> options.setProperty("password", "")
> scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
> scala> pgTable.printSchema
> root
>  |-- v: array (nullable = true)
>  ||-- element: decimal(0,0) (containsNull = true)
>  |-- d: decimal(38,18) (nullable = true)
> scala> pgTable.show
> 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 
> exceeds max precision 0
>   at scala.Predef$.require(Predef.scala:281)
>   at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116)
>   at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
> ...
> {code}
> I looked over the related code and then I think we need more logics to handle 
> numeric arrays;
> https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL

2019-01-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26540:


Assignee: Apache Spark

> Requirement failed when reading numeric arrays from PostgreSQL
> --
>
> Key: SPARK-26540
> URL: https://issues.apache.org/jira/browse/SPARK-26540
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>Priority: Minor
>
> This bug was reported in spark-user: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html
> To reproduce this;
> {code}
> // Creates a table in a PostgreSQL shell
> postgres=# CREATE TABLE t (v numeric[], d  numeric);
> CREATE TABLE
> postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555);
> INSERT 0 1
> postgres=# SELECT * FROM t;
>   v  |d 
> -+--
>  {.222,.332} | 222.4555
> (1 row)
> postgres=# \d t
> Table "public.t"
>  Column |   Type| Modifiers 
> +---+---
>  v  | numeric[] | 
>  d  | numeric   | 
> // Then, reads it in Spark
> ./bin/spark-shell --jars=postgresql-42.2.4.jar -v
> scala> import java.util.Properties
> scala> val options = new Properties();
> scala> options.setProperty("driver", "org.postgresql.Driver")
> scala> options.setProperty("user", "maropu")
> scala> options.setProperty("password", "")
> scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
> scala> pgTable.printSchema
> root
>  |-- v: array (nullable = true)
>  ||-- element: decimal(0,0) (containsNull = true)
>  |-- d: decimal(38,18) (nullable = true)
> scala> pgTable.show
> 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 
> exceeds max precision 0
>   at scala.Predef$.require(Predef.scala:281)
>   at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116)
>   at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
> ...
> {code}
> I looked over the related code and then I think we need more logics to handle 
> numeric arrays;
> https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL

2019-01-04 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734731#comment-16734731
 ] 

Dongjoon Hyun commented on SPARK-26540:
---

Thank you for checking, [~maropu]. I made a PR and will add PostgreSQL 
integration test for this.
- https://github.com/apache/spark/pull/23458

> Requirement failed when reading numeric arrays from PostgreSQL
> --
>
> Key: SPARK-26540
> URL: https://issues.apache.org/jira/browse/SPARK-26540
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> This bug was reported in spark-user: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html
> To reproduce this;
> {code}
> // Creates a table in a PostgreSQL shell
> postgres=# CREATE TABLE t (v numeric[], d  numeric);
> CREATE TABLE
> postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555);
> INSERT 0 1
> postgres=# SELECT * FROM t;
>   v  |d 
> -+--
>  {.222,.332} | 222.4555
> (1 row)
> postgres=# \d t
> Table "public.t"
>  Column |   Type| Modifiers 
> +---+---
>  v  | numeric[] | 
>  d  | numeric   | 
> // Then, reads it in Spark
> ./bin/spark-shell --jars=postgresql-42.2.4.jar -v
> scala> import java.util.Properties
> scala> val options = new Properties();
> scala> options.setProperty("driver", "org.postgresql.Driver")
> scala> options.setProperty("user", "maropu")
> scala> options.setProperty("password", "")
> scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
> scala> pgTable.printSchema
> root
>  |-- v: array (nullable = true)
>  ||-- element: decimal(0,0) (containsNull = true)
>  |-- d: decimal(38,18) (nullable = true)
> scala> pgTable.show
> 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 
> exceeds max precision 0
>   at scala.Predef$.require(Predef.scala:281)
>   at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116)
>   at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
> ...
> {code}
> I looked over the related code and then I think we need more logics to handle 
> numeric arrays;
> https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26544) escape string when serialize map/array make it a valid json (keep alignment with hive)

2019-01-04 Thread EdisonWang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

EdisonWang updated SPARK-26544:
---
Summary: escape string when serialize map/array make it a valid json (keep 
alignment with hive)  (was: the string serialized from map/array type is not a 
valid json (while hive is))

> escape string when serialize map/array make it a valid json (keep alignment 
> with hive)
> --
>
> Key: SPARK-26544
> URL: https://issues.apache.org/jira/browse/SPARK-26544
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: EdisonWang
>Priority: Major
>
> when reading a hive table with map/array type, the string serialized by 
> thrift server is not a valid json, while hive is. 
> For example, select a field whose type is map, the spark 
> thrift server returns 
>  
> {code:java}
> {"author_id":"123","log_pb":"{"impr_id":"20181231"}","request_id":"001"}
> {code}
>  
> while hive thriftserver returns
>  
> {code:java}
> {"author_id":"123", "log_pb":"{\"impr_id\":\"20181231\"}","request_id":"001"}
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26544) escape string when serialize map/array to make it a valid json (keep alignment with hive)

2019-01-04 Thread EdisonWang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

EdisonWang updated SPARK-26544:
---
Summary: escape string when serialize map/array to make it a valid json 
(keep alignment with hive)  (was: escape string when serialize map/array make 
it a valid json (keep alignment with hive))

> escape string when serialize map/array to make it a valid json (keep 
> alignment with hive)
> -
>
> Key: SPARK-26544
> URL: https://issues.apache.org/jira/browse/SPARK-26544
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: EdisonWang
>Priority: Major
>
> when reading a hive table with map/array type, the string serialized by 
> thrift server is not a valid json, while hive is. 
> For example, select a field whose type is map, the spark 
> thrift server returns 
>  
> {code:java}
> {"author_id":"123","log_pb":"{"impr_id":"20181231"}","request_id":"001"}
> {code}
>  
> while hive thriftserver returns
>  
> {code:java}
> {"author_id":"123", "log_pb":"{\"impr_id\":\"20181231\"}","request_id":"001"}
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL

2019-01-04 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734701#comment-16734701
 ] 

Takeshi Yamamuro commented on SPARK-26540:
--

[~dongjoon] [~smilegator] could you check if this issue is worth blocking the 
release processes? Could you check?

> Requirement failed when reading numeric arrays from PostgreSQL
> --
>
> Key: SPARK-26540
> URL: https://issues.apache.org/jira/browse/SPARK-26540
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> This bug was reported in spark-user: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html
> To reproduce this;
> {code}
> // Creates a table in a PostgreSQL shell
> postgres=# CREATE TABLE t (v numeric[], d  numeric);
> CREATE TABLE
> postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555);
> INSERT 0 1
> postgres=# SELECT * FROM t;
>   v  |d 
> -+--
>  {.222,.332} | 222.4555
> (1 row)
> postgres=# \d t
> Table "public.t"
>  Column |   Type| Modifiers 
> +---+---
>  v  | numeric[] | 
>  d  | numeric   | 
> // Then, reads it in Spark
> ./bin/spark-shell --jars=postgresql-42.2.4.jar -v
> scala> import java.util.Properties
> scala> val options = new Properties();
> scala> options.setProperty("driver", "org.postgresql.Driver")
> scala> options.setProperty("user", "maropu")
> scala> options.setProperty("password", "")
> scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
> scala> pgTable.printSchema
> root
>  |-- v: array (nullable = true)
>  ||-- element: decimal(0,0) (containsNull = true)
>  |-- d: decimal(38,18) (nullable = true)
> scala> pgTable.show
> 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 
> exceeds max precision 0
>   at scala.Predef$.require(Predef.scala:281)
>   at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116)
>   at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
> ...
> {code}
> I looked over the related code and then I think we need more logics to handle 
> numeric arrays;
> https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL

2019-01-04 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26540:
--
Issue Type: Bug  (was: Improvement)

> Requirement failed when reading numeric arrays from PostgreSQL
> --
>
> Key: SPARK-26540
> URL: https://issues.apache.org/jira/browse/SPARK-26540
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> This bug was reported in spark-user: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html
> To reproduce this;
> {code}
> // Creates a table in a PostgreSQL shell
> postgres=# CREATE TABLE t (v numeric[], d  numeric);
> CREATE TABLE
> postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555);
> INSERT 0 1
> postgres=# SELECT * FROM t;
>   v  |d 
> -+--
>  {.222,.332} | 222.4555
> (1 row)
> postgres=# \d t
> Table "public.t"
>  Column |   Type| Modifiers 
> +---+---
>  v  | numeric[] | 
>  d  | numeric   | 
> // Then, reads it in Spark
> ./bin/spark-shell --jars=postgresql-42.2.4.jar -v
> scala> import java.util.Properties
> scala> val options = new Properties();
> scala> options.setProperty("driver", "org.postgresql.Driver")
> scala> options.setProperty("user", "maropu")
> scala> options.setProperty("password", "")
> scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
> scala> pgTable.printSchema
> root
>  |-- v: array (nullable = true)
>  ||-- element: decimal(0,0) (containsNull = true)
>  |-- d: decimal(38,18) (nullable = true)
> scala> pgTable.show
> 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 
> exceeds max precision 0
>   at scala.Predef$.require(Predef.scala:281)
>   at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116)
>   at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
> ...
> {code}
> I looked over the related code and then I think we need more logics to handle 
> numeric arrays;
> https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26536) Upgrade Mockito to 2.23.4

2019-01-04 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-26536.
---
   Resolution: Fixed
 Assignee: Dongjoon Hyun  (was: Apache Spark)
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/23452

> Upgrade Mockito to 2.23.4
> -
>
> Key: SPARK-26536
> URL: https://issues.apache.org/jira/browse/SPARK-26536
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> This issue upgrades Mockito from 1.10.19 to 2.23.4. The following changes are 
> required.
> - Replace `org.mockito.Matchers` with `org.mockito.ArgumentMatchers`
> - Replace `anyObject` with `any`
> - Replace `getArgumentAt` with `getArgument` and add type annotation.
> - Use `isNull` matcher in case of `null` is invoked.
> {code}
>  saslHandler.channelInactive(null);
> -verify(handler).channelInactive(any(TransportClient.class));
> +verify(handler).channelInactive(isNull());
> {code}
> - Make and use `doReturn` wrapper to avoid 
> [SI-4775|https://issues.scala-lang.org/browse/SI-4775]
> {code}
> private def doReturn(value: Any) = org.mockito.Mockito.doReturn(value, 
> Seq.empty: _*)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably

2019-01-04 Thread chenliang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenliang updated SPARK-26543:
--
Description: 
For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the 
ExchangeCoordinator will introduced to determine the number of post-shuffle 
partitions. But in some certain conditions,the coordinator performed not very 
well, there are always some tasks retained and they worked with Shuffle Read 
Size / Records 0.0B/0 ,We could increase the 
spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this 
action is unreasonable as targetPostShuffleInputSize Should not be set too 
large. As follow:

{noformat}
 !screenshot-1.png! 
{noformat}

We can filter the useless partition(0B) with ExchangeCoorditinator automatically

  was:
For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the 
ExchangeCoordinator will introduced to determine the number of post-shuffle 
partitions. But in some certain conditions,the coordinator performed not very 
well, there are always some tasks retained and they worked with Shuffle Read 
Size / Records 0.0B/0 ,We could increase the 
spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this 
action is unreasonable as targetPostShuffleInputSize Should not be set too 
large. As follow:

We can filter the useless partition(0B) with ExchangeCoorditinator automatically


> Support the coordinator to demerminte post-shuffle partitions more reasonably
> -
>
> Key: SPARK-26543
> URL: https://issues.apache.org/jira/browse/SPARK-26543
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2
>Reporter: chenliang
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: screenshot-1.png
>
>
> For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the 
> ExchangeCoordinator will introduced to determine the number of post-shuffle 
> partitions. But in some certain conditions,the coordinator performed not very 
> well, there are always some tasks retained and they worked with Shuffle Read 
> Size / Records 0.0B/0 ,We could increase the 
> spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this 
> action is unreasonable as targetPostShuffleInputSize Should not be set too 
> large. As follow:
> {noformat}
>  !screenshot-1.png! 
> {noformat}
> We can filter the useless partition(0B) with ExchangeCoorditinator 
> automatically



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably

2019-01-04 Thread chenliang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenliang updated SPARK-26543:
--
Description: 
For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the 
ExchangeCoordinator will introduced to determine the number of post-shuffle 
partitions. But in some certain conditions,the coordinator performed not very 
well, there are always some tasks retained and they worked with Shuffle Read 
Size / Records 0.0B/0 ,We could increase the 
spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this 
action is unreasonable as targetPostShuffleInputSize Should not be set too 
large. As follow:

We can filter the useless partition(0B) with ExchangeCoorditinator automatically

  was:
For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the 
ExchangeCoordinator will introduced to determine the number of post-shuffle 
partitions. But in some certain conditions,the coordinator performed not very 
well, there are always some tasks retained and they worked with Shuffle Read 
Size / Records 0.0B/0 ,We could increase the 
spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this 
action is unreasonable as targetPostShuffleInputSize Should not be set too 
large. As follow:

 !screenshot-1.png! !15_24_38__12_27_2018.jpg!

We can filter the useless partition(0B) with ExchangeCoorditinator automatically


> Support the coordinator to demerminte post-shuffle partitions more reasonably
> -
>
> Key: SPARK-26543
> URL: https://issues.apache.org/jira/browse/SPARK-26543
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2
>Reporter: chenliang
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: screenshot-1.png
>
>
> For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the 
> ExchangeCoordinator will introduced to determine the number of post-shuffle 
> partitions. But in some certain conditions,the coordinator performed not very 
> well, there are always some tasks retained and they worked with Shuffle Read 
> Size / Records 0.0B/0 ,We could increase the 
> spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this 
> action is unreasonable as targetPostShuffleInputSize Should not be set too 
> large. As follow:
> We can filter the useless partition(0B) with ExchangeCoorditinator 
> automatically



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL

2019-01-04 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734727#comment-16734727
 ] 

Takeshi Yamamuro edited comment on SPARK-26540 at 1/5/19 1:42 AM:
--

I checked the query failed in v2.0, but passed in v1.6.
{code}
// spark-1.6.3
scala> val pgTable = sqlContext.read.jdbc("jdbc:postgresql:postgres", "t", 
options)
scala> pgTable.printSchema
root
 |-- v: array (nullable = true)
 ||-- element: decimal(38,18) (containsNull = true)
 |-- d: decimal(38,18) (nullable = true)

scala> pgTable.show
+++
|   v|   d|
+++
|[.222...|222.45550...|
+++
{code}


was (Author: maropu):
yea, the workaround is good.
The query failed in v2.0, but passed in v1.6.
{code}
// spark-1.6.3
scala> val pgTable = sqlContext.read.jdbc("jdbc:postgresql:postgres", "t", 
options)
scala> pgTable.printSchema
root
 |-- v: array (nullable = true)
 ||-- element: decimal(38,18) (containsNull = true)
 |-- d: decimal(38,18) (nullable = true)

scala> pgTable.show
+++
|   v|   d|
+++
|[.222...|222.45550...|
+++
{code}

> Requirement failed when reading numeric arrays from PostgreSQL
> --
>
> Key: SPARK-26540
> URL: https://issues.apache.org/jira/browse/SPARK-26540
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> This bug was reported in spark-user: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html
> To reproduce this;
> {code}
> // Creates a table in a PostgreSQL shell
> postgres=# CREATE TABLE t (v numeric[], d  numeric);
> CREATE TABLE
> postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555);
> INSERT 0 1
> postgres=# SELECT * FROM t;
>   v  |d 
> -+--
>  {.222,.332} | 222.4555
> (1 row)
> postgres=# \d t
> Table "public.t"
>  Column |   Type| Modifiers 
> +---+---
>  v  | numeric[] | 
>  d  | numeric   | 
> // Then, reads it in Spark
> ./bin/spark-shell --jars=postgresql-42.2.4.jar -v
> scala> import java.util.Properties
> scala> val options = new Properties();
> scala> options.setProperty("driver", "org.postgresql.Driver")
> scala> options.setProperty("user", "maropu")
> scala> options.setProperty("password", "")
> scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
> scala> pgTable.printSchema
> root
>  |-- v: array (nullable = true)
>  ||-- element: decimal(0,0) (containsNull = true)
>  |-- d: decimal(38,18) (nullable = true)
> scala> pgTable.show
> 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 
> exceeds max precision 0
>   at scala.Predef$.require(Predef.scala:281)
>   at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116)
>   at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
> ...
> {code}
> I looked over the related code and then I think we need more logics to handle 
> numeric arrays;
> https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL

2019-01-04 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734719#comment-16734719
 ] 

Dongjoon Hyun edited comment on SPARK-26540 at 1/5/19 1:34 AM:
---

Given that Apache Spark handles the decimal type with precision and scale, the 
reported case looks like a corner case.


was (Author: dongjoon):
Given that Apache Spark handles the decimal type with precision and scale, the 
reported case looks like a corner case. And, I believe this will be an minor 
improvement issue to cover that corner case.

> Requirement failed when reading numeric arrays from PostgreSQL
> --
>
> Key: SPARK-26540
> URL: https://issues.apache.org/jira/browse/SPARK-26540
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> This bug was reported in spark-user: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html
> To reproduce this;
> {code}
> // Creates a table in a PostgreSQL shell
> postgres=# CREATE TABLE t (v numeric[], d  numeric);
> CREATE TABLE
> postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555);
> INSERT 0 1
> postgres=# SELECT * FROM t;
>   v  |d 
> -+--
>  {.222,.332} | 222.4555
> (1 row)
> postgres=# \d t
> Table "public.t"
>  Column |   Type| Modifiers 
> +---+---
>  v  | numeric[] | 
>  d  | numeric   | 
> // Then, reads it in Spark
> ./bin/spark-shell --jars=postgresql-42.2.4.jar -v
> scala> import java.util.Properties
> scala> val options = new Properties();
> scala> options.setProperty("driver", "org.postgresql.Driver")
> scala> options.setProperty("user", "maropu")
> scala> options.setProperty("password", "")
> scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
> scala> pgTable.printSchema
> root
>  |-- v: array (nullable = true)
>  ||-- element: decimal(0,0) (containsNull = true)
>  |-- d: decimal(38,18) (nullable = true)
> scala> pgTable.show
> 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 
> exceeds max precision 0
>   at scala.Predef$.require(Predef.scala:281)
>   at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116)
>   at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
> ...
> {code}
> I looked over the related code and then I think we need more logics to handle 
> numeric arrays;
> https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26541) Add `-Pdocker-integration-tests` to `dev/scalastyle`

2019-01-04 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-26541:
-

 Summary: Add `-Pdocker-integration-tests` to `dev/scalastyle`
 Key: SPARK-26541
 URL: https://issues.apache.org/jira/browse/SPARK-26541
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun


This issue makes `scalastyle` to check `docker-integration-tests` module and 
fixes one error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26541) Add `-Pdocker-integration-tests` to `dev/scalastyle`

2019-01-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26541:


Assignee: (was: Apache Spark)

> Add `-Pdocker-integration-tests` to `dev/scalastyle`
> 
>
> Key: SPARK-26541
> URL: https://issues.apache.org/jira/browse/SPARK-26541
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> This issue makes `scalastyle` to check `docker-integration-tests` module and 
> fixes one error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26541) Add `-Pdocker-integration-tests` to `dev/scalastyle`

2019-01-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26541:


Assignee: Apache Spark

> Add `-Pdocker-integration-tests` to `dev/scalastyle`
> 
>
> Key: SPARK-26541
> URL: https://issues.apache.org/jira/browse/SPARK-26541
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
>
> This issue makes `scalastyle` to check `docker-integration-tests` module and 
> fixes one error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26542) Support the coordinator to demerminte post-shuffle partitions more reasonably

2019-01-04 Thread chenliang (JIRA)
chenliang created SPARK-26542:
-

 Summary: Support the coordinator to demerminte post-shuffle 
partitions more reasonably
 Key: SPARK-26542
 URL: https://issues.apache.org/jira/browse/SPARK-26542
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.2, 2.3.1, 2.3.0, 2.2.2, 2.2.1, 2.2.0
Reporter: chenliang
 Fix For: 2.3.0


For SparkSQL ,when we  open AE by 'set spark.sql.adapative.enable=true',the 
ExchangeCoordinator will  introduced to determine the number of post-shuffle 
partitions. But in some certain conditions,the coordinator performed not very 
well, there are always some tasks retained and they worked with Shuffle Read 
Size / Records  0.0B/0 ,We could increase the 
spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but  this 
action is  unreasonable as targetPostShuffleInputSize Should not be set too 
large. 




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably

2019-01-04 Thread chenliang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenliang updated SPARK-26543:
--
Description: 
For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the 
ExchangeCoordinator will introduced to determine the number of post-shuffle 
partitions. But in some certain conditions,the coordinator performed not very 
well, there are always some tasks retained and they worked with Shuffle Read 
Size / Records 0.0B/0 ,We could increase the 
spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this 
action is unreasonable as targetPostShuffleInputSize Should not be set too 
large. As follow:

 !screenshot-1.png! !15_24_38__12_27_2018.jpg!

We can filter the useless partition(0B) with ExchangeCoorditinator automatically

  was:


For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the 
ExchangeCoordinator will introduced to determine the number of post-shuffle 
partitions. But in some certain conditions,the coordinator performed not very 
well, there are always some tasks retained and they worked with Shuffle Read 
Size / Records 0.0B/0 ,We could increase the 
spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this 
action is unreasonable as targetPostShuffleInputSize Should not be set too 
large. As follow:

!15_24_38__12_27_2018.jpg!

We can filter the useless partition(0B) with ExchangeCoorditinator automatically


> Support the coordinator to demerminte post-shuffle partitions more reasonably
> -
>
> Key: SPARK-26543
> URL: https://issues.apache.org/jira/browse/SPARK-26543
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2
>Reporter: chenliang
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: screenshot-1.png
>
>
> For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the 
> ExchangeCoordinator will introduced to determine the number of post-shuffle 
> partitions. But in some certain conditions,the coordinator performed not very 
> well, there are always some tasks retained and they worked with Shuffle Read 
> Size / Records 0.0B/0 ,We could increase the 
> spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this 
> action is unreasonable as targetPostShuffleInputSize Should not be set too 
> large. As follow:
>  !screenshot-1.png! !15_24_38__12_27_2018.jpg!
> We can filter the useless partition(0B) with ExchangeCoorditinator 
> automatically



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26544) the string serialized from map/array type is not a valid json (while hive is)

2019-01-04 Thread EdisonWang (JIRA)
EdisonWang created SPARK-26544:
--

 Summary: the string serialized from map/array type is not a valid 
json (while hive is)
 Key: SPARK-26544
 URL: https://issues.apache.org/jira/browse/SPARK-26544
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: EdisonWang


when reading a hive table with map/array type, the string serialized by thrift 
server is not a valid json, while hive is. 

For example, select a field whose type is map, the spark thrift 
server returns 

 
{code:java}
{"author_id":"123","log_pb":"{"impr_id":"20181231"}","request_id":"001"}
{code}
 

while hive thriftserver returns

 
{code:java}
{"author_id":"123", "log_pb":"{\"impr_id\":\"20181231\"}","request_id":"001"}
{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably

2019-01-04 Thread chenliang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenliang updated SPARK-26543:
--
Attachment: (was: 15_24_38__12_27_2018.jpg)

> Support the coordinator to demerminte post-shuffle partitions more reasonably
> -
>
> Key: SPARK-26543
> URL: https://issues.apache.org/jira/browse/SPARK-26543
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2
>Reporter: chenliang
>Priority: Major
> Fix For: 2.3.0
>
>
> For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the 
> ExchangeCoordinator will introduced to determine the number of post-shuffle 
> partitions. But in some certain conditions,the coordinator performed not very 
> well, there are always some tasks retained and they worked with Shuffle Read 
> Size / Records 0.0B/0 ,We could increase the 
> spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this 
> action is unreasonable as targetPostShuffleInputSize Should not be set too 
> large. As follow:
> !15_24_38__12_27_2018.jpg!
> We can filter the useless partition(0B) with ExchangeCoorditinator 
> automatically



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL

2019-01-04 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734719#comment-16734719
 ] 

Dongjoon Hyun edited comment on SPARK-26540 at 1/5/19 1:41 AM:
---

Given that Apache Spark handles the decimal type with precision and scale, the 
reported case looks like a corner case.

Sorry, I found that the original email also described this situation. I missed 
that. Yep, I agree. I'll make a PR for this one.


was (Author: dongjoon):
Given that Apache Spark handles the decimal type with precision and scale, the 
reported case looks like a corner case.

> Requirement failed when reading numeric arrays from PostgreSQL
> --
>
> Key: SPARK-26540
> URL: https://issues.apache.org/jira/browse/SPARK-26540
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> This bug was reported in spark-user: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html
> To reproduce this;
> {code}
> // Creates a table in a PostgreSQL shell
> postgres=# CREATE TABLE t (v numeric[], d  numeric);
> CREATE TABLE
> postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555);
> INSERT 0 1
> postgres=# SELECT * FROM t;
>   v  |d 
> -+--
>  {.222,.332} | 222.4555
> (1 row)
> postgres=# \d t
> Table "public.t"
>  Column |   Type| Modifiers 
> +---+---
>  v  | numeric[] | 
>  d  | numeric   | 
> // Then, reads it in Spark
> ./bin/spark-shell --jars=postgresql-42.2.4.jar -v
> scala> import java.util.Properties
> scala> val options = new Properties();
> scala> options.setProperty("driver", "org.postgresql.Driver")
> scala> options.setProperty("user", "maropu")
> scala> options.setProperty("password", "")
> scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
> scala> pgTable.printSchema
> root
>  |-- v: array (nullable = true)
>  ||-- element: decimal(0,0) (containsNull = true)
>  |-- d: decimal(38,18) (nullable = true)
> scala> pgTable.show
> 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 
> exceeds max precision 0
>   at scala.Predef$.require(Predef.scala:281)
>   at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116)
>   at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
> ...
> {code}
> I looked over the related code and then I think we need more logics to handle 
> numeric arrays;
> https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL

2019-01-04 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734727#comment-16734727
 ] 

Takeshi Yamamuro commented on SPARK-26540:
--

yea, the workaround is good.
The query failed in v2.0, but passed in v1.6.
{code}
// spark-1.6.3
scala> val pgTable = sqlContext.read.jdbc("jdbc:postgresql:postgres", "t", 
options)
scala> pgTable.printSchema
root
 |-- v: array (nullable = true)
 ||-- element: decimal(38,18) (containsNull = true)
 |-- d: decimal(38,18) (nullable = true)

scala> pgTable.show
+++
|   v|   d|
+++
|[.222...|222.45550...|
+++
{code}

> Requirement failed when reading numeric arrays from PostgreSQL
> --
>
> Key: SPARK-26540
> URL: https://issues.apache.org/jira/browse/SPARK-26540
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> This bug was reported in spark-user: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html
> To reproduce this;
> {code}
> // Creates a table in a PostgreSQL shell
> postgres=# CREATE TABLE t (v numeric[], d  numeric);
> CREATE TABLE
> postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555);
> INSERT 0 1
> postgres=# SELECT * FROM t;
>   v  |d 
> -+--
>  {.222,.332} | 222.4555
> (1 row)
> postgres=# \d t
> Table "public.t"
>  Column |   Type| Modifiers 
> +---+---
>  v  | numeric[] | 
>  d  | numeric   | 
> // Then, reads it in Spark
> ./bin/spark-shell --jars=postgresql-42.2.4.jar -v
> scala> import java.util.Properties
> scala> val options = new Properties();
> scala> options.setProperty("driver", "org.postgresql.Driver")
> scala> options.setProperty("user", "maropu")
> scala> options.setProperty("password", "")
> scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
> scala> pgTable.printSchema
> root
>  |-- v: array (nullable = true)
>  ||-- element: decimal(0,0) (containsNull = true)
>  |-- d: decimal(38,18) (nullable = true)
> scala> pgTable.show
> 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 
> exceeds max precision 0
>   at scala.Predef$.require(Predef.scala:281)
>   at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116)
>   at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
> ...
> {code}
> I looked over the related code and then I think we need more logics to handle 
> numeric arrays;
> https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably

2019-01-04 Thread chenliang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenliang updated SPARK-26543:
--
Description: 
For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the 
ExchangeCoordinator will introduced to determine the number of post-shuffle 
partitions. But in some certain conditions,the coordinator performed not very 
well, there are always some tasks retained and they worked with Shuffle Read 
Size / Records 0.0B/0 ,We could increase the 
spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this 
action is unreasonable as targetPostShuffleInputSize Should not be set too 
large. As follow:

!image-2019-01-05-13-18-30-487.png!

We can filter the useless partition(0B) with ExchangeCoorditinator automatically

  was:
For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the 
ExchangeCoordinator will introduced to determine the number of post-shuffle 
partitions. But in some certain conditions,the coordinator performed not very 
well, there are always some tasks retained and they worked with Shuffle Read 
Size / Records 0.0B/0 ,We could increase the 
spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this 
action is unreasonable as targetPostShuffleInputSize Should not be set too 
large. As follow:

 !screenshot-1.png! 

We can filter the useless partition(0B) with ExchangeCoorditinator automatically


> Support the coordinator to demerminte post-shuffle partitions more reasonably
> -
>
> Key: SPARK-26543
> URL: https://issues.apache.org/jira/browse/SPARK-26543
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2
>Reporter: chenliang
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: image-2019-01-05-13-18-30-487.png
>
>
> For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the 
> ExchangeCoordinator will introduced to determine the number of post-shuffle 
> partitions. But in some certain conditions,the coordinator performed not very 
> well, there are always some tasks retained and they worked with Shuffle Read 
> Size / Records 0.0B/0 ,We could increase the 
> spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this 
> action is unreasonable as targetPostShuffleInputSize Should not be set too 
> large. As follow:
> !image-2019-01-05-13-18-30-487.png!
> We can filter the useless partition(0B) with ExchangeCoorditinator 
> automatically



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL

2019-01-04 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-26540:
-
Description: 
This bug was reported in spark-user: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html

To reproduce this;
{code}
// Creates a table in a PostgreSQL shell
postgres=# CREATE TABLE t (v numeric[], d  numeric);
CREATE TABLE
postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555);
INSERT 0 1
postgres=# SELECT * FROM t;
  v  |d 
-+--
 {.222,.332} | 222.4555
(1 row)

postgres=# \d t
Table "public.t"
 Column |   Type| Modifiers 
+---+---
 v  | numeric[] | 
 d  | numeric   | 

// Then, reads it in Spark
./bin/spark-shell --jars=postgresql-42.2.4.jar -v

scala> import java.util.Properties
scala> val options = new Properties();
scala> options.setProperty("driver", "org.postgresql.Driver")
scala> options.setProperty("user", "maropu")
scala> options.setProperty("password", "")
scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
scala> pgTable.printSchema
root
 |-- v: array (nullable = true)
 ||-- element: decimal(0,0) (containsNull = true)
 |-- d: decimal(38,18) (nullable = true)

scala> pgTable.show
9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 
exceeds max precision 0
at scala.Predef$.require(Predef.scala:281)
at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116)
at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
...
{code}

I looked over the related code and then I think we need more logics to handle 
numeric arrays;
https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41
 

  was:
This bug was reported in spark-user: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html

To reproduce this;
{code}
// Creates a table in a PostgreSQL shell
postgres=# CREATE TABLE t (v numeric[], d  numeric);
CREATE TABLE
postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555);
INSERT 0 1
postgres=# SELECT * FROM t;
  v  |d 
-+--
 {.222,.332} | 222.4555
(1 row)

postgres=# \d t
Table "public.t"
 Column |   Type| Modifiers 
+---+---
 v  | numeric[] | 
 d  | numeric   | 

// Then, reads it in Spark
./bin/spark-shell --jars=postgresql-42.2.4.jar -v

scala> import java.util.Properties
scala> val options = new Properties();
scala> options.setProperty("driver", "org.postgresql.Driver")
scala> options.setProperty("user", "maropu")
scala> options.setProperty("password", "")
scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
scala> pgTable.printSchema
root
 |-- v: array (nullable = true)
 ||-- element: decimal(0,0) (containsNull = true)
 |-- d: decimal(38,18) (nullable = true)

scala> pgTable.show
9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 
exceeds max precision 0
at scala.Predef$.require(Predef.scala:281)
at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116)
at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
...
{code}


> Requirement failed when reading numeric arrays from PostgreSQL
> --
>
> Key: SPARK-26540
> URL: https://issues.apache.org/jira/browse/SPARK-26540
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> This bug was reported in spark-user: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html
> To reproduce this;
> {code}
> // Creates a table in a PostgreSQL shell
> postgres=# CREATE TABLE t (v numeric[], d  numeric);
> CREATE TABLE
> postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555);
> INSERT 0 1
> postgres=# SELECT * FROM t;
>   v  |d 
> -+--
>  {.222,.332} | 222.4555
> (1 row)
> postgres=# \d t
> Table "public.t"
>  Column |   Type| Modifiers 
> +---+---
>  v  | numeric[] | 
>  d  | numeric   | 
> // Then, reads it in Spark
> ./bin/spark-shell --jars=postgresql-42.2.4.jar -v
> scala> import java.util.Properties
> scala> val options = new Properties();
> scala> options.setProperty("driver", "org.postgresql.Driver")
> scala> 

[jira] [Updated] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL

2019-01-04 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26540:
--
Affects Version/s: 2.0.2
   2.1.3

> Requirement failed when reading numeric arrays from PostgreSQL
> --
>
> Key: SPARK-26540
> URL: https://issues.apache.org/jira/browse/SPARK-26540
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.2, 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> This bug was reported in spark-user: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html
> To reproduce this;
> {code}
> // Creates a table in a PostgreSQL shell
> postgres=# CREATE TABLE t (v numeric[], d  numeric);
> CREATE TABLE
> postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555);
> INSERT 0 1
> postgres=# SELECT * FROM t;
>   v  |d 
> -+--
>  {.222,.332} | 222.4555
> (1 row)
> postgres=# \d t
> Table "public.t"
>  Column |   Type| Modifiers 
> +---+---
>  v  | numeric[] | 
>  d  | numeric   | 
> // Then, reads it in Spark
> ./bin/spark-shell --jars=postgresql-42.2.4.jar -v
> scala> import java.util.Properties
> scala> val options = new Properties();
> scala> options.setProperty("driver", "org.postgresql.Driver")
> scala> options.setProperty("user", "maropu")
> scala> options.setProperty("password", "")
> scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
> scala> pgTable.printSchema
> root
>  |-- v: array (nullable = true)
>  ||-- element: decimal(0,0) (containsNull = true)
>  |-- d: decimal(38,18) (nullable = true)
> scala> pgTable.show
> 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 
> exceeds max precision 0
>   at scala.Predef$.require(Predef.scala:281)
>   at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116)
>   at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
> ...
> {code}
> I looked over the related code and then I think we need more logics to handle 
> numeric arrays;
> https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably

2019-01-04 Thread chenliang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenliang updated SPARK-26543:
--
Description: 
For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the 
ExchangeCoordinator will introduced to determine the number of post-shuffle 
partitions. But in some certain conditions,the coordinator performed not very 
well, there are always some tasks retained and they worked with Shuffle Read 
Size / Records 0.0B/0 ,We could increase the 
spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this 
action is unreasonable as targetPostShuffleInputSize Should not be set too 
large. As follow:

 !screenshot-1.png! 

We can filter the useless partition(0B) with ExchangeCoorditinator automatically

  was:
For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the 
ExchangeCoordinator will introduced to determine the number of post-shuffle 
partitions. But in some certain conditions,the coordinator performed not very 
well, there are always some tasks retained and they worked with Shuffle Read 
Size / Records 0.0B/0 ,We could increase the 
spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this 
action is unreasonable as targetPostShuffleInputSize Should not be set too 
large. As follow:

{noformat}
 !screenshot-1.png! 
{noformat}

We can filter the useless partition(0B) with ExchangeCoorditinator automatically


> Support the coordinator to demerminte post-shuffle partitions more reasonably
> -
>
> Key: SPARK-26543
> URL: https://issues.apache.org/jira/browse/SPARK-26543
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2
>Reporter: chenliang
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: screenshot-1.png
>
>
> For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the 
> ExchangeCoordinator will introduced to determine the number of post-shuffle 
> partitions. But in some certain conditions,the coordinator performed not very 
> well, there are always some tasks retained and they worked with Shuffle Read 
> Size / Records 0.0B/0 ,We could increase the 
> spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this 
> action is unreasonable as targetPostShuffleInputSize Should not be set too 
> large. As follow:
>  !screenshot-1.png! 
> We can filter the useless partition(0B) with ExchangeCoorditinator 
> automatically



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably

2019-01-04 Thread chenliang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenliang updated SPARK-26543:
--
Attachment: image-2019-01-05-13-18-30-487.png

> Support the coordinator to demerminte post-shuffle partitions more reasonably
> -
>
> Key: SPARK-26543
> URL: https://issues.apache.org/jira/browse/SPARK-26543
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2
>Reporter: chenliang
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: image-2019-01-05-13-18-30-487.png
>
>
> For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the 
> ExchangeCoordinator will introduced to determine the number of post-shuffle 
> partitions. But in some certain conditions,the coordinator performed not very 
> well, there are always some tasks retained and they worked with Shuffle Read 
> Size / Records 0.0B/0 ,We could increase the 
> spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this 
> action is unreasonable as targetPostShuffleInputSize Should not be set too 
> large. As follow:
>  !screenshot-1.png! 
> We can filter the useless partition(0B) with ExchangeCoorditinator 
> automatically



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably

2019-01-04 Thread chenliang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenliang updated SPARK-26543:
--
Attachment: screenshot-1.png

> Support the coordinator to demerminte post-shuffle partitions more reasonably
> -
>
> Key: SPARK-26543
> URL: https://issues.apache.org/jira/browse/SPARK-26543
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2
>Reporter: chenliang
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: screenshot-1.png
>
>
> For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the 
> ExchangeCoordinator will introduced to determine the number of post-shuffle 
> partitions. But in some certain conditions,the coordinator performed not very 
> well, there are always some tasks retained and they worked with Shuffle Read 
> Size / Records 0.0B/0 ,We could increase the 
> spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this 
> action is unreasonable as targetPostShuffleInputSize Should not be set too 
> large. As follow:
> !15_24_38__12_27_2018.jpg!
> We can filter the useless partition(0B) with ExchangeCoorditinator 
> automatically



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably

2019-01-04 Thread chenliang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenliang updated SPARK-26543:
--
Attachment: (was: screenshot-1.png)

> Support the coordinator to demerminte post-shuffle partitions more reasonably
> -
>
> Key: SPARK-26543
> URL: https://issues.apache.org/jira/browse/SPARK-26543
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2
>Reporter: chenliang
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: image-2019-01-05-13-18-30-487.png
>
>
> For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the 
> ExchangeCoordinator will introduced to determine the number of post-shuffle 
> partitions. But in some certain conditions,the coordinator performed not very 
> well, there are always some tasks retained and they worked with Shuffle Read 
> Size / Records 0.0B/0 ,We could increase the 
> spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this 
> action is unreasonable as targetPostShuffleInputSize Should not be set too 
> large. As follow:
>  !screenshot-1.png! 
> We can filter the useless partition(0B) with ExchangeCoorditinator 
> automatically



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL

2019-01-04 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734719#comment-16734719
 ] 

Dongjoon Hyun commented on SPARK-26540:
---

Given that Apache Spark handles the decimal type with precision and scale, the 
reported case looks like a corner case. And, I believe this will be an minor 
improvement issue to cover that corner case.

> Requirement failed when reading numeric arrays from PostgreSQL
> --
>
> Key: SPARK-26540
> URL: https://issues.apache.org/jira/browse/SPARK-26540
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> This bug was reported in spark-user: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html
> To reproduce this;
> {code}
> // Creates a table in a PostgreSQL shell
> postgres=# CREATE TABLE t (v numeric[], d  numeric);
> CREATE TABLE
> postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555);
> INSERT 0 1
> postgres=# SELECT * FROM t;
>   v  |d 
> -+--
>  {.222,.332} | 222.4555
> (1 row)
> postgres=# \d t
> Table "public.t"
>  Column |   Type| Modifiers 
> +---+---
>  v  | numeric[] | 
>  d  | numeric   | 
> // Then, reads it in Spark
> ./bin/spark-shell --jars=postgresql-42.2.4.jar -v
> scala> import java.util.Properties
> scala> val options = new Properties();
> scala> options.setProperty("driver", "org.postgresql.Driver")
> scala> options.setProperty("user", "maropu")
> scala> options.setProperty("password", "")
> scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
> scala> pgTable.printSchema
> root
>  |-- v: array (nullable = true)
>  ||-- element: decimal(0,0) (containsNull = true)
>  |-- d: decimal(38,18) (nullable = true)
> scala> pgTable.show
> 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 
> exceeds max precision 0
>   at scala.Predef$.require(Predef.scala:281)
>   at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116)
>   at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
> ...
> {code}
> I looked over the related code and then I think we need more logics to handle 
> numeric arrays;
> https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably

2019-01-04 Thread chenliang (JIRA)
chenliang created SPARK-26543:
-

 Summary: Support the coordinator to demerminte post-shuffle 
partitions more reasonably
 Key: SPARK-26543
 URL: https://issues.apache.org/jira/browse/SPARK-26543
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.2, 2.3.1, 2.3.0, 2.2.2, 2.2.1, 2.2.0
Reporter: chenliang
 Fix For: 2.3.0


For SparkSQL ,when we  open AE by 'set spark.sql.adapative.enable=true',the 
ExchangeCoordinator will  introduced to determine the number of post-shuffle 
partitions. But in some certain conditions,the coordinator performed not very 
well, there are always some tasks retained and they worked with Shuffle Read 
Size / Records  0.0B/0 ,We could increase the 
spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but  this 
action is  unreasonable as targetPostShuffleInputSize Should not be set too 
large. 


We can filter the useless partition(0B) with ExchangeCoorditinator 
automatically 





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably

2019-01-04 Thread chenliang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenliang updated SPARK-26543:
--
Attachment: 15_24_38__12_27_2018.jpg

> Support the coordinator to demerminte post-shuffle partitions more reasonably
> -
>
> Key: SPARK-26543
> URL: https://issues.apache.org/jira/browse/SPARK-26543
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2
>Reporter: chenliang
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: 15_24_38__12_27_2018.jpg
>
>
> For SparkSQL ,when we  open AE by 'set spark.sql.adapative.enable=true',the 
> ExchangeCoordinator will  introduced to determine the number of post-shuffle 
> partitions. But in some certain conditions,the coordinator performed not very 
> well, there are always some tasks retained and they worked with Shuffle Read 
> Size / Records  0.0B/0 ,We could increase the 
> spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but  this 
> action is  unreasonable as targetPostShuffleInputSize Should not be set too 
> large. 
> We can filter the useless partition(0B) with ExchangeCoorditinator 
> automatically 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26078) WHERE .. IN fails to filter rows when used in combination with UNION

2019-01-04 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26078:
--
Fix Version/s: 2.4.1

> WHERE .. IN fails to filter rows when used in combination with UNION
> 
>
> Key: SPARK-26078
> URL: https://issues.apache.org/jira/browse/SPARK-26078
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Arttu Voutilainen
>Assignee: Marco Gaido
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.4.1, 3.0.0
>
>
> Hey,
> We encountered a case where Spark SQL does not seem to handle WHERE .. IN 
> correctly, when used in combination with UNION, but instead returns also rows 
> that do not fulfill the condition. Swapping the order of the datasets in the 
> UNION makes the problem go away. Repro below:
>  
> {code}
> sql = SQLContext(sc)
> a = spark.createDataFrame([{'id': 'a', 'num': 2}, {'id':'b', 'num':1}])
> b = spark.createDataFrame([{'id': 'a', 'num': 2}, {'id':'b', 'num':1}])
> a.registerTempTable('a')
> b.registerTempTable('b')
> bug = sql.sql("""
> SELECT id,num,source FROM
> (
> SELECT id, num, 'a' as source FROM a
> UNION ALL
> SELECT id, num, 'b' as source FROM b
> ) AS c
> WHERE c.id IN (SELECT id FROM b WHERE num = 2)
> """)
> no_bug = sql.sql("""
> SELECT id,num,source FROM
> (
> SELECT id, num, 'b' as source FROM b
> UNION ALL
> SELECT id, num, 'a' as source FROM a
> ) AS c
> WHERE c.id IN (SELECT id FROM b WHERE num = 2)
> """)
> bug.show()
> no_bug.show()
> bug.explain(True)
> no_bug.explain(True)
> {code}
> This results in one extra row in the "bug" DF coming from DF "b", that should 
> not be there as it  
> {code:java}
> >>> bug.show()
> +---+---+--+
> | id|num|source|
> +---+---+--+
> |  a|  2| a|
> |  a|  2| b|
> |  b|  1| b|
> +---+---+--+
> >>> no_bug.show()
> +---+---+--+
> | id|num|source|
> +---+---+--+
> |  a|  2| b|
> |  a|  2| a|
> +---+---+--+
> {code}
>  The reason can be seen in the query plans:
> {code:java}
> >>> bug.explain(True)
> ...
> == Optimized Logical Plan ==
> Union
> :- Project [id#0, num#1L, a AS source#136]
> :  +- Join LeftSemi, (id#0 = id#4)
> : :- LogicalRDD [id#0, num#1L], false
> : +- Project [id#4]
> :+- Filter (isnotnull(num#5L) && (num#5L = 2))
> :   +- LogicalRDD [id#4, num#5L], false
> +- Join LeftSemi, (id#4#172 = id#4#172)
>:- Project [id#4, num#5L, b AS source#137]
>:  +- LogicalRDD [id#4, num#5L], false
>+- Project [id#4 AS id#4#172]
>   +- Filter (isnotnull(num#5L) && (num#5L = 2))
>  +- LogicalRDD [id#4, num#5L], false
> {code}
> Note the line *+- Join LeftSemi, (id#4#172 = id#4#172)* - this condition 
> seems wrong, and I believe it causes the LeftSemi to return true for all rows 
> in the left-hand-side table, thus failing to filter as the WHERE .. IN 
> should. Compare with the non-buggy version, where both LeftSemi joins have 
> distinct #-things on both sides:
> {code:java}
> >>> no_bug.explain()
> ...
> == Optimized Logical Plan ==
> Union
> :- Project [id#4, num#5L, b AS source#142]
> :  +- Join LeftSemi, (id#4 = id#4#173)
> : :- LogicalRDD [id#4, num#5L], false
> : +- Project [id#4 AS id#4#173]
> :+- Filter (isnotnull(num#5L) && (num#5L = 2))
> :   +- LogicalRDD [id#4, num#5L], false
> +- Project [id#0, num#1L, a AS source#143]
>+- Join LeftSemi, (id#0 = id#4#173)
>   :- LogicalRDD [id#0, num#1L], false
>   +- Project [id#4 AS id#4#173]
>  +- Filter (isnotnull(num#5L) && (num#5L = 2))
> +- LogicalRDD [id#4, num#5L], false
> {code}
>  
> Best,
> -Arttu 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26511) java.lang.ClassCastException error when loading Spark MLlib model from parquet file

2019-01-04 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734009#comment-16734009
 ] 

Hyukjin Kwon commented on SPARK-26511:
--

BTW, please make sure such facts are described in the JIRA first to prevent 
such overhead.

> java.lang.ClassCastException error when loading Spark MLlib model from 
> parquet file
> ---
>
> Key: SPARK-26511
> URL: https://issues.apache.org/jira/browse/SPARK-26511
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.4.0
>Reporter: Amy Koh
>Priority: Major
> Attachments: repro.zip
>
>
> When I tried to load a decision tree model from a parquet file, the following 
> error is thrown. 
> {code:bash}
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.mllib.tree.model.DecisionTreeModel.load. : 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 
> (TID 2, localhost, executor driver): java.lang.ClassCastException: class 
> java.lang.Double cannot be cast to class java.lang.Integer (java.lang.Double 
> and java.lang.Integer are in module java.base of loader 'bootstrap') at 
> scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101) at 
> org.apache.spark.sql.Row$class.getInt(Row.scala:223) at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.getInt(rows.scala:165) 
> at 
> org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$SplitData$.apply(DecisionTreeModel.scala:171)
>  at 
> org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$NodeData$.apply(DecisionTreeModel.scala:195)
>  at 
> org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$$anonfun$9.apply(DecisionTreeModel.scala:247)
>  at 
> org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$$anonfun$9.apply(DecisionTreeModel.scala:247)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at 
> scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149)
>  at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at 
> org.apache.spark.scheduler.Task.run(Task.scala:108) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  at java.base/java.lang.Thread.run(Thread.java:834) Driver stacktrace: at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486) 
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
>  at scala.Option.foreach(Option.scala:257) at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
>  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630) at 
> org.apache.spark.SparkContext.runJob(SparkContext.scala:2022) at 
> org.apache.spark.SparkContext.runJob(SparkContext.scala:2043) at 
> org.apache.spark.SparkContext.runJob(SparkContext.scala:2062) at 
> org.apache.spark.SparkContext.runJob(SparkContext.scala:2087) at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936) at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at 
> org.apache.spark.rdd.RDD.collect(RDD.scala:935) at 
> 

[jira] [Updated] (SPARK-25919) Date value corrupts when tables are "ParquetHiveSerDe" formatted and target table is Partitioned

2019-01-04 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-25919:
-
Priority: Major  (was: Blocker)

> Date value corrupts when tables are "ParquetHiveSerDe" formatted and target 
> table is Partitioned
> 
>
> Key: SPARK-25919
> URL: https://issues.apache.org/jira/browse/SPARK-25919
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell, Spark Submit
>Affects Versions: 2.1.0, 2.2.1
>Reporter: Pawan
>Priority: Major
>
> Hi
> I found a really strange issue. Below are the steps to reproduce it. This 
> issue occurs only when the table row format is ParquetHiveSerDe and the 
> target table is Partitioned
> *Hive:*
> Login in to hive terminal on cluster and create below tables.
> {code:java}
> create table t_src(
> name varchar(10),
> dob timestamp
> )
> ROW FORMAT SERDE 
>  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
> STORED AS INPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
> OUTPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> create table t_tgt(
> name varchar(10),
> dob timestamp
> )
> PARTITIONED BY (city varchar(10))
> ROW FORMAT SERDE 
>  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
> STORED AS INPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
> OUTPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';
> {code}
> Insert data into the source table (t_src)
> {code:java}
> INSERT INTO t_src VALUES ('p1', '0001-01-01 00:00:00.0'),('p2', '0002-01-01 
> 00:00:00.0'), ('p3', '0003-01-01 00:00:00.0'),('p4', '0004-01-01 
> 00:00:00.0');{code}
> *Spark-shell:*
> Get on to spark-shell. 
> Execute below commands on spark shell:
> {code:java}
> import org.apache.spark.sql.hive.HiveContext
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> val q0 = "TRUNCATE table t_tgt"
> val q1 = "SELECT CAST(alias.name AS STRING) as a0, alias.dob as a1 FROM 
> DEFAULT.t_src alias"
> val q2 = "INSERT INTO TABLE DEFAULT.t_tgt PARTITION (city) SELECT tbl0.a0 as 
> c0, tbl0.a1 as c1, NULL as c2 FROM tbl0"
> sqlContext.sql(q0)
> sqlContext.sql(q1).select("a0","a1").createOrReplaceTempView("tbl0")
> sqlContext.sql(q2)
> {code}
>  After this check the contents of target table t_tgt. You will see the date 
> "0001-01-01 00:00:00" changed to "0002-01-01 00:00:00". Below snippets shows 
> the contents of both the tables:
> {code:java}
> select * from t_src;
> +-++--+
> | t_src.name | t_src.dob |
> +-++--+
> | p1 | 0001-01-01 00:00:00.0 |
> | p2 | 0002-01-01 00:00:00.0 |
> | p3 | 0003-01-01 00:00:00.0 |
> | p4 | 0004-01-01 00:00:00.0 |
> +-++–+
>  select * from t_tgt;
> +-++--+
> | t_src.name | t_src.dob | t_tgt.city |
> +-++--+
> | p1 | 0002-01-01 00:00:00.0 |__HIVE_DEF |
> | p2 | 0002-01-01 00:00:00.0 |__HIVE_DEF |
> | p3 | 0003-01-01 00:00:00.0 |__HIVE_DEF |
> | p4 | 0004-01-01 00:00:00.0 |__HIVE_DEF |
> +-++--+
> {code}
>  
> Is this a known issue? Is it fixed in any subsequent releases?
> Thanks & regards,
> Pawan Lawale



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25873) Date corruption when Spark and Hive both are on different timezones

2019-01-04 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25873.
--
Resolution: Fixed

> Date corruption when Spark and Hive both are on different timezones
> ---
>
> Key: SPARK-25873
> URL: https://issues.apache.org/jira/browse/SPARK-25873
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell, Spark Submit
>Affects Versions: 2.2.1
>Reporter: Pawan
>Priority: Major
>
> There is date alteration when loading date from one table to another in hive 
> through spark. This happens when Hive is on a remote machine with timezone 
> different than the one on which Spark is running. This happens only when the 
> Source table format is 
> 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> Below are the steps to produce the issue:
> 1. Create two tables as below in hive which has a timezone, say in, EST
> {code}
>  CREATE TABLE t_src(
>  name varchar(10),
>  dob timestamp
>  )
>  ROW FORMAT SERDE 
>  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
>  STORED AS INPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
>  OUTPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> {code}
> {code}
> INSERT INTO t_src VALUES ('p1', '0001-01-01 00:00:00.0'),('p2', '0002-01-01 
> 00:00:00.0'), ('p3', '0003-01-01 00:00:00.0'),('p4', '0004-01-01 00:00:00.0');
> {code}
>  
> {code}
>  CREATE TABLE t_tgt(
>  name varchar(10),
>  dob timestamp
>  );
> {code}
> 2. Copy {{hive-site.xml}} to {{spark-2.2.1-bin-hadoop2.7/conf}} folder, so 
> that when you create {{sqlContext}} for hive it connects to your remote hive 
> server.
> 3. Start your spark-shell on some other machine whose timezone is different 
> than that of Hive, say, PDT
> 4. Execute below code:
> {code}
> import org.apache.spark.sql.hive.HiveContext
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> val q0 = "TRUNCATE table t_tgt"
> val q1 = "SELECT CAST(alias.name AS String) as a0, alias.dob as a1 FROM t_src 
> alias"
> val q2 = "INSERT OVERWRITE TABLE t_tgt SELECT tbl0.a0 as c0, tbl0.a1 as c1 
> FROM tbl0"
> sqlContext.sql(q0)
> sqlContext.sql(q1).select("a0","a1").createOrReplaceTempView("tbl0")
> sqlContext.sql(q2)
> {code}
> 5. Now navigate to hive and check the contents of the {{TARGET table 
> (t_tgt)}}. The dob field will have incorrect values.
>  
> Is this a known issue? Is there any work around on this? Can it be fixed?
>  
> Thanks & regards,
> Pawan Lawale



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19217) Offer easy cast from vector to array

2019-01-04 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-19217.
--
Resolution: Later

Let's track this in another JIRA. It's been inactive for so long time. At least 
work around is easy and to natively support this in cast, it looks quite big 
challenge.

> Offer easy cast from vector to array
> 
>
> Key: SPARK-19217
> URL: https://issues.apache.org/jira/browse/SPARK-19217
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Working with ML often means working with DataFrames with vector columns. You 
> can't save these DataFrames to storage (edit: at least as ORC) without 
> converting the vector columns to array columns, and there doesn't appear to 
> an easy way to make that conversion.
> This is a common enough problem that it is [documented on Stack 
> Overflow|http://stackoverflow.com/q/35855382/877069]. The current solutions 
> to making the conversion from a vector column to an array column are:
> # Convert the DataFrame to an RDD and back
> # Use a UDF
> Both approaches work fine, but it really seems like you should be able to do 
> something like this instead:
> {code}
> (le_data
> .select(
> col('features').cast('array').alias('features')
> ))
> {code}
> We already have an {{ArrayType}} in {{pyspark.sql.types}}, but it appears 
> that {{cast()}} doesn't support this conversion.
> Would this be an appropriate thing to add?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26126) Put scala-library deps into root pom instead of spark-tags module

2019-01-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26126:


Assignee: Apache Spark

> Put scala-library deps into root pom instead of spark-tags module
> -
>
> Key: SPARK-26126
> URL: https://issues.apache.org/jira/browse/SPARK-26126
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.3.0, 2.4.0
>Reporter: liupengcheng
>Assignee: Apache Spark
>Priority: Minor
>
> When I do some backport in our custom spark, I notice some strange code from 
> spark-tags module:
> {code:java}
> 
>   
> org.scala-lang
> scala-library
> ${scala.version}
>   
> 
> {code}
> As i known, should spark-tags only contains some annotation related classes 
> or deps?
> should we put the scala-library deps to root pom?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26334) NullPointerException in CallMethodViaReflection when we apply reflect function for empty field

2019-01-04 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26334:
-
Target Version/s:   (was: 2.2.0, 2.3.0)

> NullPointerException  in CallMethodViaReflection when we apply reflect 
> function for empty field
> ---
>
> Key: SPARK-26334
> URL: https://issues.apache.org/jira/browse/SPARK-26334
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0
>Reporter: chenliang
>Priority: Major
> Attachments: SPARK-26334.patch, screenshot-1.png
>
>
> In the table shown below:
> {code:sql}
> CREATE EXTERNAL TABLE `test_db4`(`a` string, `b` string, `url` string) 
> PARTITIONED BY (`dt` string);
> {code}
>  !screenshot-1.png! 
>   For field `url`, some values  are initialized to NULL .
>   When we apply reflect function  to `url` , it will  lead to  
> NullPointerException  as follow:
> {code:scala}
> select reflect('java.net.URLDecoder', 'decode', url ,'utf-8') from 
> mydemo.test_db4 where dt=20180920;
> {code}
> {panel:title=NPE}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 
> (TID 17, bigdata-nmg-hdfstest12.nmg01.diditaxi.com, executor 1): 
> java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection.eval(CallMethodViaReflection.scala:95)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:235)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
> at java.net.URLDecoder.decode(URLDecoder.java:136)
> ... 21 more
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
> at scala.Option.foreach(Option.scala:257)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)
> at 

[jira] [Commented] (SPARK-26334) NullPointerException in CallMethodViaReflection when we apply reflect function for empty field

2019-01-04 Thread chenliang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734142#comment-16734142
 ] 

chenliang commented on SPARK-26334:
---

[~hyukjin.kwon]Thanks for your attention,I did know that the method itself 
throws an NPE. But for hiveSQL, it can catch the exception well , maybe we 
could make sparkSQL compatible for this issue. 

!21_47_01__01_04_2019.jpg!

 

 

> NullPointerException  in CallMethodViaReflection when we apply reflect 
> function for empty field
> ---
>
> Key: SPARK-26334
> URL: https://issues.apache.org/jira/browse/SPARK-26334
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0
>Reporter: chenliang
>Priority: Major
> Attachments: 21_47_01__01_04_2019.jpg, SPARK-26334.patch, 
> screenshot-1.png
>
>
> In the table shown below:
> {code:sql}
> CREATE EXTERNAL TABLE `test_db4`(`a` string, `b` string, `url` string) 
> PARTITIONED BY (`dt` string);
> {code}
>  !screenshot-1.png! 
>   For field `url`, some values  are initialized to NULL .
>   When we apply reflect function  to `url` , it will  lead to  
> NullPointerException  as follow:
> {code:scala}
> select reflect('java.net.URLDecoder', 'decode', url ,'utf-8') from 
> mydemo.test_db4 where dt=20180920;
> {code}
> {panel:title=NPE}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 
> (TID 17, bigdata-nmg-hdfstest12.nmg01.diditaxi.com, executor 1): 
> java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection.eval(CallMethodViaReflection.scala:95)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:235)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
> at java.net.URLDecoder.decode(URLDecoder.java:136)
> ... 21 more
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
> at scala.Option.foreach(Option.scala:257)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)
> at 
> 

[jira] [Updated] (SPARK-26512) Spark 2.4.0 is not working with Hadoop 2.8.3 in windows 10

2019-01-04 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26512:
-
Summary: Spark 2.4.0 is not working with Hadoop 2.8.3 in windows 10  (was: 
Spark 2.4.0 is not working with Hadoop 2.8.3 in windows 10?)

> Spark 2.4.0 is not working with Hadoop 2.8.3 in windows 10
> --
>
> Key: SPARK-26512
> URL: https://issues.apache.org/jira/browse/SPARK-26512
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell, YARN
>Affects Versions: 2.4.0
> Environment: operating system : Windows 10
> Spark Version : 2.4.0
> Hadoop Version : 2.8.3
>Reporter: Anubhav Jain
>Priority: Minor
>  Labels: windows
> Attachments: log.png
>
>
> I have installed Hadoop version 2.8.3 in my windows 10 environment and its 
> working fine. Now when i try to install Apache Spark(version 2.4.0) with yarn 
> as cluster manager and its not working. When i try to submit a spark job 
> using spark-submit for testing , so its coming under ACCEPTED tab in YARN UI 
> after that it fail



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26254) Move delegation token providers into a separate project

2019-01-04 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734004#comment-16734004
 ] 

Hyukjin Kwon commented on SPARK-26254:
--

[~gsomogyi], can you point out which comment is about the discussion exactly?

> Move delegation token providers into a separate project
> ---
>
> Key: SPARK-26254
> URL: https://issues.apache.org/jira/browse/SPARK-26254
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> There was a discussion in 
> [PR#22598|https://github.com/apache/spark/pull/22598] that there are several 
> provided dependencies inside core project which shouldn't be there (for ex. 
> hive and kafka). This jira is to solve this problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26126) Put scala-library deps into root pom instead of spark-tags module

2019-01-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26126:


Assignee: (was: Apache Spark)

> Put scala-library deps into root pom instead of spark-tags module
> -
>
> Key: SPARK-26126
> URL: https://issues.apache.org/jira/browse/SPARK-26126
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.3.0, 2.4.0
>Reporter: liupengcheng
>Priority: Minor
>
> When I do some backport in our custom spark, I notice some strange code from 
> spark-tags module:
> {code:java}
> 
>   
> org.scala-lang
> scala-library
> ${scala.version}
>   
> 
> {code}
> As i known, should spark-tags only contains some annotation related classes 
> or deps?
> should we put the scala-library deps to root pom?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24661) Window API - using multiple fields for partitioning with WindowSpec API and dataset that is cached causes org.apache.spark.sql.catalyst.errors.package$TreeNodeException

2019-01-04 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-24661:
-
Component/s: (was: PySpark)

> Window API - using multiple fields for partitioning with WindowSpec API and 
> dataset that is cached causes 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException
> 
>
> Key: SPARK-24661
> URL: https://issues.apache.org/jira/browse/SPARK-24661
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Java API
>Affects Versions: 2.3.0
>Reporter: David Mavashev
>Priority: Major
>
> Steps to reproduce:
> Creating a data set:
>  
> {code:java}
> List simpleWindowColumns = new ArrayList();
> simpleWindowColumns.add("column1");
> simpleWindowColumns.add("column2");
> Map expressionsWithAliasesEntrySet = new HashMap String>);
> expressionsWithAliasesEntrySet.put("count(id)", "count_column");
> DataFrameReader reader = sparkSession.read().format("csv");
> Dataset sparkDataSet = reader.option("header", 
> "true").load("/path/to/data/data.csv");
> //Invoking cached:
> sparkDataSet = sparkDataSet.cache()
> //Creating window spec with 2 columns:
> WindowSpec window = 
> Window.partitionBy(JavaConverters.asScalaIteratorConverter(simpleWindowColumns.stream().map(item->sparkDataSet.col(item)).iterator()).asScala().toSeq());
> sparkDataSet = 
> sparkDataSet.withColumns(JavaConverters.asScalaIteratorConverter(expressionsWithAliasesEntrySet.stream().map(item->item.getKey()).collect(Collectors.toList()).iterator()).asScala().toSeq(),
>   
> JavaConverters.asScalaIteratorConverter(expressionsWithAliasesEntrySet.stream().map(item->new
>  
> Column(item.getValue()).over(finalWindow)).collect(Collectors.toList()).iterator()).asScala().toSeq());
> sparkDataSet.show();{code}
> Expected:
>  
> Results are shown
>  
>  
> Actual: the following exception is thrown
> {code:java}
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, 
> tree: windowspecdefinition(O003#3, O006#6, specifiedwindowframe(RowFrame, 
> unboundedpreceding$(), unboundedfollowing$())) at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at 
> org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:385) at 
> org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:244)
>  at 
> org.apache.spark.sql.catalyst.expressions.Expression.canonicalized$lzycompute(Expression.scala:190)
>  at 
> org.apache.spark.sql.catalyst.expressions.Expression.canonicalized(Expression.scala:188)
>  at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$1.apply(Expression.scala:189)
>  at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$1.apply(Expression.scala:189)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:245) at 
> scala.collection.immutable.List.map(List.scala:285) at 
> org.apache.spark.sql.catalyst.expressions.Expression.canonicalized$lzycompute(Expression.scala:189)
>  at 
> org.apache.spark.sql.catalyst.expressions.Expression.canonicalized(Expression.scala:188)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$.normalizeExprId(QueryPlan.scala:288)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$doCanonicalize$1.apply(QueryPlan.scala:232)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$doCanonicalize$1.apply(QueryPlan.scala:226)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:116)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:120)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:245) at 
> scala.collection.AbstractTraversable.map(Traversable.scala:104) at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:120)
>  at 
> 

[jira] [Commented] (SPARK-26512) Spark 2.4.0 is not working with Hadoop 2.8.3 in windows 10

2019-01-04 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734072#comment-16734072
 ] 

Saisai Shao commented on SPARK-26512:
-

This seems like a Netty version problem, netty-3.9.9.Final.jar is unrelated. I 
was thinking if we can put spark classpath in front of Hadoop classpath, maybe 
this can be worked. There's a such configuration for driver/executor, not such 
if there's a similar one for AM only.

> Spark 2.4.0 is not working with Hadoop 2.8.3 in windows 10
> --
>
> Key: SPARK-26512
> URL: https://issues.apache.org/jira/browse/SPARK-26512
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell, YARN
>Affects Versions: 2.4.0
> Environment: operating system : Windows 10
> Spark Version : 2.4.0
> Hadoop Version : 2.8.3
>Reporter: Anubhav Jain
>Priority: Minor
>  Labels: windows
> Attachments: log.png
>
>
> I have installed Hadoop version 2.8.3 in my windows 10 environment and its 
> working fine. Now when i try to install Apache Spark(version 2.4.0) with yarn 
> as cluster manager and its not working. When i try to submit a spark job 
> using spark-submit for testing , so its coming under ACCEPTED tab in YARN UI 
> after that it fail



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26334) NullPointerException in CallMethodViaReflection when we apply reflect function for empty field

2019-01-04 Thread chenliang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenliang updated SPARK-26334:
--
Attachment: 1537537286575.jpg

> NullPointerException  in CallMethodViaReflection when we apply reflect 
> function for empty field
> ---
>
> Key: SPARK-26334
> URL: https://issues.apache.org/jira/browse/SPARK-26334
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0
>Reporter: chenliang
>Priority: Major
> Attachments: 1537537286575.jpg, SPARK-26334.patch, screenshot-1.png
>
>
> In the table shown below:
> {code:sql}
> CREATE EXTERNAL TABLE `test_db4`(`a` string, `b` string, `url` string) 
> PARTITIONED BY (`dt` string);
> {code}
>  !screenshot-1.png! 
>   For field `url`, some values  are initialized to NULL .
>   When we apply reflect function  to `url` , it will  lead to  
> NullPointerException  as follow:
> {code:scala}
> select reflect('java.net.URLDecoder', 'decode', url ,'utf-8') from 
> mydemo.test_db4 where dt=20180920;
> {code}
> {panel:title=NPE}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 
> (TID 17, bigdata-nmg-hdfstest12.nmg01.diditaxi.com, executor 1): 
> java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection.eval(CallMethodViaReflection.scala:95)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:235)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
> at java.net.URLDecoder.decode(URLDecoder.java:136)
> ... 21 more
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
> at scala.Option.foreach(Option.scala:257)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)
> at 

[jira] [Commented] (SPARK-26334) NullPointerException in CallMethodViaReflection when we apply reflect function for empty field

2019-01-04 Thread chenliang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734138#comment-16734138
 ] 

chenliang commented on SPARK-26334:
---

[~hyukjin.kwon] Thank you ,I did know that the method itself throws an NPE. But 
for hiveSQL, it can catch the exception well , maybe we could make sparkSQL 
compatible for this issue.

!1537537286575.jpg!

> NullPointerException  in CallMethodViaReflection when we apply reflect 
> function for empty field
> ---
>
> Key: SPARK-26334
> URL: https://issues.apache.org/jira/browse/SPARK-26334
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0
>Reporter: chenliang
>Priority: Major
> Attachments: 1537537286575.jpg, SPARK-26334.patch, screenshot-1.png
>
>
> In the table shown below:
> {code:sql}
> CREATE EXTERNAL TABLE `test_db4`(`a` string, `b` string, `url` string) 
> PARTITIONED BY (`dt` string);
> {code}
>  !screenshot-1.png! 
>   For field `url`, some values  are initialized to NULL .
>   When we apply reflect function  to `url` , it will  lead to  
> NullPointerException  as follow:
> {code:scala}
> select reflect('java.net.URLDecoder', 'decode', url ,'utf-8') from 
> mydemo.test_db4 where dt=20180920;
> {code}
> {panel:title=NPE}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 
> (TID 17, bigdata-nmg-hdfstest12.nmg01.diditaxi.com, executor 1): 
> java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection.eval(CallMethodViaReflection.scala:95)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:235)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
> at java.net.URLDecoder.decode(URLDecoder.java:136)
> ... 21 more
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
> at scala.Option.foreach(Option.scala:257)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)
> at 
> 

[jira] [Updated] (SPARK-26529) Add debug logs for confArchive when preparing local resource

2019-01-04 Thread liupengcheng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liupengcheng updated SPARK-26529:
-
Summary: Add debug logs for confArchive when preparing local resource   
(was: Add logs for IOException when preparing local resource )

> Add debug logs for confArchive when preparing local resource 
> -
>
> Key: SPARK-26529
> URL: https://issues.apache.org/jira/browse/SPARK-26529
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Core
>Affects Versions: 2.3.2, 2.4.0
>Reporter: liupengcheng
>Priority: Trivial
>
> Currently, `Client#createConfArchive` do not handle IOException, and some 
> detail info is not provided in logs. Sometimes, this may delay the time of 
> locating the root cause of io error.
> A case happened in our production environment is that local disk is full, and 
> the following exception is thrown but no detail path info provided. we have 
> to investigate all the local disk of the machine to find out the root cause.
> {code:java}
> Exception in thread "main" java.io.IOException: No space left on device
> at java.io.FileOutputStream.writeBytes(Native Method)
> at java.io.FileOutputStream.write(FileOutputStream.java:345)
> at java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253)
> at java.util.zip.ZipOutputStream.closeEntry(ZipOutputStream.java:238)
> at java.util.zip.ZipOutputStream.finish(ZipOutputStream.java:343)
> at java.util.zip.DeflaterOutputStream.close(DeflaterOutputStream.java:238)
> at java.util.zip.ZipOutputStream.close(ZipOutputStream.java:360)
> at org.apache.spark.deploy.yarn.Client.createConfArchive(Client.scala:769)
> at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:657)
> at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:895)
> at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:177)
> at org.apache.spark.deploy.yarn.Client.run(Client.scala:1202)
> at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1261)
> at org.apache.spark.deploy.yarn.Client.main(Client.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:767)
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:189)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:214)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:128)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}
> It make sense for us to catch the IOException and print some useful 
> information.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26334) NullPointerException in CallMethodViaReflection when we apply reflect function for empty field

2019-01-04 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26334.
--
Resolution: Invalid

The method itself already throws an NPE:

{code}
scala> java.net.URLDecoder.decode(null, "UTF-8")
java.lang.NullPointerException
  at java.net.URLDecoder.decode(URLDecoder.java:136)
  ... 29 elided
{code}

> NullPointerException  in CallMethodViaReflection when we apply reflect 
> function for empty field
> ---
>
> Key: SPARK-26334
> URL: https://issues.apache.org/jira/browse/SPARK-26334
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0
>Reporter: chenliang
>Priority: Major
> Attachments: SPARK-26334.patch, screenshot-1.png
>
>
> In the table shown below:
> {code:sql}
> CREATE EXTERNAL TABLE `test_db4`(`a` string, `b` string, `url` string) 
> PARTITIONED BY (`dt` string);
> {code}
>  !screenshot-1.png! 
>   For field `url`, some values  are initialized to NULL .
>   When we apply reflect function  to `url` , it will  lead to  
> NullPointerException  as follow:
> {code:scala}
> select reflect('java.net.URLDecoder', 'decode', url ,'utf-8') from 
> mydemo.test_db4 where dt=20180920;
> {code}
> {panel:title=NPE}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 
> (TID 17, bigdata-nmg-hdfstest12.nmg01.diditaxi.com, executor 1): 
> java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection.eval(CallMethodViaReflection.scala:95)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:235)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
> at java.net.URLDecoder.decode(URLDecoder.java:136)
> ... 21 more
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
> at scala.Option.foreach(Option.scala:257)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)
> at 
> 

[jira] [Issue Comment Deleted] (SPARK-26334) NullPointerException in CallMethodViaReflection when we apply reflect function for empty field

2019-01-04 Thread chenliang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenliang updated SPARK-26334:
--
Comment: was deleted

(was: [~hyukjin.kwon] Thank you ,I did know that the method itself throws an 
NPE. But for hiveSQL, it can catch the exception well , maybe we could make 
sparkSQL compatible for this issue.

!1537537286575.jpg!)

> NullPointerException  in CallMethodViaReflection when we apply reflect 
> function for empty field
> ---
>
> Key: SPARK-26334
> URL: https://issues.apache.org/jira/browse/SPARK-26334
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0
>Reporter: chenliang
>Priority: Major
> Attachments: 1537537286575.jpg, SPARK-26334.patch, screenshot-1.png
>
>
> In the table shown below:
> {code:sql}
> CREATE EXTERNAL TABLE `test_db4`(`a` string, `b` string, `url` string) 
> PARTITIONED BY (`dt` string);
> {code}
>  !screenshot-1.png! 
>   For field `url`, some values  are initialized to NULL .
>   When we apply reflect function  to `url` , it will  lead to  
> NullPointerException  as follow:
> {code:scala}
> select reflect('java.net.URLDecoder', 'decode', url ,'utf-8') from 
> mydemo.test_db4 where dt=20180920;
> {code}
> {panel:title=NPE}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 
> (TID 17, bigdata-nmg-hdfstest12.nmg01.diditaxi.com, executor 1): 
> java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection.eval(CallMethodViaReflection.scala:95)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:235)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
> at java.net.URLDecoder.decode(URLDecoder.java:136)
> ... 21 more
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
> at scala.Option.foreach(Option.scala:257)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)
> at 
> 

[jira] [Updated] (SPARK-26334) NullPointerException in CallMethodViaReflection when we apply reflect function for empty field

2019-01-04 Thread chenliang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenliang updated SPARK-26334:
--
Attachment: (was: 1537537286575.jpg)

> NullPointerException  in CallMethodViaReflection when we apply reflect 
> function for empty field
> ---
>
> Key: SPARK-26334
> URL: https://issues.apache.org/jira/browse/SPARK-26334
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0
>Reporter: chenliang
>Priority: Major
> Attachments: 21_47_01__01_04_2019.jpg, SPARK-26334.patch, 
> screenshot-1.png
>
>
> In the table shown below:
> {code:sql}
> CREATE EXTERNAL TABLE `test_db4`(`a` string, `b` string, `url` string) 
> PARTITIONED BY (`dt` string);
> {code}
>  !screenshot-1.png! 
>   For field `url`, some values  are initialized to NULL .
>   When we apply reflect function  to `url` , it will  lead to  
> NullPointerException  as follow:
> {code:scala}
> select reflect('java.net.URLDecoder', 'decode', url ,'utf-8') from 
> mydemo.test_db4 where dt=20180920;
> {code}
> {panel:title=NPE}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 
> (TID 17, bigdata-nmg-hdfstest12.nmg01.diditaxi.com, executor 1): 
> java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection.eval(CallMethodViaReflection.scala:95)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:235)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
> at java.net.URLDecoder.decode(URLDecoder.java:136)
> ... 21 more
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
> at scala.Option.foreach(Option.scala:257)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)
> at 

[jira] [Updated] (SPARK-26334) NullPointerException in CallMethodViaReflection when we apply reflect function for empty field

2019-01-04 Thread chenliang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenliang updated SPARK-26334:
--
Attachment: 21_47_01__01_04_2019.jpg

> NullPointerException  in CallMethodViaReflection when we apply reflect 
> function for empty field
> ---
>
> Key: SPARK-26334
> URL: https://issues.apache.org/jira/browse/SPARK-26334
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0
>Reporter: chenliang
>Priority: Major
> Attachments: 21_47_01__01_04_2019.jpg, SPARK-26334.patch, 
> screenshot-1.png
>
>
> In the table shown below:
> {code:sql}
> CREATE EXTERNAL TABLE `test_db4`(`a` string, `b` string, `url` string) 
> PARTITIONED BY (`dt` string);
> {code}
>  !screenshot-1.png! 
>   For field `url`, some values  are initialized to NULL .
>   When we apply reflect function  to `url` , it will  lead to  
> NullPointerException  as follow:
> {code:scala}
> select reflect('java.net.URLDecoder', 'decode', url ,'utf-8') from 
> mydemo.test_db4 where dt=20180920;
> {code}
> {panel:title=NPE}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 
> (TID 17, bigdata-nmg-hdfstest12.nmg01.diditaxi.com, executor 1): 
> java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection.eval(CallMethodViaReflection.scala:95)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:235)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
> at java.net.URLDecoder.decode(URLDecoder.java:136)
> ... 21 more
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
> at scala.Option.foreach(Option.scala:257)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)
> at 

[jira] [Resolved] (SPARK-26454) While creating new UDF with JAR though UDF is created successfully, it throws IllegegalArgument Exception

2019-01-04 Thread Udbhav Agrawal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udbhav Agrawal resolved SPARK-26454.

Resolution: Invalid

Apparently i placed wrong jar in different version, this issue seems invalid.

> While creating new UDF with JAR though UDF is created successfully, it throws 
> IllegegalArgument Exception
> -
>
> Key: SPARK-26454
> URL: https://issues.apache.org/jira/browse/SPARK-26454
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.3.2
>Reporter: Udbhav Agrawal
>Priority: Trivial
> Attachments: create_exception.txt
>
>
> 【Test step】:
>  1.launch spark-shell
>  2. set role admin;
>  3. create new function
>    CREATE FUNCTION Func AS 
> 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFLastDayTest' USING JAR 
> 'hdfs:///tmp/super_udf/two_udfs.jar'
>  4. Do select on the function
>  sql("select Func('2018-03-09')").show()
>  5.Create new UDF with same JAR
>     sql("CREATE FUNCTION newFunc AS 
> 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFLastDayTest' USING JAR 
> 'hdfs:///tmp/super_udf/two_udfs.jar'")
> 6. Do select on the new function created.
> sql("select newFunc ('2018-03-09')").show()
> 【Output】:
> Function is getting created but illegal argument exception is thrown , select 
> provides result but with illegal argument exception.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26359) Spark checkpoint restore fails after query restart

2019-01-04 Thread Kaspar Tint (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16733935#comment-16733935
 ] 

Kaspar Tint commented on SPARK-26359:
-

[~gsomogyi]: Yes, this ticket can be closed. The solution mentioned here should 
be enough to work around the issue.

> Spark checkpoint restore fails after query restart
> --
>
> Key: SPARK-26359
> URL: https://issues.apache.org/jira/browse/SPARK-26359
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, Structured Streaming
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0 deployed in standalone-client mode
> Checkpointing is done to S3
> The Spark application in question is responsible for running 4 different 
> queries
> Queries are written using Structured Streaming
> We are using the following algorithm for hopes of better performance:
> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version: "2" # When the 
> default is 1
>Reporter: Kaspar Tint
>Priority: Major
> Attachments: driver-redacted, metadata, redacted-offsets, 
> state-redacted, worker-redacted
>
>
> We had an incident where one of our structured streaming queries could not be 
> restarted after an usual S3 checkpointing failure. Now to clarify before 
> everything else - we are aware of the issues with S3 and are working towards 
> moving to HDFS but this will take time. S3 will cause queries to fail quite 
> often during peak hours and we have separate logic to handle this that will 
> attempt to restart the failed queries if possible.
> In this particular case we could not restart one of the failed queries. Seems 
> like between detecting a failure in the query and starting it up again 
> something went really wrong with Spark and state in checkpoint folder got 
> corrupted for some reason.
> The issue starts with the usual *FileNotFoundException* that happens with S3
> {code:java}
> 2018-12-10 21:09:25.785 ERROR MicroBatchExecution: Query feedback [id = 
> c074233a-2563-40fc-8036-b5e38e2e2c42, runId = 
> e607eb6e-8431-4269-acab-cc2c1f9f09dd]
> terminated with error
> java.io.FileNotFoundException: No such file or directory: 
> s3a://some.domain/spark/checkpoints/49/feedback/offsets/.37
> 348.8227943f-a848-4af5-b5bf-1fea81775b24.tmp
> at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2255)
> at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2149)
> at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2088)
> at 
> org.apache.hadoop.fs.FileSystem.getFileLinkStatus(FileSystem.java:2715)
> at 
> org.apache.hadoop.fs.DelegateToFileSystem.getFileLinkStatus(DelegateToFileSystem.java:131)
> at 
> org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:726)
> at 
> org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:699)
> at org.apache.hadoop.fs.FileContext.rename(FileContext.java:965)
> at 
> org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.renameTempFile(CheckpointFileManager.scala:331)
> at 
> org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.close(CheckpointFileManager.scala:147)
> at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.org$apache$spark$sql$execution$streaming$HDFSMetadataLog$$writeBatchToFile(HDFSMetadataL
> og.scala:126)
> at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply$mcZ$sp(HDFSMetadataLog.scala:112)
> at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110)
> at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110)
> at scala.Option.getOrElse(Option.scala:121)
> at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.add(HDFSMetadataLog.scala:110)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$3.apply$mcV$sp(MicroBatchExecution.scala:382)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$3.apply(MicroBatchExecution.scala:381)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$3.apply(MicroBatchExecution.scala:381)
> at 
> 

[jira] [Updated] (SPARK-26529) Add logs for IOException when preparing local resource

2019-01-04 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26529:
-
Priority: Trivial  (was: Major)

> Add logs for IOException when preparing local resource 
> ---
>
> Key: SPARK-26529
> URL: https://issues.apache.org/jira/browse/SPARK-26529
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Core
>Affects Versions: 2.3.2, 2.4.0
>Reporter: liupengcheng
>Priority: Trivial
>
> Currently, `Client#createConfArchive` do not handle IOException, and some 
> detail info is not provided in logs. Sometimes, this may delay the time of 
> locating the root cause of io error.
> A case happened in our production environment is that local disk is full, and 
> the following exception is thrown but no detail path info provided. we have 
> to investigate all the local disk of the machine to find out the root cause.
> {code:java}
> Exception in thread "main" java.io.IOException: No space left on device
> at java.io.FileOutputStream.writeBytes(Native Method)
> at java.io.FileOutputStream.write(FileOutputStream.java:345)
> at java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253)
> at java.util.zip.ZipOutputStream.closeEntry(ZipOutputStream.java:238)
> at java.util.zip.ZipOutputStream.finish(ZipOutputStream.java:343)
> at java.util.zip.DeflaterOutputStream.close(DeflaterOutputStream.java:238)
> at java.util.zip.ZipOutputStream.close(ZipOutputStream.java:360)
> at org.apache.spark.deploy.yarn.Client.createConfArchive(Client.scala:769)
> at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:657)
> at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:895)
> at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:177)
> at org.apache.spark.deploy.yarn.Client.run(Client.scala:1202)
> at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1261)
> at org.apache.spark.deploy.yarn.Client.main(Client.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:767)
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:189)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:214)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:128)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}
> It make sense for us to catch the IOException and print some useful 
> information.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26512) Spark 2.4.0 is not working with Hadoop 2.8.3 in windows 10?

2019-01-04 Thread Anubhav Jain (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16733917#comment-16733917
 ] 

Anubhav Jain commented on SPARK-26512:
--

Thanks [~jerryshao] , for the reply . I Have attached a pic 'log.png' in the 
attachment  of log of container.

> Spark 2.4.0 is not working with Hadoop 2.8.3 in windows 10?
> ---
>
> Key: SPARK-26512
> URL: https://issues.apache.org/jira/browse/SPARK-26512
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell, YARN
>Affects Versions: 2.4.0
> Environment: operating system : Windows 10
> Spark Version : 2.4.0
> Hadoop Version : 2.8.3
>Reporter: Anubhav Jain
>Priority: Minor
>  Labels: windows
> Attachments: log.png
>
>
> I have installed Hadoop version 2.8.3 in my windows 10 environment and its 
> working fine. Now when i try to install Apache Spark(version 2.4.0) with yarn 
> as cluster manager and its not working. When i try to submit a spark job 
> using spark-submit for testing , so its coming under ACCEPTED tab in YARN UI 
> after that it fail



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-26359) Spark checkpoint restore fails after query restart

2019-01-04 Thread Gabor Somogyi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi closed SPARK-26359.
-

> Spark checkpoint restore fails after query restart
> --
>
> Key: SPARK-26359
> URL: https://issues.apache.org/jira/browse/SPARK-26359
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, Structured Streaming
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0 deployed in standalone-client mode
> Checkpointing is done to S3
> The Spark application in question is responsible for running 4 different 
> queries
> Queries are written using Structured Streaming
> We are using the following algorithm for hopes of better performance:
> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version: "2" # When the 
> default is 1
>Reporter: Kaspar Tint
>Priority: Major
> Attachments: driver-redacted, metadata, redacted-offsets, 
> state-redacted, worker-redacted
>
>
> We had an incident where one of our structured streaming queries could not be 
> restarted after an usual S3 checkpointing failure. Now to clarify before 
> everything else - we are aware of the issues with S3 and are working towards 
> moving to HDFS but this will take time. S3 will cause queries to fail quite 
> often during peak hours and we have separate logic to handle this that will 
> attempt to restart the failed queries if possible.
> In this particular case we could not restart one of the failed queries. Seems 
> like between detecting a failure in the query and starting it up again 
> something went really wrong with Spark and state in checkpoint folder got 
> corrupted for some reason.
> The issue starts with the usual *FileNotFoundException* that happens with S3
> {code:java}
> 2018-12-10 21:09:25.785 ERROR MicroBatchExecution: Query feedback [id = 
> c074233a-2563-40fc-8036-b5e38e2e2c42, runId = 
> e607eb6e-8431-4269-acab-cc2c1f9f09dd]
> terminated with error
> java.io.FileNotFoundException: No such file or directory: 
> s3a://some.domain/spark/checkpoints/49/feedback/offsets/.37
> 348.8227943f-a848-4af5-b5bf-1fea81775b24.tmp
> at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2255)
> at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2149)
> at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2088)
> at 
> org.apache.hadoop.fs.FileSystem.getFileLinkStatus(FileSystem.java:2715)
> at 
> org.apache.hadoop.fs.DelegateToFileSystem.getFileLinkStatus(DelegateToFileSystem.java:131)
> at 
> org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:726)
> at 
> org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:699)
> at org.apache.hadoop.fs.FileContext.rename(FileContext.java:965)
> at 
> org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.renameTempFile(CheckpointFileManager.scala:331)
> at 
> org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.close(CheckpointFileManager.scala:147)
> at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.org$apache$spark$sql$execution$streaming$HDFSMetadataLog$$writeBatchToFile(HDFSMetadataL
> og.scala:126)
> at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply$mcZ$sp(HDFSMetadataLog.scala:112)
> at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110)
> at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110)
> at scala.Option.getOrElse(Option.scala:121)
> at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.add(HDFSMetadataLog.scala:110)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$3.apply$mcV$sp(MicroBatchExecution.scala:382)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$3.apply(MicroBatchExecution.scala:381)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$3.apply(MicroBatchExecution.scala:381)
> at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
> at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
> at 
> 

[jira] [Resolved] (SPARK-26359) Spark checkpoint restore fails after query restart

2019-01-04 Thread Gabor Somogyi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi resolved SPARK-26359.
---
Resolution: Information Provided

> Spark checkpoint restore fails after query restart
> --
>
> Key: SPARK-26359
> URL: https://issues.apache.org/jira/browse/SPARK-26359
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, Structured Streaming
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0 deployed in standalone-client mode
> Checkpointing is done to S3
> The Spark application in question is responsible for running 4 different 
> queries
> Queries are written using Structured Streaming
> We are using the following algorithm for hopes of better performance:
> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version: "2" # When the 
> default is 1
>Reporter: Kaspar Tint
>Priority: Major
> Attachments: driver-redacted, metadata, redacted-offsets, 
> state-redacted, worker-redacted
>
>
> We had an incident where one of our structured streaming queries could not be 
> restarted after an usual S3 checkpointing failure. Now to clarify before 
> everything else - we are aware of the issues with S3 and are working towards 
> moving to HDFS but this will take time. S3 will cause queries to fail quite 
> often during peak hours and we have separate logic to handle this that will 
> attempt to restart the failed queries if possible.
> In this particular case we could not restart one of the failed queries. Seems 
> like between detecting a failure in the query and starting it up again 
> something went really wrong with Spark and state in checkpoint folder got 
> corrupted for some reason.
> The issue starts with the usual *FileNotFoundException* that happens with S3
> {code:java}
> 2018-12-10 21:09:25.785 ERROR MicroBatchExecution: Query feedback [id = 
> c074233a-2563-40fc-8036-b5e38e2e2c42, runId = 
> e607eb6e-8431-4269-acab-cc2c1f9f09dd]
> terminated with error
> java.io.FileNotFoundException: No such file or directory: 
> s3a://some.domain/spark/checkpoints/49/feedback/offsets/.37
> 348.8227943f-a848-4af5-b5bf-1fea81775b24.tmp
> at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2255)
> at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2149)
> at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2088)
> at 
> org.apache.hadoop.fs.FileSystem.getFileLinkStatus(FileSystem.java:2715)
> at 
> org.apache.hadoop.fs.DelegateToFileSystem.getFileLinkStatus(DelegateToFileSystem.java:131)
> at 
> org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:726)
> at 
> org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:699)
> at org.apache.hadoop.fs.FileContext.rename(FileContext.java:965)
> at 
> org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.renameTempFile(CheckpointFileManager.scala:331)
> at 
> org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.close(CheckpointFileManager.scala:147)
> at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.org$apache$spark$sql$execution$streaming$HDFSMetadataLog$$writeBatchToFile(HDFSMetadataL
> og.scala:126)
> at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply$mcZ$sp(HDFSMetadataLog.scala:112)
> at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110)
> at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110)
> at scala.Option.getOrElse(Option.scala:121)
> at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.add(HDFSMetadataLog.scala:110)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$3.apply$mcV$sp(MicroBatchExecution.scala:382)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$3.apply(MicroBatchExecution.scala:381)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$3.apply(MicroBatchExecution.scala:381)
> at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
> at 
> 

[jira] [Commented] (SPARK-26254) Move delegation token providers into a separate project

2019-01-04 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16733956#comment-16733956
 ] 

Gabor Somogyi commented on SPARK-26254:
---

ping [~vanzin]

> Move delegation token providers into a separate project
> ---
>
> Key: SPARK-26254
> URL: https://issues.apache.org/jira/browse/SPARK-26254
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> There was a discussion in 
> [PR#22598|https://github.com/apache/spark/pull/22598] that there are several 
> provided dependencies inside core project which shouldn't be there (for ex. 
> hive and kafka). This jira is to solve this problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26462) Use ConfigEntry for hardcoded configs for execution categories.

2019-01-04 Thread pralabhkumar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16733988#comment-16733988
 ] 

pralabhkumar commented on SPARK-26462:
--

[~ueshin] Have created pull request with 

spark.memory,  spark.storage . Currently WIP , please see if its in correct 
direction , I will add for rest of other also

> Use ConfigEntry for hardcoded configs for execution categories.
> ---
>
> Key: SPARK-26462
> URL: https://issues.apache.org/jira/browse/SPARK-26462
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> Make the following hardcoded configs to use ConfigEntry.
> {code}
> spark.memory
> spark.storage
> spark.io
> spark.buffer
> spark.rdd
> spark.locality
> spark.broadcast
> spark.reducer
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26126) Put scala-library deps into root pom instead of spark-tags module

2019-01-04 Thread liupengcheng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liupengcheng updated SPARK-26126:
-
Summary: Put scala-library deps into root pom instead of spark-tags module  
(was: Should put scala-library deps into root pom instead of spark-tags module)

> Put scala-library deps into root pom instead of spark-tags module
> -
>
> Key: SPARK-26126
> URL: https://issues.apache.org/jira/browse/SPARK-26126
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.3.0, 2.4.0
>Reporter: liupengcheng
>Priority: Minor
>
> When I do some backport in our custom spark, I notice some strange code from 
> spark-tags module:
> {code:java}
> 
>   
> org.scala-lang
> scala-library
> ${scala.version}
>   
> 
> {code}
> As i known, should spark-tags only contains some annotation related classes 
> or deps?
> should we put the scala-library deps to root pom?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26462) Use ConfigEntry for hardcoded configs for execution categories.

2019-01-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26462:


Assignee: (was: Apache Spark)

> Use ConfigEntry for hardcoded configs for execution categories.
> ---
>
> Key: SPARK-26462
> URL: https://issues.apache.org/jira/browse/SPARK-26462
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> Make the following hardcoded configs to use ConfigEntry.
> {code}
> spark.memory
> spark.storage
> spark.io
> spark.buffer
> spark.rdd
> spark.locality
> spark.broadcast
> spark.reducer
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26462) Use ConfigEntry for hardcoded configs for execution categories.

2019-01-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26462:


Assignee: Apache Spark

> Use ConfigEntry for hardcoded configs for execution categories.
> ---
>
> Key: SPARK-26462
> URL: https://issues.apache.org/jira/browse/SPARK-26462
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>Priority: Major
>
> Make the following hardcoded configs to use ConfigEntry.
> {code}
> spark.memory
> spark.storage
> spark.io
> spark.buffer
> spark.rdd
> spark.locality
> spark.broadcast
> spark.reducer
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26445) Use ConfigEntry for hardcoded configs for driver/executor categories.

2019-01-04 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26445.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23415
[https://github.com/apache/spark/pull/23415]

> Use ConfigEntry for hardcoded configs for driver/executor categories.
> -
>
> Key: SPARK-26445
> URL: https://issues.apache.org/jira/browse/SPARK-26445
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.0.0
>
>
> Make hardcoded "spark.driver" and "spark.executor" configs to use 
> {{ConfigEntry}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26334) NullPointerException in CallMethodViaReflection when we apply reflect function for empty field

2019-01-04 Thread chenliang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734182#comment-16734182
 ] 

chenliang commented on SPARK-26334:
---

Anyway,thanks.

> NullPointerException  in CallMethodViaReflection when we apply reflect 
> function for empty field
> ---
>
> Key: SPARK-26334
> URL: https://issues.apache.org/jira/browse/SPARK-26334
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0
>Reporter: chenliang
>Priority: Major
> Attachments: 21_47_01__01_04_2019.jpg, SPARK-26334.patch, 
> screenshot-1.png
>
>
> In the table shown below:
> {code:sql}
> CREATE EXTERNAL TABLE `test_db4`(`a` string, `b` string, `url` string) 
> PARTITIONED BY (`dt` string);
> {code}
>  !screenshot-1.png! 
>   For field `url`, some values  are initialized to NULL .
>   When we apply reflect function  to `url` , it will  lead to  
> NullPointerException  as follow:
> {code:scala}
> select reflect('java.net.URLDecoder', 'decode', url ,'utf-8') from 
> mydemo.test_db4 where dt=20180920;
> {code}
> {panel:title=NPE}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 
> (TID 17, bigdata-nmg-hdfstest12.nmg01.diditaxi.com, executor 1): 
> java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection.eval(CallMethodViaReflection.scala:95)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:235)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
> at java.net.URLDecoder.decode(URLDecoder.java:136)
> ... 21 more
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
> at scala.Option.foreach(Option.scala:257)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)
> at 

[jira] [Resolved] (SPARK-7428) DataFrame.join() could create a new df with duplicate column name

2019-01-04 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-7428.
-
Resolution: Won't Fix

> DataFrame.join() could create a new df with duplicate column name
> -
>
> Key: SPARK-7428
> URL: https://issues.apache.org/jira/browse/SPARK-7428
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: spark-1.3.0-bin-hadoop2.4
>Reporter: yan tianxing
>Priority: Major
>
> >val df = sc.parallelize(Array(1,2,3)).toDF("x")
> >val df2 = sc.parallelize(Array(1,4,5)).toDF("x")
> >val df3 = df.join(df2,df("x")===df2("x"),"inner")
> >df3.show
> x x
> 1 1
> > df3.select("x")
> org.apache.spark.sql.AnalysisException: Ambiguous references to x: 
> (x#1,List()),(x#3,List());
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:211)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:109)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:267)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:260)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249)
> at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:103)
> at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:117)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:116)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
> at scala.collection.AbstractIterator.to(Iterator.scala:1157)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
> at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:121)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7.applyOrElse(Analyzer.scala:260)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7.applyOrElse(Analyzer.scala:197)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:197)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:196)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
> at 
> 

[jira] [Resolved] (SPARK-4235) Add union data type support

2019-01-04 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-4235.
-
Resolution: Later

I'm going to leave this resolved as {{Later}}. It's been inactive for few years 
already.

> Add union data type support
> ---
>
> Key: SPARK-4235
> URL: https://issues.apache.org/jira/browse/SPARK-4235
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Cheng Hao
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7789) sql on security hbase:Token generation only allowed for Kerberos authenticated clients

2019-01-04 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-7789.
-
Resolution: Incomplete

Leaving this resolved due to inactivity from the reporter.

> sql  on security hbase:Token generation only allowed for Kerberos 
> authenticated clients
> ---
>
> Key: SPARK-7789
> URL: https://issues.apache.org/jira/browse/SPARK-7789
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: meiyoula
>Priority: Major
>
> After creating a hbase table in beeline, then execute select sql statement, 
> Executor occurs the exception:
> {quote}
> java.lang.IllegalStateException: Error while configuring input job properties
> at 
> org.apache.hadoop.hive.hbase.HBaseStorageHandler.configureTableJobProperties(HBaseStorageHandler.java:343)
> at 
> org.apache.hadoop.hive.hbase.HBaseStorageHandler.configureInputJobProperties(HBaseStorageHandler.java:279)
> at 
> org.apache.hadoop.hive.ql.plan.PlanUtils.configureJobPropertiesForStorageHandler(PlanUtils.java:804)
> at 
> org.apache.hadoop.hive.ql.plan.PlanUtils.configureInputJobPropertiesForStorageHandler(PlanUtils.java:774)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:300)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:276)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:276)
> at 
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
> at 
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
> at scala.Option.map(Option.scala:145)
> at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:220)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.hbase.security.AccessDeniedException: 
> org.apache.hadoop.hbase.security.AccessDeniedException: Token generation only 
> allowed for Kerberos authenticated clients
> at 
> org.apache.hadoop.hbase.security.token.TokenProvider.getAuthenticationToken(TokenProvider.java:124)
> at 
> org.apache.hadoop.hbase.protobuf.generated.AuthenticationProtos$AuthenticationService$1.getAuthenticationToken(AuthenticationProtos.java:4267)
> at 
> org.apache.hadoop.hbase.protobuf.generated.AuthenticationProtos$AuthenticationService.callMethod(AuthenticationProtos.java:4387)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.execService(HRegion.java:7696)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.execServiceOnRegion(RSRpcServices.java:1877)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.execService(RSRpcServices.java:1859)
> at 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32209)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2131)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:102)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
> at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
> at java.lang.Thread.run(Thread.java:745)
> 

[jira] [Resolved] (SPARK-8403) Pruner partition won't effective when partition field and fieldSchema exist in sql predicate

2019-01-04 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-8403.
-
Resolution: Incomplete

Leaving this resolved due to reporter's inactivity.

> Pruner partition won't effective when partition field and fieldSchema exist 
> in sql predicate
> 
>
> Key: SPARK-8403
> URL: https://issues.apache.org/jira/browse/SPARK-8403
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Hong Shen
>Priority: Major
>
> When partition field and fieldSchema exist in sql predicates, pruner 
> partition won't effective.
> Here is the sql,
> {code}
> select r.uin,r.vid,r.ctype,r.bakstr2,r.cmd from t_dw_qqlive_209026 r 
> where r.cmd = 2 and (r.imp_date = 20150615 or and hour(r.itimestamp)>16)
> {code}
> Table t_dw_qqlive_209026  is partition by imp_date, itimestamp is a 
> fieldSchema in t_dw_qqlive_209026.
> When run on hive, it will only scan data in partition 20150615, but if run on 
> spark sql, it will scan the whole table t_dw_qqlive_209026.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9218) Falls back to getAllPartitions when getPartitionsByFilter fails

2019-01-04 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734218#comment-16734218
 ] 

Hyukjin Kwon commented on SPARK-9218:
-

ping [~lian cheng]

> Falls back to getAllPartitions when getPartitionsByFilter fails
> ---
>
> Key: SPARK-9218
> URL: https://issues.apache.org/jira/browse/SPARK-9218
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Major
>
> [PR #7492|https://github.com/apache/spark/pull/7492] enables Hive partition 
> predicate push-down by leveraging {{Hive.getPartitionsByFilter}}. Although 
> this optimization is pretty effective, we did observe some failures like this:
> {noformat}
> java.sql.SQLDataException: Invalid character string format for type DECIMAL.
>   at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown 
> Source)
>   at 
> org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown 
> Source)
>   at 
> org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(Unknown 
> Source)
>   at 
> org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeStatement(Unknown 
> Source)
>   at 
> org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeQuery(Unknown Source)
>   at 
> com.jolbox.bonecp.PreparedStatementHandle.executeQuery(PreparedStatementHandle.java:174)
>   at 
> org.datanucleus.store.rdbms.ParamLoggingPreparedStatement.executeQuery(ParamLoggingPreparedStatement.java:381)
>   at 
> org.datanucleus.store.rdbms.SQLController.executeStatementQuery(SQLController.java:504)
>   at 
> org.datanucleus.store.rdbms.query.SQLQuery.performExecute(SQLQuery.java:280)
>   at org.datanucleus.store.query.Query.executeQuery(Query.java:1786)
>   at 
> org.datanucleus.store.query.AbstractSQLQuery.executeWithArray(AbstractSQLQuery.java:339)
>   at org.datanucleus.api.jdo.JDOQuery.executeWithArray(JDOQuery.java:312)
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreDirectSql.getPartitionsViaSqlFilterInternal(MetaStoreDirectSql.java:300)
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreDirectSql.getPartitionsViaSqlFilter(MetaStoreDirectSql.java:211)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore$4.getSqlResult(ObjectStore.java:2320)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore$4.getSqlResult(ObjectStore.java:2317)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore$GetHelper.run(ObjectStore.java:2208)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsByFilterInternal(ObjectStore.java:2317)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsByFilter(ObjectStore.java:2165)
>   at sun.reflect.GeneratedMethodAccessor126.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:108)
>   at com.sun.proxy.$Proxy21.getPartitionsByFilter(Unknown Source)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_partitions_by_filter(HiveMetaStore.java:3760)
>   at sun.reflect.GeneratedMethodAccessor125.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:105)
>   at com.sun.proxy.$Proxy23.get_partitions_by_filter(Unknown Source)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.listPartitionsByFilter(HiveMetaStoreClient.java:903)
>   at sun.reflect.GeneratedMethodAccessor124.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
>   at com.sun.proxy.$Proxy24.listPartitionsByFilter(Unknown Source)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.getPartitionsByFilter(Hive.java:1944)
>   at sun.reflect.GeneratedMethodAccessor123.invoke(Unknown Source)
>   at 
> 

[jira] [Resolved] (SPARK-8370) Add API for data sources to register databases

2019-01-04 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-8370.
-
Resolution: Not A Problem

> Add API for data sources to register databases
> --
>
> Key: SPARK-8370
> URL: https://issues.apache.org/jira/browse/SPARK-8370
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Santiago M. Mola
>Priority: Major
>
> This API would allow to register a database with a data source instead of 
> just a table. Registering a data source database would register all its table 
> and maintain the catalog updated. The catalog could delegate to the data 
> source lookups of tables for a database registered with this API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26534) Closure Cleaner Bug

2019-01-04 Thread sam (JIRA)
sam created SPARK-26534:
---

 Summary: Closure Cleaner Bug
 Key: SPARK-26534
 URL: https://issues.apache.org/jira/browse/SPARK-26534
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.1
Reporter: sam


I've found a strange combination of closures where the closure cleaner doesn't 
seem to be smart enough to figure out how to remove a reference that is not 
used. I.e. we get a `org.apache.spark.SparkException: Task not serializable` 
for a Task that is perfectly serializable.  

 

In the example below, the only `val` that is actually needed for the closure of 
the `map` is `foo`, but it tries to serialise `thingy`.  What is odd is 
changing this code in a number of subtle ways eliminates the error, which I've 
tried to highlight using comments inline.

 
{code:java}
import org.apache.spark.sql._

object Test {
  val sparkSession: SparkSession =
SparkSession.builder.master("local").appName("app").getOrCreate()

  def apply(): Unit = {
import sparkSession.implicits._

val landedData: Dataset[String] = 
sparkSession.sparkContext.makeRDD(Seq("foo", "bar")).toDS()

// thingy has to be in this outer scope to reproduce, if in someFunc, 
cannot reproduce
val thingy: Thingy = new Thingy

// If not wrapped in someFunc cannot reproduce
val someFunc = () => {
  // If don't reference this foo inside the closer (e.g. just use identity 
function) cannot reproduce
  val foo: String = "foo"

  thingy.run(block = () => {
landedData.map(r => {
  r + foo
})
.count()
  })
}

someFunc()

  }
}

class Thingy {
  def run[R](block: () => R): R = {
block()
  }
}
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26445) Use ConfigEntry for hardcoded configs for driver/executor categories.

2019-01-04 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-26445:


Assignee: Takuya Ueshin

> Use ConfigEntry for hardcoded configs for driver/executor categories.
> -
>
> Key: SPARK-26445
> URL: https://issues.apache.org/jira/browse/SPARK-26445
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
>
> Make hardcoded "spark.driver" and "spark.executor" configs to use 
> {{ConfigEntry}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7754) [SQL] Use PartialFunction literals instead of objects in Catalyst

2019-01-04 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-7754.
-
Resolution: Won't Fix

> [SQL] Use PartialFunction literals instead of objects in Catalyst
> -
>
> Key: SPARK-7754
> URL: https://issues.apache.org/jira/browse/SPARK-7754
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Edoardo Vacchi
>Priority: Minor
>
> Catalyst rules extend two distinct "rule" types: {{Rule[LogicalPlan]}} and 
> {{Strategy}} (which is an alias for {{GenericStrategy[SparkPlan]}}).
> The distinction is fairly subtle: in the end, both rule types are supposed to 
> define a method {{apply(plan: LogicalPlan)}}
> (where LogicalPlan is either Logical- or Spark-) which returns a transformed 
> plan (or a sequence thereof, in the case
> of Strategy).
> Ceremonies asides, the body of such method is always of the kind:
> {code:java}
>  def apply(plan: PlanType) = plan match pf
> {code}
> where `pf` would be some `PartialFunction` of the PlanType:
> {code:java}
>   val pf = {
> case ... => ...
>   }
> {code}
> This is JIRA is a proposal to introduce utility methods to
>   a) reduce the boilerplate to define rewrite rules
>   b) turning them back into what they essentially represent: function types.
> These changes would be backwards compatible, and would greatly help in 
> understanding what the code does. Current use of objects is redundant and 
> possibly confusing.
> *{{Rule[LogicalPlan]}}*
> a) Introduce the utility object
> {code:java}
>   object rule 
> def rule(pf: PartialFunction[LogicalPlan, LogicalPlan]): 
> Rule[LogicalPlan] =
>   new Rule[LogicalPlan] {
> def apply (plan: LogicalPlan): LogicalPlan = plan transform pf
>   }
> def named(name: String)(pf: PartialFunction[LogicalPlan, LogicalPlan]): 
> Rule[LogicalPlan] =
>   new Rule[LogicalPlan] {
> override val ruleName = name
> def apply (plan: LogicalPlan): LogicalPlan = plan transform pf
>   }
> {code}
> b) progressively replace the boilerplate-y object definitions; e.g.
> {code:java}
> object MyRewriteRule extends Rule[LogicalPlan] {
>   def apply(plan: LogicalPlan): LogicalPlan = plan transform {
> case ... => ...
> }
> {code}
> with
> {code:java}
> // define a Rule[LogicalPlan]
> val MyRewriteRule = rule {
>   case ... => ...
> }
> {code}
> and/or :
> {code:java}
> // define a named Rule[LogicalPlan]
> val MyRewriteRule = rule.named("My rewrite rule") {
>   case ... => ...
> }
> {code}
> *Strategies*
> A similar solution could be applied to shorten the code for
> Strategies, which are total functions
> only because they are all supposed to manage the default case,
> possibly returning `Nil`. In this case
> we might introduce the following utility:
> {code:java}
> object strategy {
>   /**
>* Generate a Strategy from a PartialFunction[LogicalPlan, SparkPlan].
>* The partial function must therefore return *one single* SparkPlan for 
> each case.
>* The method will automatically wrap them in a [[Seq]].
>* Unhandled cases will automatically return Seq.empty
>*/
>   def apply(pf: PartialFunction[LogicalPlan, SparkPlan]): Strategy =
> new Strategy {
>   def apply(plan: LogicalPlan): Seq[SparkPlan] =
> if (pf.isDefinedAt(plan)) Seq(pf.apply(plan)) else Seq.empty
> }
>   /**
>* Generate a Strategy from a PartialFunction[ LogicalPlan, Seq[SparkPlan] 
> ].
>* The partial function must therefore return a Seq[SparkPlan] for each 
> case.
>* Unhandled cases will automatically return Seq.empty
>*/
>  def seq(pf: PartialFunction[LogicalPlan, Seq[SparkPlan]]): Strategy =
> new Strategy {
>   def apply(plan: LogicalPlan): Seq[SparkPlan] =
> if (pf.isDefinedAt(plan)) pf.apply(plan) else Seq.empty[SparkPlan]
> }
> }
> {code}
> Usage:
> {code:java}
> val mystrategy = strategy { case ... => ... }
> val seqstrategy = strategy.seq { case ... => ... }
> {code}
> *Further possible improvements:*
> Making the utility methods `implicit`, thereby
> further reducing the rewrite rules to:
> {code:java}
> // define a PartialFunction[LogicalPlan, LogicalPlan]
> // the implicit would convert it into a Rule[LogicalPlan] at the use sites
> val MyRewriteRule = {
>   case ... => ...
> }
> {code}
> *Caveats*
> Because of the way objects are initialized vs. vals, it might be necessary
> reorder instructions so that vals are actually initialized before they are 
> used.
> E.g.:
> {code:java}
> class MyOptimizer extends Optimizer {
>   override val batches: Seq[Batch] =
>   ...
>   Batch("Other rules", FixedPoint(100),
> MyRewriteRule // <--- might throw NPE
>   val 

[jira] [Resolved] (SPARK-8602) Shared cached DataFrames

2019-01-04 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-8602.
-
Resolution: Not A Problem

> Shared cached DataFrames
> 
>
> Key: SPARK-8602
> URL: https://issues.apache.org/jira/browse/SPARK-8602
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: John Muller
>Priority: Major
>
> Currently, the only way I can think of to share HiveContexts, SparkContexts, 
> or cached DataFrames is to use spark-jobserver and spark-jobserver-extras:
> https://gist.github.com/anonymous/578385766261d6fa7196#file-exampleshareddf-scala
> But HiveServer2 users over plain JDBC cannot access the shared dataframe. 
> Request is to add this directly to SparkSQL and treat it like a shared temp 
> table Ex. 
> SELECT a, b, c
> FROM TableA
> CACHE DATAFRAME
> This would be very useful for Rollups and Cubes, though I'm not sure what 
> this may mean for HiveMetaStore. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26436) Dataframe resulting from a GroupByKey and flatMapGroups operation throws java.lang.UnsupportedException when groupByKey is applied on it.

2019-01-04 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734282#comment-16734282
 ] 

Liang-Chi Hsieh commented on SPARK-26436:
-

Sorry I don't know what you mean "should groupByKey on rows with GenericSchema 
not work, when a repartition on the same dataframe and same column succeeds.". 
Can you elaborate it?

> Dataframe resulting from a GroupByKey and flatMapGroups operation throws 
> java.lang.UnsupportedException when groupByKey is applied on it.
> -
>
> Key: SPARK-26436
> URL: https://issues.apache.org/jira/browse/SPARK-26436
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Manish
>Priority: Major
>
> There seems to be a bug on groupByKey api for cases when it (groupByKey) is 
> applied on a DataSet resulting from a former groupByKey and flatMapGroups 
> invocation.
> In such cases groupByKey throws the following exception:
> java.lang.UnsupportedException: fieldIndex on a Row without schema is 
> undefined.
>  
> Although the dataframe has a valid schema and a groupBy("key") or 
> repartition($"key") api calls on the same Dataframe and key succeed.
>  
> Following is the code that reproduces the scenario:
>  
> {code:scala}
>  
>import org.apache.spark.sql.catalyst.encoders.RowEncoder
> import org.apache.spark.sql.{Row, SparkSession}
> import org.apache.spark.sql.types.{ IntegerType, StructField, StructType}
> import scala.collection.mutable.ListBuffer
>   object Test {
> def main(args: Array[String]): Unit = {
>   val values = List(List("1", "One") ,List("1", "Two") ,List("2", 
> "Three"),List("2","4")).map(x =>(x(0), x(1)))
>   val session = SparkSession.builder.config("spark.master", 
> "local").getOrCreate
>   import session.implicits._
>   val dataFrame = values.toDF
>   dataFrame.show()
>   dataFrame.printSchema()
>   val newSchema = StructType(dataFrame.schema.fields
> ++ Array(
> StructField("Count", IntegerType, false)
>   )
>   )
>   val expr = RowEncoder.apply(newSchema)
>   val tranform =  dataFrame.groupByKey(row => 
> row.getAs[String]("_1")).flatMapGroups((key, inputItr) => {
> val inputSeq = inputItr.toSeq
> val length = inputSeq.size
> var listBuff = new ListBuffer[Row]()
> var counter : Int= 0
> for(i <- 0 until(length))
> {
>   counter+=1
> }
> for(i <- 0 until length ) {
>   var x = inputSeq(i)
>   listBuff += Row.fromSeq(x.toSeq ++ Array[Int](counter))
> }
> listBuff.iterator
>   })(expr)
>   tranform.show
>   val newSchema1 = StructType(tranform.schema.fields
> ++ Array(
> StructField("Count1", IntegerType, false)
>   )
>   )
>   val expr1 = RowEncoder.apply(newSchema1)
>   val tranform2 =  tranform.groupByKey(row => 
> row.getAs[String]("_1")).flatMapGroups((key, inputItr) => {
> val inputSeq = inputItr.toSeq
> val length = inputSeq.size
> var listBuff = new ListBuffer[Row]()
> var counter : Int= 0
> for(i <- 0 until(length))
> {
>   counter+=1
> }
> for(i <- 0 until length ) {
>   var x = inputSeq(i)
>   listBuff += Row.fromSeq(x.toSeq ++ Array[Int](counter))
> }
> listBuff.iterator
>   })(expr1)
>   tranform2.show
> }
> }
> Test.main(null)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26535) Parsing literals as DOUBLE instead of DECIMAL

2019-01-04 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-26535:
---

 Summary: Parsing literals as DOUBLE instead of DECIMAL
 Key: SPARK-26535
 URL: https://issues.apache.org/jira/browse/SPARK-26535
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Marco Gaido


As pointed out by [~dkbiswal]'s comment 
https://github.com/apache/spark/pull/22450#issuecomment-423082389, most of 
other RDBMS (DB2, Presto, Hive, MSSQL) consider literals as DOUBLE by default.

Spark as of now consider them as DECIMAL. This is quite problematic especially 
in relation with the operations on decimal, for which we base our 
implementation on Hive/MSSQL.

So this ticket is for moving by default the resolution of literals as DOUBLE, 
but with a config which allows to get back to the previous behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26536) Upgrade Mockito to 2.23.4

2019-01-04 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-26536:
-

 Summary: Upgrade Mockito to 2.23.4
 Key: SPARK-26536
 URL: https://issues.apache.org/jira/browse/SPARK-26536
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26436) Dataframe resulting from a GroupByKey and flatMapGroups operation throws java.lang.UnsupportedException when groupByKey is applied on it.

2019-01-04 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734178#comment-16734178
 ] 

Liang-Chi Hsieh commented on SPARK-26436:
-

This issue is caused by how you create the row:
{code}
listBuff += Row.fromSeq(x.toSeq ++ Array[Int](counter))
{code}

{{Row.fromSeq}} creates a {{GenericRow}} and {{GenericRow}}'s {{fieldIndex}} is 
not implemented because {{GenericRow}} doesn't have schema.

Changing the line to create {{GenericRowWithSchema}} can solve it:
{code}
 listBuff += new GenericRowWithSchema((x.toSeq ++ Array[Int](counter)).toArray, 
newSchema)
{code}

{code}
scala> tranform2.show
+---+-+-+--+
| _1|   _2|Count|Count1|
+---+-+-+--+
|  1|  One|2| 2|
|  1|  Two|2| 2|
|  2|Three|2| 2|
|  2|4|2| 2|
+---+-+-+--+
{code}

> Dataframe resulting from a GroupByKey and flatMapGroups operation throws 
> java.lang.UnsupportedException when groupByKey is applied on it.
> -
>
> Key: SPARK-26436
> URL: https://issues.apache.org/jira/browse/SPARK-26436
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Manish
>Priority: Major
>
> There seems to be a bug on groupByKey api for cases when it (groupByKey) is 
> applied on a DataSet resulting from a former groupByKey and flatMapGroups 
> invocation.
> In such cases groupByKey throws the following exception:
> java.lang.UnsupportedException: fieldIndex on a Row without schema is 
> undefined.
>  
> Although the dataframe has a valid schema and a groupBy("key") or 
> repartition($"key") api calls on the same Dataframe and key succeed.
>  
> Following is the code that reproduces the scenario:
>  
> {code:scala}
>  
>import org.apache.spark.sql.catalyst.encoders.RowEncoder
> import org.apache.spark.sql.{Row, SparkSession}
> import org.apache.spark.sql.types.{ IntegerType, StructField, StructType}
> import scala.collection.mutable.ListBuffer
>   object Test {
> def main(args: Array[String]): Unit = {
>   val values = List(List("1", "One") ,List("1", "Two") ,List("2", 
> "Three"),List("2","4")).map(x =>(x(0), x(1)))
>   val session = SparkSession.builder.config("spark.master", 
> "local").getOrCreate
>   import session.implicits._
>   val dataFrame = values.toDF
>   dataFrame.show()
>   dataFrame.printSchema()
>   val newSchema = StructType(dataFrame.schema.fields
> ++ Array(
> StructField("Count", IntegerType, false)
>   )
>   )
>   val expr = RowEncoder.apply(newSchema)
>   val tranform =  dataFrame.groupByKey(row => 
> row.getAs[String]("_1")).flatMapGroups((key, inputItr) => {
> val inputSeq = inputItr.toSeq
> val length = inputSeq.size
> var listBuff = new ListBuffer[Row]()
> var counter : Int= 0
> for(i <- 0 until(length))
> {
>   counter+=1
> }
> for(i <- 0 until length ) {
>   var x = inputSeq(i)
>   listBuff += Row.fromSeq(x.toSeq ++ Array[Int](counter))
> }
> listBuff.iterator
>   })(expr)
>   tranform.show
>   val newSchema1 = StructType(tranform.schema.fields
> ++ Array(
> StructField("Count1", IntegerType, false)
>   )
>   )
>   val expr1 = RowEncoder.apply(newSchema1)
>   val tranform2 =  tranform.groupByKey(row => 
> row.getAs[String]("_1")).flatMapGroups((key, inputItr) => {
> val inputSeq = inputItr.toSeq
> val length = inputSeq.size
> var listBuff = new ListBuffer[Row]()
> var counter : Int= 0
> for(i <- 0 until(length))
> {
>   counter+=1
> }
> for(i <- 0 until length ) {
>   var x = inputSeq(i)
>   listBuff += Row.fromSeq(x.toSeq ++ Array[Int](counter))
> }
> listBuff.iterator
>   })(expr1)
>   tranform2.show
> }
> }
> Test.main(null)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4476) Use MapType for dict in json which has unique keys in each row.

2019-01-04 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734184#comment-16734184
 ] 

Hyukjin Kwon commented on SPARK-4476:
-

I think we can resolve this now in favour of {{from_json}} and {{to_json}}. 
They support map types. Let me leave this resolved but please reopen this if 
I'm mistaken.

> Use MapType for dict in json which has unique keys in each row.
> ---
>
> Key: SPARK-4476
> URL: https://issues.apache.org/jira/browse/SPARK-4476
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Priority: Major
>
> For the jsonRDD like this: 
> {code}
> """ {a: 1} """
> """ {b: 2} """
> """ {c: 3} """
> """ {d: 4} """
> """ {e: 5} """
> {code}
> It will create a StructType with 5 fileds in it, each field come from a 
> different row. It will be a problem if the RDD is large. A StructType with 
> thousands or millions fields is hard to play with (will cause stack overflow 
> during serialization).
> It should be MapType for this case. We need a clear rule to decide StructType 
> or MapType will be used for dict in json data. 
> cc [~yhuai] [~marmbrus]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4476) Use MapType for dict in json which has unique keys in each row.

2019-01-04 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-4476.
-
Resolution: Done

> Use MapType for dict in json which has unique keys in each row.
> ---
>
> Key: SPARK-4476
> URL: https://issues.apache.org/jira/browse/SPARK-4476
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Priority: Major
>
> For the jsonRDD like this: 
> {code}
> """ {a: 1} """
> """ {b: 2} """
> """ {c: 3} """
> """ {d: 4} """
> """ {e: 5} """
> {code}
> It will create a StructType with 5 fileds in it, each field come from a 
> different row. It will be a problem if the RDD is large. A StructType with 
> thousands or millions fields is hard to play with (will cause stack overflow 
> during serialization).
> It should be MapType for this case. We need a clear rule to decide StructType 
> or MapType will be used for dict in json data. 
> cc [~yhuai] [~marmbrus]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7148) Configure Parquet block size (row group size) for ML model import/export

2019-01-04 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-7148.
-
Resolution: Not A Problem

Then, I'm leaving resolved. Please reopen this if I am mistaken.

> Configure Parquet block size (row group size) for ML model import/export
> 
>
> Key: SPARK-7148
> URL: https://issues.apache.org/jira/browse/SPARK-7148
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, SQL
>Affects Versions: 1.3.0, 1.3.1, 1.4.0
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Minor
>
> It would be nice if we could configure the Parquet buffer size when using 
> Parquet format for ML model import/export.  Currently, for some models (trees 
> and ensembles), the schema has 13+ columns.  With a default buffer size of 
> 128MB (I think), that puts the allocated buffer way over the default memory 
> made available by run-example.  Because of this problem, users have to use 
> spark-submit and explicitly use a larger amount of memory in order to run 
> some ML examples.
> Is there a simple way to specify {{parquet.block.size}}?  I'm not familiar 
> with this part of SparkSQL.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7755) MetadataCache.refresh does not take into account _SUCCESS

2019-01-04 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-7755.
-
Resolution: Not A Problem

> MetadataCache.refresh does not take into account _SUCCESS
> -
>
> Key: SPARK-7755
> URL: https://issues.apache.org/jira/browse/SPARK-7755
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: Rowan Chattaway
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> When you make a call to sqlc.parquetFile(path) where that path contains 
> partially written files, then refresh will fail in strange ways when it 
> attempts to read footer files.
> I would like to adjust the file discovery to take into account the presence 
> of _SUCCESS and therefore only attempt to ready is we have the success marker.
> I have made the changes locally and it doesn't appear to have any side 
> effects.
> What are peoples thoughts about this?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8324) Register Query as view through JDBC interface

2019-01-04 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734203#comment-16734203
 ] 

Hyukjin Kwon commented on SPARK-8324:
-

Fixed in SPARK-24423

> Register Query as view through JDBC interface
> -
>
> Key: SPARK-8324
> URL: https://issues.apache.org/jira/browse/SPARK-8324
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Mayoor Rao
>Priority: Major
>  Labels: features
> Attachments: Jira_request_register_query_as_view.docx
>
>
> We currently have capability of adding csv, json, parquet, etc files as table 
> through beeline using Datasource API. We need a mechanism to register a 
> complex queries as a table through jdbc interface. The query definition could 
> be composed using the table names which are again registered as spark tables 
> using datasource API. 
> The query definition should be persisted and should have an option to 
> re-register when the thriftserver is restarted.
> The sql command should be able to either take a filename which contains the 
> json content or it should take the json content directly.
> There should be an option to save the output of the queries and register the 
> output as table.
> Advantage
> • Create adhoc join statements across different data-sources using Spark 
> from external BI interface. So no persistence of pre-aggregated needed.
> • No dependency of creation of programs to generate adhoc analytics
> • Enable business users to model the data across diverse data sources in 
> real time without any programming
> • Enable persistence of the query output through jdbc interface. No extra 
> programming required.
> SQL Syntax for registering a set of queries or files as table - 
> REGISTERSQLJOB USING FILE/JSON 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8324) Register Query as view through JDBC interface

2019-01-04 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-8324.
-
Resolution: Duplicate

> Register Query as view through JDBC interface
> -
>
> Key: SPARK-8324
> URL: https://issues.apache.org/jira/browse/SPARK-8324
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Mayoor Rao
>Priority: Major
>  Labels: features
> Attachments: Jira_request_register_query_as_view.docx
>
>
> We currently have capability of adding csv, json, parquet, etc files as table 
> through beeline using Datasource API. We need a mechanism to register a 
> complex queries as a table through jdbc interface. The query definition could 
> be composed using the table names which are again registered as spark tables 
> using datasource API. 
> The query definition should be persisted and should have an option to 
> re-register when the thriftserver is restarted.
> The sql command should be able to either take a filename which contains the 
> json content or it should take the json content directly.
> There should be an option to save the output of the queries and register the 
> output as table.
> Advantage
> • Create adhoc join statements across different data-sources using Spark 
> from external BI interface. So no persistence of pre-aggregated needed.
> • No dependency of creation of programs to generate adhoc analytics
> • Enable business users to model the data across diverse data sources in 
> real time without any programming
> • Enable persistence of the query output through jdbc interface. No extra 
> programming required.
> SQL Syntax for registering a set of queries or files as table - 
> REGISTERSQLJOB USING FILE/JSON 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7754) [SQL] Use PartialFunction literals instead of objects in Catalyst

2019-01-04 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734197#comment-16734197
 ] 

Hyukjin Kwon commented on SPARK-7754:
-

I don't think we're going to change if it targets mainly readability. It's been 
few years already and looks okay. I'm resolving this since there look not a lot 
of interests about this. 
Please reopen this if I am mistaken.

> [SQL] Use PartialFunction literals instead of objects in Catalyst
> -
>
> Key: SPARK-7754
> URL: https://issues.apache.org/jira/browse/SPARK-7754
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Edoardo Vacchi
>Priority: Minor
>
> Catalyst rules extend two distinct "rule" types: {{Rule[LogicalPlan]}} and 
> {{Strategy}} (which is an alias for {{GenericStrategy[SparkPlan]}}).
> The distinction is fairly subtle: in the end, both rule types are supposed to 
> define a method {{apply(plan: LogicalPlan)}}
> (where LogicalPlan is either Logical- or Spark-) which returns a transformed 
> plan (or a sequence thereof, in the case
> of Strategy).
> Ceremonies asides, the body of such method is always of the kind:
> {code:java}
>  def apply(plan: PlanType) = plan match pf
> {code}
> where `pf` would be some `PartialFunction` of the PlanType:
> {code:java}
>   val pf = {
> case ... => ...
>   }
> {code}
> This is JIRA is a proposal to introduce utility methods to
>   a) reduce the boilerplate to define rewrite rules
>   b) turning them back into what they essentially represent: function types.
> These changes would be backwards compatible, and would greatly help in 
> understanding what the code does. Current use of objects is redundant and 
> possibly confusing.
> *{{Rule[LogicalPlan]}}*
> a) Introduce the utility object
> {code:java}
>   object rule 
> def rule(pf: PartialFunction[LogicalPlan, LogicalPlan]): 
> Rule[LogicalPlan] =
>   new Rule[LogicalPlan] {
> def apply (plan: LogicalPlan): LogicalPlan = plan transform pf
>   }
> def named(name: String)(pf: PartialFunction[LogicalPlan, LogicalPlan]): 
> Rule[LogicalPlan] =
>   new Rule[LogicalPlan] {
> override val ruleName = name
> def apply (plan: LogicalPlan): LogicalPlan = plan transform pf
>   }
> {code}
> b) progressively replace the boilerplate-y object definitions; e.g.
> {code:java}
> object MyRewriteRule extends Rule[LogicalPlan] {
>   def apply(plan: LogicalPlan): LogicalPlan = plan transform {
> case ... => ...
> }
> {code}
> with
> {code:java}
> // define a Rule[LogicalPlan]
> val MyRewriteRule = rule {
>   case ... => ...
> }
> {code}
> and/or :
> {code:java}
> // define a named Rule[LogicalPlan]
> val MyRewriteRule = rule.named("My rewrite rule") {
>   case ... => ...
> }
> {code}
> *Strategies*
> A similar solution could be applied to shorten the code for
> Strategies, which are total functions
> only because they are all supposed to manage the default case,
> possibly returning `Nil`. In this case
> we might introduce the following utility:
> {code:java}
> object strategy {
>   /**
>* Generate a Strategy from a PartialFunction[LogicalPlan, SparkPlan].
>* The partial function must therefore return *one single* SparkPlan for 
> each case.
>* The method will automatically wrap them in a [[Seq]].
>* Unhandled cases will automatically return Seq.empty
>*/
>   def apply(pf: PartialFunction[LogicalPlan, SparkPlan]): Strategy =
> new Strategy {
>   def apply(plan: LogicalPlan): Seq[SparkPlan] =
> if (pf.isDefinedAt(plan)) Seq(pf.apply(plan)) else Seq.empty
> }
>   /**
>* Generate a Strategy from a PartialFunction[ LogicalPlan, Seq[SparkPlan] 
> ].
>* The partial function must therefore return a Seq[SparkPlan] for each 
> case.
>* Unhandled cases will automatically return Seq.empty
>*/
>  def seq(pf: PartialFunction[LogicalPlan, Seq[SparkPlan]]): Strategy =
> new Strategy {
>   def apply(plan: LogicalPlan): Seq[SparkPlan] =
> if (pf.isDefinedAt(plan)) pf.apply(plan) else Seq.empty[SparkPlan]
> }
> }
> {code}
> Usage:
> {code:java}
> val mystrategy = strategy { case ... => ... }
> val seqstrategy = strategy.seq { case ... => ... }
> {code}
> *Further possible improvements:*
> Making the utility methods `implicit`, thereby
> further reducing the rewrite rules to:
> {code:java}
> // define a PartialFunction[LogicalPlan, LogicalPlan]
> // the implicit would convert it into a Rule[LogicalPlan] at the use sites
> val MyRewriteRule = {
>   case ... => ...
> }
> {code}
> *Caveats*
> Because of the way objects are initialized vs. vals, it might be necessary
> reorder instructions so that vals are actually initialized before they 

[jira] [Comment Edited] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures

2019-01-04 Thread Timothy Pharo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734160#comment-16734160
 ] 

Timothy Pharo edited comment on SPARK-22231 at 1/4/19 2:15 PM:
---

[~dbtsai] and [~jeremyrsmith], this all looks great and is just what we have 
been looking for.  As this isn't yet available in Spark upstream and is not 
likely to be available any time soon, is there any plan to expose this 
separately in the interim?


was (Author: timothy pharo):
[~dbtsai] and [~jeremyrsmith], this all looks great and is just what we have 
been looking.  As this isn't yet available in Spark upstream and is not likely 
to be available any time soon, is there any plan to expose this separately in 
the interim?

> Support of map, filter, withColumn, dropColumn in nested list of structures
> ---
>
> Key: SPARK-22231
> URL: https://issues.apache.org/jira/browse/SPARK-22231
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: DB Tsai
>Assignee: Jeremy Smith
>Priority: Major
>
> At Netflix's algorithm team, we work on ranking problems to find the great 
> content to fulfill the unique tastes of our members. Before building a 
> recommendation algorithms, we need to prepare the training, testing, and 
> validation datasets in Apache Spark. Due to the nature of ranking problems, 
> we have a nested list of items to be ranked in one column, and the top level 
> is the contexts describing the setting for where a model is to be used (e.g. 
> profiles, country, time, device, etc.)  Here is a blog post describing the 
> details, [Distributed Time Travel for Feature 
> Generation|https://medium.com/netflix-techblog/distributed-time-travel-for-feature-generation-389cccdd3907].
>  
> To be more concrete, for the ranks of videos for a given profile_id at a 
> given country, our data schema can be looked like this,
> {code:java}
> root
>  |-- profile_id: long (nullable = true)
>  |-- country_iso_code: string (nullable = true)
>  |-- items: array (nullable = false)
>  ||-- element: struct (containsNull = false)
>  |||-- title_id: integer (nullable = true)
>  |||-- scores: double (nullable = true)
> ...
> {code}
> We oftentimes need to work on the nested list of structs by applying some 
> functions on them. Sometimes, we're dropping or adding new columns in the 
> nested list of structs. Currently, there is no easy solution in open source 
> Apache Spark to perform those operations using SQL primitives; many people 
> just convert the data into RDD to work on the nested level of data, and then 
> reconstruct the new dataframe as workaround. This is extremely inefficient 
> because all the optimizations like predicate pushdown in SQL can not be 
> performed, we can not leverage on the columnar format, and the serialization 
> and deserialization cost becomes really huge even we just want to add a new 
> column in the nested level.
> We built a solution internally at Netflix which we're very happy with. We 
> plan to make it open source in Spark upstream. We would like to socialize the 
> API design to see if we miss any use-case.  
> The first API we added is *mapItems* on dataframe which take a function from 
> *Column* to *Column*, and then apply the function on nested dataframe. Here 
> is an example,
> {code:java}
> case class Data(foo: Int, bar: Double, items: Seq[Double])
> val df: Dataset[Data] = spark.createDataset(Seq(
>   Data(10, 10.0, Seq(10.1, 10.2, 10.3, 10.4)),
>   Data(20, 20.0, Seq(20.1, 20.2, 20.3, 20.4))
> ))
> val result = df.mapItems("items") {
>   item => item * 2.0
> }
> result.printSchema()
> // root
> // |-- foo: integer (nullable = false)
> // |-- bar: double (nullable = false)
> // |-- items: array (nullable = true)
> // ||-- element: double (containsNull = true)
> result.show()
> // +---+++
> // |foo| bar|   items|
> // +---+++
> // | 10|10.0|[20.2, 20.4, 20.6...|
> // | 20|20.0|[40.2, 40.4, 40.6...|
> // +---+++
> {code}
> Now, with the ability of applying a function in the nested dataframe, we can 
> add a new function, *withColumn* in *Column* to add or replace the existing 
> column that has the same name in the nested list of struct. Here is two 
> examples demonstrating the API together with *mapItems*; the first one 
> replaces the existing column,
> {code:java}
> case class Item(a: Int, b: Double)
> case class Data(foo: Int, bar: Double, items: Seq[Item])
> val df: Dataset[Data] = spark.createDataset(Seq(
>   Data(10, 10.0, Seq(Item(10, 10.0), Item(11, 11.0))),
>   Data(20, 20.0, Seq(Item(20, 20.0), Item(21, 21.0)))
> ))
> val result = df.mapItems("items") {
>   item 

[jira] [Resolved] (SPARK-6540) Spark SQL thrift server fails to pass settings to following query in the same session

2019-01-04 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-6540.
-
Resolution: Incomplete

Leaving this resolved due to inactivity from the reporter

> Spark SQL thrift server fails to pass settings to following query in the same 
> session
> -
>
> Key: SPARK-6540
> URL: https://issues.apache.org/jira/browse/SPARK-6540
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.1
>Reporter: Alex Liu
>Priority: Major
>
> e.g.
> By using beeline
> > set cassandra.username=user;
> > set cassandra.password = pass;
> > show databases;
> The "show databases" query fails due to missing cassandra.username and 
> cassandra.password configuration in hiveconf when create the metastore client.
> Same test passes in HiveServer2.
> This document 
> https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Thrift+API 
> mentions how HiveServer2 maps the connection to session(explicit support for 
> sessions in the client API, e.g every RPC call references a session ID which 
> the server then maps to persistent session state.), but Spark Hive thrift 
> server fails to delivery the same function though it wraps HiveServer2 
> internally.
> Spark Hive thrift server contains a single hive context  which has a single 
> hiveConf shared by all sessions. Ideally we should isolate it per session.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8102) Big performance difference when joining 3 tables in different order

2019-01-04 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-8102.
-
Resolution: Not A Problem

The code path has been drastically changed. Now we're heading for Spark 3.0. 
Let's don't track here - the information here is obsolete.

> Big performance difference when joining 3 tables in different order
> ---
>
> Key: SPARK-8102
> URL: https://issues.apache.org/jira/browse/SPARK-8102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.1
> Environment: spark in local mode
>Reporter: Hao Ren
>Priority: Major
> Attachments: query2job.png, query3job.png
>
>
> Given 3 tables loaded from CSV files: 
> ( tables name => size)
> *click_meter_site_grouped* =>10 687 455 bytes
> *t_zipcode* => 2 738 954 bytes
> *t_category* => 2 182 bytes
> When joining the 3 tables, I notice a large performance difference if they 
> are joined in different order.
> Here are the SQL queries to compare:
> {code}
> -- snippet 1
> SELECT g.period, c.categoryName, z.regionName, action, list_id, cnt
> FROM t_category c, t_zipcode z, click_meter_site_grouped g
> WHERE c.refCategoryID = g.category AND z.regionCode = g.region
> {code}
> {code}
> -- snippet 2
> SELECT g.period, c.categoryName, z.regionName, action, list_id, cnt
> FROM t_category c, click_meter_site_grouped g, t_zipcode z
> WHERE c.refCategoryID = g.category AND z.regionCode = g.region
> {code}
> As you see, the largest table *click_meter_site_grouped* is the last table in 
> FROM clause in the first snippet,  and it is in the middle of table list in 
> second one.
> Snippet 2 runs three times faster than Snippet 1.
> (8 seconds VS 24 seconds)
> As the data is just sampled from a large data set, if we test it on the 
> original data set, it will normally result in a performance issue.
> After checking the log, we found something strange In snippet 1's log:
> 15/06/04 15:32:03 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:04 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:04 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:05 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:05 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:05 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:05 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:06 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:06 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:06 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:07 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:07 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:07 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:07 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:08 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:08 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:08 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:09 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:09 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:09 INFO HadoopRDD: Input split: 
> 

[jira] [Commented] (SPARK-8370) Add API for data sources to register databases

2019-01-04 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734205#comment-16734205
 ] 

Hyukjin Kwon commented on SPARK-8370:
-

I think we're working on multiple catalog support. I am resolving this. Please 
reopen this if I am not mistaken. Otherwise, let's track this in DatasourceV2

> Add API for data sources to register databases
> --
>
> Key: SPARK-8370
> URL: https://issues.apache.org/jira/browse/SPARK-8370
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Santiago M. Mola
>Priority: Major
>
> This API would allow to register a database with a data source instead of 
> just a table. Registering a data source database would register all its table 
> and maintain the catalog updated. The catalog could delegate to the data 
> source lookups of tables for a database registered with this API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8577) ScalaReflectionLock.synchronized can cause deadlock

2019-01-04 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-8577.
-
Resolution: Not A Problem

I'm leaving resolved since we removed 2.10.

> ScalaReflectionLock.synchronized can cause deadlock
> ---
>
> Key: SPARK-8577
> URL: https://issues.apache.org/jira/browse/SPARK-8577
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: koert kuipers
>Priority: Minor
>
> Just a heads up, i was doing some basic coding using DataFrame, Row, 
> StructType, etc. in my own project and i ended up with deadlocks in my sbt 
> tests due to the usage of ScalaReflectionLock.synchronized in the spark sql 
> code.
> the issue went away when i changed my build to have:
>   parallelExecution in Test := false
> so that the tests run consecutively...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >