[jira] [Commented] (SPARK-24062) SASL encryption cannot be worked in ThriftServer

2018-04-25 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453521#comment-16453521
 ] 

Saisai Shao commented on SPARK-24062:
-

Issue resolved by pull request 21138
[https://github.com/apache/spark/pull/21138|https://github.com/apache/spark/pull/21138]

> SASL encryption cannot be worked in ThriftServer
> 
>
> Key: SPARK-24062
> URL: https://issues.apache.org/jira/browse/SPARK-24062
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> Spark thrift server will throw an exception when SASL encryption is used.
>  
> {noformat}
> 18/04/16 14:36:46 ERROR TransportRequestHandler: Error while invoking 
> RpcHandler#receive() on RPC id 8384069538832556183
> java.lang.IllegalArgumentException: A secret key must be specified via the 
> spark.authenticate.secret config
> at 
> org.apache.spark.SecurityManager$$anonfun$getSecretKey$4.apply(SecurityManager.scala:510)
> at 
> org.apache.spark.SecurityManager$$anonfun$getSecretKey$4.apply(SecurityManager.scala:510)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.SecurityManager.getSecretKey(SecurityManager.scala:509)
> at org.apache.spark.SecurityManager.getSecretKey(SecurityManager.scala:551)
> at 
> org.apache.spark.network.sasl.SparkSaslServer$DigestCallbackHandler.handle(SparkSaslServer.java:166)
> at 
> com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:589)
> at 
> com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244)
> at 
> org.apache.spark.network.sasl.SparkSaslServer.response(SparkSaslServer.java:119)
> at 
> org.apache.spark.network.sasl.SaslRpcHandler.receive(SaslRpcHandler.java:103)
> at 
> org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:187)
> at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:111){noformat}
> To investigate it, the issue is:
> Spark on Yarn stores SASL secret in current UGI's credentials, this 
> credentials will be distributed to AM and executors, so that executors and 
> drive share the same secret to communicate. But STS/Hive library code will 
> refresh the current UGI by UGI's loginFromKeytab(), this will create a new 
> UGI in the current context with empty tokens and secret keys, so secret key 
> is lost in the current context's UGI, that's why Spark driver throws secret 
> key not found exception.
> In Spark 2.2 code, Spark also stores this secret key in {{SecurityManager}}'s 
> class variable, so even UGI is refreshed, the secret is still existed in the 
> object, so STS with SASL can still be worked in Spark 2.2. But in Spark 2.3, 
> we always search key from current UGI, which makes it fail to work in Spark 
> 2.3.
> To fix this issue, there're two possible solutions:
> 1. Fix in STS/Hive library, when a new UGI is refreshed, copy the secret key 
> from original UGI to the new one. The difficulty is that some codes to 
> refresh the UGI is existed in Hive library, which makes us hard to change the 
> code.
> 2. Roll back the logics in SecurityManager to match Spark 2.2, so that this 
> issue can be fixed.
> 2nd solution seems a simple one. So I will propose a PR with 2nd solution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24062) SASL encryption cannot be worked in ThriftServer

2018-04-25 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-24062.
-
   Resolution: Fixed
 Assignee: Saisai Shao
Fix Version/s: 2.4.0
   2.3.1

> SASL encryption cannot be worked in ThriftServer
> 
>
> Key: SPARK-24062
> URL: https://issues.apache.org/jira/browse/SPARK-24062
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> Spark thrift server will throw an exception when SASL encryption is used.
>  
> {noformat}
> 18/04/16 14:36:46 ERROR TransportRequestHandler: Error while invoking 
> RpcHandler#receive() on RPC id 8384069538832556183
> java.lang.IllegalArgumentException: A secret key must be specified via the 
> spark.authenticate.secret config
> at 
> org.apache.spark.SecurityManager$$anonfun$getSecretKey$4.apply(SecurityManager.scala:510)
> at 
> org.apache.spark.SecurityManager$$anonfun$getSecretKey$4.apply(SecurityManager.scala:510)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.SecurityManager.getSecretKey(SecurityManager.scala:509)
> at org.apache.spark.SecurityManager.getSecretKey(SecurityManager.scala:551)
> at 
> org.apache.spark.network.sasl.SparkSaslServer$DigestCallbackHandler.handle(SparkSaslServer.java:166)
> at 
> com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:589)
> at 
> com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244)
> at 
> org.apache.spark.network.sasl.SparkSaslServer.response(SparkSaslServer.java:119)
> at 
> org.apache.spark.network.sasl.SaslRpcHandler.receive(SaslRpcHandler.java:103)
> at 
> org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:187)
> at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:111){noformat}
> To investigate it, the issue is:
> Spark on Yarn stores SASL secret in current UGI's credentials, this 
> credentials will be distributed to AM and executors, so that executors and 
> drive share the same secret to communicate. But STS/Hive library code will 
> refresh the current UGI by UGI's loginFromKeytab(), this will create a new 
> UGI in the current context with empty tokens and secret keys, so secret key 
> is lost in the current context's UGI, that's why Spark driver throws secret 
> key not found exception.
> In Spark 2.2 code, Spark also stores this secret key in {{SecurityManager}}'s 
> class variable, so even UGI is refreshed, the secret is still existed in the 
> object, so STS with SASL can still be worked in Spark 2.2. But in Spark 2.3, 
> we always search key from current UGI, which makes it fail to work in Spark 
> 2.3.
> To fix this issue, there're two possible solutions:
> 1. Fix in STS/Hive library, when a new UGI is refreshed, copy the secret key 
> from original UGI to the new one. The difficulty is that some codes to 
> refresh the UGI is existed in Hive library, which makes us hard to change the 
> code.
> 2. Roll back the logics in SecurityManager to match Spark 2.2, so that this 
> issue can be fixed.
> 2nd solution seems a simple one. So I will propose a PR with 2nd solution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23916) High-order function: array_join(x, delimiter, null_replacement) → varchar

2018-04-25 Thread Takuya Ueshin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-23916.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21011
[https://github.com/apache/spark/pull/21011]

> High-order function: array_join(x, delimiter, null_replacement) → varchar
> -
>
> Key: SPARK-23916
> URL: https://issues.apache.org/jira/browse/SPARK-23916
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.4.0
>
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Concatenates the elements of the given array using the delimiter and an 
> optional string to replace nulls.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23916) High-order function: array_join(x, delimiter, null_replacement) → varchar

2018-04-25 Thread Takuya Ueshin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin reassigned SPARK-23916:
-

Assignee: Marco Gaido

> High-order function: array_join(x, delimiter, null_replacement) → varchar
> -
>
> Key: SPARK-23916
> URL: https://issues.apache.org/jira/browse/SPARK-23916
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Marco Gaido
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Concatenates the elements of the given array using the delimiter and an 
> optional string to replace nulls.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23902) Provide an option in months_between UDF to disable rounding-off

2018-04-25 Thread Takuya Ueshin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-23902.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21008
[https://github.com/apache/spark/pull/21008]

> Provide an option in months_between UDF to disable rounding-off
> ---
>
> Key: SPARK-23902
> URL: https://issues.apache.org/jira/browse/SPARK-23902
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.4.0
>
>
> https://issues.apache.org/jira/browse/HIVE-15511
> {noformat}
> Rounding off was added in {{GenericUDFMonthsBetween}} so that it can be 
> compatible with systems like oracle. However, there are places where rounding 
> off is not needed.
> E.g "CAST(MONTHS_BETWEEN(l_shipdate, l_commitdate) / 12 AS INT)" may not need 
> rounding off via BigDecimal which is compute intensive.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23902) Provide an option in months_between UDF to disable rounding-off

2018-04-25 Thread Takuya Ueshin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin reassigned SPARK-23902:
-

Assignee: Marco Gaido

> Provide an option in months_between UDF to disable rounding-off
> ---
>
> Key: SPARK-23902
> URL: https://issues.apache.org/jira/browse/SPARK-23902
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Marco Gaido
>Priority: Major
>
> https://issues.apache.org/jira/browse/HIVE-15511
> {noformat}
> Rounding off was added in {{GenericUDFMonthsBetween}} so that it can be 
> compatible with systems like oracle. However, there are places where rounding 
> off is not needed.
> E.g "CAST(MONTHS_BETWEEN(l_shipdate, l_commitdate) / 12 AS INT)" may not need 
> rounding off via BigDecimal which is compute intensive.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21645) SparkSQL Left outer join get the error result when use phoenix spark plugin

2018-04-25 Thread shining (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453410#comment-16453410
 ] 

shining commented on SPARK-21645:
-

When left outer join occurs between table a and table b, the filter of table a 
should be synchronized to table b if the filter column is contained in join 
condition.

 

> SparkSQL Left outer join get the error result when use phoenix spark plugin
> ---
>
> Key: SPARK-21645
> URL: https://issues.apache.org/jira/browse/SPARK-21645
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
> Environment: spark2.1.0
> hbase 1.1.2
> phoenix4.10
>Reporter: shining
>Priority: Major
>
> I have two tables in phoenix: AN_BASEINFO and AN_SUP_BASEINFO 
> Then I crate the outer datasource table in sparksql through phoenix spark 
> plugin.like
> create table AN_BASEINFO 
> using org.apache.phoenix.spark
> OPTIONS(table "AN_BASEINFO ", zkUrl "172.16.12.82:2181")
> and 
> create table AN_SUP_BASEINFO 
> using org.apache.phoenix.spark
> OPTIONS(table "AN_SUP_BASEINFO ", zkUrl "172.16.12.82:2181")
> IN SparkSQL I execute a sql use lef outer join,the sql is :
> {color:red}{color:#f79232}_
> *select
> a.anchedate,b.womempnumdis,b.holdingsmsgdis
> from
> AN_BASEINFO a
>  left outer join AN_SUP_BASEINFO b
> on
>a.S_EXT_NODENUM = b.S_EXT_NODENUM and a.ANCHEID   =b.ANCHEID
> where
> a.ANCHEID= '2c9e87ea5bd35458015c2df4003a1025';*_{color}{color}
> the result is : 2017-05-22 00:00:00.0   NULLNULL 
> But actually, table AN_SUP_BASEINFO exist an record that  a.S_EXT_NODENUM = 
> b.S_EXT_NODENUM and a.ANCHEID   =b.ANCHEID;
> If I add a filter condition b.holdingsmsgdis is not null in the sql, the 
> result is right:
> 2017-05-22 00:00:00.0   2   1 
> the sql:
> *{color:#d04437}select
> a.anchedate,b.womempnumdis,b.holdingsmsgdis
> from
> AN_BASEINFO a
>  left outer join AN_SUP_BASEINFO b
> on
>a.S_EXT_NODENUM = b.S_EXT_NODENUM and a.ANCHEID   =b.ANCHEID
> where
> a.ANCHEID= '2c9e87ea5bd35458015c2df4003a1025'{color:#d04437}and 
> b.holdingsmsgdis is not null;{color}{color}*
> {color:#d04437}{color:#14892c}result is right: 2017-05-22 00:00:00.0   2  
>  1 {color}{color}
> Is there anyone who know this?Please help!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21645) SparkSQL Left outer join get the error result when use phoenix spark plugin

2018-04-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21645:


Assignee: (was: Apache Spark)

> SparkSQL Left outer join get the error result when use phoenix spark plugin
> ---
>
> Key: SPARK-21645
> URL: https://issues.apache.org/jira/browse/SPARK-21645
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
> Environment: spark2.1.0
> hbase 1.1.2
> phoenix4.10
>Reporter: shining
>Priority: Major
>
> I have two tables in phoenix: AN_BASEINFO and AN_SUP_BASEINFO 
> Then I crate the outer datasource table in sparksql through phoenix spark 
> plugin.like
> create table AN_BASEINFO 
> using org.apache.phoenix.spark
> OPTIONS(table "AN_BASEINFO ", zkUrl "172.16.12.82:2181")
> and 
> create table AN_SUP_BASEINFO 
> using org.apache.phoenix.spark
> OPTIONS(table "AN_SUP_BASEINFO ", zkUrl "172.16.12.82:2181")
> IN SparkSQL I execute a sql use lef outer join,the sql is :
> {color:red}{color:#f79232}_
> *select
> a.anchedate,b.womempnumdis,b.holdingsmsgdis
> from
> AN_BASEINFO a
>  left outer join AN_SUP_BASEINFO b
> on
>a.S_EXT_NODENUM = b.S_EXT_NODENUM and a.ANCHEID   =b.ANCHEID
> where
> a.ANCHEID= '2c9e87ea5bd35458015c2df4003a1025';*_{color}{color}
> the result is : 2017-05-22 00:00:00.0   NULLNULL 
> But actually, table AN_SUP_BASEINFO exist an record that  a.S_EXT_NODENUM = 
> b.S_EXT_NODENUM and a.ANCHEID   =b.ANCHEID;
> If I add a filter condition b.holdingsmsgdis is not null in the sql, the 
> result is right:
> 2017-05-22 00:00:00.0   2   1 
> the sql:
> *{color:#d04437}select
> a.anchedate,b.womempnumdis,b.holdingsmsgdis
> from
> AN_BASEINFO a
>  left outer join AN_SUP_BASEINFO b
> on
>a.S_EXT_NODENUM = b.S_EXT_NODENUM and a.ANCHEID   =b.ANCHEID
> where
> a.ANCHEID= '2c9e87ea5bd35458015c2df4003a1025'{color:#d04437}and 
> b.holdingsmsgdis is not null;{color}{color}*
> {color:#d04437}{color:#14892c}result is right: 2017-05-22 00:00:00.0   2  
>  1 {color}{color}
> Is there anyone who know this?Please help!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21645) SparkSQL Left outer join get the error result when use phoenix spark plugin

2018-04-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21645:


Assignee: Apache Spark

> SparkSQL Left outer join get the error result when use phoenix spark plugin
> ---
>
> Key: SPARK-21645
> URL: https://issues.apache.org/jira/browse/SPARK-21645
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
> Environment: spark2.1.0
> hbase 1.1.2
> phoenix4.10
>Reporter: shining
>Assignee: Apache Spark
>Priority: Major
>
> I have two tables in phoenix: AN_BASEINFO and AN_SUP_BASEINFO 
> Then I crate the outer datasource table in sparksql through phoenix spark 
> plugin.like
> create table AN_BASEINFO 
> using org.apache.phoenix.spark
> OPTIONS(table "AN_BASEINFO ", zkUrl "172.16.12.82:2181")
> and 
> create table AN_SUP_BASEINFO 
> using org.apache.phoenix.spark
> OPTIONS(table "AN_SUP_BASEINFO ", zkUrl "172.16.12.82:2181")
> IN SparkSQL I execute a sql use lef outer join,the sql is :
> {color:red}{color:#f79232}_
> *select
> a.anchedate,b.womempnumdis,b.holdingsmsgdis
> from
> AN_BASEINFO a
>  left outer join AN_SUP_BASEINFO b
> on
>a.S_EXT_NODENUM = b.S_EXT_NODENUM and a.ANCHEID   =b.ANCHEID
> where
> a.ANCHEID= '2c9e87ea5bd35458015c2df4003a1025';*_{color}{color}
> the result is : 2017-05-22 00:00:00.0   NULLNULL 
> But actually, table AN_SUP_BASEINFO exist an record that  a.S_EXT_NODENUM = 
> b.S_EXT_NODENUM and a.ANCHEID   =b.ANCHEID;
> If I add a filter condition b.holdingsmsgdis is not null in the sql, the 
> result is right:
> 2017-05-22 00:00:00.0   2   1 
> the sql:
> *{color:#d04437}select
> a.anchedate,b.womempnumdis,b.holdingsmsgdis
> from
> AN_BASEINFO a
>  left outer join AN_SUP_BASEINFO b
> on
>a.S_EXT_NODENUM = b.S_EXT_NODENUM and a.ANCHEID   =b.ANCHEID
> where
> a.ANCHEID= '2c9e87ea5bd35458015c2df4003a1025'{color:#d04437}and 
> b.holdingsmsgdis is not null;{color}{color}*
> {color:#d04437}{color:#14892c}result is right: 2017-05-22 00:00:00.0   2  
>  1 {color}{color}
> Is there anyone who know this?Please help!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21645) SparkSQL Left outer join get the error result when use phoenix spark plugin

2018-04-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453405#comment-16453405
 ] 

Apache Spark commented on SPARK-21645:
--

User 'shining1989' has created a pull request for this issue:
https://github.com/apache/spark/pull/21161

> SparkSQL Left outer join get the error result when use phoenix spark plugin
> ---
>
> Key: SPARK-21645
> URL: https://issues.apache.org/jira/browse/SPARK-21645
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
> Environment: spark2.1.0
> hbase 1.1.2
> phoenix4.10
>Reporter: shining
>Priority: Major
>
> I have two tables in phoenix: AN_BASEINFO and AN_SUP_BASEINFO 
> Then I crate the outer datasource table in sparksql through phoenix spark 
> plugin.like
> create table AN_BASEINFO 
> using org.apache.phoenix.spark
> OPTIONS(table "AN_BASEINFO ", zkUrl "172.16.12.82:2181")
> and 
> create table AN_SUP_BASEINFO 
> using org.apache.phoenix.spark
> OPTIONS(table "AN_SUP_BASEINFO ", zkUrl "172.16.12.82:2181")
> IN SparkSQL I execute a sql use lef outer join,the sql is :
> {color:red}{color:#f79232}_
> *select
> a.anchedate,b.womempnumdis,b.holdingsmsgdis
> from
> AN_BASEINFO a
>  left outer join AN_SUP_BASEINFO b
> on
>a.S_EXT_NODENUM = b.S_EXT_NODENUM and a.ANCHEID   =b.ANCHEID
> where
> a.ANCHEID= '2c9e87ea5bd35458015c2df4003a1025';*_{color}{color}
> the result is : 2017-05-22 00:00:00.0   NULLNULL 
> But actually, table AN_SUP_BASEINFO exist an record that  a.S_EXT_NODENUM = 
> b.S_EXT_NODENUM and a.ANCHEID   =b.ANCHEID;
> If I add a filter condition b.holdingsmsgdis is not null in the sql, the 
> result is right:
> 2017-05-22 00:00:00.0   2   1 
> the sql:
> *{color:#d04437}select
> a.anchedate,b.womempnumdis,b.holdingsmsgdis
> from
> AN_BASEINFO a
>  left outer join AN_SUP_BASEINFO b
> on
>a.S_EXT_NODENUM = b.S_EXT_NODENUM and a.ANCHEID   =b.ANCHEID
> where
> a.ANCHEID= '2c9e87ea5bd35458015c2df4003a1025'{color:#d04437}and 
> b.holdingsmsgdis is not null;{color}{color}*
> {color:#d04437}{color:#14892c}result is right: 2017-05-22 00:00:00.0   2  
>  1 {color}{color}
> Is there anyone who know this?Please help!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21661) SparkSQL can't merge load table from Hadoop

2018-04-25 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453394#comment-16453394
 ] 

Hyukjin Kwon commented on SPARK-21661:
--

Another note: we now have {{spark.hadoopRDD.ignoreEmptySplits}} configuration 
too for HadoopRDD related operations.

> SparkSQL can't merge load table from Hadoop
> ---
>
> Key: SPARK-21661
> URL: https://issues.apache.org/jira/browse/SPARK-21661
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Dapeng Sun
>Assignee: Li Yuanjian
>Priority: Major
>
> Here is the original text of external table on HDFS:
> {noformat}
> PermissionOwner   Group   SizeLast Modified   Replication Block 
> Size  Name
> -rw-r--r--rootsupergroup  0 B 8/6/2017, 11:43:03 PM   3   
> 256 MB  income_band_001.dat
> -rw-r--r--rootsupergroup  0 B 8/6/2017, 11:39:31 PM   3   
> 256 MB  income_band_002.dat
> ...
> -rw-r--r--rootsupergroup  327 B   8/6/2017, 11:44:47 PM   3   
> 256 MB  income_band_530.dat
> {noformat}
> After SparkSQL load, every files have a output file, even the files are 0B. 
> For the load on Hive, the data files would be merged according the data size 
> of original files.
> Reproduce:
> {noformat}
> CREATE EXTERNAL TABLE t1 (a int,b string)  STORED AS TEXTFILE LOCATION 
> "hdfs://xxx:9000/data/t1"
> CREATE TABLE t2 STORED AS PARQUET AS SELECT * FROM t1;
> {noformat}
> The table t2 have many small files without data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23151) Provide a distribution of Spark with Hadoop 3.0

2018-04-25 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453368#comment-16453368
 ] 

Hyukjin Kwon commented on SPARK-23151:
--

[~ste...@apache.org], is it an exact duplicate of SPARK-23534? If so, I would 
like to leave this resolved since SPARK-23534 is active now.

> Provide a distribution of Spark with Hadoop 3.0
> ---
>
> Key: SPARK-23151
> URL: https://issues.apache.org/jira/browse/SPARK-23151
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Louis Burke
>Priority: Major
>
> Provide a Spark package that supports Hadoop 3.0.0. Currently the Spark 
> package
> only supports Hadoop 2.7 i.e. spark-2.2.1-bin-hadoop2.7.tgz. The implication 
> is
> that using up to date Kinesis libraries alongside s3 causes a clash w.r.t
> aws-java-sdk.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24094) Change description strings of v2 streaming sources to reflect the change

2018-04-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453348#comment-16453348
 ] 

Apache Spark commented on SPARK-24094:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/21160

> Change description strings of v2 streaming sources to reflect the change
> 
>
> Key: SPARK-24094
> URL: https://issues.apache.org/jira/browse/SPARK-24094
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24094) Change description strings of v2 streaming sources to reflect the change

2018-04-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24094:


Assignee: Tathagata Das  (was: Apache Spark)

> Change description strings of v2 streaming sources to reflect the change
> 
>
> Key: SPARK-24094
> URL: https://issues.apache.org/jira/browse/SPARK-24094
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24089) DataFrame.write().mode(SaveMode.Append).insertInto(TABLE)

2018-04-25 Thread kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453345#comment-16453345
 ] 

kumar edited comment on SPARK-24089 at 4/26/18 1:50 AM:


UNION just joins the DataFrames, i am not looking for that. I want to 
insert(append) data to already existing table. 
{code:java}
public enum SaveMode {
  /**
   * Append mode means that when saving a DataFrame to a data source, if 
data/table already exists, contents of the DataFrame are expected to be 
appended to existing data.
   *
   * @since 1.3.0
   */
  Append
{code}
_These comments taken from SaveMode enum for Append key_, this is what exactly 
i am looking for. I believe that's what i did in my code. It's not working, 
data is not appending, throwing exception, that's the issue i am facing. Please 
check it.


was (Author: rkrgarlapati):
UNION just joins the DataFrames, i am not looking for that. I want 
insert(append) data to already existing table. 
{code:java}
public enum SaveMode {
  /**
   * Append mode means that when saving a DataFrame to a data source, if 
data/table already exists, contents of the DataFrame are expected to be 
appended to existing data.
   *
   * @since 1.3.0
   */
  Append
{code}
_These comments taken from SaveMode enum for Append key_, this is what exactly 
i am looking for. I believe that's what i did in my code. It's not working, 
data is not appending, throwing exception, that's the issue i am facing. Please 
check it.

> DataFrame.write().mode(SaveMode.Append).insertInto(TABLE) 
> --
>
> Key: SPARK-24089
> URL: https://issues.apache.org/jira/browse/SPARK-24089
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: kumar
>Priority: Major
>  Labels: bug
>
> I am completely stuck with this issue, unable to progress further. For more 
> info pls refer this post : 
> [https://stackoverflow.com/questions/49994085/spark-sql-2-3-dataframe-savemode-append-issue]
> I want to load multiple files one by one, don't want to load all files at a 
> time. To achieve this i used SaveMode.Append, so that 2nd file data will be 
> added to 1st file data in database, but it's throwing exception.
> {code:java}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: unresolved 
> operator 'InsertIntoTable LogicalRDD [a1#4, b1#5, c1#6, d1#7], false, false, 
> false;;
> 'InsertIntoTable LogicalRDD [a1#4, b1#5, c1#6, d1#7], false, false, false
> +- LogicalRDD [a1#22, b1#23, c1#24, d1#25], false
> {code}
> Code:
> {code:java}
> package com.log;
> import com.log.common.RegexMatch;
> import com.log.spark.SparkProcessor;
> import org.apache.spark.SparkContext;
> import org.apache.spark.api.java.JavaRDD;
> import org.apache.spark.api.java.function.Function;
> import org.apache.spark.sql.*;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> import org.apache.spark.storage.StorageLevel;
> import java.util.ArrayList;
> import java.util.List;
> public class TestApp {
> private SparkSession spark;
> private SparkContext sparkContext;
> private SQLContext sqlContext;
> public TestApp() {
> SparkSession spark = SparkSession.builder().appName("Simple 
> Application")
> .config("spark.master", "local").getOrCreate();
> SparkContext sc = spark.sparkContext();
> this.spark = spark;
> this.sparkContext = sc;
> }
> public static void main(String[] args) {
> TestApp app = new TestApp();
> String[] afiles = {"C:\\Users\\test\\Desktop\\logs\\log1.txt",
> "C:\\Users\\test\\Desktop\\logs\\log2.txt"};
> for (String file : afiles) {
> app.writeFileToSchema(file);
> }
> }
> public void writeFileToSchema(String filePath) {
> StructType schema = getSchema();
> JavaRDD rowRDD = getRowRDD(filePath);
> if (spark.catalog().tableExists("mylogs")) {
> logDataFrame = spark.createDataFrame(rowRDD, schema);
> logDataFrame.createOrReplaceTempView("temptable");
> 
> logDataFrame.write().mode(SaveMode.Append).insertInto("mylogs");//exception
> } else {
> logDataFrame = spark.createDataFrame(rowRDD, schema);
> logDataFrame.createOrReplaceTempView("mylogs");
> }
> Dataset results = spark.sql("SELECT count(b1) FROM mylogs");
> List allrows = results.collectAsList();
> System.out.println("Count:"+allrows);
> sqlContext = logDataFrame.sqlContext();
> }
> Dataset logDataFrame;
> public List getTagList() {
> Dataset results = sqlContext.sql("SELECT distinct(b1) FROM 

[jira] [Assigned] (SPARK-24094) Change description strings of v2 streaming sources to reflect the change

2018-04-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24094:


Assignee: Apache Spark  (was: Tathagata Das)

> Change description strings of v2 streaming sources to reflect the change
> 
>
> Key: SPARK-24094
> URL: https://issues.apache.org/jira/browse/SPARK-24094
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Apache Spark
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24089) DataFrame.write().mode(SaveMode.Append).insertInto(TABLE)

2018-04-25 Thread kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453345#comment-16453345
 ] 

kumar commented on SPARK-24089:
---

UNION just joins the DataFrames, i am not looking for that. I want 
insert(append) data to already existing table. 
{code:java}
public enum SaveMode {
  /**
   * Append mode means that when saving a DataFrame to a data source, if 
data/table already exists, contents of the DataFrame are expected to be 
appended to existing data.
   *
   * @since 1.3.0
   */
  Append
{code}
_These comments taken from SaveMode enum for Append key_, this is what exactly 
i am looking for. I believe that's what i did in my code. It's not working, 
data is not appending, throwing exception, that's the issue i am facing. Please 
check it.

> DataFrame.write().mode(SaveMode.Append).insertInto(TABLE) 
> --
>
> Key: SPARK-24089
> URL: https://issues.apache.org/jira/browse/SPARK-24089
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: kumar
>Priority: Major
>  Labels: bug
>
> I am completely stuck with this issue, unable to progress further. For more 
> info pls refer this post : 
> [https://stackoverflow.com/questions/49994085/spark-sql-2-3-dataframe-savemode-append-issue]
> I want to load multiple files one by one, don't want to load all files at a 
> time. To achieve this i used SaveMode.Append, so that 2nd file data will be 
> added to 1st file data in database, but it's throwing exception.
> {code:java}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: unresolved 
> operator 'InsertIntoTable LogicalRDD [a1#4, b1#5, c1#6, d1#7], false, false, 
> false;;
> 'InsertIntoTable LogicalRDD [a1#4, b1#5, c1#6, d1#7], false, false, false
> +- LogicalRDD [a1#22, b1#23, c1#24, d1#25], false
> {code}
> Code:
> {code:java}
> package com.log;
> import com.log.common.RegexMatch;
> import com.log.spark.SparkProcessor;
> import org.apache.spark.SparkContext;
> import org.apache.spark.api.java.JavaRDD;
> import org.apache.spark.api.java.function.Function;
> import org.apache.spark.sql.*;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> import org.apache.spark.storage.StorageLevel;
> import java.util.ArrayList;
> import java.util.List;
> public class TestApp {
> private SparkSession spark;
> private SparkContext sparkContext;
> private SQLContext sqlContext;
> public TestApp() {
> SparkSession spark = SparkSession.builder().appName("Simple 
> Application")
> .config("spark.master", "local").getOrCreate();
> SparkContext sc = spark.sparkContext();
> this.spark = spark;
> this.sparkContext = sc;
> }
> public static void main(String[] args) {
> TestApp app = new TestApp();
> String[] afiles = {"C:\\Users\\test\\Desktop\\logs\\log1.txt",
> "C:\\Users\\test\\Desktop\\logs\\log2.txt"};
> for (String file : afiles) {
> app.writeFileToSchema(file);
> }
> }
> public void writeFileToSchema(String filePath) {
> StructType schema = getSchema();
> JavaRDD rowRDD = getRowRDD(filePath);
> if (spark.catalog().tableExists("mylogs")) {
> logDataFrame = spark.createDataFrame(rowRDD, schema);
> logDataFrame.createOrReplaceTempView("temptable");
> 
> logDataFrame.write().mode(SaveMode.Append).insertInto("mylogs");//exception
> } else {
> logDataFrame = spark.createDataFrame(rowRDD, schema);
> logDataFrame.createOrReplaceTempView("mylogs");
> }
> Dataset results = spark.sql("SELECT count(b1) FROM mylogs");
> List allrows = results.collectAsList();
> System.out.println("Count:"+allrows);
> sqlContext = logDataFrame.sqlContext();
> }
> Dataset logDataFrame;
> public List getTagList() {
> Dataset results = sqlContext.sql("SELECT distinct(b1) FROM 
> mylogs");
> List allrows = results.collectAsList();
> return allrows;
> }
> public StructType getSchema() {
> String schemaString = "a1 b1 c1 d1";
> List fields = new ArrayList<>();
> for (String fieldName : schemaString.split(" ")) {
> StructField field = DataTypes.createStructField(fieldName, 
> DataTypes.StringType, true);
> fields.add(field);
> }
> StructType schema = DataTypes.createStructType(fields);
> return schema;
> }
> public JavaRDD getRowRDD(String filePath) {
> JavaRDD logRDD = sparkContext.textFile(filePath, 
> 1).toJavaRDD();
> RegexMatch reg = new 

[jira] [Commented] (SPARK-24089) DataFrame.write().mode(SaveMode.Append).insertInto(TABLE)

2018-04-25 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453336#comment-16453336
 ] 

Hyukjin Kwon commented on SPARK-24089:
--

(Critical+ is also usually reserved for committers, actually)

> DataFrame.write().mode(SaveMode.Append).insertInto(TABLE) 
> --
>
> Key: SPARK-24089
> URL: https://issues.apache.org/jira/browse/SPARK-24089
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: kumar
>Priority: Major
>  Labels: bug
>
> I am completely stuck with this issue, unable to progress further. For more 
> info pls refer this post : 
> [https://stackoverflow.com/questions/49994085/spark-sql-2-3-dataframe-savemode-append-issue]
> I want to load multiple files one by one, don't want to load all files at a 
> time. To achieve this i used SaveMode.Append, so that 2nd file data will be 
> added to 1st file data in database, but it's throwing exception.
> {code:java}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: unresolved 
> operator 'InsertIntoTable LogicalRDD [a1#4, b1#5, c1#6, d1#7], false, false, 
> false;;
> 'InsertIntoTable LogicalRDD [a1#4, b1#5, c1#6, d1#7], false, false, false
> +- LogicalRDD [a1#22, b1#23, c1#24, d1#25], false
> {code}
> Code:
> {code:java}
> package com.log;
> import com.log.common.RegexMatch;
> import com.log.spark.SparkProcessor;
> import org.apache.spark.SparkContext;
> import org.apache.spark.api.java.JavaRDD;
> import org.apache.spark.api.java.function.Function;
> import org.apache.spark.sql.*;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> import org.apache.spark.storage.StorageLevel;
> import java.util.ArrayList;
> import java.util.List;
> public class TestApp {
> private SparkSession spark;
> private SparkContext sparkContext;
> private SQLContext sqlContext;
> public TestApp() {
> SparkSession spark = SparkSession.builder().appName("Simple 
> Application")
> .config("spark.master", "local").getOrCreate();
> SparkContext sc = spark.sparkContext();
> this.spark = spark;
> this.sparkContext = sc;
> }
> public static void main(String[] args) {
> TestApp app = new TestApp();
> String[] afiles = {"C:\\Users\\test\\Desktop\\logs\\log1.txt",
> "C:\\Users\\test\\Desktop\\logs\\log2.txt"};
> for (String file : afiles) {
> app.writeFileToSchema(file);
> }
> }
> public void writeFileToSchema(String filePath) {
> StructType schema = getSchema();
> JavaRDD rowRDD = getRowRDD(filePath);
> if (spark.catalog().tableExists("mylogs")) {
> logDataFrame = spark.createDataFrame(rowRDD, schema);
> logDataFrame.createOrReplaceTempView("temptable");
> 
> logDataFrame.write().mode(SaveMode.Append).insertInto("mylogs");//exception
> } else {
> logDataFrame = spark.createDataFrame(rowRDD, schema);
> logDataFrame.createOrReplaceTempView("mylogs");
> }
> Dataset results = spark.sql("SELECT count(b1) FROM mylogs");
> List allrows = results.collectAsList();
> System.out.println("Count:"+allrows);
> sqlContext = logDataFrame.sqlContext();
> }
> Dataset logDataFrame;
> public List getTagList() {
> Dataset results = sqlContext.sql("SELECT distinct(b1) FROM 
> mylogs");
> List allrows = results.collectAsList();
> return allrows;
> }
> public StructType getSchema() {
> String schemaString = "a1 b1 c1 d1";
> List fields = new ArrayList<>();
> for (String fieldName : schemaString.split(" ")) {
> StructField field = DataTypes.createStructField(fieldName, 
> DataTypes.StringType, true);
> fields.add(field);
> }
> StructType schema = DataTypes.createStructType(fields);
> return schema;
> }
> public JavaRDD getRowRDD(String filePath) {
> JavaRDD logRDD = sparkContext.textFile(filePath, 
> 1).toJavaRDD();
> RegexMatch reg = new RegexMatch();
> JavaRDD rowRDD = logRDD
> .map((Function) line -> {
> String[] st = line.split(" ");
> return RowFactory.create(st[0], st[1], st[2], st[3]);
> });
> rowRDD.persist(StorageLevel.MEMORY_ONLY());
> return rowRDD;
> }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For 

[jira] [Updated] (SPARK-24089) DataFrame.write().mode(SaveMode.Append).insertInto(TABLE)

2018-04-25 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-24089:
-
Priority: Major  (was: Critical)

> DataFrame.write().mode(SaveMode.Append).insertInto(TABLE) 
> --
>
> Key: SPARK-24089
> URL: https://issues.apache.org/jira/browse/SPARK-24089
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: kumar
>Priority: Major
>  Labels: bug
>
> I am completely stuck with this issue, unable to progress further. For more 
> info pls refer this post : 
> [https://stackoverflow.com/questions/49994085/spark-sql-2-3-dataframe-savemode-append-issue]
> I want to load multiple files one by one, don't want to load all files at a 
> time. To achieve this i used SaveMode.Append, so that 2nd file data will be 
> added to 1st file data in database, but it's throwing exception.
> {code:java}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: unresolved 
> operator 'InsertIntoTable LogicalRDD [a1#4, b1#5, c1#6, d1#7], false, false, 
> false;;
> 'InsertIntoTable LogicalRDD [a1#4, b1#5, c1#6, d1#7], false, false, false
> +- LogicalRDD [a1#22, b1#23, c1#24, d1#25], false
> {code}
> Code:
> {code:java}
> package com.log;
> import com.log.common.RegexMatch;
> import com.log.spark.SparkProcessor;
> import org.apache.spark.SparkContext;
> import org.apache.spark.api.java.JavaRDD;
> import org.apache.spark.api.java.function.Function;
> import org.apache.spark.sql.*;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> import org.apache.spark.storage.StorageLevel;
> import java.util.ArrayList;
> import java.util.List;
> public class TestApp {
> private SparkSession spark;
> private SparkContext sparkContext;
> private SQLContext sqlContext;
> public TestApp() {
> SparkSession spark = SparkSession.builder().appName("Simple 
> Application")
> .config("spark.master", "local").getOrCreate();
> SparkContext sc = spark.sparkContext();
> this.spark = spark;
> this.sparkContext = sc;
> }
> public static void main(String[] args) {
> TestApp app = new TestApp();
> String[] afiles = {"C:\\Users\\test\\Desktop\\logs\\log1.txt",
> "C:\\Users\\test\\Desktop\\logs\\log2.txt"};
> for (String file : afiles) {
> app.writeFileToSchema(file);
> }
> }
> public void writeFileToSchema(String filePath) {
> StructType schema = getSchema();
> JavaRDD rowRDD = getRowRDD(filePath);
> if (spark.catalog().tableExists("mylogs")) {
> logDataFrame = spark.createDataFrame(rowRDD, schema);
> logDataFrame.createOrReplaceTempView("temptable");
> 
> logDataFrame.write().mode(SaveMode.Append).insertInto("mylogs");//exception
> } else {
> logDataFrame = spark.createDataFrame(rowRDD, schema);
> logDataFrame.createOrReplaceTempView("mylogs");
> }
> Dataset results = spark.sql("SELECT count(b1) FROM mylogs");
> List allrows = results.collectAsList();
> System.out.println("Count:"+allrows);
> sqlContext = logDataFrame.sqlContext();
> }
> Dataset logDataFrame;
> public List getTagList() {
> Dataset results = sqlContext.sql("SELECT distinct(b1) FROM 
> mylogs");
> List allrows = results.collectAsList();
> return allrows;
> }
> public StructType getSchema() {
> String schemaString = "a1 b1 c1 d1";
> List fields = new ArrayList<>();
> for (String fieldName : schemaString.split(" ")) {
> StructField field = DataTypes.createStructField(fieldName, 
> DataTypes.StringType, true);
> fields.add(field);
> }
> StructType schema = DataTypes.createStructType(fields);
> return schema;
> }
> public JavaRDD getRowRDD(String filePath) {
> JavaRDD logRDD = sparkContext.textFile(filePath, 
> 1).toJavaRDD();
> RegexMatch reg = new RegexMatch();
> JavaRDD rowRDD = logRDD
> .map((Function) line -> {
> String[] st = line.split(" ");
> return RowFactory.create(st[0], st[1], st[2], st[3]);
> });
> rowRDD.persist(StorageLevel.MEMORY_ONLY());
> return rowRDD;
> }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24086) Exception while executing spark streaming examples

2018-04-25 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24086.
--
Resolution: Invalid

>From a quick look, that sounds because you didn't provide a profile for Kafka. 
>It sounds more like a question which should usually go to mailing lists. You 
>could have a quicker and better answer there.

> Exception while executing spark streaming examples
> --
>
> Key: SPARK-24086
> URL: https://issues.apache.org/jira/browse/SPARK-24086
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 2.3.0
>Reporter: Chandra Hasan
>Priority: Major
>
> After running mvn clean package, I tried to execute one of the spark example 
> program JavaDirectKafkaWordCount.java but throws following exeception.
> {code:java}
> [cloud-user@server-2 examples]$ run-example 
> streaming.JavaDirectKafkaWordCount 192.168.0.4:9092 msu
> 2018-04-25 09:39:22 WARN NativeCodeLoader:62 - Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 2018-04-25 09:39:22 INFO SparkContext:54 - Running Spark version 2.3.0
> 2018-04-25 09:39:22 INFO SparkContext:54 - Submitted application: 
> JavaDirectKafkaWordCount
> 2018-04-25 09:39:22 INFO SecurityManager:54 - Changing view acls to: 
> cloud-user
> 2018-04-25 09:39:22 INFO SecurityManager:54 - Changing modify acls to: 
> cloud-user
> 2018-04-25 09:39:22 INFO SecurityManager:54 - Changing view acls groups to:
> 2018-04-25 09:39:22 INFO SecurityManager:54 - Changing modify acls groups to:
> 2018-04-25 09:39:22 INFO SecurityManager:54 - SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(cloud-user); 
> groups with view permissions: Set(); users with modify permissions: 
> Set(cloud-user); groups with modify permissions: Set()
> 2018-04-25 09:39:23 INFO Utils:54 - Successfully started service 
> 'sparkDriver' on port 59333.
> 2018-04-25 09:39:23 INFO SparkEnv:54 - Registering MapOutputTracker
> 2018-04-25 09:39:23 INFO SparkEnv:54 - Registering BlockManagerMaster
> 2018-04-25 09:39:23 INFO BlockManagerMasterEndpoint:54 - Using 
> org.apache.spark.storage.DefaultTopologyMapper for getting topology 
> information
> 2018-04-25 09:39:23 INFO BlockManagerMasterEndpoint:54 - 
> BlockManagerMasterEndpoint up
> 2018-04-25 09:39:23 INFO DiskBlockManager:54 - Created local directory at 
> /tmp/blockmgr-6fc11fc1-f638-42ea-a9df-dc01fb81b7b6
> 2018-04-25 09:39:23 INFO MemoryStore:54 - MemoryStore started with capacity 
> 366.3 MB
> 2018-04-25 09:39:23 INFO SparkEnv:54 - Registering OutputCommitCoordinator
> 2018-04-25 09:39:23 INFO log:192 - Logging initialized @1825ms
> 2018-04-25 09:39:23 INFO Server:346 - jetty-9.3.z-SNAPSHOT
> 2018-04-25 09:39:23 INFO Server:414 - Started @1900ms
> 2018-04-25 09:39:23 INFO AbstractConnector:278 - Started 
> ServerConnector@6813a331{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
> 2018-04-25 09:39:23 INFO Utils:54 - Successfully started service 'SparkUI' on 
> port 4040.
> 2018-04-25 09:39:23 INFO ContextHandler:781 - Started 
> o.s.j.s.ServletContextHandler@4f7c0be3{/jobs,null,AVAILABLE,@Spark}
> 2018-04-25 09:39:23 INFO ContextHandler:781 - Started 
> o.s.j.s.ServletContextHandler@4cfbaf4{/jobs/json,null,AVAILABLE,@Spark}
> 2018-04-25 09:39:23 INFO ContextHandler:781 - Started 
> o.s.j.s.ServletContextHandler@58faa93b{/jobs/job,null,AVAILABLE,@Spark}
> 2018-04-25 09:39:23 INFO ContextHandler:781 - Started 
> o.s.j.s.ServletContextHandler@127d7908{/jobs/job/json,null,AVAILABLE,@Spark}
> 2018-04-25 09:39:23 INFO ContextHandler:781 - Started 
> o.s.j.s.ServletContextHandler@6b9c69a9{/stages,null,AVAILABLE,@Spark}
> 2018-04-25 09:39:23 INFO ContextHandler:781 - Started 
> o.s.j.s.ServletContextHandler@6622a690{/stages/json,null,AVAILABLE,@Spark}
> 2018-04-25 09:39:23 INFO ContextHandler:781 - Started 
> o.s.j.s.ServletContextHandler@30b9eadd{/stages/stage,null,AVAILABLE,@Spark}
> 2018-04-25 09:39:23 INFO ContextHandler:781 - Started 
> o.s.j.s.ServletContextHandler@3249a1ce{/stages/stage/json,null,AVAILABLE,@Spark}
> 2018-04-25 09:39:23 INFO ContextHandler:781 - Started 
> o.s.j.s.ServletContextHandler@4dd94a58{/stages/pool,null,AVAILABLE,@Spark}
> 2018-04-25 09:39:23 INFO ContextHandler:781 - Started 
> o.s.j.s.ServletContextHandler@2f4919b0{/stages/pool/json,null,AVAILABLE,@Spark}
> 2018-04-25 09:39:23 INFO ContextHandler:781 - Started 
> o.s.j.s.ServletContextHandler@a8a8b75{/storage,null,AVAILABLE,@Spark}
> 2018-04-25 09:39:23 INFO ContextHandler:781 - Started 
> o.s.j.s.ServletContextHandler@75b21c3b{/storage/json,null,AVAILABLE,@Spark}
> 2018-04-25 09:39:23 INFO ContextHandler:781 - Started 
> o.s.j.s.ServletContextHandler@72be135f{/storage/rdd,null,AVAILABLE,@Spark}
> 2018-04-25 09:39:23 INFO 

[jira] [Resolved] (SPARK-24092) spark.python.worker.reuse does not work?

2018-04-25 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24092.
--
Resolution: Invalid

Questions should go to mailing list. You could have a better and quicker answer 
there. Let's file an issue when we are sure on that.

> spark.python.worker.reuse does not work?
> 
>
> Key: SPARK-24092
> URL: https://issues.apache.org/jira/browse/SPARK-24092
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: David Figueroa
>Priority: Minor
>
> {{spark.python.worker.reuse is true by default but even after explicitly 
> setting to true the code below does not print the same python worker process 
> ids.}}
> {code:java|title=procid.py|borderStyle=solid}
> def return_pid(_): yield os.getpid()
> spark = SparkSession.builder.getOrCreate()
> pids = set(spark.sparkContext.range(32).mapPartitions(return_pid).collect())
> print(pids)
> pids = set(spark.sparkContext.range(32).mapPartitions(return_pid).collect())
> print(pids){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24069) Add array_max / array_min functions

2018-04-25 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-24069:


Assignee: Hyukjin Kwon

> Add array_max / array_min functions
> ---
>
> Key: SPARK-24069
> URL: https://issues.apache.org/jira/browse/SPARK-24069
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.4.0
>
>
> Add R versions of SPARK-23918 and SPARK-23917



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24069) Add array_max / array_min functions

2018-04-25 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24069.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21142
[https://github.com/apache/spark/pull/21142]

> Add array_max / array_min functions
> ---
>
> Key: SPARK-24069
> URL: https://issues.apache.org/jira/browse/SPARK-24069
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.4.0
>
>
> Add R versions of SPARK-23918 and SPARK-23917



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24036) Stateful operators in continuous processing

2018-04-25 Thread Jungtaek Lim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453290#comment-16453290
 ] 

Jungtaek Lim commented on SPARK-24036:
--

Btw, I would like to say the idea for iterator hack and epoch RPC coordinator 
is awesome based on current goal: once only source offsets are stateful in a 
query.

> Stateful operators in continuous processing
> ---
>
> Key: SPARK-24036
> URL: https://issues.apache.org/jira/browse/SPARK-24036
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>
> The first iteration of continuous processing in Spark 2.3 does not work with 
> stateful operators.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24094) Change description strings of v2 streaming sources to reflect the change

2018-04-25 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-24094:
-

 Summary: Change description strings of v2 streaming sources to 
reflect the change
 Key: SPARK-24094
 URL: https://issues.apache.org/jira/browse/SPARK-24094
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.3.0
Reporter: Tathagata Das
Assignee: Tathagata Das






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24070) TPC-DS Performance Tests for Parquet 1.10.0 Upgrade

2018-04-25 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453258#comment-16453258
 ] 

Takeshi Yamamuro commented on SPARK-24070:
--

Sure, I checked the numbers and see: 
[https://docs.google.com/spreadsheets/d/18EvWZYDqlC_93DI_JaSKSO115OwZ9BfX7Zs5SnU21fA/edit?usp=sharing]
Then, I found no regression at least in TPC-DS.

> TPC-DS Performance Tests for Parquet 1.10.0 Upgrade
> ---
>
> Key: SPARK-24070
> URL: https://issues.apache.org/jira/browse/SPARK-24070
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Assignee: Takeshi Yamamuro
>Priority: Major
>
> TPC-DS performance evaluation of Apache Spark Parquet 1.8.2 and 1.10.0. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22210) Online LDA variationalTopicInference should use random seed to have stable behavior

2018-04-25 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453228#comment-16453228
 ] 

Joseph K. Bradley commented on SPARK-22210:
---

[~lu.DB] Would you like to do this?  It should be a matter of taking the "seed" 
Param passed to LDA and making sure it (or a seed generated from it) is passed 
down to this method.  Thanks!

> Online LDA variationalTopicInference  should use random seed to have stable 
> behavior
> 
>
> Key: SPARK-22210
> URL: https://issues.apache.org/jira/browse/SPARK-22210
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Priority: Minor
>
> https://github.com/apache/spark/blob/16fab6b0ef3dcb33f92df30e17680922ad5fb672/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala#L582
> Gamma distribution should use random seed to have consistent behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24036) Stateful operators in continuous processing

2018-04-25 Thread Jungtaek Lim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453209#comment-16453209
 ] 

Jungtaek Lim edited comment on SPARK-24036 at 4/25/18 10:54 PM:


Maybe better to share what I've observed from continuous mode so far.
 * It leverages iterator hack to make logical batch (epoch) in stream.
 ** While iterator works different from normal, it doesn't touch existing 
operators by putting assumption that all operators are chained and fit to 
single stage.
 ** With this assumption, only WriteToContinuousDataSourceExec needs to know 
how to deal with iterator hack.
 ** Above assumption requires no repartition, which most of stateful operators 
need to deal with.
 * Based on the hack, actually it doesn't put epoch marker flow through 
downstreams.
 ** To apply distributed snapshot it is mandatory, but it might require 
non-trivial change of existing model, since checkpoint should be handled from 
each stateful operator and stored in distributed manner, and coordinator should 
be able to check snapshots from all tasks are taken correctly.
 ** This would be unnecessary change for batch, and making existing model being 
much complicated.
 ** This would bring latency concerns, since each operator should stop 
processing while taking a snapshot. (I guess sending or storing snapshot still 
could be done asynchronously.)
 ** If there're more than one upstreams, it should arrange sequences between 
upstreams to take a snapshot with only proper data within epoch.

So there is a huge challenge with existing model to extend continuous mode to 
support stateful exactly-once (not about end-to-end exactly once, since it also 
depends on sink), and I'd like to see the follow-up idea/design doc around 
continuous mode to see the direction of continuous mode: whether relying on 
such assumption and try to explore (may need to have more hacks/workarounds), 
or willing to discard assumption and redesign.

Most of features are supported with micro-batch manner, so also would like to 
check the goal of continuous mode. Is it to cover all or most of features being 
supported with micro-batch? Or is the goal of continuous mode only to cover low 
latency use cases?

 


was (Author: kabhwan):
Maybe better to share what I've observed from continuous mode so far.
 * It leverages iterator hack to make logical batch (epoch) in stream.
 ** While iterator works different from normal, it doesn't touch existing 
operators by putting assumption that all operators are chained and fit to 
single stage.
 ** With this assumption, only WriteToContinuousDataSourceExec needs to know 
how to deal with iterator hack.
 ** Above assumption requires no repartition, which most of stateful operators 
need to deal with.
 * Based on the hack, actually it doesn't put epoch marker flow through 
downstreams.
 ** To apply distributed snapshot it is mandatory, but it might require 
non-trivial change of existing model, since checkpoint should be handled from 
each stateful operator and stored in distributed manner, and coordinator should 
be able to check snapshots from all tasks are taken correctly.
 ** This would be unnecessary change for batch, and making existing model being 
much complicated.
 ** This would bring latency concerns, since each operator should stop 
processing while taking a snapshot. (I guess sending or storing snapshot still 
could be done asynchronously.)
 ** If there're more than one upstreams, it should arrange sequences between 
upstreams to take a snapshot with only proper data within epoch.

So there is a huge challenge with existing model to extend continuous mode to 
support stateful exactly-once (not about end-to-end exactly once, since it also 
depends on sink), and I'd like to see the follow-up idea/design doc around 
continuous mode to see the direction of continuous mode: whether relying on 
such assumption and try to explore (may need to have more hacks/workarounds), 
or willing to discard assumption and redesign.

Most of features are supported with micro-batch manner, so also would like to 
see the goal of continuous mode. Is it to cover all or most of features being 
supported with micro-batch? Or is the goal of continuous mode only to cover low 
latency use cases?

 

> Stateful operators in continuous processing
> ---
>
> Key: SPARK-24036
> URL: https://issues.apache.org/jira/browse/SPARK-24036
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>
> The first iteration of continuous processing in Spark 2.3 does not work with 
> stateful operators.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (SPARK-24036) Stateful operators in continuous processing

2018-04-25 Thread Jungtaek Lim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453209#comment-16453209
 ] 

Jungtaek Lim commented on SPARK-24036:
--

Maybe better to share what I've observed from continuous mode so far.
 * It leverages iterator hack to make logical batch (epoch) in stream.
 ** While iterator works different from normal, it doesn't touch existing 
operators by putting assumption that all operators are chained and fit to 
single stage.
 ** With this assumption, only WriteToContinuousDataSourceExec needs to know 
how to deal with iterator hack.
 ** Above assumption requires no repartition, which most of stateful operators 
need to deal with.
 * Based on the hack, actually it doesn't put epoch marker flow through 
downstreams.
 ** To apply distributed snapshot it is mandatory, but it might require 
non-trivial change of existing model, since checkpoint should be handled from 
each stateful operator and stored in distributed manner, and coordinator should 
be able to check snapshots from all tasks are taken correctly.
 ** This would be unnecessary change for batch, and making existing model being 
much complicated.
 ** This would bring latency concerns, since each operator should stop 
processing while taking a snapshot. (I guess sending or storing snapshot still 
could be done asynchronously.)
 ** If there're more than one upstreams, it should arrange sequences between 
upstreams to take a snapshot with only proper data within epoch.

So there is a huge challenge with existing model to extend continuous mode to 
support stateful exactly-once (not about end-to-end exactly once, since it also 
depends on sink), and I'd like to see the follow-up idea/design doc around 
continuous mode to see the direction of continuous mode: whether relying on 
such assumption and try to explore (may need to have more hacks/workarounds), 
or willing to discard assumption and redesign.

Most of features are supported with micro-batch manner, so also would like to 
see the goal of continuous mode. Is it to cover all or most of features being 
supported with micro-batch? Or is the goal of continuous mode only to cover low 
latency use cases?

 

> Stateful operators in continuous processing
> ---
>
> Key: SPARK-24036
> URL: https://issues.apache.org/jira/browse/SPARK-24036
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>
> The first iteration of continuous processing in Spark 2.3 does not work with 
> stateful operators.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23715) from_utc_timestamp returns incorrect results for some UTC date/time values

2018-04-25 Thread Bruce Robbins (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452790#comment-16452790
 ] 

Bruce Robbins edited comment on SPARK-23715 at 4/25/18 10:00 PM:
-

[~cloud_fan]

I'll give separate answers for String input and long input.

1) String Input (e.g., "2018-04-24 15:24:00")

Across Impala, Hive, and Spark, the input to from_utc_timestamp is a UTC 
datetime value. They each work as expected when the user passes a UTC datetime 
value (with one exception for Spark, described below).

Both Hive and Impala treat the string input as a UTC datetime value. That is, 
if you enter '2018-04-24 15:24:00', from_utc_timestamp treats it as 
'2018-04-24T15:00:00+00:00' (or unix time 1524582000).

Spark's from_utc_timezone function also treats it as a UTC datetime value. 
Internally, it treats it as a local time (for the above example, it initially 
treats it as "2018-04-24T15:00:00-07:00"), but then corrects itself later in 
the code path. But from the user's perspective, Spark's from_utc_timestamp 
treats the input string as a UTC datetime value as long as the user sticks with 
the formats "-mm-dd HH:mm" and "-mm-dd'T'HH:mm".

Once the user ventures away from those formats, bad things happen. If the user 
specifies a UTC datetime value with a timezone (e.g., 
"2018-04-24T15:24:00+00:00"), Spark's from_utc_timezone returns the wrong 
result. Also, the user can specify a datetime value with a non-UTC timezone, 
which makes no sense. These two issues are the subject of this Jira.

In Hive and Impala, the user can only enter a datetime value in the format 
"yyy-MM-dd HH:mm". With Hive and Impala, the user cannot enter a timezone 
specification.

Besides the Spark bug described above, from_utc_timezone works the same between 
Impala, Hive, and Spark.

Impala has a -use_local_tz_for_unix_timestamp_conversions setting which may (or 
may not) change the behavior if its from_utc_timestamp function. I have not 
tested it.

IBM's DB2 also has a from_utc_timestamp function. I have not tested it, but the 
documentation says the input datetime value should be 'An expression that 
specifies the timestamp that is in the Coordinated Universal Time time zone', 
which is consistent with HIve, Impala, and Spark.

2) Integer Input (e.g., 1524582000)

Here Impala diverges from Spark and Hive.

Impala treats the input as the number of seconds since the epoch 1970-01-01 
00:00:00 UTC.
{noformat}
> select from_utc_timestamp(cast(0 as timestamp), "UTC") as result;
+-+
| result  |
+-+
| 1970-01-01 00:00:00 |
+-+
{noformat}
Both Hive and Spark first shift the specified time by the local time zone. 
Since I am in timezone America/Los_Angeles, the input value is shifted by -8 
hours, to 1969-12-31 16:00:00 UTC:
{noformat}
hive> select from_utc_timestamp(0, 'UTC');
OK
1969-12-31 16:00:00

spark-sql> select from_utc_timestamp(cast(0 as timestamp), 'UTC');
1969-12-31 16:00:00


hive> select from_utc_timestamp(0, 'America/Los_Angeles');
OK
1969-12-31 08:00:00

spark-sql> select from_utc_timestamp(cast(0 as timestamp), 
'America/Los_Angeles');
1969-12-31 08:00:00
{noformat}
I also classified this behavior with integer input as a bug in this Jira. 
However, it is consistent with Hive, so I am not so sure.
 

 


was (Author: bersprockets):
[~cloud_fan]

I'll give separate answers for String input and long input.

1) String Input (e.g., "2018-04-24 15:24:00")

Across Impala, Hive, and Spark, the input to from_utc_timestamp is a UTC 
datetime value. They each work as expected when the user passes a UTC datetime 
value (with one exception for Spark, described below).

Both Hive and Impala treat the string input as a UTC datetime value. That is, 
if you enter '2018-04-24 15:24:00', from_utc_timestamp treats it as 
'2018-04-24T15:00:00+00:00' (or unix time 1524582000).

Spark's from_utc_timezone function also treats it as a UTC datetime value. 
Internally, it treats it as a local time (for the above example, it initially 
treats it as "2018-04-24T15:00:00-07:00"), but then corrects itself later in 
the code path. But from the user's perspective, Spark's from_utc_timestamp 
treats the input string as a UTC datetime value as long as the user sticks with 
the formats "-mm-dd HH:mm" and "-mm-dd'T'HH:mm".

Once the user ventures away from those formats, bad things happen. If the user 
specifies a UTC datetime value with a timezone (e.g., 
"2018-04-24T15:24:00+00:00"), Spark's from_utc_timezone returns the wrong 
result. Also, the user can specify a datetime value with a non-UTC timezone, 
which makes no sense. These two issues are the subject of this Jira.

In Hive and Impala, the user can only enter a datetime value in the format 
"yyy-MM-dd HH:mm". With Hive and Impala, the user cannot enter a timezone 
specification.

Besides the Spark bug 

[jira] [Resolved] (SPARK-23824) Make inpurityStats publicly accessible in ml.tree.Node

2018-04-25 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-23824.
---
Resolution: Duplicate

> Make inpurityStats publicly accessible in ml.tree.Node
> --
>
> Key: SPARK-23824
> URL: https://issues.apache.org/jira/browse/SPARK-23824
> Project: Spark
>  Issue Type: Wish
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Barry Becker
>Priority: Minor
>
> This is minor, but it is also a very easy fix.
> I would like to visualize the structure of a decision tree model, but 
> currently the only means of obtaining the label distribution data at each 
> node of the tree is hidden within each ml.tree.Node inside the impurityStats.
> I'm pretty sure that the fix for this is as easy as removing the private[ml] 
> qualifier from occurrences of
> private[ml] def impurityStats: ImpurityCalculator
> and
> override private[ml] val impurityStats: ImpurityCalculator
>  
> As a workaround, I've put my class that needs access into a  
> org.apache.spark.ml.tree package in my own repository, but I would really 
> like to not have to do that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24057) put the real data type in the AssertionError message

2018-04-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453092#comment-16453092
 ] 

Apache Spark commented on SPARK-24057:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/21159

> put the real data type in the AssertionError message
> 
>
> Key: SPARK-24057
> URL: https://issues.apache.org/jira/browse/SPARK-24057
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> I had a wrong data type in one of my tests and got the following message: 
> /spark/python/pyspark/sql/types.py", line 405, in __init__
> assert isinstance(dataType, DataType), "dataType should be DataType"
> AssertionError: dataType should be DataType
> I checked types.py,  line 405, in __init__, it has
> {{assert isinstance(dataType, DataType), "dataType should be DataType"}}
> I think it meant to be 
> {{assert isinstance(dataType, DataType), "%s should be %s"  % (dataType, 
> DataType)}}
> so the error message will be something like 
> {{AssertionError:  should be  'pyspark.sql.types.DataType'>}}
> There are a couple of other places that have the same problem. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24057) put the real data type in the AssertionError message

2018-04-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24057:


Assignee: Apache Spark

> put the real data type in the AssertionError message
> 
>
> Key: SPARK-24057
> URL: https://issues.apache.org/jira/browse/SPARK-24057
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Minor
>
> I had a wrong data type in one of my tests and got the following message: 
> /spark/python/pyspark/sql/types.py", line 405, in __init__
> assert isinstance(dataType, DataType), "dataType should be DataType"
> AssertionError: dataType should be DataType
> I checked types.py,  line 405, in __init__, it has
> {{assert isinstance(dataType, DataType), "dataType should be DataType"}}
> I think it meant to be 
> {{assert isinstance(dataType, DataType), "%s should be %s"  % (dataType, 
> DataType)}}
> so the error message will be something like 
> {{AssertionError:  should be  'pyspark.sql.types.DataType'>}}
> There are a couple of other places that have the same problem. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24057) put the real data type in the AssertionError message

2018-04-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24057:


Assignee: (was: Apache Spark)

> put the real data type in the AssertionError message
> 
>
> Key: SPARK-24057
> URL: https://issues.apache.org/jira/browse/SPARK-24057
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> I had a wrong data type in one of my tests and got the following message: 
> /spark/python/pyspark/sql/types.py", line 405, in __init__
> assert isinstance(dataType, DataType), "dataType should be DataType"
> AssertionError: dataType should be DataType
> I checked types.py,  line 405, in __init__, it has
> {{assert isinstance(dataType, DataType), "dataType should be DataType"}}
> I think it meant to be 
> {{assert isinstance(dataType, DataType), "%s should be %s"  % (dataType, 
> DataType)}}
> so the error message will be something like 
> {{AssertionError:  should be  'pyspark.sql.types.DataType'>}}
> There are a couple of other places that have the same problem. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24057) put the real data type in the AssertionError message

2018-04-25 Thread Huaxin Gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao updated SPARK-24057:
---
Issue Type: Improvement  (was: Bug)

> put the real data type in the AssertionError message
> 
>
> Key: SPARK-24057
> URL: https://issues.apache.org/jira/browse/SPARK-24057
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> I had a wrong data type in one of my tests and got the following message: 
> /spark/python/pyspark/sql/types.py", line 405, in __init__
> assert isinstance(dataType, DataType), "dataType should be DataType"
> AssertionError: dataType should be DataType
> I checked types.py,  line 405, in __init__, it has
> {{assert isinstance(dataType, DataType), "dataType should be DataType"}}
> I think it meant to be 
> {{assert isinstance(dataType, DataType), "%s should be %s"  % (dataType, 
> DataType)}}
> so the error message will be something like 
> {{AssertionError:  should be  'pyspark.sql.types.DataType'>}}
> There are a couple of other places that have the same problem. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24093) Make some fields of KafkaStreamWriter/InternalRowMicroBatchWriter visible to outside of the classes

2018-04-25 Thread Weiqing Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiqing Yang updated SPARK-24093:
-
Description: To make third parties able to get the information of streaming 
writer, for example, the information of "writer" and "topic" which streaming 
data are written into, this jira is created to make relevant fields of 
KafkaStreamWriter and InternalRowMicroBatchWriter visible to outside of the 
classes.  (was: We are working on Spark Atlas 
Connector([https://github.com/hortonworks-spark/spark-atlas-connector)|https://github.com/hortonworks-spark/spark-atlas-connector).],
 and adding supports for Spark Streaming. As SAC needs the information of 
"writer" and "topic" which streaming data are written into, this jira is 
created to make relevant fields of KafkaStreamWriter and 
InternalRowMicroBatchWriter visible to outside of the classes.)

> Make some fields of KafkaStreamWriter/InternalRowMicroBatchWriter visible to 
> outside of the classes
> ---
>
> Key: SPARK-24093
> URL: https://issues.apache.org/jira/browse/SPARK-24093
> Project: Spark
>  Issue Type: Wish
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Weiqing Yang
>Priority: Minor
>
> To make third parties able to get the information of streaming writer, for 
> example, the information of "writer" and "topic" which streaming data are 
> written into, this jira is created to make relevant fields of 
> KafkaStreamWriter and InternalRowMicroBatchWriter visible to outside of the 
> classes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24036) Stateful operators in continuous processing

2018-04-25 Thread Arun Mahadevan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453058#comment-16453058
 ] 

Arun Mahadevan commented on SPARK-24036:


Hi [~joseph.torres], I am also interested to contribute to this effort if you 
are open to it.

> Supporting single partition aggregates. I have a substantially complete 
> prototype of this in [https://github.com/jose-torres/spark/pull/13] - it 
> doesn't really involve design as much as removing a very silly hack I put in 
> earlier.

Does it require saving the aggregate state by injecting epoch marker into the 
stream or it just works using the iterator approach since its involves only 
single partition?

To extend this to support multiple partition and shuffles, shouldn't the epoch 
markers be injected into the stream and state save happen on receiving the 
markers from all the parent tasks ?

 > Just write RPC endpoints on both ends tossing rows around, optimizing for 
throughput later if needed. (I'm leaning towards this one.)

So buffering of the rows between the stages and handling back-pressure needs to 
be considered here ? Would the existing shuffle infrastructure make it easier 
to handle this ?

 

> Stateful operators in continuous processing
> ---
>
> Key: SPARK-24036
> URL: https://issues.apache.org/jira/browse/SPARK-24036
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>
> The first iteration of continuous processing in Spark 2.3 does not work with 
> stateful operators.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13446) Spark need to support reading data from Hive 2.0.0 metastore

2018-04-25 Thread Tavis Barr (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453054#comment-16453054
 ] 

Tavis Barr commented on SPARK-13446:


I do not believe the issue causing the above stack trace has actually been 
fixed.  I still got it with Spark 2.3.0 and the code causing it appears tot 
still be in Master as of today.  The offending code is in 
/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala lines 205-206.  
Please see my comments regarding [SPARK 18112| 
https://issues.apache.org/jira/browse/SPARK-18112].

> Spark need to support reading data from Hive 2.0.0 metastore
> 
>
> Key: SPARK-13446
> URL: https://issues.apache.org/jira/browse/SPARK-13446
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Lifeng Wang
>Assignee: Xiao Li
>Priority: Major
> Fix For: 2.2.0
>
>
> Spark provided HIveContext class to read data from hive metastore directly. 
> While it only supports hive 1.2.1 version and older. Since hive 2.0.0 has 
> released, it's better to upgrade to support Hive 2.0.0.
> {noformat}
> 16/02/23 02:35:02 INFO metastore: Trying to connect to metastore with URI 
> thrift://hsw-node13:9083
> 16/02/23 02:35:02 INFO metastore: Opened a connection to metastore, current 
> connections: 1
> 16/02/23 02:35:02 INFO metastore: Connected to metastore.
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:473)
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:192)
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:185)
> at 
> org.apache.spark.sql.hive.HiveContext$$anon$1.(HiveContext.scala:422)
> at 
> org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:422)
> at 
> org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:421)
> at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:72)
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:739)
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:735)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24093) Make some fields of KafkaStreamWriter/InternalRowMicroBatchWriter visible to outside of the classes

2018-04-25 Thread Weiqing Yang (JIRA)
Weiqing Yang created SPARK-24093:


 Summary: Make some fields of 
KafkaStreamWriter/InternalRowMicroBatchWriter visible to outside of the classes
 Key: SPARK-24093
 URL: https://issues.apache.org/jira/browse/SPARK-24093
 Project: Spark
  Issue Type: Wish
  Components: Structured Streaming
Affects Versions: 2.3.0
Reporter: Weiqing Yang


We are working on Spark Atlas 
Connector([https://github.com/hortonworks-spark/spark-atlas-connector)|https://github.com/hortonworks-spark/spark-atlas-connector).],
 and adding supports for Spark Streaming. As SAC needs the information of 
"writer" and "topic" which streaming data are written into, this jira is 
created to make relevant fields of KafkaStreamWriter and 
InternalRowMicroBatchWriter visible to outside of the classes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24092) spark.python.worker.reuse does not work?

2018-04-25 Thread David Figueroa (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Figueroa updated SPARK-24092:
---
Description: 
{{spark.python.worker.reuse is true by default but even after explicitly 
setting to true the code below does not print the same python worker process 
ids.}}
{code:java|title=procid.py|borderStyle=solid}
def return_pid(_): yield os.getpid()
spark = SparkSession.builder.getOrCreate()
 pids = set(spark.sparkContext.range(32).mapPartitions(return_pid).collect())
print(pids)
pids = set(spark.sparkContext.range(32).mapPartitions(return_pid).collect())
print(pids){code}

  was:
{{spark.python.worker.reuse is true by default but even after explicitly 
setting to true the code below does not print the same python worker process 
ids.}}

{code:title=procid.py|borderStyle=solid} 

def return_pid(_): yield os.getpid()
spark = SparkSession.builder.getOrCreate()
 pids = set(spark.sparkContext.range(32).mapPartitions(return_pid).collect())
print(pids)
pids = set(spark.sparkContext.range(32).mapPartitions(return_pid).collect())
print(pids)

{code}


> spark.python.worker.reuse does not work?
> 
>
> Key: SPARK-24092
> URL: https://issues.apache.org/jira/browse/SPARK-24092
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: David Figueroa
>Priority: Minor
>
> {{spark.python.worker.reuse is true by default but even after explicitly 
> setting to true the code below does not print the same python worker process 
> ids.}}
> {code:java|title=procid.py|borderStyle=solid}
> def return_pid(_): yield os.getpid()
> spark = SparkSession.builder.getOrCreate()
>  pids = set(spark.sparkContext.range(32).mapPartitions(return_pid).collect())
> print(pids)
> pids = set(spark.sparkContext.range(32).mapPartitions(return_pid).collect())
> print(pids){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24092) spark.python.worker.reuse does not work?

2018-04-25 Thread David Figueroa (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Figueroa updated SPARK-24092:
---
Description: 
{{spark.python.worker.reuse is true by default but even after explicitly 
setting to true the code below does not print the same python worker process 
ids.}}
{code:java|title=procid.py|borderStyle=solid}
def return_pid(_): yield os.getpid()
spark = SparkSession.builder.getOrCreate()
pids = set(spark.sparkContext.range(32).mapPartitions(return_pid).collect())
print(pids)
pids = set(spark.sparkContext.range(32).mapPartitions(return_pid).collect())
print(pids){code}

  was:
{{spark.python.worker.reuse is true by default but even after explicitly 
setting to true the code below does not print the same python worker process 
ids.}}
{code:java|title=procid.py|borderStyle=solid}
def return_pid(_): yield os.getpid()
spark = SparkSession.builder.getOrCreate()
 pids = set(spark.sparkContext.range(32).mapPartitions(return_pid).collect())
print(pids)
pids = set(spark.sparkContext.range(32).mapPartitions(return_pid).collect())
print(pids){code}


> spark.python.worker.reuse does not work?
> 
>
> Key: SPARK-24092
> URL: https://issues.apache.org/jira/browse/SPARK-24092
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: David Figueroa
>Priority: Minor
>
> {{spark.python.worker.reuse is true by default but even after explicitly 
> setting to true the code below does not print the same python worker process 
> ids.}}
> {code:java|title=procid.py|borderStyle=solid}
> def return_pid(_): yield os.getpid()
> spark = SparkSession.builder.getOrCreate()
> pids = set(spark.sparkContext.range(32).mapPartitions(return_pid).collect())
> print(pids)
> pids = set(spark.sparkContext.range(32).mapPartitions(return_pid).collect())
> print(pids){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24092) spark.python.worker.reuse does not work?

2018-04-25 Thread David Figueroa (JIRA)
David Figueroa created SPARK-24092:
--

 Summary: spark.python.worker.reuse does not work?
 Key: SPARK-24092
 URL: https://issues.apache.org/jira/browse/SPARK-24092
 Project: Spark
  Issue Type: Question
  Components: PySpark
Affects Versions: 2.3.0
Reporter: David Figueroa


{{spark.python.worker.reuse is true by default but even after explicitly 
setting to true the code below does not print the same python worker process 
ids.}}

{code:title=procid.py|borderStyle=solid} 

def return_pid(_): yield os.getpid()
spark = SparkSession.builder.getOrCreate()
 pids = set(spark.sparkContext.range(32).mapPartitions(return_pid).collect())
print(pids)
pids = set(spark.sparkContext.range(32).mapPartitions(return_pid).collect())
print(pids)

{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22239) User-defined window functions with pandas udf

2018-04-25 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452977#comment-16452977
 ] 

Li Jin commented on SPARK-22239:


[~hvanhovell], I have done a bit further research of UDF over rolling windows 
and posted my results here:

[https://docs.google.com/document/d/14EjeY5z4-NC27-SmIP9CsMPCANeTcvxN44a7SIJtZPc/edit?usp=sharing]

TL; DR I think we can implement efficiently by computing window indices in the 
JVM and pass the indices along with the window Python and do rolling over the 
indices in Python. I have not addressed the issue of splitting the window 
partition into smaller batches but I think it's doable as well. Would you be 
interested in taking a look and let me know what you think?

> User-defined window functions with pandas udf
> -
>
> Key: SPARK-22239
> URL: https://issues.apache.org/jira/browse/SPARK-22239
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.2.0
> Environment: 
>Reporter: Li Jin
>Priority: Major
>
> Window function is another place we can benefit from vectored udf and add 
> another useful function to the pandas_udf suite.
> Example usage (preliminary):
> {code:java}
> w = Window.partitionBy('id').orderBy('time').rangeBetween(-200, 0)
> @pandas_udf(DoubleType())
> def ema(v1):
> return v1.ewm(alpha=0.5).mean().iloc[-1]
> df.withColumn('v1_ema', ema(df.v1).over(window))
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24089) DataFrame.write().mode(SaveMode.Append).insertInto(TABLE)

2018-04-25 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452942#comment-16452942
 ] 

Marco Gaido commented on SPARK-24089:
-

Anyway, for what I can see from your post on stackoverflow, your use case is 
not valid. You cannot insert on a temp view. The insertInto writes data to a 
table. For your use case, you can use unions.

> DataFrame.write().mode(SaveMode.Append).insertInto(TABLE) 
> --
>
> Key: SPARK-24089
> URL: https://issues.apache.org/jira/browse/SPARK-24089
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: kumar
>Priority: Critical
>  Labels: bug
>
> I am completely stuck with this issue, unable to progress further. For more 
> info pls refer this post : 
> [https://stackoverflow.com/questions/49994085/spark-sql-2-3-dataframe-savemode-append-issue]
> I want to load multiple files one by one, don't want to load all files at a 
> time. To achieve this i used SaveMode.Append, so that 2nd file data will be 
> added to 1st file data in database, but it's throwing exception.
> {code:java}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: unresolved 
> operator 'InsertIntoTable LogicalRDD [a1#4, b1#5, c1#6, d1#7], false, false, 
> false;;
> 'InsertIntoTable LogicalRDD [a1#4, b1#5, c1#6, d1#7], false, false, false
> +- LogicalRDD [a1#22, b1#23, c1#24, d1#25], false
> {code}
> Code:
> {code:java}
> package com.log;
> import com.log.common.RegexMatch;
> import com.log.spark.SparkProcessor;
> import org.apache.spark.SparkContext;
> import org.apache.spark.api.java.JavaRDD;
> import org.apache.spark.api.java.function.Function;
> import org.apache.spark.sql.*;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> import org.apache.spark.storage.StorageLevel;
> import java.util.ArrayList;
> import java.util.List;
> public class TestApp {
> private SparkSession spark;
> private SparkContext sparkContext;
> private SQLContext sqlContext;
> public TestApp() {
> SparkSession spark = SparkSession.builder().appName("Simple 
> Application")
> .config("spark.master", "local").getOrCreate();
> SparkContext sc = spark.sparkContext();
> this.spark = spark;
> this.sparkContext = sc;
> }
> public static void main(String[] args) {
> TestApp app = new TestApp();
> String[] afiles = {"C:\\Users\\test\\Desktop\\logs\\log1.txt",
> "C:\\Users\\test\\Desktop\\logs\\log2.txt"};
> for (String file : afiles) {
> app.writeFileToSchema(file);
> }
> }
> public void writeFileToSchema(String filePath) {
> StructType schema = getSchema();
> JavaRDD rowRDD = getRowRDD(filePath);
> if (spark.catalog().tableExists("mylogs")) {
> logDataFrame = spark.createDataFrame(rowRDD, schema);
> logDataFrame.createOrReplaceTempView("temptable");
> 
> logDataFrame.write().mode(SaveMode.Append).insertInto("mylogs");//exception
> } else {
> logDataFrame = spark.createDataFrame(rowRDD, schema);
> logDataFrame.createOrReplaceTempView("mylogs");
> }
> Dataset results = spark.sql("SELECT count(b1) FROM mylogs");
> List allrows = results.collectAsList();
> System.out.println("Count:"+allrows);
> sqlContext = logDataFrame.sqlContext();
> }
> Dataset logDataFrame;
> public List getTagList() {
> Dataset results = sqlContext.sql("SELECT distinct(b1) FROM 
> mylogs");
> List allrows = results.collectAsList();
> return allrows;
> }
> public StructType getSchema() {
> String schemaString = "a1 b1 c1 d1";
> List fields = new ArrayList<>();
> for (String fieldName : schemaString.split(" ")) {
> StructField field = DataTypes.createStructField(fieldName, 
> DataTypes.StringType, true);
> fields.add(field);
> }
> StructType schema = DataTypes.createStructType(fields);
> return schema;
> }
> public JavaRDD getRowRDD(String filePath) {
> JavaRDD logRDD = sparkContext.textFile(filePath, 
> 1).toJavaRDD();
> RegexMatch reg = new RegexMatch();
> JavaRDD rowRDD = logRDD
> .map((Function) line -> {
> String[] st = line.split(" ");
> return RowFactory.create(st[0], st[1], st[2], st[3]);
> });
> rowRDD.persist(StorageLevel.MEMORY_ONLY());
> return rowRDD;
> }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (SPARK-24089) DataFrame.write().mode(SaveMode.Append).insertInto(TABLE)

2018-04-25 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452937#comment-16452937
 ] 

Marco Gaido commented on SPARK-24089:
-

Blocker can be set only by commiters, I moved to Critical.

> DataFrame.write().mode(SaveMode.Append).insertInto(TABLE) 
> --
>
> Key: SPARK-24089
> URL: https://issues.apache.org/jira/browse/SPARK-24089
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: kumar
>Priority: Critical
>  Labels: bug
>
> I am completely stuck with this issue, unable to progress further. For more 
> info pls refer this post : 
> [https://stackoverflow.com/questions/49994085/spark-sql-2-3-dataframe-savemode-append-issue]
> I want to load multiple files one by one, don't want to load all files at a 
> time. To achieve this i used SaveMode.Append, so that 2nd file data will be 
> added to 1st file data in database, but it's throwing exception.
> {code:java}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: unresolved 
> operator 'InsertIntoTable LogicalRDD [a1#4, b1#5, c1#6, d1#7], false, false, 
> false;;
> 'InsertIntoTable LogicalRDD [a1#4, b1#5, c1#6, d1#7], false, false, false
> +- LogicalRDD [a1#22, b1#23, c1#24, d1#25], false
> {code}
> Code:
> {code:java}
> package com.log;
> import com.log.common.RegexMatch;
> import com.log.spark.SparkProcessor;
> import org.apache.spark.SparkContext;
> import org.apache.spark.api.java.JavaRDD;
> import org.apache.spark.api.java.function.Function;
> import org.apache.spark.sql.*;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> import org.apache.spark.storage.StorageLevel;
> import java.util.ArrayList;
> import java.util.List;
> public class TestApp {
> private SparkSession spark;
> private SparkContext sparkContext;
> private SQLContext sqlContext;
> public TestApp() {
> SparkSession spark = SparkSession.builder().appName("Simple 
> Application")
> .config("spark.master", "local").getOrCreate();
> SparkContext sc = spark.sparkContext();
> this.spark = spark;
> this.sparkContext = sc;
> }
> public static void main(String[] args) {
> TestApp app = new TestApp();
> String[] afiles = {"C:\\Users\\test\\Desktop\\logs\\log1.txt",
> "C:\\Users\\test\\Desktop\\logs\\log2.txt"};
> for (String file : afiles) {
> app.writeFileToSchema(file);
> }
> }
> public void writeFileToSchema(String filePath) {
> StructType schema = getSchema();
> JavaRDD rowRDD = getRowRDD(filePath);
> if (spark.catalog().tableExists("mylogs")) {
> logDataFrame = spark.createDataFrame(rowRDD, schema);
> logDataFrame.createOrReplaceTempView("temptable");
> 
> logDataFrame.write().mode(SaveMode.Append).insertInto("mylogs");//exception
> } else {
> logDataFrame = spark.createDataFrame(rowRDD, schema);
> logDataFrame.createOrReplaceTempView("mylogs");
> }
> Dataset results = spark.sql("SELECT count(b1) FROM mylogs");
> List allrows = results.collectAsList();
> System.out.println("Count:"+allrows);
> sqlContext = logDataFrame.sqlContext();
> }
> Dataset logDataFrame;
> public List getTagList() {
> Dataset results = sqlContext.sql("SELECT distinct(b1) FROM 
> mylogs");
> List allrows = results.collectAsList();
> return allrows;
> }
> public StructType getSchema() {
> String schemaString = "a1 b1 c1 d1";
> List fields = new ArrayList<>();
> for (String fieldName : schemaString.split(" ")) {
> StructField field = DataTypes.createStructField(fieldName, 
> DataTypes.StringType, true);
> fields.add(field);
> }
> StructType schema = DataTypes.createStructType(fields);
> return schema;
> }
> public JavaRDD getRowRDD(String filePath) {
> JavaRDD logRDD = sparkContext.textFile(filePath, 
> 1).toJavaRDD();
> RegexMatch reg = new RegexMatch();
> JavaRDD rowRDD = logRDD
> .map((Function) line -> {
> String[] st = line.split(" ");
> return RowFactory.create(st[0], st[1], st[2], st[3]);
> });
> rowRDD.persist(StorageLevel.MEMORY_ONLY());
> return rowRDD;
> }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For 

[jira] [Updated] (SPARK-24089) DataFrame.write().mode(SaveMode.Append).insertInto(TABLE)

2018-04-25 Thread Marco Gaido (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated SPARK-24089:

Priority: Critical  (was: Blocker)

> DataFrame.write().mode(SaveMode.Append).insertInto(TABLE) 
> --
>
> Key: SPARK-24089
> URL: https://issues.apache.org/jira/browse/SPARK-24089
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: kumar
>Priority: Critical
>  Labels: bug
>
> I am completely stuck with this issue, unable to progress further. For more 
> info pls refer this post : 
> [https://stackoverflow.com/questions/49994085/spark-sql-2-3-dataframe-savemode-append-issue]
> I want to load multiple files one by one, don't want to load all files at a 
> time. To achieve this i used SaveMode.Append, so that 2nd file data will be 
> added to 1st file data in database, but it's throwing exception.
> {code:java}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: unresolved 
> operator 'InsertIntoTable LogicalRDD [a1#4, b1#5, c1#6, d1#7], false, false, 
> false;;
> 'InsertIntoTable LogicalRDD [a1#4, b1#5, c1#6, d1#7], false, false, false
> +- LogicalRDD [a1#22, b1#23, c1#24, d1#25], false
> {code}
> Code:
> {code:java}
> package com.log;
> import com.log.common.RegexMatch;
> import com.log.spark.SparkProcessor;
> import org.apache.spark.SparkContext;
> import org.apache.spark.api.java.JavaRDD;
> import org.apache.spark.api.java.function.Function;
> import org.apache.spark.sql.*;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> import org.apache.spark.storage.StorageLevel;
> import java.util.ArrayList;
> import java.util.List;
> public class TestApp {
> private SparkSession spark;
> private SparkContext sparkContext;
> private SQLContext sqlContext;
> public TestApp() {
> SparkSession spark = SparkSession.builder().appName("Simple 
> Application")
> .config("spark.master", "local").getOrCreate();
> SparkContext sc = spark.sparkContext();
> this.spark = spark;
> this.sparkContext = sc;
> }
> public static void main(String[] args) {
> TestApp app = new TestApp();
> String[] afiles = {"C:\\Users\\test\\Desktop\\logs\\log1.txt",
> "C:\\Users\\test\\Desktop\\logs\\log2.txt"};
> for (String file : afiles) {
> app.writeFileToSchema(file);
> }
> }
> public void writeFileToSchema(String filePath) {
> StructType schema = getSchema();
> JavaRDD rowRDD = getRowRDD(filePath);
> if (spark.catalog().tableExists("mylogs")) {
> logDataFrame = spark.createDataFrame(rowRDD, schema);
> logDataFrame.createOrReplaceTempView("temptable");
> 
> logDataFrame.write().mode(SaveMode.Append).insertInto("mylogs");//exception
> } else {
> logDataFrame = spark.createDataFrame(rowRDD, schema);
> logDataFrame.createOrReplaceTempView("mylogs");
> }
> Dataset results = spark.sql("SELECT count(b1) FROM mylogs");
> List allrows = results.collectAsList();
> System.out.println("Count:"+allrows);
> sqlContext = logDataFrame.sqlContext();
> }
> Dataset logDataFrame;
> public List getTagList() {
> Dataset results = sqlContext.sql("SELECT distinct(b1) FROM 
> mylogs");
> List allrows = results.collectAsList();
> return allrows;
> }
> public StructType getSchema() {
> String schemaString = "a1 b1 c1 d1";
> List fields = new ArrayList<>();
> for (String fieldName : schemaString.split(" ")) {
> StructField field = DataTypes.createStructField(fieldName, 
> DataTypes.StringType, true);
> fields.add(field);
> }
> StructType schema = DataTypes.createStructType(fields);
> return schema;
> }
> public JavaRDD getRowRDD(String filePath) {
> JavaRDD logRDD = sparkContext.textFile(filePath, 
> 1).toJavaRDD();
> RegexMatch reg = new RegexMatch();
> JavaRDD rowRDD = logRDD
> .map((Function) line -> {
> String[] st = line.split(" ");
> return RowFactory.create(st[0], st[1], st[2], st[3]);
> });
> rowRDD.persist(StorageLevel.MEMORY_ONLY());
> return rowRDD;
> }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24050) StreamingQuery does not calculate input / processing rates in some cases

2018-04-25 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-24050.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 21126
[https://github.com/apache/spark/pull/21126]

> StreamingQuery does not calculate input / processing rates in some cases
> 
>
> Key: SPARK-24050
> URL: https://issues.apache.org/jira/browse/SPARK-24050
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1, 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
> Fix For: 3.0.0
>
>
> In some streaming queries, the input and processing rates are not calculated 
> at all (shows up as zero) because MicroBatchExecution fails to associated 
> metrics from the executed plan of a trigger with the sources in the logical 
> plan of the trigger. The way this executed-plan-leaf-to-logical-source 
> attribution works is as follows. With V1 sources, there was no way to 
> identify which execution plan leaves were generated by a streaming source. So 
> did a best-effort attempt to match logical and execution plan leaves when the 
> number of leaves were same. In cases where the number of leaves is different, 
> we just give up and report zero rates. An example where this may happen is as 
> follows.
> {code}
> val cachedStaticDF = someStaticDF.union(anotherStaticDF).cache()
> val streamingInputDF = ...
> val query = streamingInputDF.join(cachedStaticDF).writeStream
> {code}
> In this case, the {{cachedStaticDF}} has multiple logical leaves, but in the 
> trigger's execution plan it only has leaf because a cached subplan is 
> represented as a single InMemoryTableScanExec leaf. This leads to a mismatch 
> in the number of leaves causing the input rates to be computed as zero. 
> With DataSourceV2, all inputs are represented in the executed plan using 
> {{DataSourceV2ScanExec}}, each of which has a reference to the associated 
> logical {{DataSource}} and {{DataSourceReader}}. So its easy to associate the 
> metrics to the original streaming sources. So the solution is to take 
> advantage of the presence of DataSourceV2 whenever possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23933) High-order function: map(array, array) → map<K,V>

2018-04-25 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452404#comment-16452404
 ] 

Kazuaki Ishizaki edited comment on SPARK-23933 at 4/25/18 6:48 PM:
---

Thank you for your comment.
The current map can take the even number of arguments (e.g. 2, 4, 6, 8 ...) due 
to a pair of key and map.
We can determine {{map(1.0, '2', 3.0, '4') or map(1.0, '2')}} should be behave 
as currently.

How about {{map(ARRAY [1, 2], ARRAY ["a", "b"])}}? Or How about 
{{CreateMap(Seq(CreateArray(sSeq.map(Literal(\_))), 
CreateArray(iSeq.map(Literal(\_)}}?




was (Author: kiszk):
Thank you for your comment.
The current map can take the even number of arguments (e.g. 2, 4, 6, 8 ...) due 
to a pair of key and map.
We can determine {{map(1.0, '2', 3.0, '4') or map(1.0, '2')}} should be behave 
as currently.

How about {{map(ARRAY [1, 2], ARRAY ["a", "b"])}}?



> High-order function: map(array, array) → map
> ---
>
> Key: SPARK-23933
> URL: https://issues.apache.org/jira/browse/SPARK-23933
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/map.html
> Returns a map created using the given key/value arrays.
> {noformat}
> SELECT map(ARRAY[1,3], ARRAY[2,4]); -- {1 -> 2, 3 -> 4}
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23715) from_utc_timestamp returns incorrect results for some UTC date/time values

2018-04-25 Thread Bruce Robbins (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452790#comment-16452790
 ] 

Bruce Robbins commented on SPARK-23715:
---

[~cloud_fan]

I'll give separate answers for String input and long input.

1) String Input (e.g., "2018-04-24 15:24:00")

Across Impala, Hive, and Spark, the input to from_utc_timestamp is a UTC 
datetime value. They each work as expected when the user passes a UTC datetime 
value (with one exception for Spark, described below).

Both Hive and Impala treat the string input as a UTC datetime value. That is, 
if you enter '2018-04-24 15:24:00', from_utc_timestamp treats it as 
'2018-04-24T15:00:00+00:00' (or unix time 1524582000).

Spark's from_utc_timezone function also treats it as a UTC datetime value. 
Internally, it treats it as a local time (for the above example, it initially 
treats it as "2018-04-24T15:00:00-07:00"), but then corrects itself later in 
the code path. But from the user's perspective, Spark's from_utc_timestamp 
treats the input string as a UTC datetime value as long as the user sticks with 
the formats "-mm-dd HH:mm" and "-mm-dd'T'HH:mm".

Once the user ventures away from those formats, bad things happen. If the user 
specifies a UTC datetime value with a timezone (e.g., 
"2018-04-24T15:24:00+00:00"), Spark's from_utc_timezone returns the wrong 
result. Also, the user can specify a datetime value with a non-UTC timezone, 
which makes no sense. These two issues are the subject of this Jira.

In Hive and Impala, the user can only enter a datetime value in the format 
"yyy-MM-dd HH:mm". With Hive and Impala, the user cannot enter a timezone 
specification.

Besides the Spark bug described above, from_utc_timezone works the same between 
Impala, Hive, and Spark.

Impala has a -use_local_tz_for_unix_timestamp_conversions setting which may (or 
may not) change the behavior if its from_utc_timestamp function. I have not 
tested it.

IBM's DB2 also has a from_utc_timestamp function. I have not tested it, but the 
documentation says the input datetime value should be 'An expression that 
specifies the timestamp that is in the Coordinated Universal Time time zone', 
which is consistent with HIve, Impala, and Spark.

2) Integer Input (e.g., 1524582000)

Here Impala diverges from Spark and Hive.

Impala treats the input as the number of seconds since the epoch 1970-01-01 
00:00:00 UTC.
{noformat}
> select from_utc_timestamp(cast(0 as timestamp), "UTC") as result;
+-+
| result  |
+-+
| 1970-01-01 00:00:00 |
+-+
{noformat}
Both Hive and Spark treat the input as the number of seconds since 1970-01-01 
00:00:00+. Since I am in timezone America/Los_Angeles, 
my epoch is 1970-01-01 00:00:00 -8 hours, or 1969-12-31 16:00:00:
{noformat}
hive> select from_utc_timestamp(0, 'UTC');
OK
1969-12-31 16:00:00

spark-sql> select from_utc_timestamp(cast(0 as timestamp), 'UTC');
1969-12-31 16:00:00


hive> select from_utc_timestamp(0, 'America/Los_Angeles');
OK
1969-12-31 08:00:00

spark-sql> select from_utc_timestamp(cast(0 as timestamp), 
'America/Los_Angeles');
1969-12-31 08:00:00
{noformat}
I also classified this behavior with integer input as a bug in this Jira. 
However, it is consistent with Hive, so I am not so sure. 
 

 

 

> from_utc_timestamp returns incorrect results for some UTC date/time values
> --
>
> Key: SPARK-23715
> URL: https://issues.apache.org/jira/browse/SPARK-23715
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bruce Robbins
>Priority: Major
>
> This produces the expected answer:
> {noformat}
> df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" 
> ).as("dt")).show
> +---+
> | dt|
> +---+
> |2018-03-13 07:18:23|
> +---+
> {noformat}
> However, the equivalent UTC input (but with an explicit timezone) produces a 
> wrong answer:
> {noformat}
> df.select(from_utc_timestamp(lit("2018-03-13T06:18:23+00:00"), "GMT+1" 
> ).as("dt")).show
> +---+
> | dt|
> +---+
> |2018-03-13 00:18:23|
> +---+
> {noformat}
> Additionally, the equivalent Unix time (1520921903, which is also 
> "2018-03-13T06:18:23" in the UTC time zone) produces the same wrong answer:
> {noformat}
> df.select(from_utc_timestamp(to_timestamp(lit(1520921903)), "GMT+1" 
> ).as("dt")).show
> +---+
> | dt|
> +---+
> |2018-03-13 00:18:23|
> +---+
> {noformat}
> These issues stem from the fact that the FromUTCTimestamp expression, despite 
> its name, expects the input to be in the user's local timezone. There is some 
> 

[jira] [Updated] (SPARK-24091) Internally used ConfigMap prevents use of user-specified ConfigMaps carrying Spark configs files

2018-04-25 Thread Yinan Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yinan Li updated SPARK-24091:
-
Affects Version/s: (was: 2.3.0)
   2.4.0

> Internally used ConfigMap prevents use of user-specified ConfigMaps carrying 
> Spark configs files
> 
>
> Key: SPARK-24091
> URL: https://issues.apache.org/jira/browse/SPARK-24091
> Project: Spark
>  Issue Type: Brainstorming
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> The recent PR [https://github.com/apache/spark/pull/20669] for removing the 
> init-container introduced a internally used ConfigMap carrying Spark 
> configuration properties in a file for the driver. This ConfigMap gets 
> mounted under {{$SPARK_HOME/conf}} and the environment variable 
> {{SPARK_CONF_DIR}} is set to point to the mount path. This pretty much 
> prevents users from mounting their own ConfigMaps that carry custom Spark 
> configuration files, e.g., {{log4j.properties}} and {{spark-env.sh}} and 
> leaves users with only the option of building custom images. IMO, it is very 
> useful to support mounting user-specified ConfigMaps for custom Spark 
> configuration files. This worths further discussions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24091) Internally used ConfigMap prevents use of user-specified ConfigMaps carrying Spark configs files

2018-04-25 Thread Yinan Li (JIRA)
Yinan Li created SPARK-24091:


 Summary: Internally used ConfigMap prevents use of user-specified 
ConfigMaps carrying Spark configs files
 Key: SPARK-24091
 URL: https://issues.apache.org/jira/browse/SPARK-24091
 Project: Spark
  Issue Type: Brainstorming
  Components: Kubernetes
Affects Versions: 2.3.0
Reporter: Yinan Li


The recent PR [https://github.com/apache/spark/pull/20669] for removing the 
init-container introduced a internally used ConfigMap carrying Spark 
configuration properties in a file for the driver. This ConfigMap gets mounted 
under {{$SPARK_HOME/conf}} and the environment variable {{SPARK_CONF_DIR}} is 
set to point to the mount path. This pretty much prevents users from mounting 
their own ConfigMaps that carry custom Spark configuration files, e.g., 
{{log4j.properties}} and {{spark-env.sh}} and leaves users with only the option 
of building custom images. IMO, it is very useful to support mounting 
user-specified ConfigMaps for custom Spark configuration files. This worths 
further discussions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23850) We should not redact username|user|url from UI by default

2018-04-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452729#comment-16452729
 ] 

Apache Spark commented on SPARK-23850:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/21158

> We should not redact username|user|url from UI by default
> -
>
> Key: SPARK-23850
> URL: https://issues.apache.org/jira/browse/SPARK-23850
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.2.1
>Reporter: Thomas Graves
>Priority: Major
>
> SPARK-22479 was filed to not print the log jdbc credentials, but in there 
> they also added  the username and url to be redacted.  I'm not sure why these 
> were added and to me by default these do not have security concerns.  It 
> makes it more usable by default to be able to see these things.  Users with 
> high security concerns can simply add them in their configs.
> Also on yarn just redacting url doesn't secure anything because if you go to 
> the environment ui page you see all sorts of paths and really its just 
> confusing that some of its redacted and other parts aren't.  If this was 
> specifically for jdbc I think it needs to be just applied there and not 
> broadly.
> If we remove these we need to test what the jdbc driver is going to log from 
> SPARK-22479.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23850) We should not redact username|user|url from UI by default

2018-04-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23850:


Assignee: Apache Spark

> We should not redact username|user|url from UI by default
> -
>
> Key: SPARK-23850
> URL: https://issues.apache.org/jira/browse/SPARK-23850
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.2.1
>Reporter: Thomas Graves
>Assignee: Apache Spark
>Priority: Major
>
> SPARK-22479 was filed to not print the log jdbc credentials, but in there 
> they also added  the username and url to be redacted.  I'm not sure why these 
> were added and to me by default these do not have security concerns.  It 
> makes it more usable by default to be able to see these things.  Users with 
> high security concerns can simply add them in their configs.
> Also on yarn just redacting url doesn't secure anything because if you go to 
> the environment ui page you see all sorts of paths and really its just 
> confusing that some of its redacted and other parts aren't.  If this was 
> specifically for jdbc I think it needs to be just applied there and not 
> broadly.
> If we remove these we need to test what the jdbc driver is going to log from 
> SPARK-22479.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23850) We should not redact username|user|url from UI by default

2018-04-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23850:


Assignee: (was: Apache Spark)

> We should not redact username|user|url from UI by default
> -
>
> Key: SPARK-23850
> URL: https://issues.apache.org/jira/browse/SPARK-23850
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.2.1
>Reporter: Thomas Graves
>Priority: Major
>
> SPARK-22479 was filed to not print the log jdbc credentials, but in there 
> they also added  the username and url to be redacted.  I'm not sure why these 
> were added and to me by default these do not have security concerns.  It 
> makes it more usable by default to be able to see these things.  Users with 
> high security concerns can simply add them in their configs.
> Also on yarn just redacting url doesn't secure anything because if you go to 
> the environment ui page you see all sorts of paths and really its just 
> confusing that some of its redacted and other parts aren't.  If this was 
> specifically for jdbc I think it needs to be just applied there and not 
> broadly.
> If we remove these we need to test what the jdbc driver is going to log from 
> SPARK-22479.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24090) Kubernetes Backend Hotlist for Spark 2.4

2018-04-25 Thread Anirudh Ramanathan (JIRA)
Anirudh Ramanathan created SPARK-24090:
--

 Summary: Kubernetes Backend Hotlist for Spark 2.4
 Key: SPARK-24090
 URL: https://issues.apache.org/jira/browse/SPARK-24090
 Project: Spark
  Issue Type: Umbrella
  Components: Kubernetes, Scheduler
Affects Versions: 2.4.0
Reporter: Anirudh Ramanathan
Assignee: Anirudh Ramanathan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24089) DataFrame.write().mode(SaveMode.Append).insertInto(TABLE)

2018-04-25 Thread kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kumar updated SPARK-24089:
--
Component/s: Java API

> DataFrame.write().mode(SaveMode.Append).insertInto(TABLE) 
> --
>
> Key: SPARK-24089
> URL: https://issues.apache.org/jira/browse/SPARK-24089
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: kumar
>Priority: Blocker
>  Labels: bug
>
> I am completely stuck with this issue, unable to progress further. For more 
> info pls refer this post : 
> [https://stackoverflow.com/questions/49994085/spark-sql-2-3-dataframe-savemode-append-issue]
> I want to load multiple files one by one, don't want to load all files at a 
> time. To achieve this i used SaveMode.Append, so that 2nd file data will be 
> added to 1st file data in database, but it's throwing exception.
> {code:java}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: unresolved 
> operator 'InsertIntoTable LogicalRDD [a1#4, b1#5, c1#6, d1#7], false, false, 
> false;;
> 'InsertIntoTable LogicalRDD [a1#4, b1#5, c1#6, d1#7], false, false, false
> +- LogicalRDD [a1#22, b1#23, c1#24, d1#25], false
> {code}
> Code:
> {code:java}
> package com.log;
> import com.log.common.RegexMatch;
> import com.log.spark.SparkProcessor;
> import org.apache.spark.SparkContext;
> import org.apache.spark.api.java.JavaRDD;
> import org.apache.spark.api.java.function.Function;
> import org.apache.spark.sql.*;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> import org.apache.spark.storage.StorageLevel;
> import java.util.ArrayList;
> import java.util.List;
> public class TestApp {
> private SparkSession spark;
> private SparkContext sparkContext;
> private SQLContext sqlContext;
> public TestApp() {
> SparkSession spark = SparkSession.builder().appName("Simple 
> Application")
> .config("spark.master", "local").getOrCreate();
> SparkContext sc = spark.sparkContext();
> this.spark = spark;
> this.sparkContext = sc;
> }
> public static void main(String[] args) {
> TestApp app = new TestApp();
> String[] afiles = {"C:\\Users\\test\\Desktop\\logs\\log1.txt",
> "C:\\Users\\test\\Desktop\\logs\\log2.txt"};
> for (String file : afiles) {
> app.writeFileToSchema(file);
> }
> }
> public void writeFileToSchema(String filePath) {
> StructType schema = getSchema();
> JavaRDD rowRDD = getRowRDD(filePath);
> if (spark.catalog().tableExists("mylogs")) {
> logDataFrame = spark.createDataFrame(rowRDD, schema);
> logDataFrame.createOrReplaceTempView("temptable");
> 
> logDataFrame.write().mode(SaveMode.Append).insertInto("mylogs");//exception
> } else {
> logDataFrame = spark.createDataFrame(rowRDD, schema);
> logDataFrame.createOrReplaceTempView("mylogs");
> }
> Dataset results = spark.sql("SELECT count(b1) FROM mylogs");
> List allrows = results.collectAsList();
> System.out.println("Count:"+allrows);
> sqlContext = logDataFrame.sqlContext();
> }
> Dataset logDataFrame;
> public List getTagList() {
> Dataset results = sqlContext.sql("SELECT distinct(b1) FROM 
> mylogs");
> List allrows = results.collectAsList();
> return allrows;
> }
> public StructType getSchema() {
> String schemaString = "a1 b1 c1 d1";
> List fields = new ArrayList<>();
> for (String fieldName : schemaString.split(" ")) {
> StructField field = DataTypes.createStructField(fieldName, 
> DataTypes.StringType, true);
> fields.add(field);
> }
> StructType schema = DataTypes.createStructType(fields);
> return schema;
> }
> public JavaRDD getRowRDD(String filePath) {
> JavaRDD logRDD = sparkContext.textFile(filePath, 
> 1).toJavaRDD();
> RegexMatch reg = new RegexMatch();
> JavaRDD rowRDD = logRDD
> .map((Function) line -> {
> String[] st = line.split(" ");
> return RowFactory.create(st[0], st[1], st[2], st[3]);
> });
> rowRDD.persist(StorageLevel.MEMORY_ONLY());
> return rowRDD;
> }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23874) Upgrade apache/arrow to 0.10.0

2018-04-25 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452666#comment-16452666
 ] 

Bryan Cutler commented on SPARK-23874:
--

[~smilegator] the Arrow community decided to put their efforts towards a 0.10.0 
release instead of 0.9.1.  Yes, it's possible there could be a regression, but 
its also possible that a bug was fixed that just hasn't surfaced yet.  I'll 
pick this up again after the release, do thorough testing, and we can decide it 
we will do the upgrade then.

> Upgrade apache/arrow to 0.10.0
> --
>
> Key: SPARK-23874
> URL: https://issues.apache.org/jira/browse/SPARK-23874
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Bryan Cutler
>Priority: Major
>
> Version 0.9.0 of apache arrow comes with a bug fix related to array 
> serialization. 
> https://issues.apache.org/jira/browse/ARROW-1973



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24089) DataFrame.write().mode(SaveMode.Append).insertInto(TABLE)

2018-04-25 Thread kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kumar updated SPARK-24089:
--
Description: 
I am completely stuck with this issue, unable to progress further. For more 
info pls refer this post : 
[https://stackoverflow.com/questions/49994085/spark-sql-2-3-dataframe-savemode-append-issue]

I want to load multiple files one by one, don't want to load all files at a 
time. To achieve this i used SaveMode.Append, so that 2nd file data will be 
added to 1st file data in database, but it's throwing exception.
{code:java}
Exception in thread "main" org.apache.spark.sql.AnalysisException: unresolved 
operator 'InsertIntoTable LogicalRDD [a1#4, b1#5, c1#6, d1#7], false, false, 
false;;
'InsertIntoTable LogicalRDD [a1#4, b1#5, c1#6, d1#7], false, false, false
+- LogicalRDD [a1#22, b1#23, c1#24, d1#25], false
{code}
Code:
{code:java}
package com.log;

import com.log.common.RegexMatch;
import com.log.spark.SparkProcessor;
import org.apache.spark.SparkContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.*;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import org.apache.spark.storage.StorageLevel;

import java.util.ArrayList;
import java.util.List;

public class TestApp {

private SparkSession spark;
private SparkContext sparkContext;
private SQLContext sqlContext;

public TestApp() {
SparkSession spark = SparkSession.builder().appName("Simple 
Application")
.config("spark.master", "local").getOrCreate();

SparkContext sc = spark.sparkContext();

this.spark = spark;
this.sparkContext = sc;
}

public static void main(String[] args) {
TestApp app = new TestApp();

String[] afiles = {"C:\\Users\\test\\Desktop\\logs\\log1.txt",
"C:\\Users\\test\\Desktop\\logs\\log2.txt"};

for (String file : afiles) {
app.writeFileToSchema(file);
}
}

public void writeFileToSchema(String filePath) {

StructType schema = getSchema();
JavaRDD rowRDD = getRowRDD(filePath);

if (spark.catalog().tableExists("mylogs")) {

logDataFrame = spark.createDataFrame(rowRDD, schema);
logDataFrame.createOrReplaceTempView("temptable");

logDataFrame.write().mode(SaveMode.Append).insertInto("mylogs");//exception
} else {
logDataFrame = spark.createDataFrame(rowRDD, schema);
logDataFrame.createOrReplaceTempView("mylogs");
}

Dataset results = spark.sql("SELECT count(b1) FROM mylogs");

List allrows = results.collectAsList();

System.out.println("Count:"+allrows);

sqlContext = logDataFrame.sqlContext();
}

Dataset logDataFrame;

public List getTagList() {

Dataset results = sqlContext.sql("SELECT distinct(b1) FROM 
mylogs");
List allrows = results.collectAsList();

return allrows;
}

public StructType getSchema() {
String schemaString = "a1 b1 c1 d1";

List fields = new ArrayList<>();
for (String fieldName : schemaString.split(" ")) {
StructField field = DataTypes.createStructField(fieldName, 
DataTypes.StringType, true);
fields.add(field);
}

StructType schema = DataTypes.createStructType(fields);

return schema;
}


public JavaRDD getRowRDD(String filePath) {

JavaRDD logRDD = sparkContext.textFile(filePath, 1).toJavaRDD();

RegexMatch reg = new RegexMatch();
JavaRDD rowRDD = logRDD

.map((Function) line -> {

String[] st = line.split(" ");

return RowFactory.create(st[0], st[1], st[2], st[3]);
});

rowRDD.persist(StorageLevel.MEMORY_ONLY());

return rowRDD;
}
}
{code}

  was:
I am completely stuck with this issue, unable to progress further. For more 
info pls refer this post : 
[https://stackoverflow.com/questions/49994085/spark-sql-2-3-dataframe-savemode-append-issue]

I want to load multiple files one by one, don't want to load all files at a 
time. To achieve this i used SaveMode.Append, so that 2nd file data will be 
added to 1st file data in database, but it's throwing exception.

Code:
{code:java}
package com.log;

import com.log.common.RegexMatch;
import com.log.spark.SparkProcessor;
import org.apache.spark.SparkContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.*;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import org.apache.spark.storage.StorageLevel;

import java.util.ArrayList;

[jira] [Updated] (SPARK-24089) DataFrame.write().mode(SaveMode.Append).insertInto(TABLE)

2018-04-25 Thread kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kumar updated SPARK-24089:
--
Summary: DataFrame.write().mode(SaveMode.Append).insertInto(TABLE)   (was: 
DataFrame.write.mode(SaveMode.Append).insertInto(TABLE) )

> DataFrame.write().mode(SaveMode.Append).insertInto(TABLE) 
> --
>
> Key: SPARK-24089
> URL: https://issues.apache.org/jira/browse/SPARK-24089
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: kumar
>Priority: Blocker
>  Labels: bug
>
> I am completely stuck with this issue, unable to progress further. For more 
> info pls refer this post : 
> [https://stackoverflow.com/questions/49994085/spark-sql-2-3-dataframe-savemode-append-issue]
> I want to load multiple files one by one, don't want to load all files at a 
> time. To achieve this i used SaveMode.Append, so that 2nd file data will be 
> added to 1st file data in database, but it's throwing exception.
> Code:
> {code:java}
> package com.log;
> import com.log.common.RegexMatch;
> import com.log.spark.SparkProcessor;
> import org.apache.spark.SparkContext;
> import org.apache.spark.api.java.JavaRDD;
> import org.apache.spark.api.java.function.Function;
> import org.apache.spark.sql.*;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> import org.apache.spark.storage.StorageLevel;
> import java.util.ArrayList;
> import java.util.List;
> public class TestApp {
> private SparkSession spark;
> private SparkContext sparkContext;
> private SQLContext sqlContext;
> public TestApp() {
> SparkSession spark = SparkSession.builder().appName("Simple 
> Application")
> .config("spark.master", "local").getOrCreate();
> SparkContext sc = spark.sparkContext();
> this.spark = spark;
> this.sparkContext = sc;
> }
> public static void main(String[] args) {
> TestApp app = new TestApp();
> String[] afiles = {"C:\\Users\\test\\Desktop\\logs\\log1.txt",
> "C:\\Users\\test\\Desktop\\logs\\log2.txt"};
> for (String file : afiles) {
> app.writeFileToSchema(file);
> }
> }
> public void writeFileToSchema(String filePath) {
> StructType schema = getSchema();
> JavaRDD rowRDD = getRowRDD(filePath);
> if (spark.catalog().tableExists("mylogs")) {
> logDataFrame = spark.createDataFrame(rowRDD, schema);
> logDataFrame.createOrReplaceTempView("temptable");
> 
> logDataFrame.write().mode(SaveMode.Append).insertInto("mylogs");//exception
> } else {
> logDataFrame = spark.createDataFrame(rowRDD, schema);
> logDataFrame.createOrReplaceTempView("mylogs");
> }
> Dataset results = spark.sql("SELECT count(b1) FROM mylogs");
> List allrows = results.collectAsList();
> System.out.println("Count:"+allrows);
> sqlContext = logDataFrame.sqlContext();
> }
> Dataset logDataFrame;
> public List getTagList() {
> Dataset results = sqlContext.sql("SELECT distinct(b1) FROM 
> mylogs");
> List allrows = results.collectAsList();
> return allrows;
> }
> public StructType getSchema() {
> String schemaString = "a1 b1 c1 d1";
> List fields = new ArrayList<>();
> for (String fieldName : schemaString.split(" ")) {
> StructField field = DataTypes.createStructField(fieldName, 
> DataTypes.StringType, true);
> fields.add(field);
> }
> StructType schema = DataTypes.createStructType(fields);
> return schema;
> }
> public JavaRDD getRowRDD(String filePath) {
> JavaRDD logRDD = sparkContext.textFile(filePath, 
> 1).toJavaRDD();
> RegexMatch reg = new RegexMatch();
> JavaRDD rowRDD = logRDD
> .map((Function) line -> {
> String[] st = line.split(" ");
> return RowFactory.create(st[0], st[1], st[2], st[3]);
> });
> rowRDD.persist(StorageLevel.MEMORY_ONLY());
> return rowRDD;
> }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24089) DataFrame.write.mode(SaveMode.Append).insertInto(TABLE)

2018-04-25 Thread kumar (JIRA)
kumar created SPARK-24089:
-

 Summary: DataFrame.write.mode(SaveMode.Append).insertInto(TABLE) 
 Key: SPARK-24089
 URL: https://issues.apache.org/jira/browse/SPARK-24089
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 2.3.0
Reporter: kumar


I am completely stuck with this issue, unable to progress further. For more 
info pls refer this post : 
[https://stackoverflow.com/questions/49994085/spark-sql-2-3-dataframe-savemode-append-issue]

I want to load multiple files one by one, don't want to load all files at a 
time. To achieve this i used SaveMode.Append, so that 2nd file data will be 
added to 1st file data in database, but it's throwing exception.

Code:
{code:java}
package com.log;

import com.log.common.RegexMatch;
import com.log.spark.SparkProcessor;
import org.apache.spark.SparkContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.*;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import org.apache.spark.storage.StorageLevel;

import java.util.ArrayList;
import java.util.List;

public class TestApp {

private SparkSession spark;
private SparkContext sparkContext;
private SQLContext sqlContext;

public TestApp() {
SparkSession spark = SparkSession.builder().appName("Simple 
Application")
.config("spark.master", "local").getOrCreate();

SparkContext sc = spark.sparkContext();

this.spark = spark;
this.sparkContext = sc;
}

public static void main(String[] args) {
TestApp app = new TestApp();

String[] afiles = {"C:\\Users\\test\\Desktop\\logs\\log1.txt",
"C:\\Users\\test\\Desktop\\logs\\log2.txt"};

for (String file : afiles) {
app.writeFileToSchema(file);
}
}

public void writeFileToSchema(String filePath) {

StructType schema = getSchema();
JavaRDD rowRDD = getRowRDD(filePath);

if (spark.catalog().tableExists("mylogs")) {

logDataFrame = spark.createDataFrame(rowRDD, schema);
logDataFrame.createOrReplaceTempView("temptable");

logDataFrame.write().mode(SaveMode.Append).insertInto("mylogs");//exception
} else {
logDataFrame = spark.createDataFrame(rowRDD, schema);
logDataFrame.createOrReplaceTempView("mylogs");
}

Dataset results = spark.sql("SELECT count(b1) FROM mylogs");

List allrows = results.collectAsList();

System.out.println("Count:"+allrows);

sqlContext = logDataFrame.sqlContext();
}

Dataset logDataFrame;

public List getTagList() {

Dataset results = sqlContext.sql("SELECT distinct(b1) FROM 
mylogs");
List allrows = results.collectAsList();

return allrows;
}

public StructType getSchema() {
String schemaString = "a1 b1 c1 d1";

List fields = new ArrayList<>();
for (String fieldName : schemaString.split(" ")) {
StructField field = DataTypes.createStructField(fieldName, 
DataTypes.StringType, true);
fields.add(field);
}

StructType schema = DataTypes.createStructType(fields);

return schema;
}


public JavaRDD getRowRDD(String filePath) {

JavaRDD logRDD = sparkContext.textFile(filePath, 1).toJavaRDD();

RegexMatch reg = new RegexMatch();
JavaRDD rowRDD = logRDD

.map((Function) line -> {

String[] st = line.split(" ");

return RowFactory.create(st[0], st[1], st[2], st[3]);
});

rowRDD.persist(StorageLevel.MEMORY_ONLY());

return rowRDD;
}
}
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24067) Backport SPARK-17147 to 2.3 (Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction))

2018-04-25 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452603#comment-16452603
 ] 

Cody Koeninger commented on SPARK-24067:


The original PR [https://github.com/apache/spark/pull/20572] merges cleanly 
against 2.3, are you ok with merging it?

> Backport SPARK-17147 to 2.3 (Spark Streaming Kafka 0.10 Consumer Can't Handle 
> Non-consecutive Offsets (i.e. Log Compaction))
> 
>
> Key: SPARK-24067
> URL: https://issues.apache.org/jira/browse/SPARK-24067
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.3.0
>Reporter: Joachim Hereth
>Assignee: Cody Koeninger
>Priority: Major
>
> SPARK-17147 fixes a problem with non-consecutive Kafka Offsets. The  [PR 
> w|https://github.com/apache/spark/pull/20572]as merged to {{master}}. This 
> should be backported to 2.3.
>  
> Original Description from SPARK-17147 :
>  
> When Kafka does log compaction offsets often end up with gaps, meaning the 
> next requested offset will be frequently not be offset+1. The logic in 
> KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset 
> will always be just an increment of 1 above the previous offset.
> I have worked around this problem by changing CachedKafkaConsumer to use the 
> returned record's offset, from:
>  {{nextOffset = offset + 1}}
>  to:
>  {{nextOffset = record.offset + 1}}
> and changed KafkaRDD from:
>  {{requestOffset += 1}}
>  to:
>  {{requestOffset = r.offset() + 1}}
> (I also had to change some assert logic in CachedKafkaConsumer).
> There's a strong possibility that I have misconstrued how to use the 
> streaming kafka consumer, and I'm happy to close this out if that's the case. 
> If, however, it is supposed to support non-consecutive offsets (e.g. due to 
> log compaction) I am also happy to contribute a PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24036) Stateful operators in continuous processing

2018-04-25 Thread Jose Torres (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452576#comment-16452576
 ] 

Jose Torres commented on SPARK-24036:
-

The broader Spark community is of course always welcome to help.

The work here is generally split into three components:
 * Supporting single partition aggregates. I have a substantially complete 
prototype of this in [https://github.com/jose-torres/spark/pull/13] - it 
doesn't really involve design as much as removing a very silly hack I put in 
earlier.
 * Extending support to make continuous queries with multiple partitions run. 
My experimentation suggests that this only requires making ShuffleExchangeExec 
not cache its RDD in continuous mode, but I haven't strongly verified this.
 * Making the multiple partition aggregates truly continuous. 
ShuffleExchangeExec will of course insert a stage boundary, which means that 
latency will end up being bound by the checkpoint interval. What we need to do 
is create a new kind of shuffle for continuous processing which is non-blocking 
(cc [~liweisheng]). There are two possibilities here which I haven't evaluated 
in detail:
 ** Reuse the existing shuffle infrastructure, optimizing for latency later if 
needed.
 ** Just write RPC endpoints on both ends tossing rows around, optimizing for 
throughput later if needed. (I'm leaning towards this one.)

If you're interested in working on some of this, I can prioritize a design for 
that third part.

> Stateful operators in continuous processing
> ---
>
> Key: SPARK-24036
> URL: https://issues.apache.org/jira/browse/SPARK-24036
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>
> The first iteration of continuous processing in Spark 2.3 does not work with 
> stateful operators.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24070) TPC-DS Performance Tests for Parquet 1.10.0 Upgrade

2018-04-25 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452527#comment-16452527
 ] 

Xiao Li commented on SPARK-24070:
-

[~mswit] Thank you for your suggestions! This is very helpful!



> TPC-DS Performance Tests for Parquet 1.10.0 Upgrade
> ---
>
> Key: SPARK-24070
> URL: https://issues.apache.org/jira/browse/SPARK-24070
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Assignee: Takeshi Yamamuro
>Priority: Major
>
> TPC-DS performance evaluation of Apache Spark Parquet 1.8.2 and 1.10.0. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24088) only HadoopRDD leverage HDFS Cache as preferred location

2018-04-25 Thread Xiaoju Wu (JIRA)
Xiaoju Wu created SPARK-24088:
-

 Summary: only HadoopRDD leverage HDFS Cache as preferred location
 Key: SPARK-24088
 URL: https://issues.apache.org/jira/browse/SPARK-24088
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
Affects Versions: 2.3.0
Reporter: Xiaoju Wu


Only HadoopRDD implements convertSplitLocationInfo which will convert location 
to HDFSCacheTaskLocation based on if the block is cached in Datanode memory.  
While FileScanRDD not. In FileScanRDD, all split location information is 
dropped. 

private[spark] def convertSplitLocationInfo(
 infos: Array[SplitLocationInfo]): Option[Seq[String]] = {
 Option(infos).map(_.flatMap { loc =>
 val locationStr = loc.getLocation
 if (locationStr != "localhost") {
 if (loc.isInMemory) {
 logDebug(s"Partition $locationStr is cached by Hadoop.")
 Some(HDFSCacheTaskLocation(locationStr).toString)
 } else {
 Some(HostTaskLocation(locationStr).toString)
 }
 } else {
 None
 }
 })
}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22674) PySpark breaks serialization of namedtuple subclasses

2018-04-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22674:


Assignee: Apache Spark

> PySpark breaks serialization of namedtuple subclasses
> -
>
> Key: SPARK-22674
> URL: https://issues.apache.org/jira/browse/SPARK-22674
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Jonas Amrich
>Assignee: Apache Spark
>Priority: Major
>
> Pyspark monkey patches the namedtuple class to make it serializable, however 
> this breaks serialization of its subclasses. With current implementation, any 
> subclass will be serialized (and deserialized) as it's parent namedtuple. 
> Consider this code, which will fail with {{AttributeError: 'Point' object has 
> no attribute 'sum'}}:
> {code}
> from collections import namedtuple
> Point = namedtuple("Point", "x y")
> class PointSubclass(Point):
> def sum(self):
> return self.x + self.y
> rdd = spark.sparkContext.parallelize([[PointSubclass(1, 1)]])
> rdd.collect()[0][0].sum()
> {code}
> Moreover, as PySpark hijacks all namedtuples in the main module, importing 
> pyspark breaks serialization of namedtuple subclasses even in code which is 
> not related to spark / distributed execution. I don't see any clean solution 
> to this; a possible workaround may be to limit serialization hack only to 
> direct namedtuple subclasses like in 
> https://github.com/JonasAmrich/spark/commit/f3efecee28243380ecf6657fe54e1a165c1b7204



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22674) PySpark breaks serialization of namedtuple subclasses

2018-04-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452462#comment-16452462
 ] 

Apache Spark commented on SPARK-22674:
--

User 'superbobry' has created a pull request for this issue:
https://github.com/apache/spark/pull/21157

> PySpark breaks serialization of namedtuple subclasses
> -
>
> Key: SPARK-22674
> URL: https://issues.apache.org/jira/browse/SPARK-22674
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Jonas Amrich
>Priority: Major
>
> Pyspark monkey patches the namedtuple class to make it serializable, however 
> this breaks serialization of its subclasses. With current implementation, any 
> subclass will be serialized (and deserialized) as it's parent namedtuple. 
> Consider this code, which will fail with {{AttributeError: 'Point' object has 
> no attribute 'sum'}}:
> {code}
> from collections import namedtuple
> Point = namedtuple("Point", "x y")
> class PointSubclass(Point):
> def sum(self):
> return self.x + self.y
> rdd = spark.sparkContext.parallelize([[PointSubclass(1, 1)]])
> rdd.collect()[0][0].sum()
> {code}
> Moreover, as PySpark hijacks all namedtuples in the main module, importing 
> pyspark breaks serialization of namedtuple subclasses even in code which is 
> not related to spark / distributed execution. I don't see any clean solution 
> to this; a possible workaround may be to limit serialization hack only to 
> direct namedtuple subclasses like in 
> https://github.com/JonasAmrich/spark/commit/f3efecee28243380ecf6657fe54e1a165c1b7204



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22674) PySpark breaks serialization of namedtuple subclasses

2018-04-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22674:


Assignee: (was: Apache Spark)

> PySpark breaks serialization of namedtuple subclasses
> -
>
> Key: SPARK-22674
> URL: https://issues.apache.org/jira/browse/SPARK-22674
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Jonas Amrich
>Priority: Major
>
> Pyspark monkey patches the namedtuple class to make it serializable, however 
> this breaks serialization of its subclasses. With current implementation, any 
> subclass will be serialized (and deserialized) as it's parent namedtuple. 
> Consider this code, which will fail with {{AttributeError: 'Point' object has 
> no attribute 'sum'}}:
> {code}
> from collections import namedtuple
> Point = namedtuple("Point", "x y")
> class PointSubclass(Point):
> def sum(self):
> return self.x + self.y
> rdd = spark.sparkContext.parallelize([[PointSubclass(1, 1)]])
> rdd.collect()[0][0].sum()
> {code}
> Moreover, as PySpark hijacks all namedtuples in the main module, importing 
> pyspark breaks serialization of namedtuple subclasses even in code which is 
> not related to spark / distributed execution. I don't see any clean solution 
> to this; a possible workaround may be to limit serialization hack only to 
> direct namedtuple subclasses like in 
> https://github.com/JonasAmrich/spark/commit/f3efecee28243380ecf6657fe54e1a165c1b7204



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23929) pandas_udf schema mapped by position and not by name

2018-04-25 Thread Tr3wory (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452459#comment-16452459
 ] 

Tr3wory commented on SPARK-23929:
-

Yes, but that's not simpler than using "columns=[...]":
{code:python}
from collections import OrderedDict

@pandas_udf(schema, PandasUDFType.GROUPED_MAP)  
def constants(grp):
return pd.DataFrame(OrderedDict([("id",grp.iloc[0]['id']), ("zeros",0), 
("ones",1)]),index = [0])

df.groupby("id").apply(constants).show()
{code}

 

> pandas_udf schema mapped by position and not by name
> 
>
> Key: SPARK-23929
> URL: https://issues.apache.org/jira/browse/SPARK-23929
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: PySpark
> Spark 2.3.0
>  
>Reporter: Omri
>Priority: Major
>
> The return struct of a pandas_udf should be mapped to the provided schema by 
> name. Currently it's not the case.
> Consider these two examples, where the only change is the order of the fields 
> in the provided schema struct:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("v double,id long", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show() 
> {code}
> and this one:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("id long,v double", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show()
> {code}
> The results should be the same but they are different:
> For the first code:
> {code:java}
> +---+---+
> |  v| id|
> +---+---+
> |1.0|  0|
> |1.0|  0|
> |2.0|  0|
> |2.0|  0|
> |2.0|  1|
> +---+---+
> {code}
> For the second code:
> {code:java}
> +---+---+
> | id|  v|
> +---+---+
> |  1|-0.7071067811865475|
> |  1| 0.7071067811865475|
> |  2|-0.8320502943378437|
> |  2|-0.2773500981126146|
> |  2| 1.1094003924504583|
> +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23933) High-order function: map(array, array) → map<K,V>

2018-04-25 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452404#comment-16452404
 ] 

Kazuaki Ishizaki commented on SPARK-23933:
--

Thank you for your comment.
The current map can take the even number of arguments (e.g. 2, 4, 6, 8 ...) due 
to a pair of key and map.
We can determine {{map(1.0, '2', 3.0, '4') or map(1.0, '2')}} should be behave 
as currently.

How about {{map(ARRAY [1, 2], ARRAY ["a", "b"])}}?



> High-order function: map(array, array) → map
> ---
>
> Key: SPARK-23933
> URL: https://issues.apache.org/jira/browse/SPARK-23933
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/map.html
> Returns a map created using the given key/value arrays.
> {noformat}
> SELECT map(ARRAY[1,3], ARRAY[2,4]); -- {1 -> 2, 3 -> 4}
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24087) Avoid shuffle when join keys are a super-set of bucket keys

2018-04-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452399#comment-16452399
 ] 

Apache Spark commented on SPARK-24087:
--

User 'yucai' has created a pull request for this issue:
https://github.com/apache/spark/pull/21156

> Avoid shuffle when join keys are a super-set of bucket keys
> ---
>
> Key: SPARK-24087
> URL: https://issues.apache.org/jira/browse/SPARK-24087
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: yucai
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24087) Avoid shuffle when join keys are a super-set of bucket keys

2018-04-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24087:


Assignee: (was: Apache Spark)

> Avoid shuffle when join keys are a super-set of bucket keys
> ---
>
> Key: SPARK-24087
> URL: https://issues.apache.org/jira/browse/SPARK-24087
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: yucai
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24087) Avoid shuffle when join keys are a super-set of bucket keys

2018-04-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24087:


Assignee: Apache Spark

> Avoid shuffle when join keys are a super-set of bucket keys
> ---
>
> Key: SPARK-24087
> URL: https://issues.apache.org/jira/browse/SPARK-24087
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: yucai
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24067) Backport SPARK-17147 to 2.3 (Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction))

2018-04-25 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452396#comment-16452396
 ] 

Sean Owen commented on SPARK-24067:
---

It seems like a clear bug fix. Granted it's not trivial, but it is the kind of 
thing that certainly _can_ be backported. I think it's OK, as I don't see that 
it changes API or behavior that was already correct. The fact that there's a 
flag controlling it is a safety valve. I wouldn't think of that as a new 
feature.

> Backport SPARK-17147 to 2.3 (Spark Streaming Kafka 0.10 Consumer Can't Handle 
> Non-consecutive Offsets (i.e. Log Compaction))
> 
>
> Key: SPARK-24067
> URL: https://issues.apache.org/jira/browse/SPARK-24067
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.3.0
>Reporter: Joachim Hereth
>Assignee: Cody Koeninger
>Priority: Major
>
> SPARK-17147 fixes a problem with non-consecutive Kafka Offsets. The  [PR 
> w|https://github.com/apache/spark/pull/20572]as merged to {{master}}. This 
> should be backported to 2.3.
>  
> Original Description from SPARK-17147 :
>  
> When Kafka does log compaction offsets often end up with gaps, meaning the 
> next requested offset will be frequently not be offset+1. The logic in 
> KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset 
> will always be just an increment of 1 above the previous offset.
> I have worked around this problem by changing CachedKafkaConsumer to use the 
> returned record's offset, from:
>  {{nextOffset = offset + 1}}
>  to:
>  {{nextOffset = record.offset + 1}}
> and changed KafkaRDD from:
>  {{requestOffset += 1}}
>  to:
>  {{requestOffset = r.offset() + 1}}
> (I also had to change some assert logic in CachedKafkaConsumer).
> There's a strong possibility that I have misconstrued how to use the 
> streaming kafka consumer, and I'm happy to close this out if that's the case. 
> If, however, it is supposed to support non-consecutive offsets (e.g. due to 
> log compaction) I am also happy to contribute a PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24087) Avoid shuffle when join keys are a super-set of bucket keys

2018-04-25 Thread yucai (JIRA)
yucai created SPARK-24087:
-

 Summary: Avoid shuffle when join keys are a super-set of bucket 
keys
 Key: SPARK-24087
 URL: https://issues.apache.org/jira/browse/SPARK-24087
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: yucai






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20087) Include accumulators / taskMetrics when sending TaskKilled to onTaskEnd listeners

2018-04-25 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452369#comment-16452369
 ] 

Imran Rashid commented on SPARK-20087:
--

Sound good to me, I'm in favor of the change

> Include accumulators / taskMetrics when sending TaskKilled to onTaskEnd 
> listeners
> -
>
> Key: SPARK-20087
> URL: https://issues.apache.org/jira/browse/SPARK-20087
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Charles Lewis
>Priority: Major
>
> When tasks end due to an ExceptionFailure, subscribers to onTaskEnd receive 
> accumulators / task metrics for that task, if they were still available. 
> These metrics are not currently sent when tasks are killed intentionally, 
> such as when a speculative retry finishes, and the original is killed (or 
> vice versa). Since we're killing these tasks ourselves, these metrics should 
> almost always exist, and we should treat them the same way as we treat 
> ExceptionFailures.
> Sending these metrics with the TaskKilled end reason makes aggregation across 
> all tasks in an app more accurate. This data can inform decisions about how 
> to tune the speculation parameters in order to minimize duplicated work, and 
> in general, the total cost of an app should include both successful and 
> failed tasks, if that information exists.
> PR: https://github.com/apache/spark/pull/17422



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23929) pandas_udf schema mapped by position and not by name

2018-04-25 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452355#comment-16452355
 ] 

Li Jin commented on SPARK-23929:


[~tr3w] does using OrderedDict help in your case?

> pandas_udf schema mapped by position and not by name
> 
>
> Key: SPARK-23929
> URL: https://issues.apache.org/jira/browse/SPARK-23929
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: PySpark
> Spark 2.3.0
>  
>Reporter: Omri
>Priority: Major
>
> The return struct of a pandas_udf should be mapped to the provided schema by 
> name. Currently it's not the case.
> Consider these two examples, where the only change is the order of the fields 
> in the provided schema struct:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("v double,id long", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show() 
> {code}
> and this one:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("id long,v double", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show()
> {code}
> The results should be the same but they are different:
> For the first code:
> {code:java}
> +---+---+
> |  v| id|
> +---+---+
> |1.0|  0|
> |1.0|  0|
> |2.0|  0|
> |2.0|  0|
> |2.0|  1|
> +---+---+
> {code}
> For the second code:
> {code:java}
> +---+---+
> | id|  v|
> +---+---+
> |  1|-0.7071067811865475|
> |  1| 0.7071067811865475|
> |  2|-0.8320502943378437|
> |  2|-0.2773500981126146|
> |  2| 1.1094003924504583|
> +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23933) High-order function: map(array, array) → map<K,V>

2018-04-25 Thread Alex Wajda (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452350#comment-16452350
 ] 

Alex Wajda commented on SPARK-23933:


Why can't {{map}} be overloaded? So that if you pass one array as an argument 
it would be behave as currently, treating an array as a mixture keys and 
values. But if you pass two arrays then they would be treated as separated keys 
and value.
E.g similar to the [{{sequence}}|SPARK-23927] function that also accepts 
different number of arguments.

> High-order function: map(array, array) → map
> ---
>
> Key: SPARK-23933
> URL: https://issues.apache.org/jira/browse/SPARK-23933
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/map.html
> Returns a map created using the given key/value arrays.
> {noformat}
> SELECT map(ARRAY[1,3], ARRAY[2,4]); -- {1 -> 2, 3 -> 4}
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24067) Backport SPARK-17147 to 2.3 (Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction))

2018-04-25 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452337#comment-16452337
 ] 

Cody Koeninger commented on SPARK-24067:


Given the response on the dev list about criteria for backporting, I think this 
is in a grey area.

In one sense, it's a new option.  But there have been reports of 
non-consecutive offsets even in normal non-compacted topics, which totally 
break jobs, so this could be seen as a critical bug.

I'd lean towards merging it to the 2.3 branch, but because I'm the one who 
wrote the code, I'm not comfortable making that call on my own. 

[~srowen] you merged the original PR, do you want to weigh in?

> Backport SPARK-17147 to 2.3 (Spark Streaming Kafka 0.10 Consumer Can't Handle 
> Non-consecutive Offsets (i.e. Log Compaction))
> 
>
> Key: SPARK-24067
> URL: https://issues.apache.org/jira/browse/SPARK-24067
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.3.0
>Reporter: Joachim Hereth
>Assignee: Cody Koeninger
>Priority: Major
>
> SPARK-17147 fixes a problem with non-consecutive Kafka Offsets. The  [PR 
> w|https://github.com/apache/spark/pull/20572]as merged to {{master}}. This 
> should be backported to 2.3.
>  
> Original Description from SPARK-17147 :
>  
> When Kafka does log compaction offsets often end up with gaps, meaning the 
> next requested offset will be frequently not be offset+1. The logic in 
> KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset 
> will always be just an increment of 1 above the previous offset.
> I have worked around this problem by changing CachedKafkaConsumer to use the 
> returned record's offset, from:
>  {{nextOffset = offset + 1}}
>  to:
>  {{nextOffset = record.offset + 1}}
> and changed KafkaRDD from:
>  {{requestOffset += 1}}
>  to:
>  {{requestOffset = r.offset() + 1}}
> (I also had to change some assert logic in CachedKafkaConsumer).
> There's a strong possibility that I have misconstrued how to use the 
> streaming kafka consumer, and I'm happy to close this out if that's the case. 
> If, however, it is supposed to support non-consecutive offsets (e.g. due to 
> log compaction) I am also happy to contribute a PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21337) SQL which has large ‘case when’ expressions may cause code generation beyond 64KB

2018-04-25 Thread fengchaoge (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fengchaoge updated SPARK-21337:
---
Attachment: (was: 1tes.zip)

> SQL which has large ‘case when’ expressions may cause code generation beyond 
> 64KB
> -
>
> Key: SPARK-21337
> URL: https://issues.apache.org/jira/browse/SPARK-21337
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: spark-2.1.1-hadoop-2.6.0-cdh-5.4.2
>Reporter: fengchaoge
>Priority: Major
> Fix For: 2.1.1
>
> Attachments: test.JPG, test1.JPG, test2.JPG
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24086) Exception while executing spark streaming examples

2018-04-25 Thread Chandra Hasan (JIRA)
Chandra Hasan created SPARK-24086:
-

 Summary: Exception while executing spark streaming examples
 Key: SPARK-24086
 URL: https://issues.apache.org/jira/browse/SPARK-24086
 Project: Spark
  Issue Type: Bug
  Components: Examples
Affects Versions: 2.3.0
Reporter: Chandra Hasan


After running mvn clean package, I tried to execute one of the spark example 
program JavaDirectKafkaWordCount.java but throws following exeception.
{code:java}
[cloud-user@server-2 examples]$ run-example streaming.JavaDirectKafkaWordCount 
192.168.0.4:9092 msu
2018-04-25 09:39:22 WARN NativeCodeLoader:62 - Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
2018-04-25 09:39:22 INFO SparkContext:54 - Running Spark version 2.3.0
2018-04-25 09:39:22 INFO SparkContext:54 - Submitted application: 
JavaDirectKafkaWordCount
2018-04-25 09:39:22 INFO SecurityManager:54 - Changing view acls to: cloud-user
2018-04-25 09:39:22 INFO SecurityManager:54 - Changing modify acls to: 
cloud-user
2018-04-25 09:39:22 INFO SecurityManager:54 - Changing view acls groups to:
2018-04-25 09:39:22 INFO SecurityManager:54 - Changing modify acls groups to:
2018-04-25 09:39:22 INFO SecurityManager:54 - SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(cloud-user); 
groups with view permissions: Set(); users with modify permissions: 
Set(cloud-user); groups with modify permissions: Set()
2018-04-25 09:39:23 INFO Utils:54 - Successfully started service 'sparkDriver' 
on port 59333.
2018-04-25 09:39:23 INFO SparkEnv:54 - Registering MapOutputTracker
2018-04-25 09:39:23 INFO SparkEnv:54 - Registering BlockManagerMaster
2018-04-25 09:39:23 INFO BlockManagerMasterEndpoint:54 - Using 
org.apache.spark.storage.DefaultTopologyMapper for getting topology information
2018-04-25 09:39:23 INFO BlockManagerMasterEndpoint:54 - 
BlockManagerMasterEndpoint up
2018-04-25 09:39:23 INFO DiskBlockManager:54 - Created local directory at 
/tmp/blockmgr-6fc11fc1-f638-42ea-a9df-dc01fb81b7b6
2018-04-25 09:39:23 INFO MemoryStore:54 - MemoryStore started with capacity 
366.3 MB
2018-04-25 09:39:23 INFO SparkEnv:54 - Registering OutputCommitCoordinator
2018-04-25 09:39:23 INFO log:192 - Logging initialized @1825ms
2018-04-25 09:39:23 INFO Server:346 - jetty-9.3.z-SNAPSHOT
2018-04-25 09:39:23 INFO Server:414 - Started @1900ms
2018-04-25 09:39:23 INFO AbstractConnector:278 - Started 
ServerConnector@6813a331{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2018-04-25 09:39:23 INFO Utils:54 - Successfully started service 'SparkUI' on 
port 4040.
2018-04-25 09:39:23 INFO ContextHandler:781 - Started 
o.s.j.s.ServletContextHandler@4f7c0be3{/jobs,null,AVAILABLE,@Spark}
2018-04-25 09:39:23 INFO ContextHandler:781 - Started 
o.s.j.s.ServletContextHandler@4cfbaf4{/jobs/json,null,AVAILABLE,@Spark}
2018-04-25 09:39:23 INFO ContextHandler:781 - Started 
o.s.j.s.ServletContextHandler@58faa93b{/jobs/job,null,AVAILABLE,@Spark}
2018-04-25 09:39:23 INFO ContextHandler:781 - Started 
o.s.j.s.ServletContextHandler@127d7908{/jobs/job/json,null,AVAILABLE,@Spark}
2018-04-25 09:39:23 INFO ContextHandler:781 - Started 
o.s.j.s.ServletContextHandler@6b9c69a9{/stages,null,AVAILABLE,@Spark}
2018-04-25 09:39:23 INFO ContextHandler:781 - Started 
o.s.j.s.ServletContextHandler@6622a690{/stages/json,null,AVAILABLE,@Spark}
2018-04-25 09:39:23 INFO ContextHandler:781 - Started 
o.s.j.s.ServletContextHandler@30b9eadd{/stages/stage,null,AVAILABLE,@Spark}
2018-04-25 09:39:23 INFO ContextHandler:781 - Started 
o.s.j.s.ServletContextHandler@3249a1ce{/stages/stage/json,null,AVAILABLE,@Spark}
2018-04-25 09:39:23 INFO ContextHandler:781 - Started 
o.s.j.s.ServletContextHandler@4dd94a58{/stages/pool,null,AVAILABLE,@Spark}
2018-04-25 09:39:23 INFO ContextHandler:781 - Started 
o.s.j.s.ServletContextHandler@2f4919b0{/stages/pool/json,null,AVAILABLE,@Spark}
2018-04-25 09:39:23 INFO ContextHandler:781 - Started 
o.s.j.s.ServletContextHandler@a8a8b75{/storage,null,AVAILABLE,@Spark}
2018-04-25 09:39:23 INFO ContextHandler:781 - Started 
o.s.j.s.ServletContextHandler@75b21c3b{/storage/json,null,AVAILABLE,@Spark}
2018-04-25 09:39:23 INFO ContextHandler:781 - Started 
o.s.j.s.ServletContextHandler@72be135f{/storage/rdd,null,AVAILABLE,@Spark}
2018-04-25 09:39:23 INFO ContextHandler:781 - Started 
o.s.j.s.ServletContextHandler@155d1021{/storage/rdd/json,null,AVAILABLE,@Spark}
2018-04-25 09:39:23 INFO ContextHandler:781 - Started 
o.s.j.s.ServletContextHandler@4bd2f0dc{/environment,null,AVAILABLE,@Spark}
2018-04-25 09:39:23 INFO ContextHandler:781 - Started 
o.s.j.s.ServletContextHandler@2e647e59{/environment/json,null,AVAILABLE,@Spark}
2018-04-25 09:39:23 INFO ContextHandler:781 - Started 
o.s.j.s.ServletContextHandler@2c42b421{/executors,null,AVAILABLE,@Spark}
2018-04-25 09:39:23 INFO ContextHandler:781 - Started 

[jira] [Commented] (SPARK-23929) pandas_udf schema mapped by position and not by name

2018-04-25 Thread Tr3wory (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452282#comment-16452282
 ] 

Tr3wory commented on SPARK-23929:
-

I think the problem is even more nuanced: in python the order of the values in 
a dict are not defined, so generally if you create a new DataFrame in your 
function from a dict (like in the previous example) you need to specify the 
order manually.

So
{code}
@pandas_udf(schema, PandasUDFType.GROUPED_MAP) 
def constants(grp):
return pd.DataFrame({"id":grp.iloc[0]['id'], "ones":1, "zeros":0},index = [0])

df.groupby("id").apply(constants).show()
{code}
and
{code}
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)  
def constants(grp):
return pd.DataFrame({"id":grp.iloc[0]['id'], "zeros":0, "ones":1},index = 
[0])

df.groupby("id").apply(constants).show()
{code}
gives you the exact same wrong results.
 The fact that this happens without any error or warning is even more worrying 
(but only if the types are compatible, if any of them a string, you get strange 
errors).

You need to specify the order manually:
{code}
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)  
def constants(grp):
return pd.DataFrame({"id":grp.iloc[0]['id'], "zeros":0, "ones":1}, 
columns=['id', 'zeros', 'ones'], index=[0])

df.groupby("id").apply(constants).show()
{code}
I think the current implementation is hard to use correctly and very easy to 
use incorrectly which greatly outweighs the small flexibility of not specify 
the names...

> pandas_udf schema mapped by position and not by name
> 
>
> Key: SPARK-23929
> URL: https://issues.apache.org/jira/browse/SPARK-23929
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: PySpark
> Spark 2.3.0
>  
>Reporter: Omri
>Priority: Major
>
> The return struct of a pandas_udf should be mapped to the provided schema by 
> name. Currently it's not the case.
> Consider these two examples, where the only change is the order of the fields 
> in the provided schema struct:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("v double,id long", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show() 
> {code}
> and this one:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("id long,v double", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show()
> {code}
> The results should be the same but they are different:
> For the first code:
> {code:java}
> +---+---+
> |  v| id|
> +---+---+
> |1.0|  0|
> |1.0|  0|
> |2.0|  0|
> |2.0|  0|
> |2.0|  1|
> +---+---+
> {code}
> For the second code:
> {code:java}
> +---+---+
> | id|  v|
> +---+---+
> |  1|-0.7071067811865475|
> |  1| 0.7071067811865475|
> |  2|-0.8320502943378437|
> |  2|-0.2773500981126146|
> |  2| 1.1094003924504583|
> +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23927) High-order function: sequence

2018-04-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452246#comment-16452246
 ] 

Apache Spark commented on SPARK-23927:
--

User 'wajda' has created a pull request for this issue:
https://github.com/apache/spark/pull/21155

> High-order function: sequence
> -
>
> Key: SPARK-23927
> URL: https://issues.apache.org/jira/browse/SPARK-23927
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> * sequence(start, stop) → array
> Generate a sequence of integers from start to stop, incrementing by 1 if 
> start is less than or equal to stop, otherwise -1.
> * sequence(start, stop, step) → array
> Generate a sequence of integers from start to stop, incrementing by step.
> * sequence(start, stop) → array
> Generate a sequence of dates from start date to stop date, incrementing by 1 
> day if start date is less than or equal to stop date, otherwise -1 day.
> * sequence(start, stop, step) → array
> Generate a sequence of dates from start to stop, incrementing by step. The 
> type of step can be either INTERVAL DAY TO SECOND or INTERVAL YEAR TO MONTH.
> * sequence(start, stop, step) → array
> Generate a sequence of timestamps from start to stop, incrementing by step. 
> The type of step can be either INTERVAL DAY TO SECOND or INTERVAL YEAR TO 
> MONTH.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23927) High-order function: sequence

2018-04-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23927:


Assignee: (was: Apache Spark)

> High-order function: sequence
> -
>
> Key: SPARK-23927
> URL: https://issues.apache.org/jira/browse/SPARK-23927
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> * sequence(start, stop) → array
> Generate a sequence of integers from start to stop, incrementing by 1 if 
> start is less than or equal to stop, otherwise -1.
> * sequence(start, stop, step) → array
> Generate a sequence of integers from start to stop, incrementing by step.
> * sequence(start, stop) → array
> Generate a sequence of dates from start date to stop date, incrementing by 1 
> day if start date is less than or equal to stop date, otherwise -1 day.
> * sequence(start, stop, step) → array
> Generate a sequence of dates from start to stop, incrementing by step. The 
> type of step can be either INTERVAL DAY TO SECOND or INTERVAL YEAR TO MONTH.
> * sequence(start, stop, step) → array
> Generate a sequence of timestamps from start to stop, incrementing by step. 
> The type of step can be either INTERVAL DAY TO SECOND or INTERVAL YEAR TO 
> MONTH.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23927) High-order function: sequence

2018-04-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23927:


Assignee: Apache Spark

> High-order function: sequence
> -
>
> Key: SPARK-23927
> URL: https://issues.apache.org/jira/browse/SPARK-23927
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> * sequence(start, stop) → array
> Generate a sequence of integers from start to stop, incrementing by 1 if 
> start is less than or equal to stop, otherwise -1.
> * sequence(start, stop, step) → array
> Generate a sequence of integers from start to stop, incrementing by step.
> * sequence(start, stop) → array
> Generate a sequence of dates from start date to stop date, incrementing by 1 
> day if start date is less than or equal to stop date, otherwise -1 day.
> * sequence(start, stop, step) → array
> Generate a sequence of dates from start to stop, incrementing by step. The 
> type of step can be either INTERVAL DAY TO SECOND or INTERVAL YEAR TO MONTH.
> * sequence(start, stop, step) → array
> Generate a sequence of timestamps from start to stop, incrementing by step. 
> The type of step can be either INTERVAL DAY TO SECOND or INTERVAL YEAR TO 
> MONTH.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24085) Scalar subquery error

2018-04-25 Thread Alexey Baturin (JIRA)
Alexey Baturin created SPARK-24085:
--

 Summary: Scalar subquery error
 Key: SPARK-24085
 URL: https://issues.apache.org/jira/browse/SPARK-24085
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0, 2.3.1
Reporter: Alexey Baturin


Error
{noformat}
SQL Error: java.lang.UnsupportedOperationException: Cannot evaluate expression: 
scalar-subquery{noformat}
Then query a partitioed table based on a parquet file then filter by partition 
column by scalar subquery.

Query to reproduce:
{code:sql}
CREATE TABLE test_prc_bug (
id_value string
)
partitioned by (id_type string)
location '/tmp/test_prc_bug'
stored as parquet;

insert into test_prc_bug values ('1','a');
insert into test_prc_bug values ('2','a');
insert into test_prc_bug values ('3','b');
insert into test_prc_bug values ('4','b');


select * from test_prc_bug
where id_type = (select 'b');
{code}
If table in ORC format it works fine



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21337) SQL which has large ‘case when’ expressions may cause code generation beyond 64KB

2018-04-25 Thread fengchaoge (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fengchaoge updated SPARK-21337:
---
Attachment: 1tes.zip

> SQL which has large ‘case when’ expressions may cause code generation beyond 
> 64KB
> -
>
> Key: SPARK-21337
> URL: https://issues.apache.org/jira/browse/SPARK-21337
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: spark-2.1.1-hadoop-2.6.0-cdh-5.4.2
>Reporter: fengchaoge
>Priority: Major
> Fix For: 2.1.1
>
> Attachments: 1tes.zip, test.JPG, test1.JPG, test2.JPG
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18673) Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version

2018-04-25 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452215#comment-16452215
 ] 

Steve Loughran commented on SPARK-18673:


looking @ our local commit logs, the HDP version includes

* SPARK-13471 : update groovy version dependency. This is just a safety check 
now it's explicitly included in the spark imports
* HIVE-11720 : header length; needed for some AD/kerberos auth

> Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version
> --
>
> Key: SPARK-18673
> URL: https://issues.apache.org/jira/browse/SPARK-18673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Spark built with -Dhadoop.version=3.0.0-alpha2-SNAPSHOT 
>Reporter: Steve Loughran
>Priority: Major
>
> Spark Dataframes fail to run on Hadoop 3.0.x, because hive.jar's shimloader 
> considers 3.x to be an unknown Hadoop version.
> Hive itself will have to fix this; as Spark uses its own hive 1.2.x JAR, it 
> will need to be updated to match.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18673) Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version

2018-04-25 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452196#comment-16452196
 ] 

Steve Loughran commented on SPARK-18673:


HIVE-16081 commit 93db527f47 contains the one-line change to the switch 
statement

> Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version
> --
>
> Key: SPARK-18673
> URL: https://issues.apache.org/jira/browse/SPARK-18673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Spark built with -Dhadoop.version=3.0.0-alpha2-SNAPSHOT 
>Reporter: Steve Loughran
>Priority: Major
>
> Spark Dataframes fail to run on Hadoop 3.0.x, because hive.jar's shimloader 
> considers 3.x to be an unknown Hadoop version.
> Hive itself will have to fix this; as Spark uses its own hive 1.2.x JAR, it 
> will need to be updated to match.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24084) Add job group id for query through spark-sql

2018-04-25 Thread zhoukang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated SPARK-24084:
-
Description: For spark-sql we can add job group id for the same statement.  
(was: For thrift server we can add job group id for the same statement.Like 
below:)

> Add job group id for query through spark-sql
> 
>
> Key: SPARK-24084
> URL: https://issues.apache.org/jira/browse/SPARK-24084
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: zhoukang
>Priority: Major
>
> For spark-sql we can add job group id for the same statement.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24084) Add job group id for query through spark-sql

2018-04-25 Thread zhoukang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated SPARK-24084:
-
Summary: Add job group id for query through spark-sql  (was: Add job group 
id for query through Thrift Server)

> Add job group id for query through spark-sql
> 
>
> Key: SPARK-24084
> URL: https://issues.apache.org/jira/browse/SPARK-24084
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: zhoukang
>Priority: Major
>
> For thrift server we can add job group id for the same statement.Like below:



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24084) Add job group id for query through Thrift Server

2018-04-25 Thread zhoukang (JIRA)
zhoukang created SPARK-24084:


 Summary: Add job group id for query through Thrift Server
 Key: SPARK-24084
 URL: https://issues.apache.org/jira/browse/SPARK-24084
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: zhoukang


For thrift server we can add job group id for the same statement.Like below:



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24070) TPC-DS Performance Tests for Parquet 1.10.0 Upgrade

2018-04-25 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-24070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452096#comment-16452096
 ] 

Michał Świtakowski edited comment on SPARK-24070 at 4/25/18 11:43 AM:
--

[~maropu] I think you can just use the existing benchmark: 
[https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadBenchmark.scala]

Just make sure that you have a few GB of free memory. If the files are read 
from OS buffer cache, we will be able to see any performance differences better.


was (Author: mswit):
[~maropu] I think you can just use the existing benchmark: 
[https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadBenchmark.scala]
 

Just make sure that you have a few GB of free memory. If the files are read 
from OS buffer cache, we will be able to see any performance differences better.
 * Visual

> TPC-DS Performance Tests for Parquet 1.10.0 Upgrade
> ---
>
> Key: SPARK-24070
> URL: https://issues.apache.org/jira/browse/SPARK-24070
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Assignee: Takeshi Yamamuro
>Priority: Major
>
> TPC-DS performance evaluation of Apache Spark Parquet 1.8.2 and 1.10.0. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24070) TPC-DS Performance Tests for Parquet 1.10.0 Upgrade

2018-04-25 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-24070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452096#comment-16452096
 ] 

Michał Świtakowski commented on SPARK-24070:


[~maropu] I think you can just use the existing benchmark: 
[https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadBenchmark.scala]
 

Just make sure that you have a few GB of free memory. If the files are read 
from OS buffer cache, we will be able to see any performance differences better.
 * Visual

> TPC-DS Performance Tests for Parquet 1.10.0 Upgrade
> ---
>
> Key: SPARK-24070
> URL: https://issues.apache.org/jira/browse/SPARK-24070
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Assignee: Takeshi Yamamuro
>Priority: Major
>
> TPC-DS performance evaluation of Apache Spark Parquet 1.8.2 and 1.10.0. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23880) table cache should be lazy and don't trigger any jobs.

2018-04-25 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23880.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21018
[https://github.com/apache/spark/pull/21018]

> table cache should be lazy and don't trigger any jobs.
> --
>
> Key: SPARK-23880
> URL: https://issues.apache.org/jira/browse/SPARK-23880
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 2.4.0
>
>
> {code}
> val df = spark.range(100L)
>   .filter('id > 1000)
>   .orderBy('id.desc)
>   .cache()
> {code}
> This triggers a job while the cache should be lazy. The problem is that, when 
> creating `InMemoryRelation`, we build the RDD, which calls 
> `SparkPlan.execute` and may trigger jobs, like sampling job for range 
> partitioner, or broadcast job.
> We should create the RDD at physical phase.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23880) table cache should be lazy and don't trigger any jobs.

2018-04-25 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-23880:
---

Assignee: Takeshi Yamamuro

> table cache should be lazy and don't trigger any jobs.
> --
>
> Key: SPARK-23880
> URL: https://issues.apache.org/jira/browse/SPARK-23880
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 2.4.0
>
>
> {code}
> val df = spark.range(100L)
>   .filter('id > 1000)
>   .orderBy('id.desc)
>   .cache()
> {code}
> This triggers a job while the cache should be lazy. The problem is that, when 
> creating `InMemoryRelation`, we build the RDD, which calls 
> `SparkPlan.execute` and may trigger jobs, like sampling job for range 
> partitioner, or broadcast job.
> We should create the RDD at physical phase.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24012) Union of map and other compatible column

2018-04-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452059#comment-16452059
 ] 

Apache Spark commented on SPARK-24012:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/21154

> Union of map and other compatible column
> 
>
> Key: SPARK-24012
> URL: https://issues.apache.org/jira/browse/SPARK-24012
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.1
>Reporter: Lijia Liu
>Assignee: Lijia Liu
>Priority: Major
> Fix For: 2.4.0
>
>
> Union of map and other compatible column result in unresolved operator 
> 'Union; exception
> Reproduction
> spark-sql>select map(1,2), 'str' union all select map(1,2,3,null), 1
> Output:
> Error in query: unresolved operator 'Union;;
> 'Union
> :- Project [map(1, 2) AS map(1, 2)#106, str AS str#107]
> : +- OneRowRelation$
> +- Project [map(1, cast(2 as int), 3, cast(null as int)) AS map(1, CAST(2 AS 
> INT), 3, CAST(NULL AS INT))#109, 1 AS 1#108]
>  +- OneRowRelation$
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20894) Error while checkpointing to HDFS

2018-04-25 Thread Aydin Kocas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452041#comment-16452041
 ] 

Aydin Kocas commented on SPARK-20894:
-

removing the checkpoint location along with the _spark_metadata folder in the 
affected writeStream output folder helped to get rid of the issue, but 

it should be notices that the situation persists in spark 2.3.

Seems that there is some bad state in _spark_metadata - it happened unexpected 
without any code change - therefore  me it looks like a bug somewhere. I am not 
having any hdfs, am developing locally without a cluster,

> Error while checkpointing to HDFS
> -
>
> Key: SPARK-20894
> URL: https://issues.apache.org/jira/browse/SPARK-20894
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.1
> Environment: Ubuntu, Spark 2.1.1, hadoop 2.7
>Reporter: kant kodali
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: driver_info_log, executor1_log, executor2_log
>
>
> Dataset df2 = df1.groupBy(functions.window(df1.col("Timestamp5"), "24 
> hours", "24 hours"), df1.col("AppName")).count();
> StreamingQuery query = df2.writeStream().foreach(new 
> KafkaSink()).option("checkpointLocation","/usr/local/hadoop/checkpoint").outputMode("update").start();
> query.awaitTermination();
> This for some reason fails with the Error 
> ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.IllegalStateException: Error reading delta file 
> /usr/local/hadoop/checkpoint/state/0/0/1.delta of HDFSStateStoreProvider[id = 
> (op=0, part=0), dir = /usr/local/hadoop/checkpoint/state/0/0]: 
> /usr/local/hadoop/checkpoint/state/0/0/1.delta does not exist
> I did clear all the checkpoint data in /usr/local/hadoop/checkpoint/  and all 
> consumer offsets in Kafka from all brokers prior to running and yet this 
> error still persists. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20894) Error while checkpointing to HDFS

2018-04-25 Thread Aydin Kocas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452030#comment-16452030
 ] 

Aydin Kocas commented on SPARK-20894:
-

having the same issue on 2.3 - what's the solution? Am developing locally on 
standard ext4-fs, not in a hdfs-cluster but having same hdfs-related entries in 
the log along with the error message

> Error while checkpointing to HDFS
> -
>
> Key: SPARK-20894
> URL: https://issues.apache.org/jira/browse/SPARK-20894
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.1
> Environment: Ubuntu, Spark 2.1.1, hadoop 2.7
>Reporter: kant kodali
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: driver_info_log, executor1_log, executor2_log
>
>
> Dataset df2 = df1.groupBy(functions.window(df1.col("Timestamp5"), "24 
> hours", "24 hours"), df1.col("AppName")).count();
> StreamingQuery query = df2.writeStream().foreach(new 
> KafkaSink()).option("checkpointLocation","/usr/local/hadoop/checkpoint").outputMode("update").start();
> query.awaitTermination();
> This for some reason fails with the Error 
> ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.IllegalStateException: Error reading delta file 
> /usr/local/hadoop/checkpoint/state/0/0/1.delta of HDFSStateStoreProvider[id = 
> (op=0, part=0), dir = /usr/local/hadoop/checkpoint/state/0/0]: 
> /usr/local/hadoop/checkpoint/state/0/0/1.delta does not exist
> I did clear all the checkpoint data in /usr/local/hadoop/checkpoint/  and all 
> consumer offsets in Kafka from all brokers prior to running and yet this 
> error still persists. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24012) Union of map and other compatible column

2018-04-25 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-24012.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21100
[https://github.com/apache/spark/pull/21100]

> Union of map and other compatible column
> 
>
> Key: SPARK-24012
> URL: https://issues.apache.org/jira/browse/SPARK-24012
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.1
>Reporter: Lijia Liu
>Assignee: Lijia Liu
>Priority: Major
> Fix For: 2.4.0
>
>
> Union of map and other compatible column result in unresolved operator 
> 'Union; exception
> Reproduction
> spark-sql>select map(1,2), 'str' union all select map(1,2,3,null), 1
> Output:
> Error in query: unresolved operator 'Union;;
> 'Union
> :- Project [map(1, 2) AS map(1, 2)#106, str AS str#107]
> : +- OneRowRelation$
> +- Project [map(1, cast(2 as int), 3, cast(null as int)) AS map(1, CAST(2 AS 
> INT), 3, CAST(NULL AS INT))#109, 1 AS 1#108]
>  +- OneRowRelation$
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24058) Default Params in ML should be saved separately: Python API

2018-04-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24058:


Assignee: Apache Spark

> Default Params in ML should be saved separately: Python API
> ---
>
> Key: SPARK-24058
> URL: https://issues.apache.org/jira/browse/SPARK-24058
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Major
>
> See [SPARK-23455] for reference.  Since DefaultParamsReader has been changed 
> in Scala, we must change it for Python for Spark 2.4.0 as well in order to 
> keep the 2 in sync.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >