date:20181023

[jira] [Updated] (SPARK-25804) JDOPersistenceManager leak when query via JDBC

2018-10-23 Thread pin_zhang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

pin_zhang updated SPARK-25804:
--
Description: 
1. start-thriftserver.sh under SPARK2.3.1

2. Create Table and insert values

     create table test_leak (id string, index int);

     insert into test_leak values('id1',1)

3. Create JDBC Client query the table

import java.sql.*;

public class HiveClient {

public static void main(String[] args) throws Exception {

String driverName = "org.apache.hive.jdbc.HiveDriver";
 Class.forName(driverName);
 Connection con = DriverManager.getConnection( 
"jdbc:hive2://localhost:1/default", "test", "test");
 Statement stmt = con.createStatement();
 String sql = "select * from test_leak";
 int loop = 100;
 while ( loop – > 0) {

    ResultSet rs = stmt.executeQuery(sql);

    rs.next();

    System.out.println(new java.sql.Timestamp(System.currentTimeMillis()) +" : 
" +    rs.getString(1));

   rs.close();

  if( loop % 100 ==0){
     Thread.sleep(1);
  }

}

con.close(); 
 }
 }

4. Dump HS2 heap org.datanucleus.api.jdo.JDOPersistenceManager instances keep 
increasing.

  was:
1. start-thriftserver.sh under SPARK2.3.1

2. Create Table and insert values

     create table test_leak (id string, index int);

     insert into test_leak values('id1',1)

3. Create JDBC Client query the table

import java.sql.*;

public class HiveClient {

public static void main(String[] args) throws Exception {

String driverName = "org.apache.hive.jdbc.HiveDriver";
 Class.forName(driverName);
 Connection con = DriverManager.getConnection( 
"jdbc:hive2://localhost:1/default", "test", "test");
 Statement stmt = con.createStatement();
 String sql = "select * from test_leak";
 int loop = 100;
 while ( loop -- > 0) {
 ResultSet rs = stmt.executeQuery(sql);
 rs.next(); 
 System.out.println(new java.sql.Timestamp(System.currentTimeMillis()) +" : " + 
rs.getString(1));
 rs.close();
 }
 con.close();
 }
}

4. Dump HS2 heap org.datanucleus.api.jdo.JDOPersistenceManager instances keep 
increasing.


> JDOPersistenceManager leak when query via JDBC
> --
>
> Key: SPARK-25804
> URL: https://issues.apache.org/jira/browse/SPARK-25804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: pin_zhang
>Priority: Major
>
> 1. start-thriftserver.sh under SPARK2.3.1
> 2. Create Table and insert values
>      create table test_leak (id string, index int);
>      insert into test_leak values('id1',1)
> 3. Create JDBC Client query the table
> import java.sql.*;
> public class HiveClient {
> public static void main(String[] args) throws Exception {
> String driverName = "org.apache.hive.jdbc.HiveDriver";
>  Class.forName(driverName);
>  Connection con = DriverManager.getConnection( 
> "jdbc:hive2://localhost:1/default", "test", "test");
>  Statement stmt = con.createStatement();
>  String sql = "select * from test_leak";
>  int loop = 100;
>  while ( loop – > 0) {
>     ResultSet rs = stmt.executeQuery(sql);
>     rs.next();
>     System.out.println(new java.sql.Timestamp(System.currentTimeMillis()) +" 
> : " +    rs.getString(1));
>    rs.close();
>   if( loop % 100 ==0){
>      Thread.sleep(1);
>   }
> }
> con.close(); 
>  }
>  }
> 4. Dump HS2 heap org.datanucleus.api.jdo.JDOPersistenceManager instances keep 
> increasing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25817) Dataset encoder should support combination of map and product type

2018-10-23 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25817:


Assignee: Wenchen Fan  (was: Apache Spark)

> Dataset encoder should support combination of map and product type
> --
>
> Key: SPARK-25817
> URL: https://issues.apache.org/jira/browse/SPARK-25817
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25817) Dataset encoder should support combination of map and product type

2018-10-23 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661713#comment-16661713
 ] 

Apache Spark commented on SPARK-25817:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/22812

> Dataset encoder should support combination of map and product type
> --
>
> Key: SPARK-25817
> URL: https://issues.apache.org/jira/browse/SPARK-25817
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25817) Dataset encoder should support combination of map and product type

2018-10-23 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25817:


Assignee: Apache Spark  (was: Wenchen Fan)

> Dataset encoder should support combination of map and product type
> --
>
> Key: SPARK-25817
> URL: https://issues.apache.org/jira/browse/SPARK-25817
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25817) Dataset encoder should support combination of map and product type

2018-10-23 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661714#comment-16661714
 ] 

Apache Spark commented on SPARK-25817:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/22812

> Dataset encoder should support combination of map and product type
> --
>
> Key: SPARK-25817
> URL: https://issues.apache.org/jira/browse/SPARK-25817
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25817) Dataset encoder should support combination of map and product type

2018-10-23 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-25817:
---

 Summary: Dataset encoder should support combination of map and 
product type
 Key: SPARK-25817
 URL: https://issues.apache.org/jira/browse/SPARK-25817
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25810) Spark structured streaming logs auto.offset.reset=earliest even though startingOffsets is set to latest

2018-10-23 Thread sandeep katta (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661703#comment-16661703
 ] 

sandeep katta commented on SPARK-25810:
---

{color:#FF}[~abanthiy] {color}thanks for reporting this, Can you please 
share me the logs screenshot to check exactly which part of the flow is 
misleading . is it coming part of 

ConsumerConfig values ??

> Spark structured streaming logs auto.offset.reset=earliest even though 
> startingOffsets is set to latest
> ---
>
> Key: SPARK-25810
> URL: https://issues.apache.org/jira/browse/SPARK-25810
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: ANUJA BANTHIYA
>Priority: Trivial
>
> I have a  issue when i'm trying to read data from kafka using spark 
> structured streaming. 
> Versions : spark-core_2.11 : 2.3.1, spark-sql_2.11 : 2.3.1, 
> spark-sql-kafka-0-10_2.11 : 2.3.1, kafka-client :0.11.0.0
> The issue i am facing is that the spark job always logs auto.offset.reset = 
> earliest  even though latest option is specified in the code during startup 
> of application .
> Code to reproduce: 
> {code:java}
> package com.informatica.exec
> import org.apache.spark.sql.SparkSession
> object kafkaLatestOffset {
>  def main(s: Array[String]) {
>  val spark = SparkSession
>  .builder()
>  .appName("Spark Offset basic example")
>  .master("local[*]")
>  .getOrCreate()
>  val df = spark
>  .readStream
>  .format("kafka")
>  .option("kafka.bootstrap.servers", "localhost:9092")
>  .option("subscribe", "topic1")
>  .option("startingOffsets", "latest")
>  .load()
>  val query = df.writeStream
>  .outputMode("complete")
>  .format("console")
>  .start()
>  query.awaitTermination()
>  }
> }
> {code}
>  
> As mentioned in Structured streaming doc, {{startingOffsets}}  need to be set 
> for auto.offset.reset.
> [https://spark.apache.org/docs/2.3.1/structured-streaming-kafka-integration.html]
>  * *auto.offset.reset*: Set the source option {{startingOffsets}} to specify 
> where to start instead. Structured Streaming manages which offsets are 
> consumed internally, rather than rely on the kafka Consumer to do it. This 
> will ensure that no data is missed when new topics/partitions are dynamically 
> subscribed. Note that {{startingOffsets}} only applies when a new streaming 
> query is started, and that resuming will always pick up from where the query 
> left off.
> During runtime , kafka messages are picked from the latest offset , so 
> functional wise it is working as expected. Only log is misleading as it logs  
> auto.offset.reset = *earliest* .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25797) Views created via 2.1 cannot be read via 2.2+

2018-10-23 Thread Chenxiao Mao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenxiao Mao updated SPARK-25797:
-
Description: 
We ran into this issue when we update our Spark from 2.1 to 2.3. Below's a 
simple example to reproduce the issue.

Create views via Spark 2.1
{code:sql}
create view v1 as
select (cast(1 as decimal(18,0)) + cast(1 as decimal(18,0))) c1;
{code}

Query views via Spark 2.3
{code:sql}
select * from v1;
Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3906: 
decimal(19,0) as it may truncate
{code}

After investigation, we found that this is because when a view is created via 
Spark 2.1, the expanded text is saved instead of the original text. 
Unfortunately, the blow expanded text is buggy.
{code:sql}
spark-sql> desc extended v1;
c1 decimal(19,0) NULL
Detailed Table Information
Database default
Table v1
Type VIEW
View Text SELECT `gen_attr_0` AS `c1` FROM (SELECT (CAST(CAST(1 AS 
DECIMAL(18,0)) AS DECIMAL(19,0)) + CAST(CAST(1 AS DECIMAL(18,0)) AS 
DECIMAL(19,0))) AS `gen_attr_0`) AS gen_subquery_0
{code}

We can see that c1 is decimal(19,0), however in the expanded text there is 
decimal(19,0) + decimal(19,0) which results in decimal(20,0). Since Spark 2.2, 
decimal(20,0) in query is not allowed to cast to view definition column 
decimal(19,0). ([https://github.com/apache/spark/pull/16561])

I further tested other decimal calculations. Only add/subtract has this issue.

Create views via 2.1:
{code:sql}
create view v1 as
select (cast(1 as decimal(18,0)) + cast(1 as decimal(18,0))) c1;
create view v2 as
select (cast(1 as decimal(18,0)) - cast(1 as decimal(18,0))) c1;
create view v3 as
select (cast(1 as decimal(18,0)) * cast(1 as decimal(18,0))) c1;
create view v4 as
select (cast(1 as decimal(18,0)) / cast(1 as decimal(18,0))) c1;
create view v5 as
select (cast(1 as decimal(18,0)) % cast(1 as decimal(18,0))) c1;
create view v6 as
select cast(1 as decimal(18,0)) c1
union
select cast(1 as decimal(19,0)) c1;
{code}

Query views via Spark 2.3
{code:sql}
select * from v1;
Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3906: 
decimal(19,0) as it may truncate
select * from v2;
Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3909: 
decimal(19,0) as it may truncate
select * from v3;
1
select * from v4;
1
select * from v5;
0
select * from v6;
1
{code}

Views created via Spark 2.2+ don't have this issue because Spark 2.2+ does not 
generate expanded text for view 
(https://issues.apache.org/jira/browse/SPARK-18209).

  was:
We ran into this issue when we update our Spark from 2.1 to 2.3. Below's a 
simple example to reproduce the issue.

Create views via Spark 2.1
|create view v1 as
 select (cast(1 as decimal(18,0)) + cast(1 as decimal(18,0))) c1;|

Query views via Spark 2.3
|{{select * from v1;}}
 {{Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3906: 
decimal(19,0) as it may truncate}}|

After investigation, we found that this is because when a view is created via 
Spark 2.1, the expanded text is saved instead of the original text. 
Unfortunately, the blow expanded text is buggy.
|spark-sql> desc extended v1;
 c1 decimal(19,0) NULL
 Detailed Table Information
 Database default
 Table v1
 Type VIEW
 View Text SELECT `gen_attr_0` AS `c1` FROM (SELECT (CAST(CAST(1 AS 
DECIMAL(18,0)) AS DECIMAL(19,0)) + CAST(CAST(1 AS DECIMAL(18,0)) AS 
DECIMAL(19,0))) AS `gen_attr_0`) AS gen_subquery_0|

We can see that c1 is decimal(19,0), however in the expanded text there is 
decimal(19,0) + decimal(19,0) which results in decimal(20,0). Since Spark 2.2, 
decimal(20,0) in query is not allowed to cast to view definition column 
decimal(19,0). ([https://github.com/apache/spark/pull/16561])

I further tested other decimal calculations. Only add/subtract has this issue.

Create views via 2.1:
|create view v1 as
 select (cast(1 as decimal(18,0)) + cast(1 as decimal(18,0))) c1;
 create view v2 as
 select (cast(1 as decimal(18,0)) - cast(1 as decimal(18,0))) c1;
 create view v3 as
 select (cast(1 as decimal(18,0)) * cast(1 as decimal(18,0))) c1;
 create view v4 as
 select (cast(1 as decimal(18,0)) / cast(1 as decimal(18,0))) c1;
 create view v5 as
 select (cast(1 as decimal(18,0)) % cast(1 as decimal(18,0))) c1;
 create view v6 as
 select cast(1 as decimal(18,0)) c1
 union
 select cast(1 as decimal(19,0)) c1;|

Query views via Spark 2.3
|select * from v1;
 Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3906: 
decimal(19,0) as it may truncate
 select * from v2;
 Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3909: 
decimal(19,0) as it may truncate
 select * from v3;
 1
 select * from v4;
 1
 select * from v5;
 0
 select * from v6;
 1|

 Views created via Spark 2.2+ don't have this issue because Spark 2.2+ does not 
generate expanded text for view 
(https://issues.apache.org/jira/browse/SPARK-18209).


> Views created via 2.1 cannot be read via 2.2+
>

[jira] [Commented] (SPARK-25411) Implement range partition in Spark

2018-10-23 Thread Wang, Gang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661610#comment-16661610
 ] 

Wang, Gang commented on SPARK-25411:


[~cloud_fan] How do you think of this feature? In our inner benchmark, it do 
improve a lot in performance for huge tables join with predicates.

> Implement range partition in Spark
> --
>
> Key: SPARK-25411
> URL: https://issues.apache.org/jira/browse/SPARK-25411
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wang, Gang
>Priority: Major
> Attachments: range partition design doc.pdf
>
>
> In our product environment, there are some partitioned fact tables, which are 
> all quite huge. To accelerate join execution, we need make them also 
> bucketed. Than comes the problem, if the bucket number is large enough, there 
> may be too many files(files count = bucket number * partition count), which 
> may bring pressure to the HDFS. And if the bucket number is small, Spark will 
> launch equal number of tasks to read/write it.
>  
> So, can we implement a new partition support range values, just like range 
> partition in Oracle/MySQL 
> ([https://docs.oracle.com/cd/E17952_01/mysql-5.7-en/partitioning-range.html]).
>  Say, we can partition by a date column, and make every two months as a 
> partition, or partitioned by a integer column, make interval of 1 as a 
> partition.
>  
> Ideally, feature like range partition should be implemented in Hive. While, 
> it's been always hard to update Hive version in a prod environment, and much 
> lightweight and flexible if we implement it in Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25797) Views created via 2.1 cannot be read via 2.2+

2018-10-23 Thread Chenxiao Mao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenxiao Mao updated SPARK-25797:
-
Description: 
We ran into this issue when we update our Spark from 2.1 to 2.3. Below's a 
simple example to reproduce the issue.

Create views via Spark 2.1
|create view v1 as
 select (cast(1 as decimal(18,0)) + cast(1 as decimal(18,0))) c1;|

Query views via Spark 2.3
|{{select * from v1;}}
 {{Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3906: 
decimal(19,0) as it may truncate}}|

After investigation, we found that this is because when a view is created via 
Spark 2.1, the expanded text is saved instead of the original text. 
Unfortunately, the blow expanded text is buggy.
|spark-sql> desc extended v1;
 c1 decimal(19,0) NULL
 Detailed Table Information
 Database default
 Table v1
 Type VIEW
 View Text SELECT `gen_attr_0` AS `c1` FROM (SELECT (CAST(CAST(1 AS 
DECIMAL(18,0)) AS DECIMAL(19,0)) + CAST(CAST(1 AS DECIMAL(18,0)) AS 
DECIMAL(19,0))) AS `gen_attr_0`) AS gen_subquery_0|

We can see that c1 is decimal(19,0), however in the expanded text there is 
decimal(19,0) + decimal(19,0) which results in decimal(20,0). Since Spark 2.2, 
decimal(20,0) in query is not allowed to cast to view definition column 
decimal(19,0). ([https://github.com/apache/spark/pull/16561])

I further tested other decimal calculations. Only add/subtract has this issue.

Create views via 2.1:
|create view v1 as
 select (cast(1 as decimal(18,0)) + cast(1 as decimal(18,0))) c1;
 create view v2 as
 select (cast(1 as decimal(18,0)) - cast(1 as decimal(18,0))) c1;
 create view v3 as
 select (cast(1 as decimal(18,0)) * cast(1 as decimal(18,0))) c1;
 create view v4 as
 select (cast(1 as decimal(18,0)) / cast(1 as decimal(18,0))) c1;
 create view v5 as
 select (cast(1 as decimal(18,0)) % cast(1 as decimal(18,0))) c1;
 create view v6 as
 select cast(1 as decimal(18,0)) c1
 union
 select cast(1 as decimal(19,0)) c1;|

Query views via Spark 2.3
|select * from v1;
 Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3906: 
decimal(19,0) as it may truncate
 select * from v2;
 Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3909: 
decimal(19,0) as it may truncate
 select * from v3;
 1
 select * from v4;
 1
 select * from v5;
 0
 select * from v6;
 1|

 Views created via Spark 2.2+ don't have this issue because Spark 2.2+ does not 
generate expanded text for view 
(https://issues.apache.org/jira/browse/SPARK-18209).

  was:
We ran into this issue when we update our Spark from 2.1 to 2.3. Below's a 
simple example to reproduce the issue.

Create views via Spark 2.1
|create view v1 as
 select (cast(1 as decimal(18,0)) + cast(1 as decimal(18,0))) c1;|

Query views via Spark 2.3
|{{select * from v1;}}
 {{Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3906: 
decimal(19,0) as it may truncate}}|

After investigation, we found that this is because when a view is created via 
Spark 2.1, the expanded text is saved instead of the original text. 
Unfortunately, the blow expanded text is buggy.
|spark-sql> desc extended v1;
 c1 decimal(19,0) NULL
Detailed Table Information
 Database default
 Table v1
 Type VIEW
 View Text SELECT `gen_attr_0` AS `c1` FROM (SELECT (CAST(CAST(1 AS 
DECIMAL(18,0)) AS DECIMAL(19,0)) + CAST(CAST(1 AS DECIMAL(18,0)) AS 
DECIMAL(19,0))) AS `gen_attr_0`) AS gen_subquery_0|

We can see that c1 is decimal(19,0), however in the expanded text there is 
decimal(19,0) + decimal(19,0) which results in decimal(20,0). Since Spark 2.2, 
decimal(20,0) in query is not allowed to cast to view definition column 
decimal(19,0). ([https://github.com/apache/spark/pull/16561])

I further tested other decimal calculations. Only add/subtract has this issue.

Create views via 2.1:
|create view v1 as
 select (cast(1 as decimal(18,0)) + cast(1 as decimal(18,0))) c1;
 create view v2 as
 select (cast(1 as decimal(18,0)) - cast(1 as decimal(18,0))) c1;
 create view v3 as
 select (cast(1 as decimal(18,0)) * cast(1 as decimal(18,0))) c1;
 create view v4 as
 select (cast(1 as decimal(18,0)) / cast(1 as decimal(18,0))) c1;
 create view v5 as
 select (cast(1 as decimal(18,0)) % cast(1 as decimal(18,0))) c1;
 create view v6 as
 select cast(1 as decimal(18,0)) c1
 union
 select cast(1 as decimal(19,0)) c1;|

Query views via Spark 2.3
|select * from v1;
 Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3906: 
decimal(19,0) as it may truncate
 select * from v2;
 Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3909: 
decimal(19,0) as it may truncate
 select * from v3;
 1
 select * from v4;
 1
 select * from v5;
 0
 select * from v6;
 1|

 


> Views created via 2.1 cannot be read via 2.2+
> -
>
> Key: SPARK-25797
> URL: https://issues.apache.org/jira/browse/SPARK-25797
> Project: Spark
>  Issue Type:

[jira] [Assigned] (SPARK-25772) Java encoders - switch fields on collectAsList

2018-10-23 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-25772:
---

Assignee: Vladimir Kuriatkov

> Java encoders - switch fields on collectAsList
> --
>
> Key: SPARK-25772
> URL: https://issues.apache.org/jira/browse/SPARK-25772
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: mac os
> spark 2.1.1
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_121
>Reporter: Tom
>Assignee: Vladimir Kuriatkov
>Priority: Major
> Fix For: 3.0.0
>
>
> I have the following schema in a dataset -
> root
>  |-- userId: string (nullable = true)
>  |-- data: map (nullable = true)
>  ||-- key: string
>  ||-- value: struct (valueContainsNull = true)
>  |||-- startTime: long (nullable = true)
>  |||-- endTime: long (nullable = true)
>  |-- offset: long (nullable = true)
>  And I have the following classes (+ setter and getters which I omitted for 
> simplicity) -
>  
> {code:java}
> public class MyClass {
> private String userId;
> private Map data;
> private Long offset;
>  }
> public class MyDTO {
> private long startTime;
> private long endTime;
> }
> {code}
> I collect the result the following way - 
> {code:java}
> Encoder myClassEncoder = Encoders.bean(MyClass.class);
> Dataset results = raw_df.as(myClassEncoder);
> List lst = results.collectAsList();
> {code}
> 
> I do several calculations to get the result I want and the result is correct 
> all through the way before I collect it.
> This is the result for - 
> {code:java}
> results.select(results.col("data").getField("2017-07-01").getField("startTime")).show(false);
> {code}
> |data[2017-07-01].startTime|data[2017-07-01].endTime|
> +-+--+
> |1498854000|1498870800  |
> This is the result after collecting the reuslts for - 
> {code:java}
> MyClass userData = results.collectAsList().get(0);
> MyDTO userDTO = userData.getData().get("2017-07-01");
> System.out.println("userDTO startTime: " + userDTO.getStartTime());
> System.out.println("userDTO endTime: " + userDTO.getEndTime());
> {code}
> --
> data startTime: 1498870800
> data endTime: 1498854000
> I tend to believe it is a spark issue. Would love any suggestions on how to 
> bypass it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25772) Java encoders - switch fields on collectAsList

2018-10-23 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-25772.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22745
[https://github.com/apache/spark/pull/22745]

> Java encoders - switch fields on collectAsList
> --
>
> Key: SPARK-25772
> URL: https://issues.apache.org/jira/browse/SPARK-25772
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: mac os
> spark 2.1.1
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_121
>Reporter: Tom
>Priority: Major
> Fix For: 3.0.0
>
>
> I have the following schema in a dataset -
> root
>  |-- userId: string (nullable = true)
>  |-- data: map (nullable = true)
>  ||-- key: string
>  ||-- value: struct (valueContainsNull = true)
>  |||-- startTime: long (nullable = true)
>  |||-- endTime: long (nullable = true)
>  |-- offset: long (nullable = true)
>  And I have the following classes (+ setter and getters which I omitted for 
> simplicity) -
>  
> {code:java}
> public class MyClass {
> private String userId;
> private Map data;
> private Long offset;
>  }
> public class MyDTO {
> private long startTime;
> private long endTime;
> }
> {code}
> I collect the result the following way - 
> {code:java}
> Encoder myClassEncoder = Encoders.bean(MyClass.class);
> Dataset results = raw_df.as(myClassEncoder);
> List lst = results.collectAsList();
> {code}
> 
> I do several calculations to get the result I want and the result is correct 
> all through the way before I collect it.
> This is the result for - 
> {code:java}
> results.select(results.col("data").getField("2017-07-01").getField("startTime")).show(false);
> {code}
> |data[2017-07-01].startTime|data[2017-07-01].endTime|
> +-+--+
> |1498854000|1498870800  |
> This is the result after collecting the reuslts for - 
> {code:java}
> MyClass userData = results.collectAsList().get(0);
> MyDTO userDTO = userData.getData().get("2017-07-01");
> System.out.println("userDTO startTime: " + userDTO.getStartTime());
> System.out.println("userDTO endTime: " + userDTO.getEndTime());
> {code}
> --
> data startTime: 1498870800
> data endTime: 1498854000
> I tend to believe it is a spark issue. Would love any suggestions on how to 
> bypass it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22809) pyspark is sensitive to imports with dots

2018-10-23 Thread Bryan Cutler (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-22809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-22809:
-
Fix Version/s: 2.3.2

> pyspark is sensitive to imports with dots
> -
>
> Key: SPARK-22809
> URL: https://issues.apache.org/jira/browse/SPARK-22809
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Cricket Temple
>Assignee: holdenk
>Priority: Major
> Fix For: 2.3.2, 2.4.0
>
>
> User code can fail with dotted imports.  Here's a repro script.
> {noformat}
> import numpy as np
> import pandas as pd
> import pyspark
> import scipy.interpolate
> import scipy.interpolate as scipy_interpolate
> import py4j
> scipy_interpolate2 = scipy.interpolate
> sc = pyspark.SparkContext()
> spark_session = pyspark.SQLContext(sc)
> ###
> # The details of this dataset are irrelevant  #
> # Sorry if you'd have preferred something more boring #
> ###
> x__ = np.linspace(0,10,1000)
> freq__ = np.arange(1,5)
> x_, freq_ = np.ix_(x__, freq__)
> y = np.sin(x_ * freq_).ravel()
> x = (x_ * np.ones(freq_.shape)).ravel()
> freq = (np.ones(x_.shape) * freq_).ravel()
> df_pd = pd.DataFrame(np.stack([x,y,freq]).T, columns=['x','y','freq'])
> df_sk = spark_session.createDataFrame(df_pd)
> assert(df_sk.toPandas() == df_pd).all().all()
> try:
> import matplotlib.pyplot as plt
> for f, data in df_pd.groupby("freq"):
> plt.plot(*data[['x','y']].values.T)
> plt.show()
> except:
> print("I guess we can't plot anything")
> def mymap(x, interp_fn):
> df = pd.DataFrame.from_records([row.asDict() for row in list(x)])
> return interp_fn(df.x.values, df.y.values)(np.pi)
> df_by_freq = df_sk.rdd.keyBy(lambda x: x.freq).groupByKey()
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> try:
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> raise Excpetion("Not going to reach this line")
> except py4j.protocol.Py4JJavaError, e:
> print("See?")
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate2.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> # But now it works!
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22809) pyspark is sensitive to imports with dots

2018-10-23 Thread Bryan Cutler (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-22809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661502#comment-16661502
 ] 

Bryan Cutler commented on SPARK-22809:
--

Sure, I probably shouldn't have tested out of the branches. Running tests again 
from IPython with Python 3.6.6:

 

*v2.2.2* - Error is raised

*v2.3.2* - Working

*v2.4.0-rc4* - Working

 

>From those results, it seems like SPARK-21070 most likely fixed it

> pyspark is sensitive to imports with dots
> -
>
> Key: SPARK-22809
> URL: https://issues.apache.org/jira/browse/SPARK-22809
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Cricket Temple
>Assignee: holdenk
>Priority: Major
> Fix For: 2.4.0
>
>
> User code can fail with dotted imports.  Here's a repro script.
> {noformat}
> import numpy as np
> import pandas as pd
> import pyspark
> import scipy.interpolate
> import scipy.interpolate as scipy_interpolate
> import py4j
> scipy_interpolate2 = scipy.interpolate
> sc = pyspark.SparkContext()
> spark_session = pyspark.SQLContext(sc)
> ###
> # The details of this dataset are irrelevant  #
> # Sorry if you'd have preferred something more boring #
> ###
> x__ = np.linspace(0,10,1000)
> freq__ = np.arange(1,5)
> x_, freq_ = np.ix_(x__, freq__)
> y = np.sin(x_ * freq_).ravel()
> x = (x_ * np.ones(freq_.shape)).ravel()
> freq = (np.ones(x_.shape) * freq_).ravel()
> df_pd = pd.DataFrame(np.stack([x,y,freq]).T, columns=['x','y','freq'])
> df_sk = spark_session.createDataFrame(df_pd)
> assert(df_sk.toPandas() == df_pd).all().all()
> try:
> import matplotlib.pyplot as plt
> for f, data in df_pd.groupby("freq"):
> plt.plot(*data[['x','y']].values.T)
> plt.show()
> except:
> print("I guess we can't plot anything")
> def mymap(x, interp_fn):
> df = pd.DataFrame.from_records([row.asDict() for row in list(x)])
> return interp_fn(df.x.values, df.y.values)(np.pi)
> df_by_freq = df_sk.rdd.keyBy(lambda x: x.freq).groupByKey()
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> try:
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> raise Excpetion("Not going to reach this line")
> except py4j.protocol.Py4JJavaError, e:
> print("See?")
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate2.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> # But now it works!
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25816) Functions does not resolve Columns correctly

2018-10-23 Thread Brian Zhang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian Zhang updated SPARK-25816:

Attachment: source.snappy.parquet

> Functions does not resolve Columns correctly
> 
>
> Key: SPARK-25816
> URL: https://issues.apache.org/jira/browse/SPARK-25816
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Brian Zhang
>Priority: Critical
> Attachments: source.snappy.parquet
>
>
> When there is a duplicate column name in the current Dataframe and orginal 
> Dataframe where current df is selected from, Spark in 2.3.0 and 2.3.1 does 
> not resolve the column correctly when using it in the expression, hence 
> causing casting issue. The same code is working in Spark 2.2.1
> Please see below code to reproduce the issue
> import org.apache.spark._
> import org.apache.spark.rdd._
> import org.apache.spark.storage.StorageLevel._
> import org.apache.spark.sql._
> import org.apache.spark.sql.DataFrame
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.catalyst.expressions._
> import org.apache.spark.sql.Column
> val v0 = spark.read.parquet("/data/home/bzinfa/bz/source.snappy.parquet")
> val v00 = v0.toDF(v0.schema.fields.indices.view.map("" + _):_*)
> val v5 = v00.select($"13".as("0"),$"14".as("1"),$"15".as("2"))
> val v5_2 = $"2"
> v5.where(lit(500).<(v5_2(new Column(new MapKeys(v5_2.expr))(lit(0)
> //v00's 3rdcolumn is binary and 16th is map
> Error:
> org.apache.spark.sql.AnalysisException: cannot resolve 'map_keys(`2`)' due to 
> data type mismatch: argument 1 requires map type, however, '`2`' is of binary 
> type.;
>  
>  'Project [0#1591, 1#1592, 2#1593] +- 'Filter (500 < 
> {color:#FF}2#1593{color}[map_keys({color:#FF}2#1561{color})[0]]) +- 
> Project [13#1572 AS 0#1591, 14#1573 AS 1#1592, 15#1574 AS 2#1593, 2#1561] +- 
> Project [c_bytes#1527 AS 0#1559, c_union#1528 AS 1#1560, c_fixed#1529 AS 
> 2#1561, c_boolean#1530 AS 3#1562, c_float#1531 AS 4#1563, c_double#1532 AS 
> 5#1564, c_int#1533 AS 6#1565, c_long#1534L AS 7#1566L, c_string#1535 AS 
> 8#1567, c_decimal_18_2#1536 AS 9#1568, c_decimal_28_2#1537 AS 10#1569, 
> c_decimal_38_2#1538 AS 11#1570, c_date#1539 AS 12#1571, simple_struct#1540 AS 
> 13#1572, simple_array#1541 AS 14#1573, simple_map#1542 AS 15#1574] +- 
> Relation[c_bytes#1527,c_union#1528,c_fixed#1529,c_boolean#1530,c_float#1531,c_double#1532,c_int#1533,c_long#1534L,c_string#1535,c_decimal_18_2#1536,c_decimal_28_2#1537,c_decimal_38_2#1538,c_date#1539,simple_struct#1540,simple_array#1541,simple_map#1542]
>  parquet



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25816) Functions does not resolve Columns correctly

2018-10-23 Thread Brian Zhang (JIRA)

Brian Zhang created SPARK-25816:
---

 Summary: Functions does not resolve Columns correctly
 Key: SPARK-25816
 URL: https://issues.apache.org/jira/browse/SPARK-25816
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1, 2.3.0
Reporter: Brian Zhang


When there is a duplicate column name in the current Dataframe and orginal 
Dataframe where current df is selected from, Spark in 2.3.0 and 2.3.1 does not 
resolve the column correctly when using it in the expression, hence causing 
casting issue. The same code is working in Spark 2.2.1

Please see below code to reproduce the issue

import org.apache.spark._
import org.apache.spark.rdd._
import org.apache.spark.storage.StorageLevel._
import org.apache.spark.sql._
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.catalyst.expressions._
import org.apache.spark.sql.Column

val v0 = spark.read.parquet("/data/home/bzinfa/bz/source.snappy.parquet")
val v00 = v0.toDF(v0.schema.fields.indices.view.map("" + _):_*)
val v5 = v00.select($"13".as("0"),$"14".as("1"),$"15".as("2"))
val v5_2 = $"2"
v5.where(lit(500).<(v5_2(new Column(new MapKeys(v5_2.expr))(lit(0)

//v00's 3rdcolumn is binary and 16th is map

Error:
org.apache.spark.sql.AnalysisException: cannot resolve 'map_keys(`2`)' due to 
data type mismatch: argument 1 requires map type, however, '`2`' is of binary 
type.;
 
 'Project [0#1591, 1#1592, 2#1593] +- 'Filter (500 < 
{color:#FF}2#1593{color}[map_keys({color:#FF}2#1561{color})[0]]) +- 
Project [13#1572 AS 0#1591, 14#1573 AS 1#1592, 15#1574 AS 2#1593, 2#1561] +- 
Project [c_bytes#1527 AS 0#1559, c_union#1528 AS 1#1560, c_fixed#1529 AS 
2#1561, c_boolean#1530 AS 3#1562, c_float#1531 AS 4#1563, c_double#1532 AS 
5#1564, c_int#1533 AS 6#1565, c_long#1534L AS 7#1566L, c_string#1535 AS 8#1567, 
c_decimal_18_2#1536 AS 9#1568, c_decimal_28_2#1537 AS 10#1569, 
c_decimal_38_2#1538 AS 11#1570, c_date#1539 AS 12#1571, simple_struct#1540 AS 
13#1572, simple_array#1541 AS 14#1573, simple_map#1542 AS 15#1574] +- 
Relation[c_bytes#1527,c_union#1528,c_fixed#1529,c_boolean#1530,c_float#1531,c_double#1532,c_int#1533,c_long#1534L,c_string#1535,c_decimal_18_2#1536,c_decimal_28_2#1537,c_decimal_38_2#1538,c_date#1539,simple_struct#1540,simple_array#1541,simple_map#1542]
 parquet



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24516) PySpark Bindings for K8S - make Python 3 the default

2018-10-23 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661466#comment-16661466
 ] 

Apache Spark commented on SPARK-24516:
--

User 'ifilonenko' has created a pull request for this issue:
https://github.com/apache/spark/pull/22810

> PySpark Bindings for K8S - make Python 3 the default
> 
>
> Key: SPARK-24516
> URL: https://issues.apache.org/jira/browse/SPARK-24516
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, PySpark
>Affects Versions: 2.4.0
>Reporter: Ondrej Kokes
>Priority: Minor
>
> Initial PySpark-k8s bindings have just been resolved (SPARK-23984), but the 
> default Python version there is 2. While you can override this by setting it 
> to 3, I think we should have sensible defaults.
> Python 3 has been around for ten years and is the clear successor, Python 2 
> has only 18 months left in terms of support. There isn't a good reason to 
> suggest Python 2 should be used, not in 2018 and not when both versions are 
> supported.
> The relevant commit [is 
> here|https://github.com/apache/spark/commit/1a644afbac35c204f9ad55f86999319a9ab458c6#diff-6e882d5561424e7e6651eb46f10104b8R194],
>  the version is also [in the 
> documentation|https://github.com/apache/spark/commit/1a644afbac35c204f9ad55f86999319a9ab458c6#diff-b5527f236b253e0d9f5db5164bdb43e9R643].
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24516) PySpark Bindings for K8S - make Python 3 the default

2018-10-23 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661467#comment-16661467
 ] 

Apache Spark commented on SPARK-24516:
--

User 'ifilonenko' has created a pull request for this issue:
https://github.com/apache/spark/pull/22810

> PySpark Bindings for K8S - make Python 3 the default
> 
>
> Key: SPARK-24516
> URL: https://issues.apache.org/jira/browse/SPARK-24516
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, PySpark
>Affects Versions: 2.4.0
>Reporter: Ondrej Kokes
>Priority: Minor
>
> Initial PySpark-k8s bindings have just been resolved (SPARK-23984), but the 
> default Python version there is 2. While you can override this by setting it 
> to 3, I think we should have sensible defaults.
> Python 3 has been around for ten years and is the clear successor, Python 2 
> has only 18 months left in terms of support. There isn't a good reason to 
> suggest Python 2 should be used, not in 2018 and not when both versions are 
> supported.
> The relevant commit [is 
> here|https://github.com/apache/spark/commit/1a644afbac35c204f9ad55f86999319a9ab458c6#diff-6e882d5561424e7e6651eb46f10104b8R194],
>  the version is also [in the 
> documentation|https://github.com/apache/spark/commit/1a644afbac35c204f9ad55f86999319a9ab458c6#diff-b5527f236b253e0d9f5db5164bdb43e9R643].
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24516) PySpark Bindings for K8S - make Python 3 the default

2018-10-23 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24516:


Assignee: (was: Apache Spark)

> PySpark Bindings for K8S - make Python 3 the default
> 
>
> Key: SPARK-24516
> URL: https://issues.apache.org/jira/browse/SPARK-24516
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, PySpark
>Affects Versions: 2.4.0
>Reporter: Ondrej Kokes
>Priority: Minor
>
> Initial PySpark-k8s bindings have just been resolved (SPARK-23984), but the 
> default Python version there is 2. While you can override this by setting it 
> to 3, I think we should have sensible defaults.
> Python 3 has been around for ten years and is the clear successor, Python 2 
> has only 18 months left in terms of support. There isn't a good reason to 
> suggest Python 2 should be used, not in 2018 and not when both versions are 
> supported.
> The relevant commit [is 
> here|https://github.com/apache/spark/commit/1a644afbac35c204f9ad55f86999319a9ab458c6#diff-6e882d5561424e7e6651eb46f10104b8R194],
>  the version is also [in the 
> documentation|https://github.com/apache/spark/commit/1a644afbac35c204f9ad55f86999319a9ab458c6#diff-b5527f236b253e0d9f5db5164bdb43e9R643].
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24516) PySpark Bindings for K8S - make Python 3 the default

2018-10-23 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24516:


Assignee: Apache Spark

> PySpark Bindings for K8S - make Python 3 the default
> 
>
> Key: SPARK-24516
> URL: https://issues.apache.org/jira/browse/SPARK-24516
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, PySpark
>Affects Versions: 2.4.0
>Reporter: Ondrej Kokes
>Assignee: Apache Spark
>Priority: Minor
>
> Initial PySpark-k8s bindings have just been resolved (SPARK-23984), but the 
> default Python version there is 2. While you can override this by setting it 
> to 3, I think we should have sensible defaults.
> Python 3 has been around for ten years and is the clear successor, Python 2 
> has only 18 months left in terms of support. There isn't a good reason to 
> suggest Python 2 should be used, not in 2018 and not when both versions are 
> supported.
> The relevant commit [is 
> here|https://github.com/apache/spark/commit/1a644afbac35c204f9ad55f86999319a9ab458c6#diff-6e882d5561424e7e6651eb46f10104b8R194],
>  the version is also [in the 
> documentation|https://github.com/apache/spark/commit/1a644afbac35c204f9ad55f86999319a9ab458c6#diff-b5527f236b253e0d9f5db5164bdb43e9R643].
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22809) pyspark is sensitive to imports with dots

2018-10-23 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-22809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661444#comment-16661444
 ] 

Dongjoon Hyun commented on SPARK-22809:
---

Hi, [~bryanc]. It seems that the test occurs in `branch-2.2`. Could you confirm 
2.3.2, too?

> pyspark is sensitive to imports with dots
> -
>
> Key: SPARK-22809
> URL: https://issues.apache.org/jira/browse/SPARK-22809
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Cricket Temple
>Assignee: holdenk
>Priority: Major
> Fix For: 2.4.0
>
>
> User code can fail with dotted imports.  Here's a repro script.
> {noformat}
> import numpy as np
> import pandas as pd
> import pyspark
> import scipy.interpolate
> import scipy.interpolate as scipy_interpolate
> import py4j
> scipy_interpolate2 = scipy.interpolate
> sc = pyspark.SparkContext()
> spark_session = pyspark.SQLContext(sc)
> ###
> # The details of this dataset are irrelevant  #
> # Sorry if you'd have preferred something more boring #
> ###
> x__ = np.linspace(0,10,1000)
> freq__ = np.arange(1,5)
> x_, freq_ = np.ix_(x__, freq__)
> y = np.sin(x_ * freq_).ravel()
> x = (x_ * np.ones(freq_.shape)).ravel()
> freq = (np.ones(x_.shape) * freq_).ravel()
> df_pd = pd.DataFrame(np.stack([x,y,freq]).T, columns=['x','y','freq'])
> df_sk = spark_session.createDataFrame(df_pd)
> assert(df_sk.toPandas() == df_pd).all().all()
> try:
> import matplotlib.pyplot as plt
> for f, data in df_pd.groupby("freq"):
> plt.plot(*data[['x','y']].values.T)
> plt.show()
> except:
> print("I guess we can't plot anything")
> def mymap(x, interp_fn):
> df = pd.DataFrame.from_records([row.asDict() for row in list(x)])
> return interp_fn(df.x.values, df.y.values)(np.pi)
> df_by_freq = df_sk.rdd.keyBy(lambda x: x.freq).groupByKey()
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> try:
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> raise Excpetion("Not going to reach this line")
> except py4j.protocol.Py4JJavaError, e:
> print("See?")
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate2.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> # But now it works!
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22809) pyspark is sensitive to imports with dots

2018-10-23 Thread Bryan Cutler (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-22809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-22809.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

> pyspark is sensitive to imports with dots
> -
>
> Key: SPARK-22809
> URL: https://issues.apache.org/jira/browse/SPARK-22809
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Cricket Temple
>Assignee: holdenk
>Priority: Major
> Fix For: 2.4.0
>
>
> User code can fail with dotted imports.  Here's a repro script.
> {noformat}
> import numpy as np
> import pandas as pd
> import pyspark
> import scipy.interpolate
> import scipy.interpolate as scipy_interpolate
> import py4j
> scipy_interpolate2 = scipy.interpolate
> sc = pyspark.SparkContext()
> spark_session = pyspark.SQLContext(sc)
> ###
> # The details of this dataset are irrelevant  #
> # Sorry if you'd have preferred something more boring #
> ###
> x__ = np.linspace(0,10,1000)
> freq__ = np.arange(1,5)
> x_, freq_ = np.ix_(x__, freq__)
> y = np.sin(x_ * freq_).ravel()
> x = (x_ * np.ones(freq_.shape)).ravel()
> freq = (np.ones(x_.shape) * freq_).ravel()
> df_pd = pd.DataFrame(np.stack([x,y,freq]).T, columns=['x','y','freq'])
> df_sk = spark_session.createDataFrame(df_pd)
> assert(df_sk.toPandas() == df_pd).all().all()
> try:
> import matplotlib.pyplot as plt
> for f, data in df_pd.groupby("freq"):
> plt.plot(*data[['x','y']].values.T)
> plt.show()
> except:
> print("I guess we can't plot anything")
> def mymap(x, interp_fn):
> df = pd.DataFrame.from_records([row.asDict() for row in list(x)])
> return interp_fn(df.x.values, df.y.values)(np.pi)
> df_by_freq = df_sk.rdd.keyBy(lambda x: x.freq).groupByKey()
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> try:
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> raise Excpetion("Not going to reach this line")
> except py4j.protocol.Py4JJavaError, e:
> print("See?")
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate2.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> # But now it works!
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22809) pyspark is sensitive to imports with dots

2018-10-23 Thread Bryan Cutler (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-22809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661418#comment-16661418
 ] 

Bryan Cutler commented on SPARK-22809:
--

I confirmed that I could reproduce in IPython with Spark branch-2.3 and did not 
have the issue with branch-2.4. I think we can close this issue
{noformat}
    __
 / __/__  ___ _/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.1-SNAPSHOT
  /_/

Using Python version 3.6.6 (default, Oct 12 2018 14:08:43)
SparkSession available as 'spark'.

In [1]: import pyspark.cloudpickle
   ...: import pyspark
   ...: import py4j
   ...: rdd = sc.parallelize([(1,2)])
   ...: import scipy.interpolate
     

In [2]: import scipy.interpolate
   ...: def foo(*ards, **kwd):
   ...: scipy.interpolate.interp1d
   ...: try:
   ...: rdd.mapValues(foo).collect()
   ...: except py4j.protocol.Py4JJavaError as err:
   ...: print("it errored")
   ...: import scipy.interpolate as scipy_interpolate
   ...: def bar(*ards, **kwd):
   ...: scipy_interpolate.interp1d
   ...: rdd.mapValues(bar).collect()
   ...: print("worked")
   ...: rdd.mapValues(foo).collect()
   ...: print("worked") 
     
worked  
worked{noformat}

{noformat}
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.2.3-SNAPSHOT
  /_/

Using Python version 3.6.6 (default, Oct 12 2018 14:08:43)
SparkSession available as 'spark'.

In [1]: import pyspark.cloudpickle 
   ...: import pyspark 
   ...: import py4j 
   ...: rdd = sc.parallelize([(1,2)]) 
   ...: import scipy.interpolate
 

In [2]: import scipy.interpolate 
   ...: def foo(*ards, **kwd): 
   ...: scipy.interpolate.interp1d 
   ...: try: 
   ...: rdd.mapValues(foo).collect() 
   ...: except py4j.protocol.Py4JJavaError as err: 
   ...: print("it errored") 
   ...: import scipy.interpolate as scipy_interpolate 
   ...: def bar(*ards, **kwd): 
   ...: scipy_interpolate.interp1d 
   ...: rdd.mapValues(bar).collect() 
   ...: print("worked") 
   ...: rdd.mapValues(foo).collect() 
   ...: print("worked") 
 
18/10/23 15:39:54 ERROR Executor: Exception in task 7.0 in stage 0.0 (TID 7)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/home/bryan/git/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
196, in main
process()
  File "/home/bryan/git/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
191, in process
serializer.dump_stream(func(split_index, iterator), outfile)
  File "/home/bryan/git/spark/python/lib/pyspark.zip/pyspark/serializers.py", 
line 268, in dump_stream
vs = list(itertools.islice(iterator, batch))
  File "/home/bryan/git/spark/python/pyspark/rdd.py", line 1951, in 
map_values_fn = lambda kv: (kv[0], f(kv[1]))
  File "", line 3, in foo
AttributeError: module 'scipy' has no attribute 'interpolate'

at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:197)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:238)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:156)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:67)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:344)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
[Stage 0:>  (0 + 8) / 
8]18/10/23 15:39:54 WARN TaskSetManager: Lost task 7.0 in stage 0.0 (TID 7, 
localhost, executor driver): org.apache.spark.api.python.PythonException: 
Traceback (most recent call last):
  File "/home/bryan/git/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
196, in main
process()
  File "/home/bryan/git/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
191, in process
serializer.dump_stream(func(split_index, iterator), outfile)
  File "/home/bryan/git/spark/python/lib/pyspark.zip/pyspark/serializers.py", 
line 268,

[jira] [Created] (SPARK-25815) Kerberos Support in Kubernetes resource manager (Client Mode)

2018-10-23 Thread Ilan Filonenko (JIRA)

Ilan Filonenko created SPARK-25815:
--

 Summary: Kerberos Support in Kubernetes resource manager (Client 
Mode)
 Key: SPARK-25815
 URL: https://issues.apache.org/jira/browse/SPARK-25815
 Project: Spark
  Issue Type: New Feature
  Components: Kubernetes
Affects Versions: 3.0.0
Reporter: Ilan Filonenko


Include Kerberos support for Spark on K8S jobs running in client-mode



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23257) Kerberos Support in Kubernetes resource manager (Cluster Mode)

2018-10-23 Thread Ilan Filonenko (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilan Filonenko updated SPARK-23257:
---
Summary: Kerberos Support in Kubernetes resource manager (Cluster Mode)  
(was: Implement Kerberos Support in Kubernetes resource manager)

> Kerberos Support in Kubernetes resource manager (Cluster Mode)
> --
>
> Key: SPARK-23257
> URL: https://issues.apache.org/jira/browse/SPARK-23257
> Project: Spark
>  Issue Type: Wish
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Rob Keevil
>Assignee: Ilan Filonenko
>Priority: Major
> Fix For: 3.0.0
>
>
> On the forked k8s branch of Spark at 
> [https://github.com/apache-spark-on-k8s/spark/pull/540] , Kerberos support 
> has been added to the Kubernetes resource manager.  The Kubernetes code 
> between these two repositories appears to have diverged, so this commit 
> cannot be merged in easily.  Are there any plans to re-implement this work on 
> the main Spark repository?
>  
> [ifilonenko|https://github.com/ifilonenko] [~liyinan926] I am happy to help 
> with the development and testing of this, but i wanted to confirm that this 
> isn't already in progress -  I could not find any discussion about this 
> specific topic online.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25807) Mitigate 1-based substr() confusion

2018-10-23 Thread kevin yu (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661324#comment-16661324
 ] 

kevin yu commented on SPARK-25807:
--

I am looking into option 1, option 3 causes to change behavior, probably 
require more discussion.

Kevin

> Mitigate 1-based substr() confusion
> ---
>
> Key: SPARK-25807
> URL: https://issues.apache.org/jira/browse/SPARK-25807
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, PySpark
>Affects Versions: 1.3.0, 2.3.2, 2.4.0, 2.5.0, 3.0.0
>Reporter: Oron Navon
>Priority: Minor
>
> The method {{Column.substr()}} is 1-based, conforming with SQL and Hive's 
> {{SUBSTRING}}, and contradicting both Python's {{substr}} and Java's 
> {{substr}}, which are zero-based.  Both PySpark users and Java API users 
> often naturally expect a 0-based {{substr()}}. Adding to the confusion, 
> {{substr()}} currently allows a {{startPos}} value of 0, which returns the 
> same result as {{startPos==1}}.
> Since changing {{substr()}} to 0-based is probably NOT a reasonable option 
> here, I suggest making one or more of the following changes:
>  # Adding a method {{substr0}}, which would be zero-based
>  # Renaming {{substr}} to {{substr1}}
>  # Making the existing {{substr()}} throw an exception on {{startPos==0}}, 
> which should catch and alert most users who expect zero-based behavior.
> This is my first discussion on this project, apologies for any faux pas.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25801) pandas_udf grouped_map fails with input dataframe with more than 255 columns

2018-10-23 Thread Bryan Cutler (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-25801.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

> pandas_udf grouped_map fails with input dataframe with more than 255 columns
> 
>
> Key: SPARK-25801
> URL: https://issues.apache.org/jira/browse/SPARK-25801
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: python 2.7
> pyspark 2.3.0
>Reporter: Frederik
>Priority: Major
> Fix For: 2.4.0
>
>
> Hi,
> I'm using a pandas_udf to deploy a model to predict all samples in a spark 
> dataframe,
> for this I use a udf as follows:
> @pandas_udf("scores double", PandasUDFType.GROUPED_MAP) def 
> predict_scores(pdf):  score_values = model.predict_proba(pdf)[:,1]  return 
> pd.DataFrame({'scores': score_values})
> So it takes a dataframe and predicts the probability of being positive 
> according to an sklearn model for each row and returns this as single column. 
> This works great on a random groupBy, e.g.:
> sdf_to_score.groupBy(sf.col('age')).apply(predict_scores)
> as long as the dataframe has <255 columns. When the input dataframe has more 
> than 255 columns (thus features in my model), I get:
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "path/to/lib/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 
> 219, in main
> func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, 
> eval_type)
>   File "path/to/lib/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 
> 148, in read_udfs
> mapper = eval(mapper_str, udfs)
>   File "", line 1
> SyntaxError: more than 255 arguments
> Which seems to be related with Python's general limitation of having not 
> allowing more than 255 arguments for a function?
>  
> Is this a bug or is there a straightforward way around this problem?
>  
> Regards,
> Frederik



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25801) pandas_udf grouped_map fails with input dataframe with more than 255 columns

2018-10-23 Thread Bryan Cutler (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661290#comment-16661290
 ] 

Bryan Cutler commented on SPARK-25801:
--

[~Toekan] you might try turning your features into an array of doubles, so that 
there is only one column. Then you could unpack them in your udf if needed. 
I'll mark this as fixed in Spark 2.4 and close. You can reopen if you are 
unable to find a workaround and want to request a fix to be backported for the 
next 2.3 release.

> pandas_udf grouped_map fails with input dataframe with more than 255 columns
> 
>
> Key: SPARK-25801
> URL: https://issues.apache.org/jira/browse/SPARK-25801
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: python 2.7
> pyspark 2.3.0
>Reporter: Frederik
>Priority: Major
>
> Hi,
> I'm using a pandas_udf to deploy a model to predict all samples in a spark 
> dataframe,
> for this I use a udf as follows:
> @pandas_udf("scores double", PandasUDFType.GROUPED_MAP) def 
> predict_scores(pdf):  score_values = model.predict_proba(pdf)[:,1]  return 
> pd.DataFrame({'scores': score_values})
> So it takes a dataframe and predicts the probability of being positive 
> according to an sklearn model for each row and returns this as single column. 
> This works great on a random groupBy, e.g.:
> sdf_to_score.groupBy(sf.col('age')).apply(predict_scores)
> as long as the dataframe has <255 columns. When the input dataframe has more 
> than 255 columns (thus features in my model), I get:
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "path/to/lib/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 
> 219, in main
> func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, 
> eval_type)
>   File "path/to/lib/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 
> 148, in read_udfs
> mapper = eval(mapper_str, udfs)
>   File "", line 1
> SyntaxError: more than 255 arguments
> Which seems to be related with Python's general limitation of having not 
> allowing more than 255 arguments for a function?
>  
> Is this a bug or is there a straightforward way around this problem?
>  
> Regards,
> Frederik



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25814) spark driver runs out of memory on org.apache.spark.util.kvstore.InMemoryStore

2018-10-23 Thread Ruslan Dautkhanov (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661282#comment-16661282
 ] 

Ruslan Dautkhanov commented on SPARK-25814:
---

thank you [~vanzin] ! I will try to tune those down and see if this help.

 

> spark driver runs out of memory on org.apache.spark.util.kvstore.InMemoryStore
> --
>
> Key: SPARK-25814
> URL: https://issues.apache.org/jira/browse/SPARK-25814
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.2, 2.3.2
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: driver, memory-analysis, memory-leak, statestore
> Attachments: image-2018-10-23-14-06-53-722.png
>
>
>  We're looking into issue when even huge spark driver memory gets eventually 
> exhausted and GC makes driver stop responding.
> Used [JXRay.com|http://jxray.com/] tool and found that most of driver heap is 
> used by 
>  
> {noformat}
> org.apache.spark.status.AppStatusStore
>   -> org.apache.spark.status.ElementTrackingStore
> -> org.apache.spark.util.kvstore.InMemoryStore
>  
> {noformat}
>  
> Is there is a way to tune this particular spark driver's memory region down?
>  
>  
> !image-2018-10-23-14-06-53-722.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25814) spark driver runs out of memory on org.apache.spark.util.kvstore.InMemoryStore

2018-10-23 Thread Ruslan Dautkhanov (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov updated SPARK-25814:
--
Priority: Major  (was: Critical)

> spark driver runs out of memory on org.apache.spark.util.kvstore.InMemoryStore
> --
>
> Key: SPARK-25814
> URL: https://issues.apache.org/jira/browse/SPARK-25814
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.2, 2.3.2
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: driver, memory-analysis, memory-leak, statestore
> Attachments: image-2018-10-23-14-06-53-722.png
>
>
>  We're looking into issue when even huge spark driver memory gets eventually 
> exhausted and GC makes driver stop responding.
> Used [JXRay.com|http://jxray.com/] tool and found that most of driver heap is 
> used by 
>  
> {noformat}
> org.apache.spark.status.AppStatusStore
>   -> org.apache.spark.status.ElementTrackingStore
> -> org.apache.spark.util.kvstore.InMemoryStore
>  
> {noformat}
>  
> Is there is a way to tune this particular spark driver's memory region down?
>  
>  
> !image-2018-10-23-14-06-53-722.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25814) spark driver runs out of memory on org.apache.spark.util.kvstore.InMemoryStore

2018-10-23 Thread Marcelo Vanzin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661269#comment-16661269
 ] 

Marcelo Vanzin commented on SPARK-25814:


That's UI data. You can control how much UI data is retained with configs that 
have been there for a long time:

{noformat}
spark.ui.retainedTasks
spark.ui.retainedStages
spark.ui.retainedJobs
{noformat}


> spark driver runs out of memory on org.apache.spark.util.kvstore.InMemoryStore
> --
>
> Key: SPARK-25814
> URL: https://issues.apache.org/jira/browse/SPARK-25814
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.2, 2.3.2
>Reporter: Ruslan Dautkhanov
>Priority: Critical
>  Labels: driver, memory-analysis, memory-leak, statestore
> Attachments: image-2018-10-23-14-06-53-722.png
>
>
>  We're looking into issue when even huge spark driver memory gets eventually 
> exhausted and GC makes driver stop responding.
> Used [JXRay.com|http://jxray.com/] tool and found that most of driver heap is 
> used by 
>  
> {noformat}
> org.apache.spark.status.AppStatusStore
>   -> org.apache.spark.status.ElementTrackingStore
> -> org.apache.spark.util.kvstore.InMemoryStore
>  
> {noformat}
>  
> Is there is a way to tune this particular spark driver's memory region down?
>  
>  
> !image-2018-10-23-14-06-53-722.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25814) spark driver runs out of memory on org.apache.spark.util.kvstore.InMemoryStore

2018-10-23 Thread Ruslan Dautkhanov (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov updated SPARK-25814:
--
Description: 
 We're looking into issue when even huge spark driver memory gets eventually 
exhausted and GC makes driver stop responding.

Used [JXRay.com|http://jxray.com/] tool and found that most of driver heap is 
used by 

 
{noformat}
org.apache.spark.status.AppStatusStore
  -> org.apache.spark.status.ElementTrackingStore
-> org.apache.spark.util.kvstore.InMemoryStore
 
{noformat}
 

Is there is a way to tune this particular spark driver's memory region down?

 

 

!image-2018-10-23-14-06-53-722.png!

 

  was:
 We're looking into issue when even huge spark driver memory gets eventually 
exhausted and GC makes driver stop responding.

Used jxray.com tool and found that most of driver heap is used by 

 
{noformat}
org.apache.spark.status.AppStatusStore
  -> org.apache.spark.status.ElementTrackingStore
-> org.apache.spark.util.kvstore.InMemoryStore
 
{noformat}
 

Is there is a way to tune this particular spark driver's memory region down?

 

 

!image-2018-10-23-14-06-53-722.png!

 


> spark driver runs out of memory on org.apache.spark.util.kvstore.InMemoryStore
> --
>
> Key: SPARK-25814
> URL: https://issues.apache.org/jira/browse/SPARK-25814
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.2, 2.3.2
>Reporter: Ruslan Dautkhanov
>Priority: Critical
>  Labels: driver, memory-analysis, memory-leak, statestore
> Attachments: image-2018-10-23-14-06-53-722.png
>
>
>  We're looking into issue when even huge spark driver memory gets eventually 
> exhausted and GC makes driver stop responding.
> Used [JXRay.com|http://jxray.com/] tool and found that most of driver heap is 
> used by 
>  
> {noformat}
> org.apache.spark.status.AppStatusStore
>   -> org.apache.spark.status.ElementTrackingStore
> -> org.apache.spark.util.kvstore.InMemoryStore
>  
> {noformat}
>  
> Is there is a way to tune this particular spark driver's memory region down?
>  
>  
> !image-2018-10-23-14-06-53-722.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25814) spark driver runs out of memory on org.apache.spark.util.kvstore.InMemoryStore

2018-10-23 Thread Ruslan Dautkhanov (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov updated SPARK-25814:
--
Description: 
 We're looking into issue when even huge spark driver memory gets eventually 
exhausted and GC makes driver stop responding.

Used jxray.com tool and found that most of driver heap is used by 

 
{noformat}
org.apache.spark.status.AppStatusStore
  -> org.apache.spark.status.ElementTrackingStore
-> org.apache.spark.util.kvstore.InMemoryStore
 
{noformat}
 

Is there is a way to tune this particular spark driver's memory region down?

 

 

!image-2018-10-23-14-06-53-722.png!

 

  was:
 We're looking into issue when even huge spark driver memory gets eventually 
exhausted and GC makes driver stop responding.

Used jxray.com tool and found that most of driver heap is used by 

 
{noformat}
org.apache.spark.status.AppStatusStore
  -> org.apache.spark.status.ElementTrackingStore
-> org.apache.spark.util.kvstore.InMemoryStore
 
{noformat}
 

 

Is there is a way to tune this particular spark driver's memory region down?

 

!image-2018-10-23-14-03-12-258.png!

 

!image-2018-10-23-14-06-53-722.png!

 


> spark driver runs out of memory on org.apache.spark.util.kvstore.InMemoryStore
> --
>
> Key: SPARK-25814
> URL: https://issues.apache.org/jira/browse/SPARK-25814
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.2, 2.3.2
>Reporter: Ruslan Dautkhanov
>Priority: Critical
>  Labels: driver, memory-analysis, memory-leak, statestore
> Attachments: image-2018-10-23-14-06-53-722.png
>
>
>  We're looking into issue when even huge spark driver memory gets eventually 
> exhausted and GC makes driver stop responding.
> Used jxray.com tool and found that most of driver heap is used by 
>  
> {noformat}
> org.apache.spark.status.AppStatusStore
>   -> org.apache.spark.status.ElementTrackingStore
> -> org.apache.spark.util.kvstore.InMemoryStore
>  
> {noformat}
>  
> Is there is a way to tune this particular spark driver's memory region down?
>  
>  
> !image-2018-10-23-14-06-53-722.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25814) spark driver runs out of memory on org.apache.spark.util.kvstore.InMemoryStore

2018-10-23 Thread Ruslan Dautkhanov (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov updated SPARK-25814:
--
Description: 
 We're looking into issue when even huge spark driver memory gets eventually 
exhausted and GC makes driver stop responding.

Used jxray.com tool and found that most of driver heap is used by 

 
{noformat}
org.apache.spark.status.AppStatusStore
  -> org.apache.spark.status.ElementTrackingStore
-> org.apache.spark.util.kvstore.InMemoryStore
 
{noformat}
 

 

Is there is a way to tune this particular spark driver's memory region down?

 

!image-2018-10-23-14-03-12-258.png!

 

!image-2018-10-23-14-06-53-722.png!

 

  was:
 We're looking into issue when even huge spark driver memory gets eventually 
exhausted and GC makes driver stop responding.

Used jxray.com tool and found that most of driver heap is used by 

 
{noformat}
org.apache.spark.status.AppStatusStore
  -> org.apache.spark.status.ElementTrackingStore
-> org.apache.spark.util.kvstore.InMemoryStore
 
{noformat}
 

 

Is there is a way to tune this particular spark driver's memory region down?

 

!image-2018-10-23-14-03-12-258.png!

 


> spark driver runs out of memory on org.apache.spark.util.kvstore.InMemoryStore
> --
>
> Key: SPARK-25814
> URL: https://issues.apache.org/jira/browse/SPARK-25814
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.2, 2.3.2
>Reporter: Ruslan Dautkhanov
>Priority: Critical
>  Labels: driver, memory-analysis, memory-leak, statestore
> Attachments: image-2018-10-23-14-06-53-722.png
>
>
>  We're looking into issue when even huge spark driver memory gets eventually 
> exhausted and GC makes driver stop responding.
> Used jxray.com tool and found that most of driver heap is used by 
>  
> {noformat}
> org.apache.spark.status.AppStatusStore
>   -> org.apache.spark.status.ElementTrackingStore
> -> org.apache.spark.util.kvstore.InMemoryStore
>  
> {noformat}
>  
>  
> Is there is a way to tune this particular spark driver's memory region down?
>  
> !image-2018-10-23-14-03-12-258.png!
>  
> !image-2018-10-23-14-06-53-722.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25814) spark driver runs out of memory on org.apache.spark.util.kvstore.InMemoryStore

2018-10-23 Thread Ruslan Dautkhanov (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov updated SPARK-25814:
--
Attachment: image-2018-10-23-14-06-53-722.png

> spark driver runs out of memory on org.apache.spark.util.kvstore.InMemoryStore
> --
>
> Key: SPARK-25814
> URL: https://issues.apache.org/jira/browse/SPARK-25814
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.2, 2.3.2
>Reporter: Ruslan Dautkhanov
>Priority: Critical
>  Labels: driver, memory-analysis, memory-leak, statestore
> Attachments: image-2018-10-23-14-06-53-722.png
>
>
>  We're looking into issue when even huge spark driver memory gets eventually 
> exhausted and GC makes driver stop responding.
> Used jxray.com tool and found that most of driver heap is used by 
>  
> {noformat}
> org.apache.spark.status.AppStatusStore
>   -> org.apache.spark.status.ElementTrackingStore
> -> org.apache.spark.util.kvstore.InMemoryStore
>  
> {noformat}
>  
>  
> Is there is a way to tune this particular spark driver's memory region down?
>  
> !image-2018-10-23-14-03-12-258.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25814) spark driver runs out of memory on org.apache.spark.util.kvstore.InMemoryStore

2018-10-23 Thread Ruslan Dautkhanov (JIRA)

Ruslan Dautkhanov created SPARK-25814:
-

 Summary: spark driver runs out of memory on 
org.apache.spark.util.kvstore.InMemoryStore
 Key: SPARK-25814
 URL: https://issues.apache.org/jira/browse/SPARK-25814
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.2, 2.2.2
Reporter: Ruslan Dautkhanov
 Attachments: image-2018-10-23-14-06-53-722.png

 We're looking into issue when even huge spark driver memory gets eventually 
exhausted and GC makes driver stop responding.

Used jxray.com tool and found that most of driver heap is used by 

 
{noformat}
org.apache.spark.status.AppStatusStore
  -> org.apache.spark.status.ElementTrackingStore
-> org.apache.spark.util.kvstore.InMemoryStore
 
{noformat}
 

 

Is there is a way to tune this particular spark driver's memory region down?

 

!image-2018-10-23-14-03-12-258.png!

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25813) Unit Test "pageNavigation" for test suite PagedTableSuite.scala is failing for Apache Spark master branch

2018-10-23 Thread Parth Gandhi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661172#comment-16661172
 ] 

Parth Gandhi commented on SPARK-25813:
--

Duplicate JIRA, refer https://issues.apache.org/jira/browse/SPARK-25812. 
Closing this JIRA.

> Unit Test "pageNavigation" for test suite PagedTableSuite.scala is failing 
> for Apache Spark master branch
> -
>
> Key: SPARK-25813
> URL: https://issues.apache.org/jira/browse/SPARK-25813
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Parth Gandhi
>Priority: Major
>
> The PR [https://github.com/apache/spark/pull/22668] which was merged a few 
> days back is breaking one unit test for Apache Spark master branch. This 
> needs to be fixed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25813) Unit Test "pageNavigation" for test suite PagedTableSuite.scala is failing for Apache Spark master branch

2018-10-23 Thread Parth Gandhi (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Parth Gandhi resolved SPARK-25813.
--
Resolution: Duplicate

> Unit Test "pageNavigation" for test suite PagedTableSuite.scala is failing 
> for Apache Spark master branch
> -
>
> Key: SPARK-25813
> URL: https://issues.apache.org/jira/browse/SPARK-25813
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Parth Gandhi
>Priority: Major
>
> The PR [https://github.com/apache/spark/pull/22668] which was merged a few 
> days back is breaking one unit test for Apache Spark master branch. This 
> needs to be fixed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25656) Add an example section about how to use Parquet/ORC library options

2018-10-23 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-25656.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22801
[https://github.com/apache/spark/pull/22801]

> Add an example section about how to use Parquet/ORC library options
> ---
>
> Key: SPARK-25656
> URL: https://issues.apache.org/jira/browse/SPARK-25656
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Examples, SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> Our current doc does not explain we are passing the data source specific 
> options to the underlying data source:
> - 
> https://spark.apache.org/docs/latest/sql-programming-guide.html#manually-specifying-options
> We can add some introduction section for both Parquet/ORC examples there. We 
> had better give both read/write side configuration examples, too. One example 
> candidate is `dictionary encoding`: `parquet.enable.dictionary` and 
> `orc.dictionary.key.threshold` et al.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25656) Add an example section about how to use Parquet/ORC library options

2018-10-23 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-25656:
-

Assignee: Dongjoon Hyun

> Add an example section about how to use Parquet/ORC library options
> ---
>
> Key: SPARK-25656
> URL: https://issues.apache.org/jira/browse/SPARK-25656
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Examples, SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> Our current doc does not explain we are passing the data source specific 
> options to the underlying data source:
> - 
> https://spark.apache.org/docs/latest/sql-programming-guide.html#manually-specifying-options
> We can add some introduction section for both Parquet/ORC examples there. We 
> had better give both read/write side configuration examples, too. One example 
> candidate is `dictionary encoding`: `parquet.enable.dictionary` and 
> `orc.dictionary.key.threshold` et al.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25813) Unit Test "pageNavigation" for test suite PagedTableSuite.scala is failing for Apache Spark master branch

2018-10-23 Thread Parth Gandhi (JIRA)

Parth Gandhi created SPARK-25813:


 Summary: Unit Test "pageNavigation" for test suite 
PagedTableSuite.scala is failing for Apache Spark master branch
 Key: SPARK-25813
 URL: https://issues.apache.org/jira/browse/SPARK-25813
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Parth Gandhi


The PR [https://github.com/apache/spark/pull/22668] which was merged a few days 
back is breaking one unit test for Apache Spark master branch. This needs to be 
fixed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25812) Flaky test: PagedTableSuite.pageNavigation

2018-10-23 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-25812.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22808
[https://github.com/apache/spark/pull/22808]

> Flaky test: PagedTableSuite.pageNavigation
> --
>
> Key: SPARK-25812
> URL: https://issues.apache.org/jira/browse/SPARK-25812
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5074/testReport/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5073/testReport/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5072/testReport/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5070/testReport/
> - 
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97878/testReport/org.apache.spark.ui/PagedTableSuite/pageNavigation/]
> {code:java}
> [info] PagedTableSuite:
> [info] - pageNavigation *** FAILED *** (2 milliseconds)
> [info]   
> [info] 
> [info]class="form-inline pull-right" style="margin-bottom: 0px;">
> [info] 
> [info] 
> [info] 1 Pages. Jump to
> [info]  value="1" class="span1"/>
> [info]   
> [info] . Show 
> [info]  value="10" class="span1"/>
> [info] items in a page.
> [info]   
> [info] Go
> [info]   
> [info] 
> [info] 
> [info]   Page: 
> [info]   
> [info] 
> [info] 
> [info] 1
> [info] 
> [info] 
> [info]   
> [info] 
> [info]did not equal List() (PagedTableSuite.scala:76)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
> [info]   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
> [info]   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
> [info]   at 
> org.apache.spark.ui.PagedTableSuite$$anonfun$4.apply(PagedTableSuite.scala:76)
> [info]   at 
> org.apache.spark.ui.PagedTableSuite$$anonfun$4.apply(PagedTableSuite.scala:52)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25812) Flaky test: PagedTableSuite.pageNavigation

2018-10-23 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-25812:
-

Assignee: Gengliang Wang

> Flaky test: PagedTableSuite.pageNavigation
> --
>
> Key: SPARK-25812
> URL: https://issues.apache.org/jira/browse/SPARK-25812
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Gengliang Wang
>Priority: Major
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5074/testReport/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5073/testReport/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5072/testReport/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5070/testReport/
> - 
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97878/testReport/org.apache.spark.ui/PagedTableSuite/pageNavigation/]
> {code:java}
> [info] PagedTableSuite:
> [info] - pageNavigation *** FAILED *** (2 milliseconds)
> [info]   
> [info] 
> [info]class="form-inline pull-right" style="margin-bottom: 0px;">
> [info] 
> [info] 
> [info] 1 Pages. Jump to
> [info]  value="1" class="span1"/>
> [info]   
> [info] . Show 
> [info]  value="10" class="span1"/>
> [info] items in a page.
> [info]   
> [info] Go
> [info]   
> [info] 
> [info] 
> [info]   Page: 
> [info]   
> [info] 
> [info] 
> [info] 1
> [info] 
> [info] 
> [info]   
> [info] 
> [info]did not equal List() (PagedTableSuite.scala:76)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
> [info]   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
> [info]   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
> [info]   at 
> org.apache.spark.ui.PagedTableSuite$$anonfun$4.apply(PagedTableSuite.scala:76)
> [info]   at 
> org.apache.spark.ui.PagedTableSuite$$anonfun$4.apply(PagedTableSuite.scala:52)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25793) Loading model bug in BisectingKMeans

2018-10-23 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25793:
--
Target Version/s: 2.4.1, 3.0.0  (was: 2.4.1, 2.5.0)

> Loading model bug in BisectingKMeans
> 
>
> Key: SPARK-25793
> URL: https://issues.apache.org/jira/browse/SPARK-25793
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Priority: Major
>
> See this line:
> [https://github.com/apache/spark/blob/fc64e83f9538d6b7e13359a4933a454ba7ed89ec/mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala#L129]
>  
> This also affects `ml.clustering.BisectingKMeansModel`
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25812) Flaky test: PagedTableSuite.pageNavigation

2018-10-23 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25812:


Assignee: Apache Spark

> Flaky test: PagedTableSuite.pageNavigation
> --
>
> Key: SPARK-25812
> URL: https://issues.apache.org/jira/browse/SPARK-25812
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5074/testReport/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5073/testReport/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5072/testReport/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5070/testReport/
> - 
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97878/testReport/org.apache.spark.ui/PagedTableSuite/pageNavigation/]
> {code:java}
> [info] PagedTableSuite:
> [info] - pageNavigation *** FAILED *** (2 milliseconds)
> [info]   
> [info] 
> [info]class="form-inline pull-right" style="margin-bottom: 0px;">
> [info] 
> [info] 
> [info] 1 Pages. Jump to
> [info]  value="1" class="span1"/>
> [info]   
> [info] . Show 
> [info]  value="10" class="span1"/>
> [info] items in a page.
> [info]   
> [info] Go
> [info]   
> [info] 
> [info] 
> [info]   Page: 
> [info]   
> [info] 
> [info] 
> [info] 1
> [info] 
> [info] 
> [info]   
> [info] 
> [info]did not equal List() (PagedTableSuite.scala:76)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
> [info]   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
> [info]   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
> [info]   at 
> org.apache.spark.ui.PagedTableSuite$$anonfun$4.apply(PagedTableSuite.scala:76)
> [info]   at 
> org.apache.spark.ui.PagedTableSuite$$anonfun$4.apply(PagedTableSuite.scala:52)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25812) Flaky test: PagedTableSuite.pageNavigation

2018-10-23 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16660959#comment-16660959
 ] 

Apache Spark commented on SPARK-25812:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/22808

> Flaky test: PagedTableSuite.pageNavigation
> --
>
> Key: SPARK-25812
> URL: https://issues.apache.org/jira/browse/SPARK-25812
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5074/testReport/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5073/testReport/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5072/testReport/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5070/testReport/
> - 
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97878/testReport/org.apache.spark.ui/PagedTableSuite/pageNavigation/]
> {code:java}
> [info] PagedTableSuite:
> [info] - pageNavigation *** FAILED *** (2 milliseconds)
> [info]   
> [info] 
> [info]class="form-inline pull-right" style="margin-bottom: 0px;">
> [info] 
> [info] 
> [info] 1 Pages. Jump to
> [info]  value="1" class="span1"/>
> [info]   
> [info] . Show 
> [info]  value="10" class="span1"/>
> [info] items in a page.
> [info]   
> [info] Go
> [info]   
> [info] 
> [info] 
> [info]   Page: 
> [info]   
> [info] 
> [info] 
> [info] 1
> [info] 
> [info] 
> [info]   
> [info] 
> [info]did not equal List() (PagedTableSuite.scala:76)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
> [info]   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
> [info]   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
> [info]   at 
> org.apache.spark.ui.PagedTableSuite$$anonfun$4.apply(PagedTableSuite.scala:76)
> [info]   at 
> org.apache.spark.ui.PagedTableSuite$$anonfun$4.apply(PagedTableSuite.scala:52)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25812) Flaky test: PagedTableSuite.pageNavigation

2018-10-23 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25812:


Assignee: (was: Apache Spark)

> Flaky test: PagedTableSuite.pageNavigation
> --
>
> Key: SPARK-25812
> URL: https://issues.apache.org/jira/browse/SPARK-25812
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5074/testReport/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5073/testReport/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5072/testReport/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5070/testReport/
> - 
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97878/testReport/org.apache.spark.ui/PagedTableSuite/pageNavigation/]
> {code:java}
> [info] PagedTableSuite:
> [info] - pageNavigation *** FAILED *** (2 milliseconds)
> [info]   
> [info] 
> [info]class="form-inline pull-right" style="margin-bottom: 0px;">
> [info] 
> [info] 
> [info] 1 Pages. Jump to
> [info]  value="1" class="span1"/>
> [info]   
> [info] . Show 
> [info]  value="10" class="span1"/>
> [info] items in a page.
> [info]   
> [info] Go
> [info]   
> [info] 
> [info] 
> [info]   Page: 
> [info]   
> [info] 
> [info] 
> [info] 1
> [info] 
> [info] 
> [info]   
> [info] 
> [info]did not equal List() (PagedTableSuite.scala:76)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
> [info]   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
> [info]   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
> [info]   at 
> org.apache.spark.ui.PagedTableSuite$$anonfun$4.apply(PagedTableSuite.scala:76)
> [info]   at 
> org.apache.spark.ui.PagedTableSuite$$anonfun$4.apply(PagedTableSuite.scala:52)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19851) Add support for EVERY and ANY (SOME) aggregates

2018-10-23 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-19851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16660955#comment-16660955
 ] 

Apache Spark commented on SPARK-19851:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/22809

> Add support for EVERY and ANY (SOME) aggregates
> ---
>
> Key: SPARK-19851
> URL: https://issues.apache.org/jira/browse/SPARK-19851
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Michael Styles
>Priority: Major
>
> Add support for EVERY and ANY (SOME) aggregates.
> - EVERY returns true if all input values are true.
> - ANY returns true if at least one input value is true.
> - SOME is equivalent to ANY.
> Both aggregates are part of the SQL standard.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25793) Loading model bug in BisectingKMeans

2018-10-23 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-25793:
--
Target Version/s: 2.4.1, 2.5.0

> Loading model bug in BisectingKMeans
> 
>
> Key: SPARK-25793
> URL: https://issues.apache.org/jira/browse/SPARK-25793
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Priority: Major
>
> See this line:
> [https://github.com/apache/spark/blob/fc64e83f9538d6b7e13359a4933a454ba7ed89ec/mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala#L129]
>  
> This also affects `ml.clustering.BisectingKMeansModel`
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25812) Flaky test: PagedTableSuite.pageNavigation

2018-10-23 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25812:
--
Description: 
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5074/testReport/

- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5073/testReport/

- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5072/testReport/

- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5070/testReport/

- 
[https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97878/testReport/org.apache.spark.ui/PagedTableSuite/pageNavigation/]

{code:java}
[info] PagedTableSuite:
[info] - pageNavigation *** FAILED *** (2 milliseconds)
[info]   
[info] 
[info]   
[info] 
[info] 
[info] 1 Pages. Jump to
[info] 
[info]   
[info] . Show 
[info] 
[info] items in a page.
[info]   
[info] Go
[info]   
[info] 
[info] 
[info]   Page: 
[info]   
[info] 
[info] 
[info] 1
[info] 
[info] 
[info]   
[info] 
[info]did not equal List() (PagedTableSuite.scala:76)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at 
org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
[info]   at 
org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
[info]   at 
org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
[info]   at 
org.apache.spark.ui.PagedTableSuite$$anonfun$4.apply(PagedTableSuite.scala:76)
[info]   at 
org.apache.spark.ui.PagedTableSuite$$anonfun$4.apply(PagedTableSuite.scala:52)
{code}

  was:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97878/testReport/org.apache.spark.ui/PagedTableSuite/pageNavigation/

{code}
[info] PagedTableSuite:
[info] - pageNavigation *** FAILED *** (2 milliseconds)
[info]   
[info] 
[info]   
[info] 
[info] 
[info] 1 Pages. Jump to
[info] 
[info]   
[info] . Show 
[info] 
[info] items in a page.
[info]   
[info] Go
[info]   
[info] 
[info] 
[info]   Page: 
[info]   
[info] 
[info] 
[info] 1
[info] 
[info] 
[info]   
[info] 
[info]did not equal List() (PagedTableSuite.scala:76)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at 
org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
[info]   at 
org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
[info]   at 
org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
[info]   at 
org.apache.spark.ui.PagedTableSuite$$anonfun$4.apply(PagedTableSuite.scala:76)
[info]   at 
org.apache.spark.ui.PagedTableSuite$$anonfun$4.apply(PagedTableSuite.scala:52)
{code}


> Flaky test: PagedTableSuite.pageNavigation
> --
>
> Key: SPARK-25812
> URL: https://issues.apache.org/jira/browse/SPARK-25812
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5074/testReport/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5073/testReport/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5072/testReport/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5070/testReport/
> - 
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97878/testReport/org.apache.spark.ui/PagedTableSuite/pageNavigation/]
> {code:java}
> [info] PagedTableSuite:
> [info] - pageNavigation *** FAILED *** (2 milliseconds)
> [info]   
> [info] 
> [info]class="form-inline pull-right" style="margin-bottom: 0px;">
> [info] 
> [info] 
> [info] 1 Pages. Jump to
> [info]  value="1" class="span1"/>
> [info]   
> [info] . Show 
> [info]  value="10" class="span1"/>
> [info] items in a page.
> [info]   
> [info] Go
> [info]   
> [info] 
> [info] 
>

[jira] [Created] (SPARK-25812) Flaky test: PagedTableSuite.pageNavigation

2018-10-23 Thread Dongjoon Hyun (JIRA)

Dongjoon Hyun created SPARK-25812:
-

 Summary: Flaky test: PagedTableSuite.pageNavigation
 Key: SPARK-25812
 URL: https://issues.apache.org/jira/browse/SPARK-25812
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun


https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97878/testReport/org.apache.spark.ui/PagedTableSuite/pageNavigation/

{code}
[info] PagedTableSuite:
[info] - pageNavigation *** FAILED *** (2 milliseconds)
[info]   
[info] 
[info]   
[info] 
[info] 
[info] 1 Pages. Jump to
[info] 
[info]   
[info] . Show 
[info] 
[info] items in a page.
[info]   
[info] Go
[info]   
[info] 
[info] 
[info]   Page: 
[info]   
[info] 
[info] 
[info] 1
[info] 
[info] 
[info]   
[info] 
[info]did not equal List() (PagedTableSuite.scala:76)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at 
org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
[info]   at 
org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
[info]   at 
org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
[info]   at 
org.apache.spark.ui.PagedTableSuite$$anonfun$4.apply(PagedTableSuite.scala:76)
[info]   at 
org.apache.spark.ui.PagedTableSuite$$anonfun$4.apply(PagedTableSuite.scala:52)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25675) [Spark Job History] Job UI page does not show pagination with one page

2018-10-23 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16660918#comment-16660918
 ] 

Apache Spark commented on SPARK-25675:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/22808

> [Spark Job History] Job UI page does not show pagination with one page
> --
>
> Key: SPARK-25675
> URL: https://issues.apache.org/jira/browse/SPARK-25675
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: ABHISHEK KUMAR GUPTA
>Assignee: Shivu Sondur
>Priority: Major
> Fix For: 3.0.0
>
>
> 1. set spark.ui.retainedJobs= 1 in spark-default conf of spark Job History
>  2. Restart Job History
>  3. Submit Beeline jobs for 1
>  4. Launch Job History UI Page
>  5. Select JDBC Running Application ID from Incomplete Application Page
>  6. Launch Jo Page
>  7. Pagination Panel display based on page size as below
>  
> 
>  Completed Jobs XXX
>  Page: 1 2 3 ... XX Page: Jump to 1 show 100 items in a 
> page
>  
> -
>  8. Change the value in Jump to 1 show *XXX* items in page, that is display 
> all completed Jobs in a single page
> *Actual Result:*
>  All completed Jobs will be display in a Page but no Pagination panel so that 
> User can modify and set the number of Jobs in a page.
> *Expected Result:*
>  It should display the Pagination panel as below
>  >>>
>  Page: 1                                                             1 Page: 
> Jump to 1 show *XXX* items in a page
>  
>  Pagination of page size *1* because it is displaying total number of 
> completed Jobs in a single Page.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25675) [Spark Job History] Job UI page does not show pagination with one page

2018-10-23 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16660914#comment-16660914
 ] 

Apache Spark commented on SPARK-25675:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/22808

> [Spark Job History] Job UI page does not show pagination with one page
> --
>
> Key: SPARK-25675
> URL: https://issues.apache.org/jira/browse/SPARK-25675
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: ABHISHEK KUMAR GUPTA
>Assignee: Shivu Sondur
>Priority: Major
> Fix For: 3.0.0
>
>
> 1. set spark.ui.retainedJobs= 1 in spark-default conf of spark Job History
>  2. Restart Job History
>  3. Submit Beeline jobs for 1
>  4. Launch Job History UI Page
>  5. Select JDBC Running Application ID from Incomplete Application Page
>  6. Launch Jo Page
>  7. Pagination Panel display based on page size as below
>  
> 
>  Completed Jobs XXX
>  Page: 1 2 3 ... XX Page: Jump to 1 show 100 items in a 
> page
>  
> -
>  8. Change the value in Jump to 1 show *XXX* items in page, that is display 
> all completed Jobs in a single page
> *Actual Result:*
>  All completed Jobs will be display in a Page but no Pagination panel so that 
> User can modify and set the number of Jobs in a page.
> *Expected Result:*
>  It should display the Pagination panel as below
>  >>>
>  Page: 1                                                             1 Page: 
> Jump to 1 show *XXX* items in a page
>  
>  Pagination of page size *1* because it is displaying total number of 
> completed Jobs in a single Page.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25811) Support PyArrow's feature to raise an error for unsafe cast

2018-10-23 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16660807#comment-16660807
 ] 

Apache Spark commented on SPARK-25811:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/22807

> Support PyArrow's feature to raise an error for unsafe cast
> ---
>
> Key: SPARK-25811
> URL: https://issues.apache.org/jira/browse/SPARK-25811
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> Since 0.11.0, PyArrow supports to raise an error for unsafe cast. We should 
> use it to raise a proper error for pandas udf users when such cast is 
> detected.
> We can also add a config to control such behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25811) Support PyArrow's feature to raise an error for unsafe cast

2018-10-23 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16660803#comment-16660803
 ] 

Apache Spark commented on SPARK-25811:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/22807

> Support PyArrow's feature to raise an error for unsafe cast
> ---
>
> Key: SPARK-25811
> URL: https://issues.apache.org/jira/browse/SPARK-25811
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> Since 0.11.0, PyArrow supports to raise an error for unsafe cast. We should 
> use it to raise a proper error for pandas udf users when such cast is 
> detected.
> We can also add a config to control such behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25811) Support PyArrow's feature to raise an error for unsafe cast

2018-10-23 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25811:


Assignee: (was: Apache Spark)

> Support PyArrow's feature to raise an error for unsafe cast
> ---
>
> Key: SPARK-25811
> URL: https://issues.apache.org/jira/browse/SPARK-25811
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> Since 0.11.0, PyArrow supports to raise an error for unsafe cast. We should 
> use it to raise a proper error for pandas udf users when such cast is 
> detected.
> We can also add a config to control such behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25811) Support PyArrow's feature to raise an error for unsafe cast

2018-10-23 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25811:


Assignee: Apache Spark

> Support PyArrow's feature to raise an error for unsafe cast
> ---
>
> Key: SPARK-25811
> URL: https://issues.apache.org/jira/browse/SPARK-25811
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>Priority: Major
>
> Since 0.11.0, PyArrow supports to raise an error for unsafe cast. We should 
> use it to raise a proper error for pandas udf users when such cast is 
> detected.
> We can also add a config to control such behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25811) Support PyArrow's feature to raise an error for unsafe cast

2018-10-23 Thread Liang-Chi Hsieh (JIRA)

Liang-Chi Hsieh created SPARK-25811:
---

 Summary: Support PyArrow's feature to raise an error for unsafe 
cast
 Key: SPARK-25811
 URL: https://issues.apache.org/jira/browse/SPARK-25811
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.0.0
Reporter: Liang-Chi Hsieh


Since 0.11.0, PyArrow supports to raise an error for unsafe cast. We should use 
it to raise a proper error for pandas udf users when such cast is detected.

We can also add a config to control such behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25250) Race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple ti

2018-10-23 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16660746#comment-16660746
 ] 

Apache Spark commented on SPARK-25250:
--

User 'pgandhi999' has created a pull request for this issue:
https://github.com/apache/spark/pull/22806

> Race condition with tasks running when new attempt for same stage is created 
> leads to other task in the next attempt running on the same partition id 
> retry multiple times
> --
>
> Key: SPARK-25250
> URL: https://issues.apache.org/jira/browse/SPARK-25250
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.3.1
>Reporter: Parth Gandhi
>Priority: Major
>
> We recently had a scenario where a race condition occurred when a task from 
> previous stage attempt just finished before new attempt for the same stage 
> was created due to fetch failure, so the new task created in the second 
> attempt on the same partition id was retrying multiple times due to 
> TaskCommitDenied Exception without realizing that the task in earlier attempt 
> was already successful.  
> For example, consider a task with partition id 9000 and index 9000 running in 
> stage 4.0. We see a fetch failure so thus, we spawn a new stage attempt 4.1. 
> Just within this timespan, the above task completes successfully, thus, 
> marking the partition id 9000 as complete for 4.0. However, as stage 4.1 has 
> not yet been created, the taskset info for that stage is not available to the 
> TaskScheduler so, naturally, the partition id 9000 has not been marked 
> completed for 4.1. Stage 4.1 now spawns task with index 2000 on the same 
> partition id 9000. This task fails due to CommitDeniedException and since, it 
> does not see the corresponding partition id as been marked successful, it 
> keeps retrying multiple times until the job finally succeeds. It doesn't 
> cause any job failures because the DAG scheduler is tracking the partitions 
> separate from the task set managers.
>  
> Steps to Reproduce:
>  # Run any large job involving shuffle operation.
>  # When the ShuffleMap stage finishes and the ResultStage begins running, 
> cause this stage to throw a fetch failure exception(Try deleting certain 
> shuffle files on any host).
>  # Observe the task attempt numbers for the next stage attempt. Please note 
> that this issue is an intermittent one, so it might not happen all the time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25250) Race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple tim

2018-10-23 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25250:


Assignee: Apache Spark

> Race condition with tasks running when new attempt for same stage is created 
> leads to other task in the next attempt running on the same partition id 
> retry multiple times
> --
>
> Key: SPARK-25250
> URL: https://issues.apache.org/jira/browse/SPARK-25250
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.3.1
>Reporter: Parth Gandhi
>Assignee: Apache Spark
>Priority: Major
>
> We recently had a scenario where a race condition occurred when a task from 
> previous stage attempt just finished before new attempt for the same stage 
> was created due to fetch failure, so the new task created in the second 
> attempt on the same partition id was retrying multiple times due to 
> TaskCommitDenied Exception without realizing that the task in earlier attempt 
> was already successful.  
> For example, consider a task with partition id 9000 and index 9000 running in 
> stage 4.0. We see a fetch failure so thus, we spawn a new stage attempt 4.1. 
> Just within this timespan, the above task completes successfully, thus, 
> marking the partition id 9000 as complete for 4.0. However, as stage 4.1 has 
> not yet been created, the taskset info for that stage is not available to the 
> TaskScheduler so, naturally, the partition id 9000 has not been marked 
> completed for 4.1. Stage 4.1 now spawns task with index 2000 on the same 
> partition id 9000. This task fails due to CommitDeniedException and since, it 
> does not see the corresponding partition id as been marked successful, it 
> keeps retrying multiple times until the job finally succeeds. It doesn't 
> cause any job failures because the DAG scheduler is tracking the partitions 
> separate from the task set managers.
>  
> Steps to Reproduce:
>  # Run any large job involving shuffle operation.
>  # When the ShuffleMap stage finishes and the ResultStage begins running, 
> cause this stage to throw a fetch failure exception(Try deleting certain 
> shuffle files on any host).
>  # Observe the task attempt numbers for the next stage attempt. Please note 
> that this issue is an intermittent one, so it might not happen all the time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25250) Race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple ti

2018-10-23 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16660744#comment-16660744
 ] 

Apache Spark commented on SPARK-25250:
--

User 'pgandhi999' has created a pull request for this issue:
https://github.com/apache/spark/pull/22806

> Race condition with tasks running when new attempt for same stage is created 
> leads to other task in the next attempt running on the same partition id 
> retry multiple times
> --
>
> Key: SPARK-25250
> URL: https://issues.apache.org/jira/browse/SPARK-25250
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.3.1
>Reporter: Parth Gandhi
>Priority: Major
>
> We recently had a scenario where a race condition occurred when a task from 
> previous stage attempt just finished before new attempt for the same stage 
> was created due to fetch failure, so the new task created in the second 
> attempt on the same partition id was retrying multiple times due to 
> TaskCommitDenied Exception without realizing that the task in earlier attempt 
> was already successful.  
> For example, consider a task with partition id 9000 and index 9000 running in 
> stage 4.0. We see a fetch failure so thus, we spawn a new stage attempt 4.1. 
> Just within this timespan, the above task completes successfully, thus, 
> marking the partition id 9000 as complete for 4.0. However, as stage 4.1 has 
> not yet been created, the taskset info for that stage is not available to the 
> TaskScheduler so, naturally, the partition id 9000 has not been marked 
> completed for 4.1. Stage 4.1 now spawns task with index 2000 on the same 
> partition id 9000. This task fails due to CommitDeniedException and since, it 
> does not see the corresponding partition id as been marked successful, it 
> keeps retrying multiple times until the job finally succeeds. It doesn't 
> cause any job failures because the DAG scheduler is tracking the partitions 
> separate from the task set managers.
>  
> Steps to Reproduce:
>  # Run any large job involving shuffle operation.
>  # When the ShuffleMap stage finishes and the ResultStage begins running, 
> cause this stage to throw a fetch failure exception(Try deleting certain 
> shuffle files on any host).
>  # Observe the task attempt numbers for the next stage attempt. Please note 
> that this issue is an intermittent one, so it might not happen all the time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25250) Race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple tim

2018-10-23 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25250:


Assignee: (was: Apache Spark)

> Race condition with tasks running when new attempt for same stage is created 
> leads to other task in the next attempt running on the same partition id 
> retry multiple times
> --
>
> Key: SPARK-25250
> URL: https://issues.apache.org/jira/browse/SPARK-25250
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.3.1
>Reporter: Parth Gandhi
>Priority: Major
>
> We recently had a scenario where a race condition occurred when a task from 
> previous stage attempt just finished before new attempt for the same stage 
> was created due to fetch failure, so the new task created in the second 
> attempt on the same partition id was retrying multiple times due to 
> TaskCommitDenied Exception without realizing that the task in earlier attempt 
> was already successful.  
> For example, consider a task with partition id 9000 and index 9000 running in 
> stage 4.0. We see a fetch failure so thus, we spawn a new stage attempt 4.1. 
> Just within this timespan, the above task completes successfully, thus, 
> marking the partition id 9000 as complete for 4.0. However, as stage 4.1 has 
> not yet been created, the taskset info for that stage is not available to the 
> TaskScheduler so, naturally, the partition id 9000 has not been marked 
> completed for 4.1. Stage 4.1 now spawns task with index 2000 on the same 
> partition id 9000. This task fails due to CommitDeniedException and since, it 
> does not see the corresponding partition id as been marked successful, it 
> keeps retrying multiple times until the job finally succeeds. It doesn't 
> cause any job failures because the DAG scheduler is tracking the partitions 
> separate from the task set managers.
>  
> Steps to Reproduce:
>  # Run any large job involving shuffle operation.
>  # When the ShuffleMap stage finishes and the ResultStage begins running, 
> cause this stage to throw a fetch failure exception(Try deleting certain 
> shuffle files on any host).
>  # Observe the task attempt numbers for the next stage attempt. Please note 
> that this issue is an intermittent one, so it might not happen all the time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25810) Spark structured streaming logs auto.offset.reset=earliest even though startingOffsets is set to latest

2018-10-23 Thread ANUJA BANTHIYA (JIRA)

ANUJA BANTHIYA created SPARK-25810:
--

 Summary: Spark structured streaming logs 
auto.offset.reset=earliest even though startingOffsets is set to latest
 Key: SPARK-25810
 URL: https://issues.apache.org/jira/browse/SPARK-25810
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.3.1
Reporter: ANUJA BANTHIYA


I have a  issue when i'm trying to read data from kafka using spark structured 
streaming. 

Versions : spark-core_2.11 : 2.3.1, spark-sql_2.11 : 2.3.1, 
spark-sql-kafka-0-10_2.11 : 2.3.1, kafka-client :0.11.0.0

The issue i am facing is that the spark job always logs auto.offset.reset = 
earliest  even though latest option is specified in the code during startup of 
application .

Code to reproduce: 
{code:java}
package com.informatica.exec
import org.apache.spark.sql.SparkSession
object kafkaLatestOffset {
 def main(s: Array[String]) {

 val spark = SparkSession
 .builder()
 .appName("Spark Offset basic example")
 .master("local[*]")
 .getOrCreate()
 val df = spark
 .readStream
 .format("kafka")
 .option("kafka.bootstrap.servers", "localhost:9092")
 .option("subscribe", "topic1")
 .option("startingOffsets", "latest")
 .load()
 val query = df.writeStream
 .outputMode("complete")
 .format("console")
 .start()

 query.awaitTermination()
 }
}
{code}
 

As mentioned in Structured streaming doc, {{startingOffsets}}  need to be set 
for auto.offset.reset.

[https://spark.apache.org/docs/2.3.1/structured-streaming-kafka-integration.html]
 * *auto.offset.reset*: Set the source option {{startingOffsets}} to specify 
where to start instead. Structured Streaming manages which offsets are consumed 
internally, rather than rely on the kafka Consumer to do it. This will ensure 
that no data is missed when new topics/partitions are dynamically subscribed. 
Note that {{startingOffsets}} only applies when a new streaming query is 
started, and that resuming will always pick up from where the query left off.

During runtime , kafka messages are picked from the latest offset , so 
functional wise it is working as expected. Only log is misleading as it logs  
auto.offset.reset = *earliest* .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25791) Datatype of serializers in RowEncoder should be accessible

2018-10-23 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-25791.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22785
[https://github.com/apache/spark/pull/22785]

> Datatype of serializers in RowEncoder should be accessible
> --
>
> Key: SPARK-25791
> URL: https://issues.apache.org/jira/browse/SPARK-25791
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 3.0.0
>
>
> The serializers of {{RowEncoder}} use few {{If}} Catalyst expression which 
> inherits {{ComplexTypeMergingExpression}} that will check input data types.
> It is possible to generate serializers which fail the check and can't to 
> access the data type of serializers. When producing {{If}} expression, we 
> should use the same data type at its input expressions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25791) Datatype of serializers in RowEncoder should be accessible

2018-10-23 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-25791:
---

Assignee: Liang-Chi Hsieh

> Datatype of serializers in RowEncoder should be accessible
> --
>
> Key: SPARK-25791
> URL: https://issues.apache.org/jira/browse/SPARK-25791
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 3.0.0
>
>
> The serializers of {{RowEncoder}} use few {{If}} Catalyst expression which 
> inherits {{ComplexTypeMergingExpression}} that will check input data types.
> It is possible to generate serializers which fail the check and can't to 
> access the data type of serializers. When producing {{If}} expression, we 
> should use the same data type at its input expressions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25809) Support additional K8S cluster types for integration tests

2018-10-23 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25809:


Assignee: (was: Apache Spark)

> Support additional K8S cluster types for integration tests
> --
>
> Key: SPARK-25809
> URL: https://issues.apache.org/jira/browse/SPARK-25809
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Rob Vesse
>Priority: Major
>
> Currently the Spark on K8S integration tests are hardcoded to use a 
> {{minikube}} based backend.  It would be nice if developers had more 
> flexibility in the choice of K8S cluster they wish to use for integration 
> testing.  More specifically it would be useful to be able to use the built-in 
> Kubernetes support in recent Docker releases and to just use a generic K8S 
> cluster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25809) Support additional K8S cluster types for integration tests

2018-10-23 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25809:


Assignee: Apache Spark

> Support additional K8S cluster types for integration tests
> --
>
> Key: SPARK-25809
> URL: https://issues.apache.org/jira/browse/SPARK-25809
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Rob Vesse
>Assignee: Apache Spark
>Priority: Major
>
> Currently the Spark on K8S integration tests are hardcoded to use a 
> {{minikube}} based backend.  It would be nice if developers had more 
> flexibility in the choice of K8S cluster they wish to use for integration 
> testing.  More specifically it would be useful to be able to use the built-in 
> Kubernetes support in recent Docker releases and to just use a generic K8S 
> cluster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25809) Support additional K8S cluster types for integration tests

2018-10-23 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16660641#comment-16660641
 ] 

Apache Spark commented on SPARK-25809:
--

User 'rvesse' has created a pull request for this issue:
https://github.com/apache/spark/pull/22805

> Support additional K8S cluster types for integration tests
> --
>
> Key: SPARK-25809
> URL: https://issues.apache.org/jira/browse/SPARK-25809
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Rob Vesse
>Priority: Major
>
> Currently the Spark on K8S integration tests are hardcoded to use a 
> {{minikube}} based backend.  It would be nice if developers had more 
> flexibility in the choice of K8S cluster they wish to use for integration 
> testing.  More specifically it would be useful to be able to use the built-in 
> Kubernetes support in recent Docker releases and to just use a generic K8S 
> cluster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25809) Support additional K8S cluster types for integration tests

2018-10-23 Thread Rob Vesse (JIRA)

Rob Vesse created SPARK-25809:
-

 Summary: Support additional K8S cluster types for integration tests
 Key: SPARK-25809
 URL: https://issues.apache.org/jira/browse/SPARK-25809
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 2.3.2, 2.4.0
Reporter: Rob Vesse


Currently the Spark on K8S integration tests are hardcoded to use a 
{{minikube}} based backend.  It would be nice if developers had more 
flexibility in the choice of K8S cluster they wish to use for integration 
testing.  More specifically it would be useful to be able to use the built-in 
Kubernetes support in recent Docker releases and to just use a generic K8S 
cluster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25805) Flaky test: DataFrameSuite.SPARK-25159 unittest failure

2018-10-23 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-25805.
-
   Resolution: Fixed
Fix Version/s: 3.0.0
   2.4.1

Issue resolved by pull request 22799
[https://github.com/apache/spark/pull/22799]

> Flaky test: DataFrameSuite.SPARK-25159 unittest failure
> ---
>
> Key: SPARK-25805
> URL: https://issues.apache.org/jira/browse/SPARK-25805
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Minor
> Fix For: 2.4.1, 3.0.0
>
>
> I've seen this test fail on internal builds:
> {noformat}
> Error Message0 did not equal 1Stacktrace  
> org.scalatest.exceptions.TestFailedException: 0 did not equal 1
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
>   at 
> org.apache.spark.sql.DataFrameSuite$$anonfun$99$$anonfun$apply$mcV$sp$219.apply(DataFrameSuite.scala:2552)
>   at 
> org.apache.spark.sql.DataFrameSuite$$anonfun$99$$anonfun$apply$mcV$sp$219.apply(DataFrameSuite.scala:2534)
>   at 
> org.apache.spark.sql.test.SQLTestUtilsBase$class.withTempPath(SQLTestUtils.scala:179)
>   at 
> org.apache.spark.sql.DataFrameSuite.withTempPath(DataFrameSuite.scala:46)
>   at 
> org.apache.spark.sql.DataFrameSuite$$anonfun$99.apply$mcV$sp(DataFrameSuite.scala:2534)
>   at 
> org.apache.spark.sql.DataFrameSuite$$anonfun$99.apply(DataFrameSuite.scala:2534)
>   at 
> org.apache.spark.sql.DataFrameSuite$$anonfun$99.apply(DataFrameSuite.scala:2534)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196)
>   at 
> org.apache.spark.sql.DataFrameSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(DataFrameSuite.scala:46)
>   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:221)
>   at org.apache.spark.sql.DataFrameSuite.runTest(DataFrameSuite.scala:46)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
>   at org.scalatest.Suite$class.run(Suite.scala:1147)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:52)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52)
>   at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1210)
>   at 
> org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1257)
>   at 
>

[jira] [Assigned] (SPARK-25805) Flaky test: DataFrameSuite.SPARK-25159 unittest failure

2018-10-23 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-25805:
---

Assignee: Imran Rashid

> Flaky test: DataFrameSuite.SPARK-25159 unittest failure
> ---
>
> Key: SPARK-25805
> URL: https://issues.apache.org/jira/browse/SPARK-25805
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Minor
> Fix For: 2.4.1, 3.0.0
>
>
> I've seen this test fail on internal builds:
> {noformat}
> Error Message0 did not equal 1Stacktrace  
> org.scalatest.exceptions.TestFailedException: 0 did not equal 1
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
>   at 
> org.apache.spark.sql.DataFrameSuite$$anonfun$99$$anonfun$apply$mcV$sp$219.apply(DataFrameSuite.scala:2552)
>   at 
> org.apache.spark.sql.DataFrameSuite$$anonfun$99$$anonfun$apply$mcV$sp$219.apply(DataFrameSuite.scala:2534)
>   at 
> org.apache.spark.sql.test.SQLTestUtilsBase$class.withTempPath(SQLTestUtils.scala:179)
>   at 
> org.apache.spark.sql.DataFrameSuite.withTempPath(DataFrameSuite.scala:46)
>   at 
> org.apache.spark.sql.DataFrameSuite$$anonfun$99.apply$mcV$sp(DataFrameSuite.scala:2534)
>   at 
> org.apache.spark.sql.DataFrameSuite$$anonfun$99.apply(DataFrameSuite.scala:2534)
>   at 
> org.apache.spark.sql.DataFrameSuite$$anonfun$99.apply(DataFrameSuite.scala:2534)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196)
>   at 
> org.apache.spark.sql.DataFrameSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(DataFrameSuite.scala:46)
>   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:221)
>   at org.apache.spark.sql.DataFrameSuite.runTest(DataFrameSuite.scala:46)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
>   at org.scalatest.Suite$class.run(Suite.scala:1147)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:52)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52)
>   at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1210)
>   at 
> org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1257)
>   at 
> org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1255)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>

[jira] [Commented] (SPARK-25808) upgrade jsr305 version from 1.3.9 to 3.0.0

2018-10-23 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16660381#comment-16660381
 ] 

Apache Spark commented on SPARK-25808:
--

User 'daviddingly' has created a pull request for this issue:
https://github.com/apache/spark/pull/22803

> upgrade jsr305 version from 1.3.9 to 3.0.0
> --
>
> Key: SPARK-25808
> URL: https://issues.apache.org/jira/browse/SPARK-25808
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: ding xiaoyuan
>Priority: Minor
>
>  
> we find below warnings when build spark project:
> {noformat}
> [warn] * com.google.code.findbugs:jsr305:3.0.0 is selected over 1.3.9
> [warn] +- org.apache.hadoop:hadoop-common:2.7.3 (depends on 3.0.0)
> [warn] +- org.apache.spark:spark-core_2.11:3.0.0-SNAPSHOT (depends on 1.3.9)
> [warn] +- org.apache.spark:spark-network-common_2.11:3.0.0-SNAPSHOT (depends 
> on 1.3.9)
> [warn] +- org.apache.spark:spark-unsafe_2.11:3.0.0-SNAPSHOT (depends on 
> 1.3.9){noformat}
> so ideally we need to upgrade jsr305 from 1.3.9 to 3.0.0 to fix this warning



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25808) upgrade jsr305 version from 1.3.9 to 3.0.0

2018-10-23 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25808:


Assignee: (was: Apache Spark)

> upgrade jsr305 version from 1.3.9 to 3.0.0
> --
>
> Key: SPARK-25808
> URL: https://issues.apache.org/jira/browse/SPARK-25808
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: ding xiaoyuan
>Priority: Minor
>
>  
> we find below warnings when build spark project:
> {noformat}
> [warn] * com.google.code.findbugs:jsr305:3.0.0 is selected over 1.3.9
> [warn] +- org.apache.hadoop:hadoop-common:2.7.3 (depends on 3.0.0)
> [warn] +- org.apache.spark:spark-core_2.11:3.0.0-SNAPSHOT (depends on 1.3.9)
> [warn] +- org.apache.spark:spark-network-common_2.11:3.0.0-SNAPSHOT (depends 
> on 1.3.9)
> [warn] +- org.apache.spark:spark-unsafe_2.11:3.0.0-SNAPSHOT (depends on 
> 1.3.9){noformat}
> so ideally we need to upgrade jsr305 from 1.3.9 to 3.0.0 to fix this warning



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25808) upgrade jsr305 version from 1.3.9 to 3.0.0

2018-10-23 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16660380#comment-16660380
 ] 

Apache Spark commented on SPARK-25808:
--

User 'daviddingly' has created a pull request for this issue:
https://github.com/apache/spark/pull/22803

> upgrade jsr305 version from 1.3.9 to 3.0.0
> --
>
> Key: SPARK-25808
> URL: https://issues.apache.org/jira/browse/SPARK-25808
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: ding xiaoyuan
>Priority: Minor
>
>  
> we find below warnings when build spark project:
> {noformat}
> [warn] * com.google.code.findbugs:jsr305:3.0.0 is selected over 1.3.9
> [warn] +- org.apache.hadoop:hadoop-common:2.7.3 (depends on 3.0.0)
> [warn] +- org.apache.spark:spark-core_2.11:3.0.0-SNAPSHOT (depends on 1.3.9)
> [warn] +- org.apache.spark:spark-network-common_2.11:3.0.0-SNAPSHOT (depends 
> on 1.3.9)
> [warn] +- org.apache.spark:spark-unsafe_2.11:3.0.0-SNAPSHOT (depends on 
> 1.3.9){noformat}
> so ideally we need to upgrade jsr305 from 1.3.9 to 3.0.0 to fix this warning



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25808) upgrade jsr305 version from 1.3.9 to 3.0.0

2018-10-23 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25808:


Assignee: Apache Spark

> upgrade jsr305 version from 1.3.9 to 3.0.0
> --
>
> Key: SPARK-25808
> URL: https://issues.apache.org/jira/browse/SPARK-25808
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: ding xiaoyuan
>Assignee: Apache Spark
>Priority: Minor
>
>  
> we find below warnings when build spark project:
> {noformat}
> [warn] * com.google.code.findbugs:jsr305:3.0.0 is selected over 1.3.9
> [warn] +- org.apache.hadoop:hadoop-common:2.7.3 (depends on 3.0.0)
> [warn] +- org.apache.spark:spark-core_2.11:3.0.0-SNAPSHOT (depends on 1.3.9)
> [warn] +- org.apache.spark:spark-network-common_2.11:3.0.0-SNAPSHOT (depends 
> on 1.3.9)
> [warn] +- org.apache.spark:spark-unsafe_2.11:3.0.0-SNAPSHOT (depends on 
> 1.3.9){noformat}
> so ideally we need to upgrade jsr305 from 1.3.9 to 3.0.0 to fix this warning



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25808) upgrade jsr305 version from 1.3.9 to 3.0.0

2018-10-23 Thread ding xiaoyuan (JIRA)

ding xiaoyuan created SPARK-25808:
-

 Summary: upgrade jsr305 version from 1.3.9 to 3.0.0
 Key: SPARK-25808
 URL: https://issues.apache.org/jira/browse/SPARK-25808
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.0.0
Reporter: ding xiaoyuan


 

we find below warnings when build spark project:
{noformat}
[warn] * com.google.code.findbugs:jsr305:3.0.0 is selected over 1.3.9
[warn] +- org.apache.hadoop:hadoop-common:2.7.3 (depends on 3.0.0)
[warn] +- org.apache.spark:spark-core_2.11:3.0.0-SNAPSHOT (depends on 1.3.9)
[warn] +- org.apache.spark:spark-network-common_2.11:3.0.0-SNAPSHOT (depends on 
1.3.9)
[warn] +- org.apache.spark:spark-unsafe_2.11:3.0.0-SNAPSHOT (depends on 
1.3.9){noformat}
so ideally we need to upgrade jsr305 from 1.3.9 to 3.0.0 to fix this warning



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25801) pandas_udf grouped_map fails with input dataframe with more than 255 columns

2018-10-23 Thread Frederik (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16660360#comment-16660360
 ] 

Frederik commented on SPARK-25801:
--

Hi Bryan,

Thanks for the quick answer!

I wasn't aware Python 3.7 doesn't have the 255 arguments limitation. 
Unfortunately I can't use python 3.7 (I'm on a platform where I can't change 
PYSPARK_DRIVER_PYTHON from 3.6 and PYSPARK_DRIVER_PYTHON and PYSPARK_PYTHON 
need the same minor versions) nor upgrade Spark. Think I'll use an approach 
with standard udf's as for example outlined here:

[https://florianwilhelm.info/2017/10/efficient_udfs_with_pyspark/]

Unless there's other options?

 

 

> pandas_udf grouped_map fails with input dataframe with more than 255 columns
> 
>
> Key: SPARK-25801
> URL: https://issues.apache.org/jira/browse/SPARK-25801
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: python 2.7
> pyspark 2.3.0
>Reporter: Frederik
>Priority: Major
>
> Hi,
> I'm using a pandas_udf to deploy a model to predict all samples in a spark 
> dataframe,
> for this I use a udf as follows:
> @pandas_udf("scores double", PandasUDFType.GROUPED_MAP) def 
> predict_scores(pdf):  score_values = model.predict_proba(pdf)[:,1]  return 
> pd.DataFrame({'scores': score_values})
> So it takes a dataframe and predicts the probability of being positive 
> according to an sklearn model for each row and returns this as single column. 
> This works great on a random groupBy, e.g.:
> sdf_to_score.groupBy(sf.col('age')).apply(predict_scores)
> as long as the dataframe has <255 columns. When the input dataframe has more 
> than 255 columns (thus features in my model), I get:
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "path/to/lib/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 
> 219, in main
> func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, 
> eval_type)
>   File "path/to/lib/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 
> 148, in read_udfs
> mapper = eval(mapper_str, udfs)
>   File "", line 1
> SyntaxError: more than 255 arguments
> Which seems to be related with Python's general limitation of having not 
> allowing more than 255 arguments for a function?
>  
> Is this a bug or is there a straightforward way around this problem?
>  
> Regards,
> Frederik



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25665) Refactor ObjectHashAggregateExecBenchmark to use main method

2018-10-23 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16660291#comment-16660291
 ] 

Apache Spark commented on SPARK-25665:
--

User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/22804

> Refactor ObjectHashAggregateExecBenchmark to use main method
> 
>
> Key: SPARK-25665
> URL: https://issues.apache.org/jira/browse/SPARK-25665
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25665) Refactor ObjectHashAggregateExecBenchmark to use main method

2018-10-23 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25665:


Assignee: Apache Spark

> Refactor ObjectHashAggregateExecBenchmark to use main method
> 
>
> Key: SPARK-25665
> URL: https://issues.apache.org/jira/browse/SPARK-25665
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25665) Refactor ObjectHashAggregateExecBenchmark to use main method

2018-10-23 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25665:


Assignee: (was: Apache Spark)

> Refactor ObjectHashAggregateExecBenchmark to use main method
> 
>
> Key: SPARK-25665
> URL: https://issues.apache.org/jira/browse/SPARK-25665
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25665) Refactor ObjectHashAggregateExecBenchmark to use main method

2018-10-23 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16660290#comment-16660290
 ] 

Apache Spark commented on SPARK-25665:
--

User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/22804

> Refactor ObjectHashAggregateExecBenchmark to use main method
> 
>
> Key: SPARK-25665
> URL: https://issues.apache.org/jira/browse/SPARK-25665
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25807) Mitigate 1-based substr() confusion

2018-10-23 Thread Oron Navon (JIRA)

Oron Navon created SPARK-25807:
--

 Summary: Mitigate 1-based substr() confusion
 Key: SPARK-25807
 URL: https://issues.apache.org/jira/browse/SPARK-25807
 Project: Spark
  Issue Type: Improvement
  Components: Java API, PySpark
Affects Versions: 2.3.2, 1.3.0, 2.4.0, 2.5.0, 3.0.0
Reporter: Oron Navon


The method {{Column.substr()}} is 1-based, conforming with SQL and Hive's 
{{SUBSTRING}}, and contradicting both Python's {{substr}} and Java's 
{{substr}}, which are zero-based.  Both PySpark users and Java API users often 
naturally expect a 0-based {{substr()}}. Adding to the confusion, {{substr()}} 
currently allows a {{startPos}} value of 0, which returns the same result as 
{{startPos==1}}.

Since changing {{substr()}} to 0-based is probably NOT a reasonable option 
here, I suggest making one or more of the following changes:
 # Adding a method {{substr0}}, which would be zero-based
 # Renaming {{substr}} to {{substr1}}
 # Making the existing {{substr()}} throw an exception on {{startPos==0}}, 
which should catch and alert most users who expect zero-based behavior.

This is my first discussion on this project, apologies for any faux pas.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25806) The instanceof FileSplit is redundant for ParquetFileFormat

2018-10-23 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25806:


Assignee: Apache Spark

>  The instanceof FileSplit is redundant for ParquetFileFormat
> 
>
> Key: SPARK-25806
> URL: https://issues.apache.org/jira/browse/SPARK-25806
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 3.0.0
>Reporter: liuxian
>Assignee: Apache Spark
>Priority: Trivial
>
> The instance of FileSplit is redundant for 
> {color:#ffc66d}buildReaderWithPartitionValues {color}
> {color:#33}in the {color}{color:#f79232}ParquetFileFormat{color} class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25806) The instanceof FileSplit is redundant for ParquetFileFormat

2018-10-23 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16660185#comment-16660185
 ] 

Apache Spark commented on SPARK-25806:
--

User '10110346' has created a pull request for this issue:
https://github.com/apache/spark/pull/22802

>  The instanceof FileSplit is redundant for ParquetFileFormat
> 
>
> Key: SPARK-25806
> URL: https://issues.apache.org/jira/browse/SPARK-25806
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 3.0.0
>Reporter: liuxian
>Priority: Trivial
>
> The instance of FileSplit is redundant for 
> {color:#ffc66d}buildReaderWithPartitionValues {color}
> {color:#33}in the {color}{color:#f79232}ParquetFileFormat{color} class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25806) The instanceof FileSplit is redundant for ParquetFileFormat

2018-10-23 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25806:


Assignee: (was: Apache Spark)

>  The instanceof FileSplit is redundant for ParquetFileFormat
> 
>
> Key: SPARK-25806
> URL: https://issues.apache.org/jira/browse/SPARK-25806
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 3.0.0
>Reporter: liuxian
>Priority: Trivial
>
> The instance of FileSplit is redundant for 
> {color:#ffc66d}buildReaderWithPartitionValues {color}
> {color:#33}in the {color}{color:#f79232}ParquetFileFormat{color} class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25806) The instanceof FileSplit is redundant for ParquetFileFormat

2018-10-23 Thread liuxian (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-25806:

Description: 
The instance of FileSplit is redundant for 
{color:#ffc66d}buildReaderWithPartitionValues {color}

{color:#33}in the {color}{color:#f79232}ParquetFileFormat{color} class.

  was:
The instanceof FileSplit is redundant for 
{color:#ffc66d}buildReaderWithPartitionValues {color}

{color:#33}in the {color}{color:#f79232}ParquetFileFormat{color} class.


>  The instanceof FileSplit is redundant for ParquetFileFormat
> 
>
> Key: SPARK-25806
> URL: https://issues.apache.org/jira/browse/SPARK-25806
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 3.0.0
>Reporter: liuxian
>Priority: Trivial
>
> The instance of FileSplit is redundant for 
> {color:#ffc66d}buildReaderWithPartitionValues {color}
> {color:#33}in the {color}{color:#f79232}ParquetFileFormat{color} class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25806) The instanceof FileSplit is redundant for ParquetFileFormat

2018-10-23 Thread liuxian (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-25806:

Description: 
The instanceof FileSplit is redundant for 
{color:#ffc66d}buildReaderWithPartitionValues {color}

{color:#33}in the {color}{color:#f79232}ParquetFileFormat{color} class.

  was:
The instanceof FileSplit is redundant for 
{color:#ffc66d}buildReaderWithPartitionValues {color}

{color:#33}{color:#ffc66d}in the {color}{color}ParquetFileFormat class.


>  The instanceof FileSplit is redundant for ParquetFileFormat
> 
>
> Key: SPARK-25806
> URL: https://issues.apache.org/jira/browse/SPARK-25806
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 3.0.0
>Reporter: liuxian
>Priority: Trivial
>
> The instanceof FileSplit is redundant for 
> {color:#ffc66d}buildReaderWithPartitionValues {color}
> {color:#33}in the {color}{color:#f79232}ParquetFileFormat{color} class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25806) The instanceof FileSplit is redundant for ParquetFileFormat

2018-10-23 Thread liuxian (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-25806:

Description: 
The instanceof FileSplit is redundant for 
{color:#ffc66d}buildReaderWithPartitionValues {color}

{color:#33}{color:#ffc66d}in the {color}{color}ParquetFileFormat class.

  was:The instanceof FileSplit is redundant for 
{color:#ffc66d}buildReaderWithPartitionValues {color:#33}in the 
{color}{color}{color:#f79232}ParquetFileFormat{color} class.


>  The instanceof FileSplit is redundant for ParquetFileFormat
> 
>
> Key: SPARK-25806
> URL: https://issues.apache.org/jira/browse/SPARK-25806
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 3.0.0
>Reporter: liuxian
>Priority: Trivial
>
> The instanceof FileSplit is redundant for 
> {color:#ffc66d}buildReaderWithPartitionValues {color}
> {color:#33}{color:#ffc66d}in the {color}{color}ParquetFileFormat class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25806) The instanceof FileSplit is redundant for ParquetFileFormat

2018-10-23 Thread liuxian (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-25806:

Description: The instanceof FileSplit is redundant for 
{color:#ffc66d}buildReaderWithPartitionValues {color:#33}in the 
{color}{color}{color:#f79232}ParquetFileFormat{color} class.  (was: The 
instanceof FileSplit is redundant for 
{color:#ffc66d}buildReaderWithPartitionValues {color:#33}in the 
{color}{color}{color:#f79232}ParquetFileFormat {color:#33}class.{color}
{color})

>  The instanceof FileSplit is redundant for ParquetFileFormat
> 
>
> Key: SPARK-25806
> URL: https://issues.apache.org/jira/browse/SPARK-25806
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 3.0.0
>Reporter: liuxian
>Priority: Trivial
>
> The instanceof FileSplit is redundant for 
> {color:#ffc66d}buildReaderWithPartitionValues {color:#33}in the 
> {color}{color}{color:#f79232}ParquetFileFormat{color} class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25806) The instanceof FileSplit is redundant for ParquetFileFormat

2018-10-23 Thread liuxian (JIRA)

liuxian created SPARK-25806:
---

 Summary:  The instanceof FileSplit is redundant for 
ParquetFileFormat
 Key: SPARK-25806
 URL: https://issues.apache.org/jira/browse/SPARK-25806
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
Affects Versions: 3.0.0
Reporter: liuxian


The instanceof FileSplit is redundant for 
{color:#ffc66d}buildReaderWithPartitionValues {color:#33}in the 
{color}{color}{color:#f79232}ParquetFileFormat {color:#33}class.{color}
{color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25040) Empty string should be disallowed for data types other than string and binary in JSON

2018-10-23 Thread Liang-Chi Hsieh (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-25040:

Summary: Empty string should be disallowed for data types other than string 
and binary in JSON  (was: Empty string for double and float types  should be 
nulls in JSON)

> Empty string should be disallowed for data types other than string and binary 
> in JSON
> -
>
> Key: SPARK-25040
> URL: https://issues.apache.org/jira/browse/SPARK-25040
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Liang-Chi Hsieh
>Priority: Minor
> Fix For: 3.0.0
>
>
> The issue itself seems to be a behaviour change between 1.6 and 2.x for 
> treating empty string as null or not in double and float.
> {code}
> {"a":"a1","int":1,"other":4.4}
> {"a":"a2","int":"","other":""}
> {code}
> code ：
> {code}
> val config = new SparkConf().setMaster("local[5]").setAppName("test")
> val sc = SparkContext.getOrCreate(config)
> val sql = new SQLContext(sc)
> val file_path = 
> this.getClass.getClassLoader.getResource("Sanity4.json").getFile
> val df = sql.read.schema(null).json(file_path)
> df.show(30)
> {code}
> then in spark 1.6, result is
> {code}
> +---++-+
> | a| int|other|
> +---++-+
> | a1| 1| 4.4|
> | a2|null| null|
> +---++-+
> {code}
> {code}
> root
> |-- a: string (nullable = true)
> |-- int: long (nullable = true)
> |-- other: double (nullable = true)
> {code}
> but in spark 2.2, result is
> {code}
> +++-+
> | a| int|other|
> +++-+
> | a1| 1| 4.4|
> |null|null| null|
> +++-+
> {code}
> {code}
> root
> |-- a: string (nullable = true)
> |-- int: long (nullable = true)
> |-- other: double (nullable = true)
> {code}
> Another easy reproducer:
> {code}
> spark.read.schema("a DOUBLE, b FLOAT")
>   .option("mode", "FAILFAST").json(Seq("""{"a":"", "b": ""}""", """{"a": 
> 1.1, "b": 1.1}""").toDS)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25040) Empty string should be disallowed for data types except for string and binary types in JSON

2018-10-23 Thread Liang-Chi Hsieh (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16660143#comment-16660143
 ] 

Liang-Chi Hsieh commented on SPARK-25040:
-

The JIRA title is not correct now. I changed it.

> Empty string should be disallowed for data types except for string and binary 
> types in JSON
> ---
>
> Key: SPARK-25040
> URL: https://issues.apache.org/jira/browse/SPARK-25040
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Liang-Chi Hsieh
>Priority: Minor
> Fix For: 3.0.0
>
>
> The issue itself seems to be a behaviour change between 1.6 and 2.x for 
> treating empty string as null or not in double and float.
> {code}
> {"a":"a1","int":1,"other":4.4}
> {"a":"a2","int":"","other":""}
> {code}
> code ：
> {code}
> val config = new SparkConf().setMaster("local[5]").setAppName("test")
> val sc = SparkContext.getOrCreate(config)
> val sql = new SQLContext(sc)
> val file_path = 
> this.getClass.getClassLoader.getResource("Sanity4.json").getFile
> val df = sql.read.schema(null).json(file_path)
> df.show(30)
> {code}
> then in spark 1.6, result is
> {code}
> +---++-+
> | a| int|other|
> +---++-+
> | a1| 1| 4.4|
> | a2|null| null|
> +---++-+
> {code}
> {code}
> root
> |-- a: string (nullable = true)
> |-- int: long (nullable = true)
> |-- other: double (nullable = true)
> {code}
> but in spark 2.2, result is
> {code}
> +++-+
> | a| int|other|
> +++-+
> | a1| 1| 4.4|
> |null|null| null|
> +++-+
> {code}
> {code}
> root
> |-- a: string (nullable = true)
> |-- int: long (nullable = true)
> |-- other: double (nullable = true)
> {code}
> Another easy reproducer:
> {code}
> spark.read.schema("a DOUBLE, b FLOAT")
>   .option("mode", "FAILFAST").json(Seq("""{"a":"", "b": ""}""", """{"a": 
> 1.1, "b": 1.1}""").toDS)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25040) Empty string should be disallowed for data types except for string and binary types in JSON

2018-10-23 Thread Liang-Chi Hsieh (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-25040:

Summary: Empty string should be disallowed for data types except for string 
and binary types in JSON  (was: Empty string should be disallowed for data 
types other than string and binary in JSON)

> Empty string should be disallowed for data types except for string and binary 
> types in JSON
> ---
>
> Key: SPARK-25040
> URL: https://issues.apache.org/jira/browse/SPARK-25040
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Liang-Chi Hsieh
>Priority: Minor
> Fix For: 3.0.0
>
>
> The issue itself seems to be a behaviour change between 1.6 and 2.x for 
> treating empty string as null or not in double and float.
> {code}
> {"a":"a1","int":1,"other":4.4}
> {"a":"a2","int":"","other":""}
> {code}
> code ：
> {code}
> val config = new SparkConf().setMaster("local[5]").setAppName("test")
> val sc = SparkContext.getOrCreate(config)
> val sql = new SQLContext(sc)
> val file_path = 
> this.getClass.getClassLoader.getResource("Sanity4.json").getFile
> val df = sql.read.schema(null).json(file_path)
> df.show(30)
> {code}
> then in spark 1.6, result is
> {code}
> +---++-+
> | a| int|other|
> +---++-+
> | a1| 1| 4.4|
> | a2|null| null|
> +---++-+
> {code}
> {code}
> root
> |-- a: string (nullable = true)
> |-- int: long (nullable = true)
> |-- other: double (nullable = true)
> {code}
> but in spark 2.2, result is
> {code}
> +++-+
> | a| int|other|
> +++-+
> | a1| 1| 4.4|
> |null|null| null|
> +++-+
> {code}
> {code}
> root
> |-- a: string (nullable = true)
> |-- int: long (nullable = true)
> |-- other: double (nullable = true)
> {code}
> Another easy reproducer:
> {code}
> spark.read.schema("a DOUBLE, b FLOAT")
>   .option("mode", "FAILFAST").json(Seq("""{"a":"", "b": ""}""", """{"a": 
> 1.1, "b": 1.1}""").toDS)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25796) Enable external shuffle service for kubernetes mode.

2018-10-23 Thread Prashant Sharma (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma resolved SPARK-25796.
-
Resolution: Duplicate

> Enable external shuffle service for kubernetes mode.
> 
>
> Key: SPARK-25796
> URL: https://issues.apache.org/jira/browse/SPARK-25796
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Prashant Sharma
>Priority: Major
>
> This is required to support dynamic scaling for spark jobs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

94 matches

Mail list logo