[jira] [Updated] (SPARK-25804) JDOPersistenceManager leak when query via JDBC
[ https://issues.apache.org/jira/browse/SPARK-25804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pin_zhang updated SPARK-25804: -- Description: 1. start-thriftserver.sh under SPARK2.3.1 2. Create Table and insert values create table test_leak (id string, index int); insert into test_leak values('id1',1) 3. Create JDBC Client query the table import java.sql.*; public class HiveClient { public static void main(String[] args) throws Exception { String driverName = "org.apache.hive.jdbc.HiveDriver"; Class.forName(driverName); Connection con = DriverManager.getConnection( "jdbc:hive2://localhost:1/default", "test", "test"); Statement stmt = con.createStatement(); String sql = "select * from test_leak"; int loop = 100; while ( loop – > 0) { ResultSet rs = stmt.executeQuery(sql); rs.next(); System.out.println(new java.sql.Timestamp(System.currentTimeMillis()) +" : " + rs.getString(1)); rs.close(); if( loop % 100 ==0){ Thread.sleep(1); } } con.close(); } } 4. Dump HS2 heap org.datanucleus.api.jdo.JDOPersistenceManager instances keep increasing. was: 1. start-thriftserver.sh under SPARK2.3.1 2. Create Table and insert values create table test_leak (id string, index int); insert into test_leak values('id1',1) 3. Create JDBC Client query the table import java.sql.*; public class HiveClient { public static void main(String[] args) throws Exception { String driverName = "org.apache.hive.jdbc.HiveDriver"; Class.forName(driverName); Connection con = DriverManager.getConnection( "jdbc:hive2://localhost:1/default", "test", "test"); Statement stmt = con.createStatement(); String sql = "select * from test_leak"; int loop = 100; while ( loop -- > 0) { ResultSet rs = stmt.executeQuery(sql); rs.next(); System.out.println(new java.sql.Timestamp(System.currentTimeMillis()) +" : " + rs.getString(1)); rs.close(); } con.close(); } } 4. Dump HS2 heap org.datanucleus.api.jdo.JDOPersistenceManager instances keep increasing. > JDOPersistenceManager leak when query via JDBC > -- > > Key: SPARK-25804 > URL: https://issues.apache.org/jira/browse/SPARK-25804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: pin_zhang >Priority: Major > > 1. start-thriftserver.sh under SPARK2.3.1 > 2. Create Table and insert values > create table test_leak (id string, index int); > insert into test_leak values('id1',1) > 3. Create JDBC Client query the table > import java.sql.*; > public class HiveClient { > public static void main(String[] args) throws Exception { > String driverName = "org.apache.hive.jdbc.HiveDriver"; > Class.forName(driverName); > Connection con = DriverManager.getConnection( > "jdbc:hive2://localhost:1/default", "test", "test"); > Statement stmt = con.createStatement(); > String sql = "select * from test_leak"; > int loop = 100; > while ( loop – > 0) { > ResultSet rs = stmt.executeQuery(sql); > rs.next(); > System.out.println(new java.sql.Timestamp(System.currentTimeMillis()) +" > : " + rs.getString(1)); > rs.close(); > if( loop % 100 ==0){ > Thread.sleep(1); > } > } > con.close(); > } > } > 4. Dump HS2 heap org.datanucleus.api.jdo.JDOPersistenceManager instances keep > increasing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25817) Dataset encoder should support combination of map and product type
[ https://issues.apache.org/jira/browse/SPARK-25817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25817: Assignee: Wenchen Fan (was: Apache Spark) > Dataset encoder should support combination of map and product type > -- > > Key: SPARK-25817 > URL: https://issues.apache.org/jira/browse/SPARK-25817 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25817) Dataset encoder should support combination of map and product type
[ https://issues.apache.org/jira/browse/SPARK-25817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661713#comment-16661713 ] Apache Spark commented on SPARK-25817: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/22812 > Dataset encoder should support combination of map and product type > -- > > Key: SPARK-25817 > URL: https://issues.apache.org/jira/browse/SPARK-25817 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25817) Dataset encoder should support combination of map and product type
[ https://issues.apache.org/jira/browse/SPARK-25817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25817: Assignee: Apache Spark (was: Wenchen Fan) > Dataset encoder should support combination of map and product type > -- > > Key: SPARK-25817 > URL: https://issues.apache.org/jira/browse/SPARK-25817 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25817) Dataset encoder should support combination of map and product type
[ https://issues.apache.org/jira/browse/SPARK-25817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661714#comment-16661714 ] Apache Spark commented on SPARK-25817: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/22812 > Dataset encoder should support combination of map and product type > -- > > Key: SPARK-25817 > URL: https://issues.apache.org/jira/browse/SPARK-25817 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25817) Dataset encoder should support combination of map and product type
Wenchen Fan created SPARK-25817: --- Summary: Dataset encoder should support combination of map and product type Key: SPARK-25817 URL: https://issues.apache.org/jira/browse/SPARK-25817 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25810) Spark structured streaming logs auto.offset.reset=earliest even though startingOffsets is set to latest
[ https://issues.apache.org/jira/browse/SPARK-25810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661703#comment-16661703 ] sandeep katta commented on SPARK-25810: --- {color:#FF}[~abanthiy] {color}thanks for reporting this, Can you please share me the logs screenshot to check exactly which part of the flow is misleading . is it coming part of ConsumerConfig values ?? > Spark structured streaming logs auto.offset.reset=earliest even though > startingOffsets is set to latest > --- > > Key: SPARK-25810 > URL: https://issues.apache.org/jira/browse/SPARK-25810 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: ANUJA BANTHIYA >Priority: Trivial > > I have a issue when i'm trying to read data from kafka using spark > structured streaming. > Versions : spark-core_2.11 : 2.3.1, spark-sql_2.11 : 2.3.1, > spark-sql-kafka-0-10_2.11 : 2.3.1, kafka-client :0.11.0.0 > The issue i am facing is that the spark job always logs auto.offset.reset = > earliest even though latest option is specified in the code during startup > of application . > Code to reproduce: > {code:java} > package com.informatica.exec > import org.apache.spark.sql.SparkSession > object kafkaLatestOffset { > def main(s: Array[String]) { > val spark = SparkSession > .builder() > .appName("Spark Offset basic example") > .master("local[*]") > .getOrCreate() > val df = spark > .readStream > .format("kafka") > .option("kafka.bootstrap.servers", "localhost:9092") > .option("subscribe", "topic1") > .option("startingOffsets", "latest") > .load() > val query = df.writeStream > .outputMode("complete") > .format("console") > .start() > query.awaitTermination() > } > } > {code} > > As mentioned in Structured streaming doc, {{startingOffsets}} need to be set > for auto.offset.reset. > [https://spark.apache.org/docs/2.3.1/structured-streaming-kafka-integration.html] > * *auto.offset.reset*: Set the source option {{startingOffsets}} to specify > where to start instead. Structured Streaming manages which offsets are > consumed internally, rather than rely on the kafka Consumer to do it. This > will ensure that no data is missed when new topics/partitions are dynamically > subscribed. Note that {{startingOffsets}} only applies when a new streaming > query is started, and that resuming will always pick up from where the query > left off. > During runtime , kafka messages are picked from the latest offset , so > functional wise it is working as expected. Only log is misleading as it logs > auto.offset.reset = *earliest* . -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25797) Views created via 2.1 cannot be read via 2.2+
[ https://issues.apache.org/jira/browse/SPARK-25797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chenxiao Mao updated SPARK-25797: - Description: We ran into this issue when we update our Spark from 2.1 to 2.3. Below's a simple example to reproduce the issue. Create views via Spark 2.1 {code:sql} create view v1 as select (cast(1 as decimal(18,0)) + cast(1 as decimal(18,0))) c1; {code} Query views via Spark 2.3 {code:sql} select * from v1; Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3906: decimal(19,0) as it may truncate {code} After investigation, we found that this is because when a view is created via Spark 2.1, the expanded text is saved instead of the original text. Unfortunately, the blow expanded text is buggy. {code:sql} spark-sql> desc extended v1; c1 decimal(19,0) NULL Detailed Table Information Database default Table v1 Type VIEW View Text SELECT `gen_attr_0` AS `c1` FROM (SELECT (CAST(CAST(1 AS DECIMAL(18,0)) AS DECIMAL(19,0)) + CAST(CAST(1 AS DECIMAL(18,0)) AS DECIMAL(19,0))) AS `gen_attr_0`) AS gen_subquery_0 {code} We can see that c1 is decimal(19,0), however in the expanded text there is decimal(19,0) + decimal(19,0) which results in decimal(20,0). Since Spark 2.2, decimal(20,0) in query is not allowed to cast to view definition column decimal(19,0). ([https://github.com/apache/spark/pull/16561]) I further tested other decimal calculations. Only add/subtract has this issue. Create views via 2.1: {code:sql} create view v1 as select (cast(1 as decimal(18,0)) + cast(1 as decimal(18,0))) c1; create view v2 as select (cast(1 as decimal(18,0)) - cast(1 as decimal(18,0))) c1; create view v3 as select (cast(1 as decimal(18,0)) * cast(1 as decimal(18,0))) c1; create view v4 as select (cast(1 as decimal(18,0)) / cast(1 as decimal(18,0))) c1; create view v5 as select (cast(1 as decimal(18,0)) % cast(1 as decimal(18,0))) c1; create view v6 as select cast(1 as decimal(18,0)) c1 union select cast(1 as decimal(19,0)) c1; {code} Query views via Spark 2.3 {code:sql} select * from v1; Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3906: decimal(19,0) as it may truncate select * from v2; Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3909: decimal(19,0) as it may truncate select * from v3; 1 select * from v4; 1 select * from v5; 0 select * from v6; 1 {code} Views created via Spark 2.2+ don't have this issue because Spark 2.2+ does not generate expanded text for view (https://issues.apache.org/jira/browse/SPARK-18209). was: We ran into this issue when we update our Spark from 2.1 to 2.3. Below's a simple example to reproduce the issue. Create views via Spark 2.1 |create view v1 as select (cast(1 as decimal(18,0)) + cast(1 as decimal(18,0))) c1;| Query views via Spark 2.3 |{{select * from v1;}} {{Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3906: decimal(19,0) as it may truncate}}| After investigation, we found that this is because when a view is created via Spark 2.1, the expanded text is saved instead of the original text. Unfortunately, the blow expanded text is buggy. |spark-sql> desc extended v1; c1 decimal(19,0) NULL Detailed Table Information Database default Table v1 Type VIEW View Text SELECT `gen_attr_0` AS `c1` FROM (SELECT (CAST(CAST(1 AS DECIMAL(18,0)) AS DECIMAL(19,0)) + CAST(CAST(1 AS DECIMAL(18,0)) AS DECIMAL(19,0))) AS `gen_attr_0`) AS gen_subquery_0| We can see that c1 is decimal(19,0), however in the expanded text there is decimal(19,0) + decimal(19,0) which results in decimal(20,0). Since Spark 2.2, decimal(20,0) in query is not allowed to cast to view definition column decimal(19,0). ([https://github.com/apache/spark/pull/16561]) I further tested other decimal calculations. Only add/subtract has this issue. Create views via 2.1: |create view v1 as select (cast(1 as decimal(18,0)) + cast(1 as decimal(18,0))) c1; create view v2 as select (cast(1 as decimal(18,0)) - cast(1 as decimal(18,0))) c1; create view v3 as select (cast(1 as decimal(18,0)) * cast(1 as decimal(18,0))) c1; create view v4 as select (cast(1 as decimal(18,0)) / cast(1 as decimal(18,0))) c1; create view v5 as select (cast(1 as decimal(18,0)) % cast(1 as decimal(18,0))) c1; create view v6 as select cast(1 as decimal(18,0)) c1 union select cast(1 as decimal(19,0)) c1;| Query views via Spark 2.3 |select * from v1; Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3906: decimal(19,0) as it may truncate select * from v2; Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3909: decimal(19,0) as it may truncate select * from v3; 1 select * from v4; 1 select * from v5; 0 select * from v6; 1| Views created via Spark 2.2+ don't have this issue because Spark 2.2+ does not generate expanded text for view (https://issues.apache.org/jira/browse/SPARK-18209). > Views created via 2.1 cannot be read via 2.2+ >
[jira] [Commented] (SPARK-25411) Implement range partition in Spark
[ https://issues.apache.org/jira/browse/SPARK-25411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661610#comment-16661610 ] Wang, Gang commented on SPARK-25411: [~cloud_fan] How do you think of this feature? In our inner benchmark, it do improve a lot in performance for huge tables join with predicates. > Implement range partition in Spark > -- > > Key: SPARK-25411 > URL: https://issues.apache.org/jira/browse/SPARK-25411 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wang, Gang >Priority: Major > Attachments: range partition design doc.pdf > > > In our product environment, there are some partitioned fact tables, which are > all quite huge. To accelerate join execution, we need make them also > bucketed. Than comes the problem, if the bucket number is large enough, there > may be too many files(files count = bucket number * partition count), which > may bring pressure to the HDFS. And if the bucket number is small, Spark will > launch equal number of tasks to read/write it. > > So, can we implement a new partition support range values, just like range > partition in Oracle/MySQL > ([https://docs.oracle.com/cd/E17952_01/mysql-5.7-en/partitioning-range.html]). > Say, we can partition by a date column, and make every two months as a > partition, or partitioned by a integer column, make interval of 1 as a > partition. > > Ideally, feature like range partition should be implemented in Hive. While, > it's been always hard to update Hive version in a prod environment, and much > lightweight and flexible if we implement it in Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25797) Views created via 2.1 cannot be read via 2.2+
[ https://issues.apache.org/jira/browse/SPARK-25797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chenxiao Mao updated SPARK-25797: - Description: We ran into this issue when we update our Spark from 2.1 to 2.3. Below's a simple example to reproduce the issue. Create views via Spark 2.1 |create view v1 as select (cast(1 as decimal(18,0)) + cast(1 as decimal(18,0))) c1;| Query views via Spark 2.3 |{{select * from v1;}} {{Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3906: decimal(19,0) as it may truncate}}| After investigation, we found that this is because when a view is created via Spark 2.1, the expanded text is saved instead of the original text. Unfortunately, the blow expanded text is buggy. |spark-sql> desc extended v1; c1 decimal(19,0) NULL Detailed Table Information Database default Table v1 Type VIEW View Text SELECT `gen_attr_0` AS `c1` FROM (SELECT (CAST(CAST(1 AS DECIMAL(18,0)) AS DECIMAL(19,0)) + CAST(CAST(1 AS DECIMAL(18,0)) AS DECIMAL(19,0))) AS `gen_attr_0`) AS gen_subquery_0| We can see that c1 is decimal(19,0), however in the expanded text there is decimal(19,0) + decimal(19,0) which results in decimal(20,0). Since Spark 2.2, decimal(20,0) in query is not allowed to cast to view definition column decimal(19,0). ([https://github.com/apache/spark/pull/16561]) I further tested other decimal calculations. Only add/subtract has this issue. Create views via 2.1: |create view v1 as select (cast(1 as decimal(18,0)) + cast(1 as decimal(18,0))) c1; create view v2 as select (cast(1 as decimal(18,0)) - cast(1 as decimal(18,0))) c1; create view v3 as select (cast(1 as decimal(18,0)) * cast(1 as decimal(18,0))) c1; create view v4 as select (cast(1 as decimal(18,0)) / cast(1 as decimal(18,0))) c1; create view v5 as select (cast(1 as decimal(18,0)) % cast(1 as decimal(18,0))) c1; create view v6 as select cast(1 as decimal(18,0)) c1 union select cast(1 as decimal(19,0)) c1;| Query views via Spark 2.3 |select * from v1; Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3906: decimal(19,0) as it may truncate select * from v2; Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3909: decimal(19,0) as it may truncate select * from v3; 1 select * from v4; 1 select * from v5; 0 select * from v6; 1| Views created via Spark 2.2+ don't have this issue because Spark 2.2+ does not generate expanded text for view (https://issues.apache.org/jira/browse/SPARK-18209). was: We ran into this issue when we update our Spark from 2.1 to 2.3. Below's a simple example to reproduce the issue. Create views via Spark 2.1 |create view v1 as select (cast(1 as decimal(18,0)) + cast(1 as decimal(18,0))) c1;| Query views via Spark 2.3 |{{select * from v1;}} {{Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3906: decimal(19,0) as it may truncate}}| After investigation, we found that this is because when a view is created via Spark 2.1, the expanded text is saved instead of the original text. Unfortunately, the blow expanded text is buggy. |spark-sql> desc extended v1; c1 decimal(19,0) NULL Detailed Table Information Database default Table v1 Type VIEW View Text SELECT `gen_attr_0` AS `c1` FROM (SELECT (CAST(CAST(1 AS DECIMAL(18,0)) AS DECIMAL(19,0)) + CAST(CAST(1 AS DECIMAL(18,0)) AS DECIMAL(19,0))) AS `gen_attr_0`) AS gen_subquery_0| We can see that c1 is decimal(19,0), however in the expanded text there is decimal(19,0) + decimal(19,0) which results in decimal(20,0). Since Spark 2.2, decimal(20,0) in query is not allowed to cast to view definition column decimal(19,0). ([https://github.com/apache/spark/pull/16561]) I further tested other decimal calculations. Only add/subtract has this issue. Create views via 2.1: |create view v1 as select (cast(1 as decimal(18,0)) + cast(1 as decimal(18,0))) c1; create view v2 as select (cast(1 as decimal(18,0)) - cast(1 as decimal(18,0))) c1; create view v3 as select (cast(1 as decimal(18,0)) * cast(1 as decimal(18,0))) c1; create view v4 as select (cast(1 as decimal(18,0)) / cast(1 as decimal(18,0))) c1; create view v5 as select (cast(1 as decimal(18,0)) % cast(1 as decimal(18,0))) c1; create view v6 as select cast(1 as decimal(18,0)) c1 union select cast(1 as decimal(19,0)) c1;| Query views via Spark 2.3 |select * from v1; Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3906: decimal(19,0) as it may truncate select * from v2; Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3909: decimal(19,0) as it may truncate select * from v3; 1 select * from v4; 1 select * from v5; 0 select * from v6; 1| > Views created via 2.1 cannot be read via 2.2+ > - > > Key: SPARK-25797 > URL: https://issues.apache.org/jira/browse/SPARK-25797 > Project: Spark > Issue Type:
[jira] [Assigned] (SPARK-25772) Java encoders - switch fields on collectAsList
[ https://issues.apache.org/jira/browse/SPARK-25772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-25772: --- Assignee: Vladimir Kuriatkov > Java encoders - switch fields on collectAsList > -- > > Key: SPARK-25772 > URL: https://issues.apache.org/jira/browse/SPARK-25772 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 > Environment: mac os > spark 2.1.1 > Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_121 >Reporter: Tom >Assignee: Vladimir Kuriatkov >Priority: Major > Fix For: 3.0.0 > > > I have the following schema in a dataset - > root > |-- userId: string (nullable = true) > |-- data: map (nullable = true) > ||-- key: string > ||-- value: struct (valueContainsNull = true) > |||-- startTime: long (nullable = true) > |||-- endTime: long (nullable = true) > |-- offset: long (nullable = true) > And I have the following classes (+ setter and getters which I omitted for > simplicity) - > > {code:java} > public class MyClass { > private String userId; > private Map data; > private Long offset; > } > public class MyDTO { > private long startTime; > private long endTime; > } > {code} > I collect the result the following way - > {code:java} > Encoder myClassEncoder = Encoders.bean(MyClass.class); > Dataset results = raw_df.as(myClassEncoder); > List lst = results.collectAsList(); > {code} > > I do several calculations to get the result I want and the result is correct > all through the way before I collect it. > This is the result for - > {code:java} > results.select(results.col("data").getField("2017-07-01").getField("startTime")).show(false); > {code} > |data[2017-07-01].startTime|data[2017-07-01].endTime| > +-+--+ > |1498854000|1498870800 | > This is the result after collecting the reuslts for - > {code:java} > MyClass userData = results.collectAsList().get(0); > MyDTO userDTO = userData.getData().get("2017-07-01"); > System.out.println("userDTO startTime: " + userDTO.getStartTime()); > System.out.println("userDTO endTime: " + userDTO.getEndTime()); > {code} > -- > data startTime: 1498870800 > data endTime: 1498854000 > I tend to believe it is a spark issue. Would love any suggestions on how to > bypass it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25772) Java encoders - switch fields on collectAsList
[ https://issues.apache.org/jira/browse/SPARK-25772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-25772. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 22745 [https://github.com/apache/spark/pull/22745] > Java encoders - switch fields on collectAsList > -- > > Key: SPARK-25772 > URL: https://issues.apache.org/jira/browse/SPARK-25772 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 > Environment: mac os > spark 2.1.1 > Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_121 >Reporter: Tom >Priority: Major > Fix For: 3.0.0 > > > I have the following schema in a dataset - > root > |-- userId: string (nullable = true) > |-- data: map (nullable = true) > ||-- key: string > ||-- value: struct (valueContainsNull = true) > |||-- startTime: long (nullable = true) > |||-- endTime: long (nullable = true) > |-- offset: long (nullable = true) > And I have the following classes (+ setter and getters which I omitted for > simplicity) - > > {code:java} > public class MyClass { > private String userId; > private Map data; > private Long offset; > } > public class MyDTO { > private long startTime; > private long endTime; > } > {code} > I collect the result the following way - > {code:java} > Encoder myClassEncoder = Encoders.bean(MyClass.class); > Dataset results = raw_df.as(myClassEncoder); > List lst = results.collectAsList(); > {code} > > I do several calculations to get the result I want and the result is correct > all through the way before I collect it. > This is the result for - > {code:java} > results.select(results.col("data").getField("2017-07-01").getField("startTime")).show(false); > {code} > |data[2017-07-01].startTime|data[2017-07-01].endTime| > +-+--+ > |1498854000|1498870800 | > This is the result after collecting the reuslts for - > {code:java} > MyClass userData = results.collectAsList().get(0); > MyDTO userDTO = userData.getData().get("2017-07-01"); > System.out.println("userDTO startTime: " + userDTO.getStartTime()); > System.out.println("userDTO endTime: " + userDTO.getEndTime()); > {code} > -- > data startTime: 1498870800 > data endTime: 1498854000 > I tend to believe it is a spark issue. Would love any suggestions on how to > bypass it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22809) pyspark is sensitive to imports with dots
[ https://issues.apache.org/jira/browse/SPARK-22809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-22809: - Fix Version/s: 2.3.2 > pyspark is sensitive to imports with dots > - > > Key: SPARK-22809 > URL: https://issues.apache.org/jira/browse/SPARK-22809 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0, 2.2.1 >Reporter: Cricket Temple >Assignee: holdenk >Priority: Major > Fix For: 2.3.2, 2.4.0 > > > User code can fail with dotted imports. Here's a repro script. > {noformat} > import numpy as np > import pandas as pd > import pyspark > import scipy.interpolate > import scipy.interpolate as scipy_interpolate > import py4j > scipy_interpolate2 = scipy.interpolate > sc = pyspark.SparkContext() > spark_session = pyspark.SQLContext(sc) > ### > # The details of this dataset are irrelevant # > # Sorry if you'd have preferred something more boring # > ### > x__ = np.linspace(0,10,1000) > freq__ = np.arange(1,5) > x_, freq_ = np.ix_(x__, freq__) > y = np.sin(x_ * freq_).ravel() > x = (x_ * np.ones(freq_.shape)).ravel() > freq = (np.ones(x_.shape) * freq_).ravel() > df_pd = pd.DataFrame(np.stack([x,y,freq]).T, columns=['x','y','freq']) > df_sk = spark_session.createDataFrame(df_pd) > assert(df_sk.toPandas() == df_pd).all().all() > try: > import matplotlib.pyplot as plt > for f, data in df_pd.groupby("freq"): > plt.plot(*data[['x','y']].values.T) > plt.show() > except: > print("I guess we can't plot anything") > def mymap(x, interp_fn): > df = pd.DataFrame.from_records([row.asDict() for row in list(x)]) > return interp_fn(df.x.values, df.y.values)(np.pi) > df_by_freq = df_sk.rdd.keyBy(lambda x: x.freq).groupByKey() > result = df_by_freq.mapValues(lambda x: mymap(x, > scipy_interpolate.interp1d)).collect() > assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), > atol=1e-6)) > try: > result = df_by_freq.mapValues(lambda x: mymap(x, > scipy.interpolate.interp1d)).collect() > raise Excpetion("Not going to reach this line") > except py4j.protocol.Py4JJavaError, e: > print("See?") > result = df_by_freq.mapValues(lambda x: mymap(x, > scipy_interpolate2.interp1d)).collect() > assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), > atol=1e-6)) > # But now it works! > result = df_by_freq.mapValues(lambda x: mymap(x, > scipy.interpolate.interp1d)).collect() > assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), > atol=1e-6)) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22809) pyspark is sensitive to imports with dots
[ https://issues.apache.org/jira/browse/SPARK-22809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661502#comment-16661502 ] Bryan Cutler commented on SPARK-22809: -- Sure, I probably shouldn't have tested out of the branches. Running tests again from IPython with Python 3.6.6: *v2.2.2* - Error is raised *v2.3.2* - Working *v2.4.0-rc4* - Working >From those results, it seems like SPARK-21070 most likely fixed it > pyspark is sensitive to imports with dots > - > > Key: SPARK-22809 > URL: https://issues.apache.org/jira/browse/SPARK-22809 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0, 2.2.1 >Reporter: Cricket Temple >Assignee: holdenk >Priority: Major > Fix For: 2.4.0 > > > User code can fail with dotted imports. Here's a repro script. > {noformat} > import numpy as np > import pandas as pd > import pyspark > import scipy.interpolate > import scipy.interpolate as scipy_interpolate > import py4j > scipy_interpolate2 = scipy.interpolate > sc = pyspark.SparkContext() > spark_session = pyspark.SQLContext(sc) > ### > # The details of this dataset are irrelevant # > # Sorry if you'd have preferred something more boring # > ### > x__ = np.linspace(0,10,1000) > freq__ = np.arange(1,5) > x_, freq_ = np.ix_(x__, freq__) > y = np.sin(x_ * freq_).ravel() > x = (x_ * np.ones(freq_.shape)).ravel() > freq = (np.ones(x_.shape) * freq_).ravel() > df_pd = pd.DataFrame(np.stack([x,y,freq]).T, columns=['x','y','freq']) > df_sk = spark_session.createDataFrame(df_pd) > assert(df_sk.toPandas() == df_pd).all().all() > try: > import matplotlib.pyplot as plt > for f, data in df_pd.groupby("freq"): > plt.plot(*data[['x','y']].values.T) > plt.show() > except: > print("I guess we can't plot anything") > def mymap(x, interp_fn): > df = pd.DataFrame.from_records([row.asDict() for row in list(x)]) > return interp_fn(df.x.values, df.y.values)(np.pi) > df_by_freq = df_sk.rdd.keyBy(lambda x: x.freq).groupByKey() > result = df_by_freq.mapValues(lambda x: mymap(x, > scipy_interpolate.interp1d)).collect() > assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), > atol=1e-6)) > try: > result = df_by_freq.mapValues(lambda x: mymap(x, > scipy.interpolate.interp1d)).collect() > raise Excpetion("Not going to reach this line") > except py4j.protocol.Py4JJavaError, e: > print("See?") > result = df_by_freq.mapValues(lambda x: mymap(x, > scipy_interpolate2.interp1d)).collect() > assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), > atol=1e-6)) > # But now it works! > result = df_by_freq.mapValues(lambda x: mymap(x, > scipy.interpolate.interp1d)).collect() > assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), > atol=1e-6)) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25816) Functions does not resolve Columns correctly
[ https://issues.apache.org/jira/browse/SPARK-25816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brian Zhang updated SPARK-25816: Attachment: source.snappy.parquet > Functions does not resolve Columns correctly > > > Key: SPARK-25816 > URL: https://issues.apache.org/jira/browse/SPARK-25816 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Brian Zhang >Priority: Critical > Attachments: source.snappy.parquet > > > When there is a duplicate column name in the current Dataframe and orginal > Dataframe where current df is selected from, Spark in 2.3.0 and 2.3.1 does > not resolve the column correctly when using it in the expression, hence > causing casting issue. The same code is working in Spark 2.2.1 > Please see below code to reproduce the issue > import org.apache.spark._ > import org.apache.spark.rdd._ > import org.apache.spark.storage.StorageLevel._ > import org.apache.spark.sql._ > import org.apache.spark.sql.DataFrame > import org.apache.spark.sql.types._ > import org.apache.spark.sql.functions._ > import org.apache.spark.sql.catalyst.expressions._ > import org.apache.spark.sql.Column > val v0 = spark.read.parquet("/data/home/bzinfa/bz/source.snappy.parquet") > val v00 = v0.toDF(v0.schema.fields.indices.view.map("" + _):_*) > val v5 = v00.select($"13".as("0"),$"14".as("1"),$"15".as("2")) > val v5_2 = $"2" > v5.where(lit(500).<(v5_2(new Column(new MapKeys(v5_2.expr))(lit(0) > //v00's 3rdcolumn is binary and 16th is map > Error: > org.apache.spark.sql.AnalysisException: cannot resolve 'map_keys(`2`)' due to > data type mismatch: argument 1 requires map type, however, '`2`' is of binary > type.; > > 'Project [0#1591, 1#1592, 2#1593] +- 'Filter (500 < > {color:#FF}2#1593{color}[map_keys({color:#FF}2#1561{color})[0]]) +- > Project [13#1572 AS 0#1591, 14#1573 AS 1#1592, 15#1574 AS 2#1593, 2#1561] +- > Project [c_bytes#1527 AS 0#1559, c_union#1528 AS 1#1560, c_fixed#1529 AS > 2#1561, c_boolean#1530 AS 3#1562, c_float#1531 AS 4#1563, c_double#1532 AS > 5#1564, c_int#1533 AS 6#1565, c_long#1534L AS 7#1566L, c_string#1535 AS > 8#1567, c_decimal_18_2#1536 AS 9#1568, c_decimal_28_2#1537 AS 10#1569, > c_decimal_38_2#1538 AS 11#1570, c_date#1539 AS 12#1571, simple_struct#1540 AS > 13#1572, simple_array#1541 AS 14#1573, simple_map#1542 AS 15#1574] +- > Relation[c_bytes#1527,c_union#1528,c_fixed#1529,c_boolean#1530,c_float#1531,c_double#1532,c_int#1533,c_long#1534L,c_string#1535,c_decimal_18_2#1536,c_decimal_28_2#1537,c_decimal_38_2#1538,c_date#1539,simple_struct#1540,simple_array#1541,simple_map#1542] > parquet -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25816) Functions does not resolve Columns correctly
Brian Zhang created SPARK-25816: --- Summary: Functions does not resolve Columns correctly Key: SPARK-25816 URL: https://issues.apache.org/jira/browse/SPARK-25816 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.1, 2.3.0 Reporter: Brian Zhang When there is a duplicate column name in the current Dataframe and orginal Dataframe where current df is selected from, Spark in 2.3.0 and 2.3.1 does not resolve the column correctly when using it in the expression, hence causing casting issue. The same code is working in Spark 2.2.1 Please see below code to reproduce the issue import org.apache.spark._ import org.apache.spark.rdd._ import org.apache.spark.storage.StorageLevel._ import org.apache.spark.sql._ import org.apache.spark.sql.DataFrame import org.apache.spark.sql.types._ import org.apache.spark.sql.functions._ import org.apache.spark.sql.catalyst.expressions._ import org.apache.spark.sql.Column val v0 = spark.read.parquet("/data/home/bzinfa/bz/source.snappy.parquet") val v00 = v0.toDF(v0.schema.fields.indices.view.map("" + _):_*) val v5 = v00.select($"13".as("0"),$"14".as("1"),$"15".as("2")) val v5_2 = $"2" v5.where(lit(500).<(v5_2(new Column(new MapKeys(v5_2.expr))(lit(0) //v00's 3rdcolumn is binary and 16th is map Error: org.apache.spark.sql.AnalysisException: cannot resolve 'map_keys(`2`)' due to data type mismatch: argument 1 requires map type, however, '`2`' is of binary type.; 'Project [0#1591, 1#1592, 2#1593] +- 'Filter (500 < {color:#FF}2#1593{color}[map_keys({color:#FF}2#1561{color})[0]]) +- Project [13#1572 AS 0#1591, 14#1573 AS 1#1592, 15#1574 AS 2#1593, 2#1561] +- Project [c_bytes#1527 AS 0#1559, c_union#1528 AS 1#1560, c_fixed#1529 AS 2#1561, c_boolean#1530 AS 3#1562, c_float#1531 AS 4#1563, c_double#1532 AS 5#1564, c_int#1533 AS 6#1565, c_long#1534L AS 7#1566L, c_string#1535 AS 8#1567, c_decimal_18_2#1536 AS 9#1568, c_decimal_28_2#1537 AS 10#1569, c_decimal_38_2#1538 AS 11#1570, c_date#1539 AS 12#1571, simple_struct#1540 AS 13#1572, simple_array#1541 AS 14#1573, simple_map#1542 AS 15#1574] +- Relation[c_bytes#1527,c_union#1528,c_fixed#1529,c_boolean#1530,c_float#1531,c_double#1532,c_int#1533,c_long#1534L,c_string#1535,c_decimal_18_2#1536,c_decimal_28_2#1537,c_decimal_38_2#1538,c_date#1539,simple_struct#1540,simple_array#1541,simple_map#1542] parquet -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24516) PySpark Bindings for K8S - make Python 3 the default
[ https://issues.apache.org/jira/browse/SPARK-24516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661466#comment-16661466 ] Apache Spark commented on SPARK-24516: -- User 'ifilonenko' has created a pull request for this issue: https://github.com/apache/spark/pull/22810 > PySpark Bindings for K8S - make Python 3 the default > > > Key: SPARK-24516 > URL: https://issues.apache.org/jira/browse/SPARK-24516 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, PySpark >Affects Versions: 2.4.0 >Reporter: Ondrej Kokes >Priority: Minor > > Initial PySpark-k8s bindings have just been resolved (SPARK-23984), but the > default Python version there is 2. While you can override this by setting it > to 3, I think we should have sensible defaults. > Python 3 has been around for ten years and is the clear successor, Python 2 > has only 18 months left in terms of support. There isn't a good reason to > suggest Python 2 should be used, not in 2018 and not when both versions are > supported. > The relevant commit [is > here|https://github.com/apache/spark/commit/1a644afbac35c204f9ad55f86999319a9ab458c6#diff-6e882d5561424e7e6651eb46f10104b8R194], > the version is also [in the > documentation|https://github.com/apache/spark/commit/1a644afbac35c204f9ad55f86999319a9ab458c6#diff-b5527f236b253e0d9f5db5164bdb43e9R643]. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24516) PySpark Bindings for K8S - make Python 3 the default
[ https://issues.apache.org/jira/browse/SPARK-24516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661467#comment-16661467 ] Apache Spark commented on SPARK-24516: -- User 'ifilonenko' has created a pull request for this issue: https://github.com/apache/spark/pull/22810 > PySpark Bindings for K8S - make Python 3 the default > > > Key: SPARK-24516 > URL: https://issues.apache.org/jira/browse/SPARK-24516 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, PySpark >Affects Versions: 2.4.0 >Reporter: Ondrej Kokes >Priority: Minor > > Initial PySpark-k8s bindings have just been resolved (SPARK-23984), but the > default Python version there is 2. While you can override this by setting it > to 3, I think we should have sensible defaults. > Python 3 has been around for ten years and is the clear successor, Python 2 > has only 18 months left in terms of support. There isn't a good reason to > suggest Python 2 should be used, not in 2018 and not when both versions are > supported. > The relevant commit [is > here|https://github.com/apache/spark/commit/1a644afbac35c204f9ad55f86999319a9ab458c6#diff-6e882d5561424e7e6651eb46f10104b8R194], > the version is also [in the > documentation|https://github.com/apache/spark/commit/1a644afbac35c204f9ad55f86999319a9ab458c6#diff-b5527f236b253e0d9f5db5164bdb43e9R643]. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24516) PySpark Bindings for K8S - make Python 3 the default
[ https://issues.apache.org/jira/browse/SPARK-24516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24516: Assignee: (was: Apache Spark) > PySpark Bindings for K8S - make Python 3 the default > > > Key: SPARK-24516 > URL: https://issues.apache.org/jira/browse/SPARK-24516 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, PySpark >Affects Versions: 2.4.0 >Reporter: Ondrej Kokes >Priority: Minor > > Initial PySpark-k8s bindings have just been resolved (SPARK-23984), but the > default Python version there is 2. While you can override this by setting it > to 3, I think we should have sensible defaults. > Python 3 has been around for ten years and is the clear successor, Python 2 > has only 18 months left in terms of support. There isn't a good reason to > suggest Python 2 should be used, not in 2018 and not when both versions are > supported. > The relevant commit [is > here|https://github.com/apache/spark/commit/1a644afbac35c204f9ad55f86999319a9ab458c6#diff-6e882d5561424e7e6651eb46f10104b8R194], > the version is also [in the > documentation|https://github.com/apache/spark/commit/1a644afbac35c204f9ad55f86999319a9ab458c6#diff-b5527f236b253e0d9f5db5164bdb43e9R643]. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24516) PySpark Bindings for K8S - make Python 3 the default
[ https://issues.apache.org/jira/browse/SPARK-24516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24516: Assignee: Apache Spark > PySpark Bindings for K8S - make Python 3 the default > > > Key: SPARK-24516 > URL: https://issues.apache.org/jira/browse/SPARK-24516 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, PySpark >Affects Versions: 2.4.0 >Reporter: Ondrej Kokes >Assignee: Apache Spark >Priority: Minor > > Initial PySpark-k8s bindings have just been resolved (SPARK-23984), but the > default Python version there is 2. While you can override this by setting it > to 3, I think we should have sensible defaults. > Python 3 has been around for ten years and is the clear successor, Python 2 > has only 18 months left in terms of support. There isn't a good reason to > suggest Python 2 should be used, not in 2018 and not when both versions are > supported. > The relevant commit [is > here|https://github.com/apache/spark/commit/1a644afbac35c204f9ad55f86999319a9ab458c6#diff-6e882d5561424e7e6651eb46f10104b8R194], > the version is also [in the > documentation|https://github.com/apache/spark/commit/1a644afbac35c204f9ad55f86999319a9ab458c6#diff-b5527f236b253e0d9f5db5164bdb43e9R643]. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22809) pyspark is sensitive to imports with dots
[ https://issues.apache.org/jira/browse/SPARK-22809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661444#comment-16661444 ] Dongjoon Hyun commented on SPARK-22809: --- Hi, [~bryanc]. It seems that the test occurs in `branch-2.2`. Could you confirm 2.3.2, too? > pyspark is sensitive to imports with dots > - > > Key: SPARK-22809 > URL: https://issues.apache.org/jira/browse/SPARK-22809 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0, 2.2.1 >Reporter: Cricket Temple >Assignee: holdenk >Priority: Major > Fix For: 2.4.0 > > > User code can fail with dotted imports. Here's a repro script. > {noformat} > import numpy as np > import pandas as pd > import pyspark > import scipy.interpolate > import scipy.interpolate as scipy_interpolate > import py4j > scipy_interpolate2 = scipy.interpolate > sc = pyspark.SparkContext() > spark_session = pyspark.SQLContext(sc) > ### > # The details of this dataset are irrelevant # > # Sorry if you'd have preferred something more boring # > ### > x__ = np.linspace(0,10,1000) > freq__ = np.arange(1,5) > x_, freq_ = np.ix_(x__, freq__) > y = np.sin(x_ * freq_).ravel() > x = (x_ * np.ones(freq_.shape)).ravel() > freq = (np.ones(x_.shape) * freq_).ravel() > df_pd = pd.DataFrame(np.stack([x,y,freq]).T, columns=['x','y','freq']) > df_sk = spark_session.createDataFrame(df_pd) > assert(df_sk.toPandas() == df_pd).all().all() > try: > import matplotlib.pyplot as plt > for f, data in df_pd.groupby("freq"): > plt.plot(*data[['x','y']].values.T) > plt.show() > except: > print("I guess we can't plot anything") > def mymap(x, interp_fn): > df = pd.DataFrame.from_records([row.asDict() for row in list(x)]) > return interp_fn(df.x.values, df.y.values)(np.pi) > df_by_freq = df_sk.rdd.keyBy(lambda x: x.freq).groupByKey() > result = df_by_freq.mapValues(lambda x: mymap(x, > scipy_interpolate.interp1d)).collect() > assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), > atol=1e-6)) > try: > result = df_by_freq.mapValues(lambda x: mymap(x, > scipy.interpolate.interp1d)).collect() > raise Excpetion("Not going to reach this line") > except py4j.protocol.Py4JJavaError, e: > print("See?") > result = df_by_freq.mapValues(lambda x: mymap(x, > scipy_interpolate2.interp1d)).collect() > assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), > atol=1e-6)) > # But now it works! > result = df_by_freq.mapValues(lambda x: mymap(x, > scipy.interpolate.interp1d)).collect() > assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), > atol=1e-6)) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22809) pyspark is sensitive to imports with dots
[ https://issues.apache.org/jira/browse/SPARK-22809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-22809. -- Resolution: Fixed Fix Version/s: 2.4.0 > pyspark is sensitive to imports with dots > - > > Key: SPARK-22809 > URL: https://issues.apache.org/jira/browse/SPARK-22809 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0, 2.2.1 >Reporter: Cricket Temple >Assignee: holdenk >Priority: Major > Fix For: 2.4.0 > > > User code can fail with dotted imports. Here's a repro script. > {noformat} > import numpy as np > import pandas as pd > import pyspark > import scipy.interpolate > import scipy.interpolate as scipy_interpolate > import py4j > scipy_interpolate2 = scipy.interpolate > sc = pyspark.SparkContext() > spark_session = pyspark.SQLContext(sc) > ### > # The details of this dataset are irrelevant # > # Sorry if you'd have preferred something more boring # > ### > x__ = np.linspace(0,10,1000) > freq__ = np.arange(1,5) > x_, freq_ = np.ix_(x__, freq__) > y = np.sin(x_ * freq_).ravel() > x = (x_ * np.ones(freq_.shape)).ravel() > freq = (np.ones(x_.shape) * freq_).ravel() > df_pd = pd.DataFrame(np.stack([x,y,freq]).T, columns=['x','y','freq']) > df_sk = spark_session.createDataFrame(df_pd) > assert(df_sk.toPandas() == df_pd).all().all() > try: > import matplotlib.pyplot as plt > for f, data in df_pd.groupby("freq"): > plt.plot(*data[['x','y']].values.T) > plt.show() > except: > print("I guess we can't plot anything") > def mymap(x, interp_fn): > df = pd.DataFrame.from_records([row.asDict() for row in list(x)]) > return interp_fn(df.x.values, df.y.values)(np.pi) > df_by_freq = df_sk.rdd.keyBy(lambda x: x.freq).groupByKey() > result = df_by_freq.mapValues(lambda x: mymap(x, > scipy_interpolate.interp1d)).collect() > assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), > atol=1e-6)) > try: > result = df_by_freq.mapValues(lambda x: mymap(x, > scipy.interpolate.interp1d)).collect() > raise Excpetion("Not going to reach this line") > except py4j.protocol.Py4JJavaError, e: > print("See?") > result = df_by_freq.mapValues(lambda x: mymap(x, > scipy_interpolate2.interp1d)).collect() > assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), > atol=1e-6)) > # But now it works! > result = df_by_freq.mapValues(lambda x: mymap(x, > scipy.interpolate.interp1d)).collect() > assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), > atol=1e-6)) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22809) pyspark is sensitive to imports with dots
[ https://issues.apache.org/jira/browse/SPARK-22809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661418#comment-16661418 ] Bryan Cutler commented on SPARK-22809: -- I confirmed that I could reproduce in IPython with Spark branch-2.3 and did not have the issue with branch-2.4. I think we can close this issue {noformat} __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.4.1-SNAPSHOT /_/ Using Python version 3.6.6 (default, Oct 12 2018 14:08:43) SparkSession available as 'spark'. In [1]: import pyspark.cloudpickle ...: import pyspark ...: import py4j ...: rdd = sc.parallelize([(1,2)]) ...: import scipy.interpolate In [2]: import scipy.interpolate ...: def foo(*ards, **kwd): ...: scipy.interpolate.interp1d ...: try: ...: rdd.mapValues(foo).collect() ...: except py4j.protocol.Py4JJavaError as err: ...: print("it errored") ...: import scipy.interpolate as scipy_interpolate ...: def bar(*ards, **kwd): ...: scipy_interpolate.interp1d ...: rdd.mapValues(bar).collect() ...: print("worked") ...: rdd.mapValues(foo).collect() ...: print("worked") worked worked{noformat} {noformat} __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.2.3-SNAPSHOT /_/ Using Python version 3.6.6 (default, Oct 12 2018 14:08:43) SparkSession available as 'spark'. In [1]: import pyspark.cloudpickle ...: import pyspark ...: import py4j ...: rdd = sc.parallelize([(1,2)]) ...: import scipy.interpolate In [2]: import scipy.interpolate ...: def foo(*ards, **kwd): ...: scipy.interpolate.interp1d ...: try: ...: rdd.mapValues(foo).collect() ...: except py4j.protocol.Py4JJavaError as err: ...: print("it errored") ...: import scipy.interpolate as scipy_interpolate ...: def bar(*ards, **kwd): ...: scipy_interpolate.interp1d ...: rdd.mapValues(bar).collect() ...: print("worked") ...: rdd.mapValues(foo).collect() ...: print("worked") 18/10/23 15:39:54 ERROR Executor: Exception in task 7.0 in stage 0.0 (TID 7) org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/home/bryan/git/spark/python/lib/pyspark.zip/pyspark/worker.py", line 196, in main process() File "/home/bryan/git/spark/python/lib/pyspark.zip/pyspark/worker.py", line 191, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/home/bryan/git/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 268, in dump_stream vs = list(itertools.islice(iterator, batch)) File "/home/bryan/git/spark/python/pyspark/rdd.py", line 1951, in map_values_fn = lambda kv: (kv[0], f(kv[1])) File "", line 3, in foo AttributeError: module 'scipy' has no attribute 'interpolate' at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:197) at org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:238) at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:156) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:67) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:344) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) [Stage 0:> (0 + 8) / 8]18/10/23 15:39:54 WARN TaskSetManager: Lost task 7.0 in stage 0.0 (TID 7, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/home/bryan/git/spark/python/lib/pyspark.zip/pyspark/worker.py", line 196, in main process() File "/home/bryan/git/spark/python/lib/pyspark.zip/pyspark/worker.py", line 191, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/home/bryan/git/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 268,
[jira] [Created] (SPARK-25815) Kerberos Support in Kubernetes resource manager (Client Mode)
Ilan Filonenko created SPARK-25815: -- Summary: Kerberos Support in Kubernetes resource manager (Client Mode) Key: SPARK-25815 URL: https://issues.apache.org/jira/browse/SPARK-25815 Project: Spark Issue Type: New Feature Components: Kubernetes Affects Versions: 3.0.0 Reporter: Ilan Filonenko Include Kerberos support for Spark on K8S jobs running in client-mode -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23257) Kerberos Support in Kubernetes resource manager (Cluster Mode)
[ https://issues.apache.org/jira/browse/SPARK-23257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ilan Filonenko updated SPARK-23257: --- Summary: Kerberos Support in Kubernetes resource manager (Cluster Mode) (was: Implement Kerberos Support in Kubernetes resource manager) > Kerberos Support in Kubernetes resource manager (Cluster Mode) > -- > > Key: SPARK-23257 > URL: https://issues.apache.org/jira/browse/SPARK-23257 > Project: Spark > Issue Type: Wish > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Rob Keevil >Assignee: Ilan Filonenko >Priority: Major > Fix For: 3.0.0 > > > On the forked k8s branch of Spark at > [https://github.com/apache-spark-on-k8s/spark/pull/540] , Kerberos support > has been added to the Kubernetes resource manager. The Kubernetes code > between these two repositories appears to have diverged, so this commit > cannot be merged in easily. Are there any plans to re-implement this work on > the main Spark repository? > > [ifilonenko|https://github.com/ifilonenko] [~liyinan926] I am happy to help > with the development and testing of this, but i wanted to confirm that this > isn't already in progress - I could not find any discussion about this > specific topic online. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25807) Mitigate 1-based substr() confusion
[ https://issues.apache.org/jira/browse/SPARK-25807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661324#comment-16661324 ] kevin yu commented on SPARK-25807: -- I am looking into option 1, option 3 causes to change behavior, probably require more discussion. Kevin > Mitigate 1-based substr() confusion > --- > > Key: SPARK-25807 > URL: https://issues.apache.org/jira/browse/SPARK-25807 > Project: Spark > Issue Type: Improvement > Components: Java API, PySpark >Affects Versions: 1.3.0, 2.3.2, 2.4.0, 2.5.0, 3.0.0 >Reporter: Oron Navon >Priority: Minor > > The method {{Column.substr()}} is 1-based, conforming with SQL and Hive's > {{SUBSTRING}}, and contradicting both Python's {{substr}} and Java's > {{substr}}, which are zero-based. Both PySpark users and Java API users > often naturally expect a 0-based {{substr()}}. Adding to the confusion, > {{substr()}} currently allows a {{startPos}} value of 0, which returns the > same result as {{startPos==1}}. > Since changing {{substr()}} to 0-based is probably NOT a reasonable option > here, I suggest making one or more of the following changes: > # Adding a method {{substr0}}, which would be zero-based > # Renaming {{substr}} to {{substr1}} > # Making the existing {{substr()}} throw an exception on {{startPos==0}}, > which should catch and alert most users who expect zero-based behavior. > This is my first discussion on this project, apologies for any faux pas. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25801) pandas_udf grouped_map fails with input dataframe with more than 255 columns
[ https://issues.apache.org/jira/browse/SPARK-25801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-25801. -- Resolution: Fixed Fix Version/s: 2.4.0 > pandas_udf grouped_map fails with input dataframe with more than 255 columns > > > Key: SPARK-25801 > URL: https://issues.apache.org/jira/browse/SPARK-25801 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 > Environment: python 2.7 > pyspark 2.3.0 >Reporter: Frederik >Priority: Major > Fix For: 2.4.0 > > > Hi, > I'm using a pandas_udf to deploy a model to predict all samples in a spark > dataframe, > for this I use a udf as follows: > @pandas_udf("scores double", PandasUDFType.GROUPED_MAP) def > predict_scores(pdf): score_values = model.predict_proba(pdf)[:,1] return > pd.DataFrame({'scores': score_values}) > So it takes a dataframe and predicts the probability of being positive > according to an sklearn model for each row and returns this as single column. > This works great on a random groupBy, e.g.: > sdf_to_score.groupBy(sf.col('age')).apply(predict_scores) > as long as the dataframe has <255 columns. When the input dataframe has more > than 255 columns (thus features in my model), I get: > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File "path/to/lib/spark2/python/lib/pyspark.zip/pyspark/worker.py", line > 219, in main > func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, > eval_type) > File "path/to/lib/spark2/python/lib/pyspark.zip/pyspark/worker.py", line > 148, in read_udfs > mapper = eval(mapper_str, udfs) > File "", line 1 > SyntaxError: more than 255 arguments > Which seems to be related with Python's general limitation of having not > allowing more than 255 arguments for a function? > > Is this a bug or is there a straightforward way around this problem? > > Regards, > Frederik -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25801) pandas_udf grouped_map fails with input dataframe with more than 255 columns
[ https://issues.apache.org/jira/browse/SPARK-25801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661290#comment-16661290 ] Bryan Cutler commented on SPARK-25801: -- [~Toekan] you might try turning your features into an array of doubles, so that there is only one column. Then you could unpack them in your udf if needed. I'll mark this as fixed in Spark 2.4 and close. You can reopen if you are unable to find a workaround and want to request a fix to be backported for the next 2.3 release. > pandas_udf grouped_map fails with input dataframe with more than 255 columns > > > Key: SPARK-25801 > URL: https://issues.apache.org/jira/browse/SPARK-25801 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 > Environment: python 2.7 > pyspark 2.3.0 >Reporter: Frederik >Priority: Major > > Hi, > I'm using a pandas_udf to deploy a model to predict all samples in a spark > dataframe, > for this I use a udf as follows: > @pandas_udf("scores double", PandasUDFType.GROUPED_MAP) def > predict_scores(pdf): score_values = model.predict_proba(pdf)[:,1] return > pd.DataFrame({'scores': score_values}) > So it takes a dataframe and predicts the probability of being positive > according to an sklearn model for each row and returns this as single column. > This works great on a random groupBy, e.g.: > sdf_to_score.groupBy(sf.col('age')).apply(predict_scores) > as long as the dataframe has <255 columns. When the input dataframe has more > than 255 columns (thus features in my model), I get: > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File "path/to/lib/spark2/python/lib/pyspark.zip/pyspark/worker.py", line > 219, in main > func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, > eval_type) > File "path/to/lib/spark2/python/lib/pyspark.zip/pyspark/worker.py", line > 148, in read_udfs > mapper = eval(mapper_str, udfs) > File "", line 1 > SyntaxError: more than 255 arguments > Which seems to be related with Python's general limitation of having not > allowing more than 255 arguments for a function? > > Is this a bug or is there a straightforward way around this problem? > > Regards, > Frederik -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25814) spark driver runs out of memory on org.apache.spark.util.kvstore.InMemoryStore
[ https://issues.apache.org/jira/browse/SPARK-25814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661282#comment-16661282 ] Ruslan Dautkhanov commented on SPARK-25814: --- thank you [~vanzin] ! I will try to tune those down and see if this help. > spark driver runs out of memory on org.apache.spark.util.kvstore.InMemoryStore > -- > > Key: SPARK-25814 > URL: https://issues.apache.org/jira/browse/SPARK-25814 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.2, 2.3.2 >Reporter: Ruslan Dautkhanov >Priority: Major > Labels: driver, memory-analysis, memory-leak, statestore > Attachments: image-2018-10-23-14-06-53-722.png > > > We're looking into issue when even huge spark driver memory gets eventually > exhausted and GC makes driver stop responding. > Used [JXRay.com|http://jxray.com/] tool and found that most of driver heap is > used by > > {noformat} > org.apache.spark.status.AppStatusStore > -> org.apache.spark.status.ElementTrackingStore > -> org.apache.spark.util.kvstore.InMemoryStore > > {noformat} > > Is there is a way to tune this particular spark driver's memory region down? > > > !image-2018-10-23-14-06-53-722.png! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25814) spark driver runs out of memory on org.apache.spark.util.kvstore.InMemoryStore
[ https://issues.apache.org/jira/browse/SPARK-25814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruslan Dautkhanov updated SPARK-25814: -- Priority: Major (was: Critical) > spark driver runs out of memory on org.apache.spark.util.kvstore.InMemoryStore > -- > > Key: SPARK-25814 > URL: https://issues.apache.org/jira/browse/SPARK-25814 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.2, 2.3.2 >Reporter: Ruslan Dautkhanov >Priority: Major > Labels: driver, memory-analysis, memory-leak, statestore > Attachments: image-2018-10-23-14-06-53-722.png > > > We're looking into issue when even huge spark driver memory gets eventually > exhausted and GC makes driver stop responding. > Used [JXRay.com|http://jxray.com/] tool and found that most of driver heap is > used by > > {noformat} > org.apache.spark.status.AppStatusStore > -> org.apache.spark.status.ElementTrackingStore > -> org.apache.spark.util.kvstore.InMemoryStore > > {noformat} > > Is there is a way to tune this particular spark driver's memory region down? > > > !image-2018-10-23-14-06-53-722.png! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25814) spark driver runs out of memory on org.apache.spark.util.kvstore.InMemoryStore
[ https://issues.apache.org/jira/browse/SPARK-25814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661269#comment-16661269 ] Marcelo Vanzin commented on SPARK-25814: That's UI data. You can control how much UI data is retained with configs that have been there for a long time: {noformat} spark.ui.retainedTasks spark.ui.retainedStages spark.ui.retainedJobs {noformat} > spark driver runs out of memory on org.apache.spark.util.kvstore.InMemoryStore > -- > > Key: SPARK-25814 > URL: https://issues.apache.org/jira/browse/SPARK-25814 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.2, 2.3.2 >Reporter: Ruslan Dautkhanov >Priority: Critical > Labels: driver, memory-analysis, memory-leak, statestore > Attachments: image-2018-10-23-14-06-53-722.png > > > We're looking into issue when even huge spark driver memory gets eventually > exhausted and GC makes driver stop responding. > Used [JXRay.com|http://jxray.com/] tool and found that most of driver heap is > used by > > {noformat} > org.apache.spark.status.AppStatusStore > -> org.apache.spark.status.ElementTrackingStore > -> org.apache.spark.util.kvstore.InMemoryStore > > {noformat} > > Is there is a way to tune this particular spark driver's memory region down? > > > !image-2018-10-23-14-06-53-722.png! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25814) spark driver runs out of memory on org.apache.spark.util.kvstore.InMemoryStore
[ https://issues.apache.org/jira/browse/SPARK-25814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruslan Dautkhanov updated SPARK-25814: -- Description: We're looking into issue when even huge spark driver memory gets eventually exhausted and GC makes driver stop responding. Used [JXRay.com|http://jxray.com/] tool and found that most of driver heap is used by {noformat} org.apache.spark.status.AppStatusStore -> org.apache.spark.status.ElementTrackingStore -> org.apache.spark.util.kvstore.InMemoryStore {noformat} Is there is a way to tune this particular spark driver's memory region down? !image-2018-10-23-14-06-53-722.png! was: We're looking into issue when even huge spark driver memory gets eventually exhausted and GC makes driver stop responding. Used jxray.com tool and found that most of driver heap is used by {noformat} org.apache.spark.status.AppStatusStore -> org.apache.spark.status.ElementTrackingStore -> org.apache.spark.util.kvstore.InMemoryStore {noformat} Is there is a way to tune this particular spark driver's memory region down? !image-2018-10-23-14-06-53-722.png! > spark driver runs out of memory on org.apache.spark.util.kvstore.InMemoryStore > -- > > Key: SPARK-25814 > URL: https://issues.apache.org/jira/browse/SPARK-25814 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.2, 2.3.2 >Reporter: Ruslan Dautkhanov >Priority: Critical > Labels: driver, memory-analysis, memory-leak, statestore > Attachments: image-2018-10-23-14-06-53-722.png > > > We're looking into issue when even huge spark driver memory gets eventually > exhausted and GC makes driver stop responding. > Used [JXRay.com|http://jxray.com/] tool and found that most of driver heap is > used by > > {noformat} > org.apache.spark.status.AppStatusStore > -> org.apache.spark.status.ElementTrackingStore > -> org.apache.spark.util.kvstore.InMemoryStore > > {noformat} > > Is there is a way to tune this particular spark driver's memory region down? > > > !image-2018-10-23-14-06-53-722.png! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25814) spark driver runs out of memory on org.apache.spark.util.kvstore.InMemoryStore
[ https://issues.apache.org/jira/browse/SPARK-25814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruslan Dautkhanov updated SPARK-25814: -- Description: We're looking into issue when even huge spark driver memory gets eventually exhausted and GC makes driver stop responding. Used jxray.com tool and found that most of driver heap is used by {noformat} org.apache.spark.status.AppStatusStore -> org.apache.spark.status.ElementTrackingStore -> org.apache.spark.util.kvstore.InMemoryStore {noformat} Is there is a way to tune this particular spark driver's memory region down? !image-2018-10-23-14-06-53-722.png! was: We're looking into issue when even huge spark driver memory gets eventually exhausted and GC makes driver stop responding. Used jxray.com tool and found that most of driver heap is used by {noformat} org.apache.spark.status.AppStatusStore -> org.apache.spark.status.ElementTrackingStore -> org.apache.spark.util.kvstore.InMemoryStore {noformat} Is there is a way to tune this particular spark driver's memory region down? !image-2018-10-23-14-03-12-258.png! !image-2018-10-23-14-06-53-722.png! > spark driver runs out of memory on org.apache.spark.util.kvstore.InMemoryStore > -- > > Key: SPARK-25814 > URL: https://issues.apache.org/jira/browse/SPARK-25814 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.2, 2.3.2 >Reporter: Ruslan Dautkhanov >Priority: Critical > Labels: driver, memory-analysis, memory-leak, statestore > Attachments: image-2018-10-23-14-06-53-722.png > > > We're looking into issue when even huge spark driver memory gets eventually > exhausted and GC makes driver stop responding. > Used jxray.com tool and found that most of driver heap is used by > > {noformat} > org.apache.spark.status.AppStatusStore > -> org.apache.spark.status.ElementTrackingStore > -> org.apache.spark.util.kvstore.InMemoryStore > > {noformat} > > Is there is a way to tune this particular spark driver's memory region down? > > > !image-2018-10-23-14-06-53-722.png! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25814) spark driver runs out of memory on org.apache.spark.util.kvstore.InMemoryStore
[ https://issues.apache.org/jira/browse/SPARK-25814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruslan Dautkhanov updated SPARK-25814: -- Description: We're looking into issue when even huge spark driver memory gets eventually exhausted and GC makes driver stop responding. Used jxray.com tool and found that most of driver heap is used by {noformat} org.apache.spark.status.AppStatusStore -> org.apache.spark.status.ElementTrackingStore -> org.apache.spark.util.kvstore.InMemoryStore {noformat} Is there is a way to tune this particular spark driver's memory region down? !image-2018-10-23-14-03-12-258.png! !image-2018-10-23-14-06-53-722.png! was: We're looking into issue when even huge spark driver memory gets eventually exhausted and GC makes driver stop responding. Used jxray.com tool and found that most of driver heap is used by {noformat} org.apache.spark.status.AppStatusStore -> org.apache.spark.status.ElementTrackingStore -> org.apache.spark.util.kvstore.InMemoryStore {noformat} Is there is a way to tune this particular spark driver's memory region down? !image-2018-10-23-14-03-12-258.png! > spark driver runs out of memory on org.apache.spark.util.kvstore.InMemoryStore > -- > > Key: SPARK-25814 > URL: https://issues.apache.org/jira/browse/SPARK-25814 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.2, 2.3.2 >Reporter: Ruslan Dautkhanov >Priority: Critical > Labels: driver, memory-analysis, memory-leak, statestore > Attachments: image-2018-10-23-14-06-53-722.png > > > We're looking into issue when even huge spark driver memory gets eventually > exhausted and GC makes driver stop responding. > Used jxray.com tool and found that most of driver heap is used by > > {noformat} > org.apache.spark.status.AppStatusStore > -> org.apache.spark.status.ElementTrackingStore > -> org.apache.spark.util.kvstore.InMemoryStore > > {noformat} > > > Is there is a way to tune this particular spark driver's memory region down? > > !image-2018-10-23-14-03-12-258.png! > > !image-2018-10-23-14-06-53-722.png! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25814) spark driver runs out of memory on org.apache.spark.util.kvstore.InMemoryStore
[ https://issues.apache.org/jira/browse/SPARK-25814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruslan Dautkhanov updated SPARK-25814: -- Attachment: image-2018-10-23-14-06-53-722.png > spark driver runs out of memory on org.apache.spark.util.kvstore.InMemoryStore > -- > > Key: SPARK-25814 > URL: https://issues.apache.org/jira/browse/SPARK-25814 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.2, 2.3.2 >Reporter: Ruslan Dautkhanov >Priority: Critical > Labels: driver, memory-analysis, memory-leak, statestore > Attachments: image-2018-10-23-14-06-53-722.png > > > We're looking into issue when even huge spark driver memory gets eventually > exhausted and GC makes driver stop responding. > Used jxray.com tool and found that most of driver heap is used by > > {noformat} > org.apache.spark.status.AppStatusStore > -> org.apache.spark.status.ElementTrackingStore > -> org.apache.spark.util.kvstore.InMemoryStore > > {noformat} > > > Is there is a way to tune this particular spark driver's memory region down? > > !image-2018-10-23-14-03-12-258.png! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25814) spark driver runs out of memory on org.apache.spark.util.kvstore.InMemoryStore
Ruslan Dautkhanov created SPARK-25814: - Summary: spark driver runs out of memory on org.apache.spark.util.kvstore.InMemoryStore Key: SPARK-25814 URL: https://issues.apache.org/jira/browse/SPARK-25814 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.2, 2.2.2 Reporter: Ruslan Dautkhanov Attachments: image-2018-10-23-14-06-53-722.png We're looking into issue when even huge spark driver memory gets eventually exhausted and GC makes driver stop responding. Used jxray.com tool and found that most of driver heap is used by {noformat} org.apache.spark.status.AppStatusStore -> org.apache.spark.status.ElementTrackingStore -> org.apache.spark.util.kvstore.InMemoryStore {noformat} Is there is a way to tune this particular spark driver's memory region down? !image-2018-10-23-14-03-12-258.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25813) Unit Test "pageNavigation" for test suite PagedTableSuite.scala is failing for Apache Spark master branch
[ https://issues.apache.org/jira/browse/SPARK-25813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661172#comment-16661172 ] Parth Gandhi commented on SPARK-25813: -- Duplicate JIRA, refer https://issues.apache.org/jira/browse/SPARK-25812. Closing this JIRA. > Unit Test "pageNavigation" for test suite PagedTableSuite.scala is failing > for Apache Spark master branch > - > > Key: SPARK-25813 > URL: https://issues.apache.org/jira/browse/SPARK-25813 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Parth Gandhi >Priority: Major > > The PR [https://github.com/apache/spark/pull/22668] which was merged a few > days back is breaking one unit test for Apache Spark master branch. This > needs to be fixed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25813) Unit Test "pageNavigation" for test suite PagedTableSuite.scala is failing for Apache Spark master branch
[ https://issues.apache.org/jira/browse/SPARK-25813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Parth Gandhi resolved SPARK-25813. -- Resolution: Duplicate > Unit Test "pageNavigation" for test suite PagedTableSuite.scala is failing > for Apache Spark master branch > - > > Key: SPARK-25813 > URL: https://issues.apache.org/jira/browse/SPARK-25813 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Parth Gandhi >Priority: Major > > The PR [https://github.com/apache/spark/pull/22668] which was merged a few > days back is breaking one unit test for Apache Spark master branch. This > needs to be fixed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25656) Add an example section about how to use Parquet/ORC library options
[ https://issues.apache.org/jira/browse/SPARK-25656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-25656. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 22801 [https://github.com/apache/spark/pull/22801] > Add an example section about how to use Parquet/ORC library options > --- > > Key: SPARK-25656 > URL: https://issues.apache.org/jira/browse/SPARK-25656 > Project: Spark > Issue Type: Documentation > Components: Documentation, Examples, SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.0.0 > > > Our current doc does not explain we are passing the data source specific > options to the underlying data source: > - > https://spark.apache.org/docs/latest/sql-programming-guide.html#manually-specifying-options > We can add some introduction section for both Parquet/ORC examples there. We > had better give both read/write side configuration examples, too. One example > candidate is `dictionary encoding`: `parquet.enable.dictionary` and > `orc.dictionary.key.threshold` et al. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25656) Add an example section about how to use Parquet/ORC library options
[ https://issues.apache.org/jira/browse/SPARK-25656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-25656: - Assignee: Dongjoon Hyun > Add an example section about how to use Parquet/ORC library options > --- > > Key: SPARK-25656 > URL: https://issues.apache.org/jira/browse/SPARK-25656 > Project: Spark > Issue Type: Documentation > Components: Documentation, Examples, SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.0.0 > > > Our current doc does not explain we are passing the data source specific > options to the underlying data source: > - > https://spark.apache.org/docs/latest/sql-programming-guide.html#manually-specifying-options > We can add some introduction section for both Parquet/ORC examples there. We > had better give both read/write side configuration examples, too. One example > candidate is `dictionary encoding`: `parquet.enable.dictionary` and > `orc.dictionary.key.threshold` et al. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25813) Unit Test "pageNavigation" for test suite PagedTableSuite.scala is failing for Apache Spark master branch
Parth Gandhi created SPARK-25813: Summary: Unit Test "pageNavigation" for test suite PagedTableSuite.scala is failing for Apache Spark master branch Key: SPARK-25813 URL: https://issues.apache.org/jira/browse/SPARK-25813 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0 Reporter: Parth Gandhi The PR [https://github.com/apache/spark/pull/22668] which was merged a few days back is breaking one unit test for Apache Spark master branch. This needs to be fixed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25812) Flaky test: PagedTableSuite.pageNavigation
[ https://issues.apache.org/jira/browse/SPARK-25812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-25812. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 22808 [https://github.com/apache/spark/pull/22808] > Flaky test: PagedTableSuite.pageNavigation > -- > > Key: SPARK-25812 > URL: https://issues.apache.org/jira/browse/SPARK-25812 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.0.0 > > > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5074/testReport/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5073/testReport/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5072/testReport/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5070/testReport/ > - > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97878/testReport/org.apache.spark.ui/PagedTableSuite/pageNavigation/] > {code:java} > [info] PagedTableSuite: > [info] - pageNavigation *** FAILED *** (2 milliseconds) > [info] > [info] > [info]class="form-inline pull-right" style="margin-bottom: 0px;"> > [info] > [info] > [info] 1 Pages. Jump to > [info] value="1" class="span1"/> > [info] > [info] . Show > [info] value="10" class="span1"/> > [info] items in a page. > [info] > [info] Go > [info] > [info] > [info] > [info] Page: > [info] > [info] > [info] > [info] 1 > [info] > [info] > [info] > [info] > [info]did not equal List() (PagedTableSuite.scala:76) > [info] org.scalatest.exceptions.TestFailedException: > [info] at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528) > [info] at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) > [info] at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501) > [info] at > org.apache.spark.ui.PagedTableSuite$$anonfun$4.apply(PagedTableSuite.scala:76) > [info] at > org.apache.spark.ui.PagedTableSuite$$anonfun$4.apply(PagedTableSuite.scala:52) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25812) Flaky test: PagedTableSuite.pageNavigation
[ https://issues.apache.org/jira/browse/SPARK-25812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-25812: - Assignee: Gengliang Wang > Flaky test: PagedTableSuite.pageNavigation > -- > > Key: SPARK-25812 > URL: https://issues.apache.org/jira/browse/SPARK-25812 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Gengliang Wang >Priority: Major > > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5074/testReport/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5073/testReport/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5072/testReport/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5070/testReport/ > - > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97878/testReport/org.apache.spark.ui/PagedTableSuite/pageNavigation/] > {code:java} > [info] PagedTableSuite: > [info] - pageNavigation *** FAILED *** (2 milliseconds) > [info] > [info] > [info]class="form-inline pull-right" style="margin-bottom: 0px;"> > [info] > [info] > [info] 1 Pages. Jump to > [info] value="1" class="span1"/> > [info] > [info] . Show > [info] value="10" class="span1"/> > [info] items in a page. > [info] > [info] Go > [info] > [info] > [info] > [info] Page: > [info] > [info] > [info] > [info] 1 > [info] > [info] > [info] > [info] > [info]did not equal List() (PagedTableSuite.scala:76) > [info] org.scalatest.exceptions.TestFailedException: > [info] at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528) > [info] at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) > [info] at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501) > [info] at > org.apache.spark.ui.PagedTableSuite$$anonfun$4.apply(PagedTableSuite.scala:76) > [info] at > org.apache.spark.ui.PagedTableSuite$$anonfun$4.apply(PagedTableSuite.scala:52) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25793) Loading model bug in BisectingKMeans
[ https://issues.apache.org/jira/browse/SPARK-25793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25793: -- Target Version/s: 2.4.1, 3.0.0 (was: 2.4.1, 2.5.0) > Loading model bug in BisectingKMeans > > > Key: SPARK-25793 > URL: https://issues.apache.org/jira/browse/SPARK-25793 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.4.0 >Reporter: Weichen Xu >Priority: Major > > See this line: > [https://github.com/apache/spark/blob/fc64e83f9538d6b7e13359a4933a454ba7ed89ec/mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala#L129] > > This also affects `ml.clustering.BisectingKMeansModel` > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25812) Flaky test: PagedTableSuite.pageNavigation
[ https://issues.apache.org/jira/browse/SPARK-25812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25812: Assignee: Apache Spark > Flaky test: PagedTableSuite.pageNavigation > -- > > Key: SPARK-25812 > URL: https://issues.apache.org/jira/browse/SPARK-25812 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5074/testReport/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5073/testReport/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5072/testReport/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5070/testReport/ > - > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97878/testReport/org.apache.spark.ui/PagedTableSuite/pageNavigation/] > {code:java} > [info] PagedTableSuite: > [info] - pageNavigation *** FAILED *** (2 milliseconds) > [info] > [info] > [info]class="form-inline pull-right" style="margin-bottom: 0px;"> > [info] > [info] > [info] 1 Pages. Jump to > [info] value="1" class="span1"/> > [info] > [info] . Show > [info] value="10" class="span1"/> > [info] items in a page. > [info] > [info] Go > [info] > [info] > [info] > [info] Page: > [info] > [info] > [info] > [info] 1 > [info] > [info] > [info] > [info] > [info]did not equal List() (PagedTableSuite.scala:76) > [info] org.scalatest.exceptions.TestFailedException: > [info] at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528) > [info] at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) > [info] at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501) > [info] at > org.apache.spark.ui.PagedTableSuite$$anonfun$4.apply(PagedTableSuite.scala:76) > [info] at > org.apache.spark.ui.PagedTableSuite$$anonfun$4.apply(PagedTableSuite.scala:52) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25812) Flaky test: PagedTableSuite.pageNavigation
[ https://issues.apache.org/jira/browse/SPARK-25812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16660959#comment-16660959 ] Apache Spark commented on SPARK-25812: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/22808 > Flaky test: PagedTableSuite.pageNavigation > -- > > Key: SPARK-25812 > URL: https://issues.apache.org/jira/browse/SPARK-25812 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5074/testReport/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5073/testReport/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5072/testReport/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5070/testReport/ > - > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97878/testReport/org.apache.spark.ui/PagedTableSuite/pageNavigation/] > {code:java} > [info] PagedTableSuite: > [info] - pageNavigation *** FAILED *** (2 milliseconds) > [info] > [info] > [info]class="form-inline pull-right" style="margin-bottom: 0px;"> > [info] > [info] > [info] 1 Pages. Jump to > [info] value="1" class="span1"/> > [info] > [info] . Show > [info] value="10" class="span1"/> > [info] items in a page. > [info] > [info] Go > [info] > [info] > [info] > [info] Page: > [info] > [info] > [info] > [info] 1 > [info] > [info] > [info] > [info] > [info]did not equal List() (PagedTableSuite.scala:76) > [info] org.scalatest.exceptions.TestFailedException: > [info] at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528) > [info] at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) > [info] at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501) > [info] at > org.apache.spark.ui.PagedTableSuite$$anonfun$4.apply(PagedTableSuite.scala:76) > [info] at > org.apache.spark.ui.PagedTableSuite$$anonfun$4.apply(PagedTableSuite.scala:52) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25812) Flaky test: PagedTableSuite.pageNavigation
[ https://issues.apache.org/jira/browse/SPARK-25812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25812: Assignee: (was: Apache Spark) > Flaky test: PagedTableSuite.pageNavigation > -- > > Key: SPARK-25812 > URL: https://issues.apache.org/jira/browse/SPARK-25812 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5074/testReport/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5073/testReport/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5072/testReport/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5070/testReport/ > - > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97878/testReport/org.apache.spark.ui/PagedTableSuite/pageNavigation/] > {code:java} > [info] PagedTableSuite: > [info] - pageNavigation *** FAILED *** (2 milliseconds) > [info] > [info] > [info]class="form-inline pull-right" style="margin-bottom: 0px;"> > [info] > [info] > [info] 1 Pages. Jump to > [info] value="1" class="span1"/> > [info] > [info] . Show > [info] value="10" class="span1"/> > [info] items in a page. > [info] > [info] Go > [info] > [info] > [info] > [info] Page: > [info] > [info] > [info] > [info] 1 > [info] > [info] > [info] > [info] > [info]did not equal List() (PagedTableSuite.scala:76) > [info] org.scalatest.exceptions.TestFailedException: > [info] at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528) > [info] at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) > [info] at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501) > [info] at > org.apache.spark.ui.PagedTableSuite$$anonfun$4.apply(PagedTableSuite.scala:76) > [info] at > org.apache.spark.ui.PagedTableSuite$$anonfun$4.apply(PagedTableSuite.scala:52) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19851) Add support for EVERY and ANY (SOME) aggregates
[ https://issues.apache.org/jira/browse/SPARK-19851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16660955#comment-16660955 ] Apache Spark commented on SPARK-19851: -- User 'dilipbiswal' has created a pull request for this issue: https://github.com/apache/spark/pull/22809 > Add support for EVERY and ANY (SOME) aggregates > --- > > Key: SPARK-19851 > URL: https://issues.apache.org/jira/browse/SPARK-19851 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.1.0 >Reporter: Michael Styles >Priority: Major > > Add support for EVERY and ANY (SOME) aggregates. > - EVERY returns true if all input values are true. > - ANY returns true if at least one input value is true. > - SOME is equivalent to ANY. > Both aggregates are part of the SQL standard. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25793) Loading model bug in BisectingKMeans
[ https://issues.apache.org/jira/browse/SPARK-25793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-25793: -- Target Version/s: 2.4.1, 2.5.0 > Loading model bug in BisectingKMeans > > > Key: SPARK-25793 > URL: https://issues.apache.org/jira/browse/SPARK-25793 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.4.0 >Reporter: Weichen Xu >Priority: Major > > See this line: > [https://github.com/apache/spark/blob/fc64e83f9538d6b7e13359a4933a454ba7ed89ec/mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala#L129] > > This also affects `ml.clustering.BisectingKMeansModel` > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25812) Flaky test: PagedTableSuite.pageNavigation
[ https://issues.apache.org/jira/browse/SPARK-25812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25812: -- Description: - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5074/testReport/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5073/testReport/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5072/testReport/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5070/testReport/ - [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97878/testReport/org.apache.spark.ui/PagedTableSuite/pageNavigation/] {code:java} [info] PagedTableSuite: [info] - pageNavigation *** FAILED *** (2 milliseconds) [info] [info] [info] [info] [info] [info] 1 Pages. Jump to [info] [info] [info] . Show [info] [info] items in a page. [info] [info] Go [info] [info] [info] [info] Page: [info] [info] [info] [info] 1 [info] [info] [info] [info] [info]did not equal List() (PagedTableSuite.scala:76) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528) [info] at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) [info] at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501) [info] at org.apache.spark.ui.PagedTableSuite$$anonfun$4.apply(PagedTableSuite.scala:76) [info] at org.apache.spark.ui.PagedTableSuite$$anonfun$4.apply(PagedTableSuite.scala:52) {code} was: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97878/testReport/org.apache.spark.ui/PagedTableSuite/pageNavigation/ {code} [info] PagedTableSuite: [info] - pageNavigation *** FAILED *** (2 milliseconds) [info] [info] [info] [info] [info] [info] 1 Pages. Jump to [info] [info] [info] . Show [info] [info] items in a page. [info] [info] Go [info] [info] [info] [info] Page: [info] [info] [info] [info] 1 [info] [info] [info] [info] [info]did not equal List() (PagedTableSuite.scala:76) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528) [info] at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) [info] at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501) [info] at org.apache.spark.ui.PagedTableSuite$$anonfun$4.apply(PagedTableSuite.scala:76) [info] at org.apache.spark.ui.PagedTableSuite$$anonfun$4.apply(PagedTableSuite.scala:52) {code} > Flaky test: PagedTableSuite.pageNavigation > -- > > Key: SPARK-25812 > URL: https://issues.apache.org/jira/browse/SPARK-25812 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5074/testReport/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5073/testReport/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5072/testReport/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5070/testReport/ > - > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97878/testReport/org.apache.spark.ui/PagedTableSuite/pageNavigation/] > {code:java} > [info] PagedTableSuite: > [info] - pageNavigation *** FAILED *** (2 milliseconds) > [info] > [info] > [info]class="form-inline pull-right" style="margin-bottom: 0px;"> > [info] > [info] > [info] 1 Pages. Jump to > [info] value="1" class="span1"/> > [info] > [info] . Show > [info] value="10" class="span1"/> > [info] items in a page. > [info] > [info] Go > [info] > [info] > [info] >
[jira] [Created] (SPARK-25812) Flaky test: PagedTableSuite.pageNavigation
Dongjoon Hyun created SPARK-25812: - Summary: Flaky test: PagedTableSuite.pageNavigation Key: SPARK-25812 URL: https://issues.apache.org/jira/browse/SPARK-25812 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0 Reporter: Dongjoon Hyun https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97878/testReport/org.apache.spark.ui/PagedTableSuite/pageNavigation/ {code} [info] PagedTableSuite: [info] - pageNavigation *** FAILED *** (2 milliseconds) [info] [info] [info] [info] [info] [info] 1 Pages. Jump to [info] [info] [info] . Show [info] [info] items in a page. [info] [info] Go [info] [info] [info] [info] Page: [info] [info] [info] [info] 1 [info] [info] [info] [info] [info]did not equal List() (PagedTableSuite.scala:76) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528) [info] at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) [info] at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501) [info] at org.apache.spark.ui.PagedTableSuite$$anonfun$4.apply(PagedTableSuite.scala:76) [info] at org.apache.spark.ui.PagedTableSuite$$anonfun$4.apply(PagedTableSuite.scala:52) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25675) [Spark Job History] Job UI page does not show pagination with one page
[ https://issues.apache.org/jira/browse/SPARK-25675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16660918#comment-16660918 ] Apache Spark commented on SPARK-25675: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/22808 > [Spark Job History] Job UI page does not show pagination with one page > -- > > Key: SPARK-25675 > URL: https://issues.apache.org/jira/browse/SPARK-25675 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.1 >Reporter: ABHISHEK KUMAR GUPTA >Assignee: Shivu Sondur >Priority: Major > Fix For: 3.0.0 > > > 1. set spark.ui.retainedJobs= 1 in spark-default conf of spark Job History > 2. Restart Job History > 3. Submit Beeline jobs for 1 > 4. Launch Job History UI Page > 5. Select JDBC Running Application ID from Incomplete Application Page > 6. Launch Jo Page > 7. Pagination Panel display based on page size as below > > > Completed Jobs XXX > Page: 1 2 3 ... XX Page: Jump to 1 show 100 items in a > page > > - > 8. Change the value in Jump to 1 show *XXX* items in page, that is display > all completed Jobs in a single page > *Actual Result:* > All completed Jobs will be display in a Page but no Pagination panel so that > User can modify and set the number of Jobs in a page. > *Expected Result:* > It should display the Pagination panel as below > >>> > Page: 1 1 Page: > Jump to 1 show *XXX* items in a page > > Pagination of page size *1* because it is displaying total number of > completed Jobs in a single Page. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25675) [Spark Job History] Job UI page does not show pagination with one page
[ https://issues.apache.org/jira/browse/SPARK-25675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16660914#comment-16660914 ] Apache Spark commented on SPARK-25675: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/22808 > [Spark Job History] Job UI page does not show pagination with one page > -- > > Key: SPARK-25675 > URL: https://issues.apache.org/jira/browse/SPARK-25675 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.1 >Reporter: ABHISHEK KUMAR GUPTA >Assignee: Shivu Sondur >Priority: Major > Fix For: 3.0.0 > > > 1. set spark.ui.retainedJobs= 1 in spark-default conf of spark Job History > 2. Restart Job History > 3. Submit Beeline jobs for 1 > 4. Launch Job History UI Page > 5. Select JDBC Running Application ID from Incomplete Application Page > 6. Launch Jo Page > 7. Pagination Panel display based on page size as below > > > Completed Jobs XXX > Page: 1 2 3 ... XX Page: Jump to 1 show 100 items in a > page > > - > 8. Change the value in Jump to 1 show *XXX* items in page, that is display > all completed Jobs in a single page > *Actual Result:* > All completed Jobs will be display in a Page but no Pagination panel so that > User can modify and set the number of Jobs in a page. > *Expected Result:* > It should display the Pagination panel as below > >>> > Page: 1 1 Page: > Jump to 1 show *XXX* items in a page > > Pagination of page size *1* because it is displaying total number of > completed Jobs in a single Page. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25811) Support PyArrow's feature to raise an error for unsafe cast
[ https://issues.apache.org/jira/browse/SPARK-25811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16660807#comment-16660807 ] Apache Spark commented on SPARK-25811: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/22807 > Support PyArrow's feature to raise an error for unsafe cast > --- > > Key: SPARK-25811 > URL: https://issues.apache.org/jira/browse/SPARK-25811 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Priority: Major > > Since 0.11.0, PyArrow supports to raise an error for unsafe cast. We should > use it to raise a proper error for pandas udf users when such cast is > detected. > We can also add a config to control such behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25811) Support PyArrow's feature to raise an error for unsafe cast
[ https://issues.apache.org/jira/browse/SPARK-25811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16660803#comment-16660803 ] Apache Spark commented on SPARK-25811: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/22807 > Support PyArrow's feature to raise an error for unsafe cast > --- > > Key: SPARK-25811 > URL: https://issues.apache.org/jira/browse/SPARK-25811 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Priority: Major > > Since 0.11.0, PyArrow supports to raise an error for unsafe cast. We should > use it to raise a proper error for pandas udf users when such cast is > detected. > We can also add a config to control such behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25811) Support PyArrow's feature to raise an error for unsafe cast
[ https://issues.apache.org/jira/browse/SPARK-25811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25811: Assignee: (was: Apache Spark) > Support PyArrow's feature to raise an error for unsafe cast > --- > > Key: SPARK-25811 > URL: https://issues.apache.org/jira/browse/SPARK-25811 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Priority: Major > > Since 0.11.0, PyArrow supports to raise an error for unsafe cast. We should > use it to raise a proper error for pandas udf users when such cast is > detected. > We can also add a config to control such behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25811) Support PyArrow's feature to raise an error for unsafe cast
[ https://issues.apache.org/jira/browse/SPARK-25811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25811: Assignee: Apache Spark > Support PyArrow's feature to raise an error for unsafe cast > --- > > Key: SPARK-25811 > URL: https://issues.apache.org/jira/browse/SPARK-25811 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Assignee: Apache Spark >Priority: Major > > Since 0.11.0, PyArrow supports to raise an error for unsafe cast. We should > use it to raise a proper error for pandas udf users when such cast is > detected. > We can also add a config to control such behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25811) Support PyArrow's feature to raise an error for unsafe cast
Liang-Chi Hsieh created SPARK-25811: --- Summary: Support PyArrow's feature to raise an error for unsafe cast Key: SPARK-25811 URL: https://issues.apache.org/jira/browse/SPARK-25811 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.0.0 Reporter: Liang-Chi Hsieh Since 0.11.0, PyArrow supports to raise an error for unsafe cast. We should use it to raise a proper error for pandas udf users when such cast is detected. We can also add a config to control such behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25250) Race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple ti
[ https://issues.apache.org/jira/browse/SPARK-25250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16660746#comment-16660746 ] Apache Spark commented on SPARK-25250: -- User 'pgandhi999' has created a pull request for this issue: https://github.com/apache/spark/pull/22806 > Race condition with tasks running when new attempt for same stage is created > leads to other task in the next attempt running on the same partition id > retry multiple times > -- > > Key: SPARK-25250 > URL: https://issues.apache.org/jira/browse/SPARK-25250 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.3.1 >Reporter: Parth Gandhi >Priority: Major > > We recently had a scenario where a race condition occurred when a task from > previous stage attempt just finished before new attempt for the same stage > was created due to fetch failure, so the new task created in the second > attempt on the same partition id was retrying multiple times due to > TaskCommitDenied Exception without realizing that the task in earlier attempt > was already successful. > For example, consider a task with partition id 9000 and index 9000 running in > stage 4.0. We see a fetch failure so thus, we spawn a new stage attempt 4.1. > Just within this timespan, the above task completes successfully, thus, > marking the partition id 9000 as complete for 4.0. However, as stage 4.1 has > not yet been created, the taskset info for that stage is not available to the > TaskScheduler so, naturally, the partition id 9000 has not been marked > completed for 4.1. Stage 4.1 now spawns task with index 2000 on the same > partition id 9000. This task fails due to CommitDeniedException and since, it > does not see the corresponding partition id as been marked successful, it > keeps retrying multiple times until the job finally succeeds. It doesn't > cause any job failures because the DAG scheduler is tracking the partitions > separate from the task set managers. > > Steps to Reproduce: > # Run any large job involving shuffle operation. > # When the ShuffleMap stage finishes and the ResultStage begins running, > cause this stage to throw a fetch failure exception(Try deleting certain > shuffle files on any host). > # Observe the task attempt numbers for the next stage attempt. Please note > that this issue is an intermittent one, so it might not happen all the time. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25250) Race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple tim
[ https://issues.apache.org/jira/browse/SPARK-25250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25250: Assignee: Apache Spark > Race condition with tasks running when new attempt for same stage is created > leads to other task in the next attempt running on the same partition id > retry multiple times > -- > > Key: SPARK-25250 > URL: https://issues.apache.org/jira/browse/SPARK-25250 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.3.1 >Reporter: Parth Gandhi >Assignee: Apache Spark >Priority: Major > > We recently had a scenario where a race condition occurred when a task from > previous stage attempt just finished before new attempt for the same stage > was created due to fetch failure, so the new task created in the second > attempt on the same partition id was retrying multiple times due to > TaskCommitDenied Exception without realizing that the task in earlier attempt > was already successful. > For example, consider a task with partition id 9000 and index 9000 running in > stage 4.0. We see a fetch failure so thus, we spawn a new stage attempt 4.1. > Just within this timespan, the above task completes successfully, thus, > marking the partition id 9000 as complete for 4.0. However, as stage 4.1 has > not yet been created, the taskset info for that stage is not available to the > TaskScheduler so, naturally, the partition id 9000 has not been marked > completed for 4.1. Stage 4.1 now spawns task with index 2000 on the same > partition id 9000. This task fails due to CommitDeniedException and since, it > does not see the corresponding partition id as been marked successful, it > keeps retrying multiple times until the job finally succeeds. It doesn't > cause any job failures because the DAG scheduler is tracking the partitions > separate from the task set managers. > > Steps to Reproduce: > # Run any large job involving shuffle operation. > # When the ShuffleMap stage finishes and the ResultStage begins running, > cause this stage to throw a fetch failure exception(Try deleting certain > shuffle files on any host). > # Observe the task attempt numbers for the next stage attempt. Please note > that this issue is an intermittent one, so it might not happen all the time. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25250) Race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple ti
[ https://issues.apache.org/jira/browse/SPARK-25250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16660744#comment-16660744 ] Apache Spark commented on SPARK-25250: -- User 'pgandhi999' has created a pull request for this issue: https://github.com/apache/spark/pull/22806 > Race condition with tasks running when new attempt for same stage is created > leads to other task in the next attempt running on the same partition id > retry multiple times > -- > > Key: SPARK-25250 > URL: https://issues.apache.org/jira/browse/SPARK-25250 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.3.1 >Reporter: Parth Gandhi >Priority: Major > > We recently had a scenario where a race condition occurred when a task from > previous stage attempt just finished before new attempt for the same stage > was created due to fetch failure, so the new task created in the second > attempt on the same partition id was retrying multiple times due to > TaskCommitDenied Exception without realizing that the task in earlier attempt > was already successful. > For example, consider a task with partition id 9000 and index 9000 running in > stage 4.0. We see a fetch failure so thus, we spawn a new stage attempt 4.1. > Just within this timespan, the above task completes successfully, thus, > marking the partition id 9000 as complete for 4.0. However, as stage 4.1 has > not yet been created, the taskset info for that stage is not available to the > TaskScheduler so, naturally, the partition id 9000 has not been marked > completed for 4.1. Stage 4.1 now spawns task with index 2000 on the same > partition id 9000. This task fails due to CommitDeniedException and since, it > does not see the corresponding partition id as been marked successful, it > keeps retrying multiple times until the job finally succeeds. It doesn't > cause any job failures because the DAG scheduler is tracking the partitions > separate from the task set managers. > > Steps to Reproduce: > # Run any large job involving shuffle operation. > # When the ShuffleMap stage finishes and the ResultStage begins running, > cause this stage to throw a fetch failure exception(Try deleting certain > shuffle files on any host). > # Observe the task attempt numbers for the next stage attempt. Please note > that this issue is an intermittent one, so it might not happen all the time. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25250) Race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple tim
[ https://issues.apache.org/jira/browse/SPARK-25250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25250: Assignee: (was: Apache Spark) > Race condition with tasks running when new attempt for same stage is created > leads to other task in the next attempt running on the same partition id > retry multiple times > -- > > Key: SPARK-25250 > URL: https://issues.apache.org/jira/browse/SPARK-25250 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.3.1 >Reporter: Parth Gandhi >Priority: Major > > We recently had a scenario where a race condition occurred when a task from > previous stage attempt just finished before new attempt for the same stage > was created due to fetch failure, so the new task created in the second > attempt on the same partition id was retrying multiple times due to > TaskCommitDenied Exception without realizing that the task in earlier attempt > was already successful. > For example, consider a task with partition id 9000 and index 9000 running in > stage 4.0. We see a fetch failure so thus, we spawn a new stage attempt 4.1. > Just within this timespan, the above task completes successfully, thus, > marking the partition id 9000 as complete for 4.0. However, as stage 4.1 has > not yet been created, the taskset info for that stage is not available to the > TaskScheduler so, naturally, the partition id 9000 has not been marked > completed for 4.1. Stage 4.1 now spawns task with index 2000 on the same > partition id 9000. This task fails due to CommitDeniedException and since, it > does not see the corresponding partition id as been marked successful, it > keeps retrying multiple times until the job finally succeeds. It doesn't > cause any job failures because the DAG scheduler is tracking the partitions > separate from the task set managers. > > Steps to Reproduce: > # Run any large job involving shuffle operation. > # When the ShuffleMap stage finishes and the ResultStage begins running, > cause this stage to throw a fetch failure exception(Try deleting certain > shuffle files on any host). > # Observe the task attempt numbers for the next stage attempt. Please note > that this issue is an intermittent one, so it might not happen all the time. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25810) Spark structured streaming logs auto.offset.reset=earliest even though startingOffsets is set to latest
ANUJA BANTHIYA created SPARK-25810: -- Summary: Spark structured streaming logs auto.offset.reset=earliest even though startingOffsets is set to latest Key: SPARK-25810 URL: https://issues.apache.org/jira/browse/SPARK-25810 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.3.1 Reporter: ANUJA BANTHIYA I have a issue when i'm trying to read data from kafka using spark structured streaming. Versions : spark-core_2.11 : 2.3.1, spark-sql_2.11 : 2.3.1, spark-sql-kafka-0-10_2.11 : 2.3.1, kafka-client :0.11.0.0 The issue i am facing is that the spark job always logs auto.offset.reset = earliest even though latest option is specified in the code during startup of application . Code to reproduce: {code:java} package com.informatica.exec import org.apache.spark.sql.SparkSession object kafkaLatestOffset { def main(s: Array[String]) { val spark = SparkSession .builder() .appName("Spark Offset basic example") .master("local[*]") .getOrCreate() val df = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "localhost:9092") .option("subscribe", "topic1") .option("startingOffsets", "latest") .load() val query = df.writeStream .outputMode("complete") .format("console") .start() query.awaitTermination() } } {code} As mentioned in Structured streaming doc, {{startingOffsets}} need to be set for auto.offset.reset. [https://spark.apache.org/docs/2.3.1/structured-streaming-kafka-integration.html] * *auto.offset.reset*: Set the source option {{startingOffsets}} to specify where to start instead. Structured Streaming manages which offsets are consumed internally, rather than rely on the kafka Consumer to do it. This will ensure that no data is missed when new topics/partitions are dynamically subscribed. Note that {{startingOffsets}} only applies when a new streaming query is started, and that resuming will always pick up from where the query left off. During runtime , kafka messages are picked from the latest offset , so functional wise it is working as expected. Only log is misleading as it logs auto.offset.reset = *earliest* . -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25791) Datatype of serializers in RowEncoder should be accessible
[ https://issues.apache.org/jira/browse/SPARK-25791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-25791. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 22785 [https://github.com/apache/spark/pull/22785] > Datatype of serializers in RowEncoder should be accessible > -- > > Key: SPARK-25791 > URL: https://issues.apache.org/jira/browse/SPARK-25791 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 3.0.0 > > > The serializers of {{RowEncoder}} use few {{If}} Catalyst expression which > inherits {{ComplexTypeMergingExpression}} that will check input data types. > It is possible to generate serializers which fail the check and can't to > access the data type of serializers. When producing {{If}} expression, we > should use the same data type at its input expressions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25791) Datatype of serializers in RowEncoder should be accessible
[ https://issues.apache.org/jira/browse/SPARK-25791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-25791: --- Assignee: Liang-Chi Hsieh > Datatype of serializers in RowEncoder should be accessible > -- > > Key: SPARK-25791 > URL: https://issues.apache.org/jira/browse/SPARK-25791 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 3.0.0 > > > The serializers of {{RowEncoder}} use few {{If}} Catalyst expression which > inherits {{ComplexTypeMergingExpression}} that will check input data types. > It is possible to generate serializers which fail the check and can't to > access the data type of serializers. When producing {{If}} expression, we > should use the same data type at its input expressions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25809) Support additional K8S cluster types for integration tests
[ https://issues.apache.org/jira/browse/SPARK-25809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25809: Assignee: (was: Apache Spark) > Support additional K8S cluster types for integration tests > -- > > Key: SPARK-25809 > URL: https://issues.apache.org/jira/browse/SPARK-25809 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.2, 2.4.0 >Reporter: Rob Vesse >Priority: Major > > Currently the Spark on K8S integration tests are hardcoded to use a > {{minikube}} based backend. It would be nice if developers had more > flexibility in the choice of K8S cluster they wish to use for integration > testing. More specifically it would be useful to be able to use the built-in > Kubernetes support in recent Docker releases and to just use a generic K8S > cluster. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25809) Support additional K8S cluster types for integration tests
[ https://issues.apache.org/jira/browse/SPARK-25809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25809: Assignee: Apache Spark > Support additional K8S cluster types for integration tests > -- > > Key: SPARK-25809 > URL: https://issues.apache.org/jira/browse/SPARK-25809 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.2, 2.4.0 >Reporter: Rob Vesse >Assignee: Apache Spark >Priority: Major > > Currently the Spark on K8S integration tests are hardcoded to use a > {{minikube}} based backend. It would be nice if developers had more > flexibility in the choice of K8S cluster they wish to use for integration > testing. More specifically it would be useful to be able to use the built-in > Kubernetes support in recent Docker releases and to just use a generic K8S > cluster. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25809) Support additional K8S cluster types for integration tests
[ https://issues.apache.org/jira/browse/SPARK-25809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16660641#comment-16660641 ] Apache Spark commented on SPARK-25809: -- User 'rvesse' has created a pull request for this issue: https://github.com/apache/spark/pull/22805 > Support additional K8S cluster types for integration tests > -- > > Key: SPARK-25809 > URL: https://issues.apache.org/jira/browse/SPARK-25809 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.2, 2.4.0 >Reporter: Rob Vesse >Priority: Major > > Currently the Spark on K8S integration tests are hardcoded to use a > {{minikube}} based backend. It would be nice if developers had more > flexibility in the choice of K8S cluster they wish to use for integration > testing. More specifically it would be useful to be able to use the built-in > Kubernetes support in recent Docker releases and to just use a generic K8S > cluster. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25809) Support additional K8S cluster types for integration tests
Rob Vesse created SPARK-25809: - Summary: Support additional K8S cluster types for integration tests Key: SPARK-25809 URL: https://issues.apache.org/jira/browse/SPARK-25809 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 2.3.2, 2.4.0 Reporter: Rob Vesse Currently the Spark on K8S integration tests are hardcoded to use a {{minikube}} based backend. It would be nice if developers had more flexibility in the choice of K8S cluster they wish to use for integration testing. More specifically it would be useful to be able to use the built-in Kubernetes support in recent Docker releases and to just use a generic K8S cluster. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25805) Flaky test: DataFrameSuite.SPARK-25159 unittest failure
[ https://issues.apache.org/jira/browse/SPARK-25805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-25805. - Resolution: Fixed Fix Version/s: 3.0.0 2.4.1 Issue resolved by pull request 22799 [https://github.com/apache/spark/pull/22799] > Flaky test: DataFrameSuite.SPARK-25159 unittest failure > --- > > Key: SPARK-25805 > URL: https://issues.apache.org/jira/browse/SPARK-25805 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Minor > Fix For: 2.4.1, 3.0.0 > > > I've seen this test fail on internal builds: > {noformat} > Error Message0 did not equal 1Stacktrace > org.scalatest.exceptions.TestFailedException: 0 did not equal 1 > at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528) > at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) > at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501) > at > org.apache.spark.sql.DataFrameSuite$$anonfun$99$$anonfun$apply$mcV$sp$219.apply(DataFrameSuite.scala:2552) > at > org.apache.spark.sql.DataFrameSuite$$anonfun$99$$anonfun$apply$mcV$sp$219.apply(DataFrameSuite.scala:2534) > at > org.apache.spark.sql.test.SQLTestUtilsBase$class.withTempPath(SQLTestUtils.scala:179) > at > org.apache.spark.sql.DataFrameSuite.withTempPath(DataFrameSuite.scala:46) > at > org.apache.spark.sql.DataFrameSuite$$anonfun$99.apply$mcV$sp(DataFrameSuite.scala:2534) > at > org.apache.spark.sql.DataFrameSuite$$anonfun$99.apply(DataFrameSuite.scala:2534) > at > org.apache.spark.sql.DataFrameSuite$$anonfun$99.apply(DataFrameSuite.scala:2534) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196) > at > org.apache.spark.sql.DataFrameSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(DataFrameSuite.scala:46) > at > org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:221) > at org.apache.spark.sql.DataFrameSuite.runTest(DataFrameSuite.scala:46) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384) > at scala.collection.immutable.List.foreach(List.scala:392) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1560) > at org.scalatest.Suite$class.run(Suite.scala:1147) > at > org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) > at org.scalatest.SuperEngine.runImpl(Engine.scala:521) > at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:52) > at > org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213) > at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210) > at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52) > at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1210) > at > org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1257) > at >
[jira] [Assigned] (SPARK-25805) Flaky test: DataFrameSuite.SPARK-25159 unittest failure
[ https://issues.apache.org/jira/browse/SPARK-25805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-25805: --- Assignee: Imran Rashid > Flaky test: DataFrameSuite.SPARK-25159 unittest failure > --- > > Key: SPARK-25805 > URL: https://issues.apache.org/jira/browse/SPARK-25805 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Minor > Fix For: 2.4.1, 3.0.0 > > > I've seen this test fail on internal builds: > {noformat} > Error Message0 did not equal 1Stacktrace > org.scalatest.exceptions.TestFailedException: 0 did not equal 1 > at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528) > at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) > at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501) > at > org.apache.spark.sql.DataFrameSuite$$anonfun$99$$anonfun$apply$mcV$sp$219.apply(DataFrameSuite.scala:2552) > at > org.apache.spark.sql.DataFrameSuite$$anonfun$99$$anonfun$apply$mcV$sp$219.apply(DataFrameSuite.scala:2534) > at > org.apache.spark.sql.test.SQLTestUtilsBase$class.withTempPath(SQLTestUtils.scala:179) > at > org.apache.spark.sql.DataFrameSuite.withTempPath(DataFrameSuite.scala:46) > at > org.apache.spark.sql.DataFrameSuite$$anonfun$99.apply$mcV$sp(DataFrameSuite.scala:2534) > at > org.apache.spark.sql.DataFrameSuite$$anonfun$99.apply(DataFrameSuite.scala:2534) > at > org.apache.spark.sql.DataFrameSuite$$anonfun$99.apply(DataFrameSuite.scala:2534) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196) > at > org.apache.spark.sql.DataFrameSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(DataFrameSuite.scala:46) > at > org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:221) > at org.apache.spark.sql.DataFrameSuite.runTest(DataFrameSuite.scala:46) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384) > at scala.collection.immutable.List.foreach(List.scala:392) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1560) > at org.scalatest.Suite$class.run(Suite.scala:1147) > at > org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) > at org.scalatest.SuperEngine.runImpl(Engine.scala:521) > at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:52) > at > org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213) > at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210) > at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52) > at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1210) > at > org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1257) > at > org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1255) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) >
[jira] [Commented] (SPARK-25808) upgrade jsr305 version from 1.3.9 to 3.0.0
[ https://issues.apache.org/jira/browse/SPARK-25808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16660381#comment-16660381 ] Apache Spark commented on SPARK-25808: -- User 'daviddingly' has created a pull request for this issue: https://github.com/apache/spark/pull/22803 > upgrade jsr305 version from 1.3.9 to 3.0.0 > -- > > Key: SPARK-25808 > URL: https://issues.apache.org/jira/browse/SPARK-25808 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: ding xiaoyuan >Priority: Minor > > > we find below warnings when build spark project: > {noformat} > [warn] * com.google.code.findbugs:jsr305:3.0.0 is selected over 1.3.9 > [warn] +- org.apache.hadoop:hadoop-common:2.7.3 (depends on 3.0.0) > [warn] +- org.apache.spark:spark-core_2.11:3.0.0-SNAPSHOT (depends on 1.3.9) > [warn] +- org.apache.spark:spark-network-common_2.11:3.0.0-SNAPSHOT (depends > on 1.3.9) > [warn] +- org.apache.spark:spark-unsafe_2.11:3.0.0-SNAPSHOT (depends on > 1.3.9){noformat} > so ideally we need to upgrade jsr305 from 1.3.9 to 3.0.0 to fix this warning -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25808) upgrade jsr305 version from 1.3.9 to 3.0.0
[ https://issues.apache.org/jira/browse/SPARK-25808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25808: Assignee: (was: Apache Spark) > upgrade jsr305 version from 1.3.9 to 3.0.0 > -- > > Key: SPARK-25808 > URL: https://issues.apache.org/jira/browse/SPARK-25808 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: ding xiaoyuan >Priority: Minor > > > we find below warnings when build spark project: > {noformat} > [warn] * com.google.code.findbugs:jsr305:3.0.0 is selected over 1.3.9 > [warn] +- org.apache.hadoop:hadoop-common:2.7.3 (depends on 3.0.0) > [warn] +- org.apache.spark:spark-core_2.11:3.0.0-SNAPSHOT (depends on 1.3.9) > [warn] +- org.apache.spark:spark-network-common_2.11:3.0.0-SNAPSHOT (depends > on 1.3.9) > [warn] +- org.apache.spark:spark-unsafe_2.11:3.0.0-SNAPSHOT (depends on > 1.3.9){noformat} > so ideally we need to upgrade jsr305 from 1.3.9 to 3.0.0 to fix this warning -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25808) upgrade jsr305 version from 1.3.9 to 3.0.0
[ https://issues.apache.org/jira/browse/SPARK-25808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16660380#comment-16660380 ] Apache Spark commented on SPARK-25808: -- User 'daviddingly' has created a pull request for this issue: https://github.com/apache/spark/pull/22803 > upgrade jsr305 version from 1.3.9 to 3.0.0 > -- > > Key: SPARK-25808 > URL: https://issues.apache.org/jira/browse/SPARK-25808 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: ding xiaoyuan >Priority: Minor > > > we find below warnings when build spark project: > {noformat} > [warn] * com.google.code.findbugs:jsr305:3.0.0 is selected over 1.3.9 > [warn] +- org.apache.hadoop:hadoop-common:2.7.3 (depends on 3.0.0) > [warn] +- org.apache.spark:spark-core_2.11:3.0.0-SNAPSHOT (depends on 1.3.9) > [warn] +- org.apache.spark:spark-network-common_2.11:3.0.0-SNAPSHOT (depends > on 1.3.9) > [warn] +- org.apache.spark:spark-unsafe_2.11:3.0.0-SNAPSHOT (depends on > 1.3.9){noformat} > so ideally we need to upgrade jsr305 from 1.3.9 to 3.0.0 to fix this warning -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25808) upgrade jsr305 version from 1.3.9 to 3.0.0
[ https://issues.apache.org/jira/browse/SPARK-25808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25808: Assignee: Apache Spark > upgrade jsr305 version from 1.3.9 to 3.0.0 > -- > > Key: SPARK-25808 > URL: https://issues.apache.org/jira/browse/SPARK-25808 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: ding xiaoyuan >Assignee: Apache Spark >Priority: Minor > > > we find below warnings when build spark project: > {noformat} > [warn] * com.google.code.findbugs:jsr305:3.0.0 is selected over 1.3.9 > [warn] +- org.apache.hadoop:hadoop-common:2.7.3 (depends on 3.0.0) > [warn] +- org.apache.spark:spark-core_2.11:3.0.0-SNAPSHOT (depends on 1.3.9) > [warn] +- org.apache.spark:spark-network-common_2.11:3.0.0-SNAPSHOT (depends > on 1.3.9) > [warn] +- org.apache.spark:spark-unsafe_2.11:3.0.0-SNAPSHOT (depends on > 1.3.9){noformat} > so ideally we need to upgrade jsr305 from 1.3.9 to 3.0.0 to fix this warning -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25808) upgrade jsr305 version from 1.3.9 to 3.0.0
ding xiaoyuan created SPARK-25808: - Summary: upgrade jsr305 version from 1.3.9 to 3.0.0 Key: SPARK-25808 URL: https://issues.apache.org/jira/browse/SPARK-25808 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.0.0 Reporter: ding xiaoyuan we find below warnings when build spark project: {noformat} [warn] * com.google.code.findbugs:jsr305:3.0.0 is selected over 1.3.9 [warn] +- org.apache.hadoop:hadoop-common:2.7.3 (depends on 3.0.0) [warn] +- org.apache.spark:spark-core_2.11:3.0.0-SNAPSHOT (depends on 1.3.9) [warn] +- org.apache.spark:spark-network-common_2.11:3.0.0-SNAPSHOT (depends on 1.3.9) [warn] +- org.apache.spark:spark-unsafe_2.11:3.0.0-SNAPSHOT (depends on 1.3.9){noformat} so ideally we need to upgrade jsr305 from 1.3.9 to 3.0.0 to fix this warning -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25801) pandas_udf grouped_map fails with input dataframe with more than 255 columns
[ https://issues.apache.org/jira/browse/SPARK-25801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16660360#comment-16660360 ] Frederik commented on SPARK-25801: -- Hi Bryan, Thanks for the quick answer! I wasn't aware Python 3.7 doesn't have the 255 arguments limitation. Unfortunately I can't use python 3.7 (I'm on a platform where I can't change PYSPARK_DRIVER_PYTHON from 3.6 and PYSPARK_DRIVER_PYTHON and PYSPARK_PYTHON need the same minor versions) nor upgrade Spark. Think I'll use an approach with standard udf's as for example outlined here: [https://florianwilhelm.info/2017/10/efficient_udfs_with_pyspark/] Unless there's other options? > pandas_udf grouped_map fails with input dataframe with more than 255 columns > > > Key: SPARK-25801 > URL: https://issues.apache.org/jira/browse/SPARK-25801 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 > Environment: python 2.7 > pyspark 2.3.0 >Reporter: Frederik >Priority: Major > > Hi, > I'm using a pandas_udf to deploy a model to predict all samples in a spark > dataframe, > for this I use a udf as follows: > @pandas_udf("scores double", PandasUDFType.GROUPED_MAP) def > predict_scores(pdf): score_values = model.predict_proba(pdf)[:,1] return > pd.DataFrame({'scores': score_values}) > So it takes a dataframe and predicts the probability of being positive > according to an sklearn model for each row and returns this as single column. > This works great on a random groupBy, e.g.: > sdf_to_score.groupBy(sf.col('age')).apply(predict_scores) > as long as the dataframe has <255 columns. When the input dataframe has more > than 255 columns (thus features in my model), I get: > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File "path/to/lib/spark2/python/lib/pyspark.zip/pyspark/worker.py", line > 219, in main > func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, > eval_type) > File "path/to/lib/spark2/python/lib/pyspark.zip/pyspark/worker.py", line > 148, in read_udfs > mapper = eval(mapper_str, udfs) > File "", line 1 > SyntaxError: more than 255 arguments > Which seems to be related with Python's general limitation of having not > allowing more than 255 arguments for a function? > > Is this a bug or is there a straightforward way around this problem? > > Regards, > Frederik -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25665) Refactor ObjectHashAggregateExecBenchmark to use main method
[ https://issues.apache.org/jira/browse/SPARK-25665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16660291#comment-16660291 ] Apache Spark commented on SPARK-25665: -- User 'peter-toth' has created a pull request for this issue: https://github.com/apache/spark/pull/22804 > Refactor ObjectHashAggregateExecBenchmark to use main method > > > Key: SPARK-25665 > URL: https://issues.apache.org/jira/browse/SPARK-25665 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25665) Refactor ObjectHashAggregateExecBenchmark to use main method
[ https://issues.apache.org/jira/browse/SPARK-25665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25665: Assignee: Apache Spark > Refactor ObjectHashAggregateExecBenchmark to use main method > > > Key: SPARK-25665 > URL: https://issues.apache.org/jira/browse/SPARK-25665 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25665) Refactor ObjectHashAggregateExecBenchmark to use main method
[ https://issues.apache.org/jira/browse/SPARK-25665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25665: Assignee: (was: Apache Spark) > Refactor ObjectHashAggregateExecBenchmark to use main method > > > Key: SPARK-25665 > URL: https://issues.apache.org/jira/browse/SPARK-25665 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25665) Refactor ObjectHashAggregateExecBenchmark to use main method
[ https://issues.apache.org/jira/browse/SPARK-25665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16660290#comment-16660290 ] Apache Spark commented on SPARK-25665: -- User 'peter-toth' has created a pull request for this issue: https://github.com/apache/spark/pull/22804 > Refactor ObjectHashAggregateExecBenchmark to use main method > > > Key: SPARK-25665 > URL: https://issues.apache.org/jira/browse/SPARK-25665 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25807) Mitigate 1-based substr() confusion
Oron Navon created SPARK-25807: -- Summary: Mitigate 1-based substr() confusion Key: SPARK-25807 URL: https://issues.apache.org/jira/browse/SPARK-25807 Project: Spark Issue Type: Improvement Components: Java API, PySpark Affects Versions: 2.3.2, 1.3.0, 2.4.0, 2.5.0, 3.0.0 Reporter: Oron Navon The method {{Column.substr()}} is 1-based, conforming with SQL and Hive's {{SUBSTRING}}, and contradicting both Python's {{substr}} and Java's {{substr}}, which are zero-based. Both PySpark users and Java API users often naturally expect a 0-based {{substr()}}. Adding to the confusion, {{substr()}} currently allows a {{startPos}} value of 0, which returns the same result as {{startPos==1}}. Since changing {{substr()}} to 0-based is probably NOT a reasonable option here, I suggest making one or more of the following changes: # Adding a method {{substr0}}, which would be zero-based # Renaming {{substr}} to {{substr1}} # Making the existing {{substr()}} throw an exception on {{startPos==0}}, which should catch and alert most users who expect zero-based behavior. This is my first discussion on this project, apologies for any faux pas. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25806) The instanceof FileSplit is redundant for ParquetFileFormat
[ https://issues.apache.org/jira/browse/SPARK-25806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25806: Assignee: Apache Spark > The instanceof FileSplit is redundant for ParquetFileFormat > > > Key: SPARK-25806 > URL: https://issues.apache.org/jira/browse/SPARK-25806 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 3.0.0 >Reporter: liuxian >Assignee: Apache Spark >Priority: Trivial > > The instance of FileSplit is redundant for > {color:#ffc66d}buildReaderWithPartitionValues {color} > {color:#33}in the {color}{color:#f79232}ParquetFileFormat{color} class. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25806) The instanceof FileSplit is redundant for ParquetFileFormat
[ https://issues.apache.org/jira/browse/SPARK-25806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16660185#comment-16660185 ] Apache Spark commented on SPARK-25806: -- User '10110346' has created a pull request for this issue: https://github.com/apache/spark/pull/22802 > The instanceof FileSplit is redundant for ParquetFileFormat > > > Key: SPARK-25806 > URL: https://issues.apache.org/jira/browse/SPARK-25806 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 3.0.0 >Reporter: liuxian >Priority: Trivial > > The instance of FileSplit is redundant for > {color:#ffc66d}buildReaderWithPartitionValues {color} > {color:#33}in the {color}{color:#f79232}ParquetFileFormat{color} class. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25806) The instanceof FileSplit is redundant for ParquetFileFormat
[ https://issues.apache.org/jira/browse/SPARK-25806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25806: Assignee: (was: Apache Spark) > The instanceof FileSplit is redundant for ParquetFileFormat > > > Key: SPARK-25806 > URL: https://issues.apache.org/jira/browse/SPARK-25806 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 3.0.0 >Reporter: liuxian >Priority: Trivial > > The instance of FileSplit is redundant for > {color:#ffc66d}buildReaderWithPartitionValues {color} > {color:#33}in the {color}{color:#f79232}ParquetFileFormat{color} class. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25806) The instanceof FileSplit is redundant for ParquetFileFormat
[ https://issues.apache.org/jira/browse/SPARK-25806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-25806: Description: The instance of FileSplit is redundant for {color:#ffc66d}buildReaderWithPartitionValues {color} {color:#33}in the {color}{color:#f79232}ParquetFileFormat{color} class. was: The instanceof FileSplit is redundant for {color:#ffc66d}buildReaderWithPartitionValues {color} {color:#33}in the {color}{color:#f79232}ParquetFileFormat{color} class. > The instanceof FileSplit is redundant for ParquetFileFormat > > > Key: SPARK-25806 > URL: https://issues.apache.org/jira/browse/SPARK-25806 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 3.0.0 >Reporter: liuxian >Priority: Trivial > > The instance of FileSplit is redundant for > {color:#ffc66d}buildReaderWithPartitionValues {color} > {color:#33}in the {color}{color:#f79232}ParquetFileFormat{color} class. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25806) The instanceof FileSplit is redundant for ParquetFileFormat
[ https://issues.apache.org/jira/browse/SPARK-25806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-25806: Description: The instanceof FileSplit is redundant for {color:#ffc66d}buildReaderWithPartitionValues {color} {color:#33}in the {color}{color:#f79232}ParquetFileFormat{color} class. was: The instanceof FileSplit is redundant for {color:#ffc66d}buildReaderWithPartitionValues {color} {color:#33}{color:#ffc66d}in the {color}{color}ParquetFileFormat class. > The instanceof FileSplit is redundant for ParquetFileFormat > > > Key: SPARK-25806 > URL: https://issues.apache.org/jira/browse/SPARK-25806 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 3.0.0 >Reporter: liuxian >Priority: Trivial > > The instanceof FileSplit is redundant for > {color:#ffc66d}buildReaderWithPartitionValues {color} > {color:#33}in the {color}{color:#f79232}ParquetFileFormat{color} class. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25806) The instanceof FileSplit is redundant for ParquetFileFormat
[ https://issues.apache.org/jira/browse/SPARK-25806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-25806: Description: The instanceof FileSplit is redundant for {color:#ffc66d}buildReaderWithPartitionValues {color} {color:#33}{color:#ffc66d}in the {color}{color}ParquetFileFormat class. was:The instanceof FileSplit is redundant for {color:#ffc66d}buildReaderWithPartitionValues {color:#33}in the {color}{color}{color:#f79232}ParquetFileFormat{color} class. > The instanceof FileSplit is redundant for ParquetFileFormat > > > Key: SPARK-25806 > URL: https://issues.apache.org/jira/browse/SPARK-25806 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 3.0.0 >Reporter: liuxian >Priority: Trivial > > The instanceof FileSplit is redundant for > {color:#ffc66d}buildReaderWithPartitionValues {color} > {color:#33}{color:#ffc66d}in the {color}{color}ParquetFileFormat class. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25806) The instanceof FileSplit is redundant for ParquetFileFormat
[ https://issues.apache.org/jira/browse/SPARK-25806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-25806: Description: The instanceof FileSplit is redundant for {color:#ffc66d}buildReaderWithPartitionValues {color:#33}in the {color}{color}{color:#f79232}ParquetFileFormat{color} class. (was: The instanceof FileSplit is redundant for {color:#ffc66d}buildReaderWithPartitionValues {color:#33}in the {color}{color}{color:#f79232}ParquetFileFormat {color:#33}class.{color} {color}) > The instanceof FileSplit is redundant for ParquetFileFormat > > > Key: SPARK-25806 > URL: https://issues.apache.org/jira/browse/SPARK-25806 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 3.0.0 >Reporter: liuxian >Priority: Trivial > > The instanceof FileSplit is redundant for > {color:#ffc66d}buildReaderWithPartitionValues {color:#33}in the > {color}{color}{color:#f79232}ParquetFileFormat{color} class. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25806) The instanceof FileSplit is redundant for ParquetFileFormat
liuxian created SPARK-25806: --- Summary: The instanceof FileSplit is redundant for ParquetFileFormat Key: SPARK-25806 URL: https://issues.apache.org/jira/browse/SPARK-25806 Project: Spark Issue Type: Improvement Components: Input/Output Affects Versions: 3.0.0 Reporter: liuxian The instanceof FileSplit is redundant for {color:#ffc66d}buildReaderWithPartitionValues {color:#33}in the {color}{color}{color:#f79232}ParquetFileFormat {color:#33}class.{color} {color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25040) Empty string should be disallowed for data types other than string and binary in JSON
[ https://issues.apache.org/jira/browse/SPARK-25040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-25040: Summary: Empty string should be disallowed for data types other than string and binary in JSON (was: Empty string for double and float types should be nulls in JSON) > Empty string should be disallowed for data types other than string and binary > in JSON > - > > Key: SPARK-25040 > URL: https://issues.apache.org/jira/browse/SPARK-25040 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.4.0 >Reporter: Hyukjin Kwon >Assignee: Liang-Chi Hsieh >Priority: Minor > Fix For: 3.0.0 > > > The issue itself seems to be a behaviour change between 1.6 and 2.x for > treating empty string as null or not in double and float. > {code} > {"a":"a1","int":1,"other":4.4} > {"a":"a2","int":"","other":""} > {code} > code : > {code} > val config = new SparkConf().setMaster("local[5]").setAppName("test") > val sc = SparkContext.getOrCreate(config) > val sql = new SQLContext(sc) > val file_path = > this.getClass.getClassLoader.getResource("Sanity4.json").getFile > val df = sql.read.schema(null).json(file_path) > df.show(30) > {code} > then in spark 1.6, result is > {code} > +---++-+ > | a| int|other| > +---++-+ > | a1| 1| 4.4| > | a2|null| null| > +---++-+ > {code} > {code} > root > |-- a: string (nullable = true) > |-- int: long (nullable = true) > |-- other: double (nullable = true) > {code} > but in spark 2.2, result is > {code} > +++-+ > | a| int|other| > +++-+ > | a1| 1| 4.4| > |null|null| null| > +++-+ > {code} > {code} > root > |-- a: string (nullable = true) > |-- int: long (nullable = true) > |-- other: double (nullable = true) > {code} > Another easy reproducer: > {code} > spark.read.schema("a DOUBLE, b FLOAT") > .option("mode", "FAILFAST").json(Seq("""{"a":"", "b": ""}""", """{"a": > 1.1, "b": 1.1}""").toDS) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25040) Empty string should be disallowed for data types except for string and binary types in JSON
[ https://issues.apache.org/jira/browse/SPARK-25040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16660143#comment-16660143 ] Liang-Chi Hsieh commented on SPARK-25040: - The JIRA title is not correct now. I changed it. > Empty string should be disallowed for data types except for string and binary > types in JSON > --- > > Key: SPARK-25040 > URL: https://issues.apache.org/jira/browse/SPARK-25040 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.4.0 >Reporter: Hyukjin Kwon >Assignee: Liang-Chi Hsieh >Priority: Minor > Fix For: 3.0.0 > > > The issue itself seems to be a behaviour change between 1.6 and 2.x for > treating empty string as null or not in double and float. > {code} > {"a":"a1","int":1,"other":4.4} > {"a":"a2","int":"","other":""} > {code} > code : > {code} > val config = new SparkConf().setMaster("local[5]").setAppName("test") > val sc = SparkContext.getOrCreate(config) > val sql = new SQLContext(sc) > val file_path = > this.getClass.getClassLoader.getResource("Sanity4.json").getFile > val df = sql.read.schema(null).json(file_path) > df.show(30) > {code} > then in spark 1.6, result is > {code} > +---++-+ > | a| int|other| > +---++-+ > | a1| 1| 4.4| > | a2|null| null| > +---++-+ > {code} > {code} > root > |-- a: string (nullable = true) > |-- int: long (nullable = true) > |-- other: double (nullable = true) > {code} > but in spark 2.2, result is > {code} > +++-+ > | a| int|other| > +++-+ > | a1| 1| 4.4| > |null|null| null| > +++-+ > {code} > {code} > root > |-- a: string (nullable = true) > |-- int: long (nullable = true) > |-- other: double (nullable = true) > {code} > Another easy reproducer: > {code} > spark.read.schema("a DOUBLE, b FLOAT") > .option("mode", "FAILFAST").json(Seq("""{"a":"", "b": ""}""", """{"a": > 1.1, "b": 1.1}""").toDS) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25040) Empty string should be disallowed for data types except for string and binary types in JSON
[ https://issues.apache.org/jira/browse/SPARK-25040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-25040: Summary: Empty string should be disallowed for data types except for string and binary types in JSON (was: Empty string should be disallowed for data types other than string and binary in JSON) > Empty string should be disallowed for data types except for string and binary > types in JSON > --- > > Key: SPARK-25040 > URL: https://issues.apache.org/jira/browse/SPARK-25040 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.4.0 >Reporter: Hyukjin Kwon >Assignee: Liang-Chi Hsieh >Priority: Minor > Fix For: 3.0.0 > > > The issue itself seems to be a behaviour change between 1.6 and 2.x for > treating empty string as null or not in double and float. > {code} > {"a":"a1","int":1,"other":4.4} > {"a":"a2","int":"","other":""} > {code} > code : > {code} > val config = new SparkConf().setMaster("local[5]").setAppName("test") > val sc = SparkContext.getOrCreate(config) > val sql = new SQLContext(sc) > val file_path = > this.getClass.getClassLoader.getResource("Sanity4.json").getFile > val df = sql.read.schema(null).json(file_path) > df.show(30) > {code} > then in spark 1.6, result is > {code} > +---++-+ > | a| int|other| > +---++-+ > | a1| 1| 4.4| > | a2|null| null| > +---++-+ > {code} > {code} > root > |-- a: string (nullable = true) > |-- int: long (nullable = true) > |-- other: double (nullable = true) > {code} > but in spark 2.2, result is > {code} > +++-+ > | a| int|other| > +++-+ > | a1| 1| 4.4| > |null|null| null| > +++-+ > {code} > {code} > root > |-- a: string (nullable = true) > |-- int: long (nullable = true) > |-- other: double (nullable = true) > {code} > Another easy reproducer: > {code} > spark.read.schema("a DOUBLE, b FLOAT") > .option("mode", "FAILFAST").json(Seq("""{"a":"", "b": ""}""", """{"a": > 1.1, "b": 1.1}""").toDS) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25796) Enable external shuffle service for kubernetes mode.
[ https://issues.apache.org/jira/browse/SPARK-25796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma resolved SPARK-25796. - Resolution: Duplicate > Enable external shuffle service for kubernetes mode. > > > Key: SPARK-25796 > URL: https://issues.apache.org/jira/browse/SPARK-25796 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Prashant Sharma >Priority: Major > > This is required to support dynamic scaling for spark jobs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org