[jira] [Commented] (SPARK-23402) Dataset write method not working as expected for postgresql database
[ https://issues.apache.org/jira/browse/SPARK-23402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363584#comment-16363584 ] kevin yu commented on SPARK-23402: -- Thanks, I will install the 9.5.8, and try again. Sent from my iPhone > Dataset write method not working as expected for postgresql database > > > Key: SPARK-23402 > URL: https://issues.apache.org/jira/browse/SPARK-23402 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.1 > Environment: PostgreSQL: 9.5.8 (10 + Also same issue) > OS: Cent OS 7 & Windows 7,8 > JDBC: 9.4-1201-jdbc41 > > Spark: I executed in both 2.1.0 and 2.2.1 > Mode: Standalone > OS: Windows 7 >Reporter: Pallapothu Jyothi Swaroop >Priority: Major > Attachments: Emsku[1].jpg > > > I am using spark dataset write to insert data on postgresql existing table. > For this I am using write method mode as append mode. While using i am > getting exception like table already exists. But, I gave option as append > mode. > It's strange. When i change options to sqlserver/oracle append mode is > working as expected. > > *Database Properties:* > {{destinationProps.put("driver", "org.postgresql.Driver"); > destinationProps.put("url", "jdbc:postgresql://127.0.0.1:30001/dbmig"); > destinationProps.put("user", "dbmig");}} > {{destinationProps.put("password", "dbmig");}} > > *Dataset Write Code:* > {{valueAnalysisDataset.write().mode(SaveMode.Append).jdbc(destinationDbMap.get("url"), > "dqvalue", destinationdbProperties);}} > > > {{Exception in thread "main" org.postgresql.util.PSQLException: ERROR: > relation "dqvalue" already exists at > org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2412) > at > org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2125) > at > org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:297) > at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:428) at > org.postgresql.jdbc.PgStatement.execute(PgStatement.java:354) at > org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:301) at > org.postgresql.jdbc.PgStatement.executeCachedSql(PgStatement.java:287) at > org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:264) at > org.postgresql.jdbc.PgStatement.executeUpdate(PgStatement.java:244) at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createTable(JdbcUtils.scala:806) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:95) > at > org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:469) > at > org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:50) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92) > at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233) at > org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:460) at > com.ads.dqam.action.impl.PostgresValueAnalysis.persistValueAnalysis(PostgresValueAnalysis.java:25) > at > com.ads.dqam.action.AbstractValueAnalysis.persistAnalysis(AbstractValueAnalysis.java:81) > at com.ads.dqam.Analysis.doAnalysis(Analysis.java:32) at > com.ads.dqam.Client.main(Client.java:71)}} > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23402) Dataset write method not working as expected for postgresql database
[ https://issues.apache.org/jira/browse/SPARK-23402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363555#comment-16363555 ] Pallapothu Jyothi Swaroop edited comment on SPARK-23402 at 2/14/18 6:48 AM: [~kevinyu98] Thanks for checking again. I tested with 9.5.4 Append mode is working with out exception. I analyzed some thing that may use full for you. Can you check above scala file. https://github.com/apache/spark/blob/v2.2.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala Below is the statement for checking table exists or not. In this statement it is failing. val tableExists = JdbcUtils.tableExists(conn, options) But i am not sure. Why it is failing. I executed sql for table exists command taken from the postgres dialect. It is executed successfully in database. was (Author: swaroopp): Thanks for checking again. I tested with 9.5.4 Append mode is working with out exception. I analyzed some thing that may use full for you. Can you check above scala file. https://github.com/apache/spark/blob/v2.2.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala Below is the statement for checking table exists or not. In this statement it is failing. val tableExists = JdbcUtils.tableExists(conn, options) But i am not sure. Why it is failing. I executed sql for table exists command taken from the postgres dialect. It is executed successfully in database. > Dataset write method not working as expected for postgresql database > > > Key: SPARK-23402 > URL: https://issues.apache.org/jira/browse/SPARK-23402 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.1 > Environment: PostgreSQL: 9.5.8 (10 + Also same issue) > OS: Cent OS 7 & Windows 7,8 > JDBC: 9.4-1201-jdbc41 > > Spark: I executed in both 2.1.0 and 2.2.1 > Mode: Standalone > OS: Windows 7 >Reporter: Pallapothu Jyothi Swaroop >Priority: Major > Attachments: Emsku[1].jpg > > > I am using spark dataset write to insert data on postgresql existing table. > For this I am using write method mode as append mode. While using i am > getting exception like table already exists. But, I gave option as append > mode. > It's strange. When i change options to sqlserver/oracle append mode is > working as expected. > > *Database Properties:* > {{destinationProps.put("driver", "org.postgresql.Driver"); > destinationProps.put("url", "jdbc:postgresql://127.0.0.1:30001/dbmig"); > destinationProps.put("user", "dbmig");}} > {{destinationProps.put("password", "dbmig");}} > > *Dataset Write Code:* > {{valueAnalysisDataset.write().mode(SaveMode.Append).jdbc(destinationDbMap.get("url"), > "dqvalue", destinationdbProperties);}} > > > {{Exception in thread "main" org.postgresql.util.PSQLException: ERROR: > relation "dqvalue" already exists at > org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2412) > at > org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2125) > at > org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:297) > at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:428) at > org.postgresql.jdbc.PgStatement.execute(PgStatement.java:354) at > org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:301) at > org.postgresql.jdbc.PgStatement.executeCachedSql(PgStatement.java:287) at > org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:264) at > org.postgresql.jdbc.PgStatement.executeUpdate(PgStatement.java:244) at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createTable(JdbcUtils.scala:806) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:95) > at > org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:469) > at > org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:50) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) > at >
[jira] [Commented] (SPARK-23402) Dataset write method not working as expected for postgresql database
[ https://issues.apache.org/jira/browse/SPARK-23402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363555#comment-16363555 ] Pallapothu Jyothi Swaroop commented on SPARK-23402: --- Thanks for checking again. I tested with 9.5.4 Append mode is working with out exception. I analyzed some thing that may use full for you. Can you check above scala file. https://github.com/apache/spark/blob/v2.2.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala Below is the statement for checking table exists or not. In this statement it is failing. val tableExists = JdbcUtils.tableExists(conn, options) But i am not sure. Why it is failing. I executed sql for table exists command taken from the postgres dialect. It is executed successfully in database. > Dataset write method not working as expected for postgresql database > > > Key: SPARK-23402 > URL: https://issues.apache.org/jira/browse/SPARK-23402 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.1 > Environment: PostgreSQL: 9.5.8 (10 + Also same issue) > OS: Cent OS 7 & Windows 7,8 > JDBC: 9.4-1201-jdbc41 > > Spark: I executed in both 2.1.0 and 2.2.1 > Mode: Standalone > OS: Windows 7 >Reporter: Pallapothu Jyothi Swaroop >Priority: Major > Attachments: Emsku[1].jpg > > > I am using spark dataset write to insert data on postgresql existing table. > For this I am using write method mode as append mode. While using i am > getting exception like table already exists. But, I gave option as append > mode. > It's strange. When i change options to sqlserver/oracle append mode is > working as expected. > > *Database Properties:* > {{destinationProps.put("driver", "org.postgresql.Driver"); > destinationProps.put("url", "jdbc:postgresql://127.0.0.1:30001/dbmig"); > destinationProps.put("user", "dbmig");}} > {{destinationProps.put("password", "dbmig");}} > > *Dataset Write Code:* > {{valueAnalysisDataset.write().mode(SaveMode.Append).jdbc(destinationDbMap.get("url"), > "dqvalue", destinationdbProperties);}} > > > {{Exception in thread "main" org.postgresql.util.PSQLException: ERROR: > relation "dqvalue" already exists at > org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2412) > at > org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2125) > at > org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:297) > at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:428) at > org.postgresql.jdbc.PgStatement.execute(PgStatement.java:354) at > org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:301) at > org.postgresql.jdbc.PgStatement.executeCachedSql(PgStatement.java:287) at > org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:264) at > org.postgresql.jdbc.PgStatement.executeUpdate(PgStatement.java:244) at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createTable(JdbcUtils.scala:806) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:95) > at > org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:469) > at > org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:50) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92) > at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233) at > org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:460) at > com.ads.dqam.action.impl.PostgresValueAnalysis.persistValueAnalysis(PostgresValueAnalysis.java:25) > at >
[jira] [Commented] (SPARK-23402) Dataset write method not working as expected for postgresql database
[ https://issues.apache.org/jira/browse/SPARK-23402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363539#comment-16363539 ] kevin yu commented on SPARK-23402: -- Yes, I create empty table (emptytable) in database (mydb) in the postgresql, then I run the above statement from spark-shell, it works fine. The only difference I see is that my postgres is at 9.5.6, yours is at 9.5.8 +. > Dataset write method not working as expected for postgresql database > > > Key: SPARK-23402 > URL: https://issues.apache.org/jira/browse/SPARK-23402 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.1 > Environment: PostgreSQL: 9.5.8 (10 + Also same issue) > OS: Cent OS 7 & Windows 7,8 > JDBC: 9.4-1201-jdbc41 > > Spark: I executed in both 2.1.0 and 2.2.1 > Mode: Standalone > OS: Windows 7 >Reporter: Pallapothu Jyothi Swaroop >Priority: Major > Attachments: Emsku[1].jpg > > > I am using spark dataset write to insert data on postgresql existing table. > For this I am using write method mode as append mode. While using i am > getting exception like table already exists. But, I gave option as append > mode. > It's strange. When i change options to sqlserver/oracle append mode is > working as expected. > > *Database Properties:* > {{destinationProps.put("driver", "org.postgresql.Driver"); > destinationProps.put("url", "jdbc:postgresql://127.0.0.1:30001/dbmig"); > destinationProps.put("user", "dbmig");}} > {{destinationProps.put("password", "dbmig");}} > > *Dataset Write Code:* > {{valueAnalysisDataset.write().mode(SaveMode.Append).jdbc(destinationDbMap.get("url"), > "dqvalue", destinationdbProperties);}} > > > {{Exception in thread "main" org.postgresql.util.PSQLException: ERROR: > relation "dqvalue" already exists at > org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2412) > at > org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2125) > at > org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:297) > at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:428) at > org.postgresql.jdbc.PgStatement.execute(PgStatement.java:354) at > org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:301) at > org.postgresql.jdbc.PgStatement.executeCachedSql(PgStatement.java:287) at > org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:264) at > org.postgresql.jdbc.PgStatement.executeUpdate(PgStatement.java:244) at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createTable(JdbcUtils.scala:806) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:95) > at > org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:469) > at > org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:50) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92) > at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233) at > org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:460) at > com.ads.dqam.action.impl.PostgresValueAnalysis.persistValueAnalysis(PostgresValueAnalysis.java:25) > at > com.ads.dqam.action.AbstractValueAnalysis.persistAnalysis(AbstractValueAnalysis.java:81) > at com.ads.dqam.Analysis.doAnalysis(Analysis.java:32) at > com.ads.dqam.Client.main(Client.java:71)}} > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail:
[jira] [Updated] (SPARK-23368) Avoid unnecessary Exchange or Sort after projection
[ https://issues.apache.org/jira/browse/SPARK-23368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated SPARK-23368: Summary: Avoid unnecessary Exchange or Sort after projection (was: OutputOrdering and OutputPartitioning in ProjectExec should reflect the projected columns) > Avoid unnecessary Exchange or Sort after projection > --- > > Key: SPARK-23368 > URL: https://issues.apache.org/jira/browse/SPARK-23368 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maryann Xue >Priority: Minor > > After column rename projection, the ProjectExec's outputOrdering and > outputPartitioning should reflect the projected columns as well. For example, > {code:java} > SELECT b1 > FROM ( > SELECT a a1, b b1 > FROM testData2 > ORDER BY a > ) > ORDER BY a1{code} > The inner query is ordered on a1 as well. If we had a rule to eliminate Sort > on sorted result, together with this fix, the order-by in the outer query > could have been optimized out. > > Similarly, the below query > {code:java} > SELECT * > FROM ( > SELECT t1.a a1, t2.a a2, t1.b b1, t2.b b2 > FROM testData2 t1 > LEFT JOIN testData2 t2 > ON t1.a = t2.a > ) > JOIN testData2 t3 > ON a1 = t3.a{code} > is equivalent to > {code:java} > SELECT * > FROM testData2 t1 > LEFT JOIN testData2 t2 > ON t1.a = t2.a > JOIN testData2 t3 > ON t1.a = t3.a{code} > , so the unnecessary sorting and hash-partitioning that have been optimized > out for the second query should have be eliminated in the first query as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23402) Dataset write method not working as expected for postgresql database
[ https://issues.apache.org/jira/browse/SPARK-23402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363512#comment-16363512 ] Pallapothu Jyothi Swaroop commented on SPARK-23402: --- [~kevinyu98] Did you create table before execute above instructions? It will throw exception only when table already exists in database. Please run above statements again you will get exception and let me know issue replicated or not. > Dataset write method not working as expected for postgresql database > > > Key: SPARK-23402 > URL: https://issues.apache.org/jira/browse/SPARK-23402 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.1 > Environment: PostgreSQL: 9.5.8 (10 + Also same issue) > OS: Cent OS 7 & Windows 7,8 > JDBC: 9.4-1201-jdbc41 > > Spark: I executed in both 2.1.0 and 2.2.1 > Mode: Standalone > OS: Windows 7 >Reporter: Pallapothu Jyothi Swaroop >Priority: Major > Attachments: Emsku[1].jpg > > > I am using spark dataset write to insert data on postgresql existing table. > For this I am using write method mode as append mode. While using i am > getting exception like table already exists. But, I gave option as append > mode. > It's strange. When i change options to sqlserver/oracle append mode is > working as expected. > > *Database Properties:* > {{destinationProps.put("driver", "org.postgresql.Driver"); > destinationProps.put("url", "jdbc:postgresql://127.0.0.1:30001/dbmig"); > destinationProps.put("user", "dbmig");}} > {{destinationProps.put("password", "dbmig");}} > > *Dataset Write Code:* > {{valueAnalysisDataset.write().mode(SaveMode.Append).jdbc(destinationDbMap.get("url"), > "dqvalue", destinationdbProperties);}} > > > {{Exception in thread "main" org.postgresql.util.PSQLException: ERROR: > relation "dqvalue" already exists at > org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2412) > at > org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2125) > at > org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:297) > at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:428) at > org.postgresql.jdbc.PgStatement.execute(PgStatement.java:354) at > org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:301) at > org.postgresql.jdbc.PgStatement.executeCachedSql(PgStatement.java:287) at > org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:264) at > org.postgresql.jdbc.PgStatement.executeUpdate(PgStatement.java:244) at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createTable(JdbcUtils.scala:806) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:95) > at > org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:469) > at > org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:50) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92) > at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233) at > org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:460) at > com.ads.dqam.action.impl.PostgresValueAnalysis.persistValueAnalysis(PostgresValueAnalysis.java:25) > at > com.ads.dqam.action.AbstractValueAnalysis.persistAnalysis(AbstractValueAnalysis.java:81) > at com.ads.dqam.Analysis.doAnalysis(Analysis.java:32) at > com.ads.dqam.Client.main(Client.java:71)}} > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail:
[jira] [Created] (SPARK-23420) Datasource loading not handling paths with regex chars.
Mitchell created SPARK-23420: Summary: Datasource loading not handling paths with regex chars. Key: SPARK-23420 URL: https://issues.apache.org/jira/browse/SPARK-23420 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 2.2.1 Reporter: Mitchell Greetings, during some recent testing I ran across an issue attempting to load files with regex chars like []()* etc. in them. The files are valid in the various storages and the normal hadoop APIs all function properly accessing them. When my code is executed, I get the following stack trace. 8/02/14 04:52:46 ERROR yarn.ApplicationMaster: User class threw exception: java.io.IOException: Illegal file pattern: Unmatched closing ')' near index 130 A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_?? ^ java.io.IOException: Illegal file pattern: Unmatched closing ')' near index 130 A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_?? ^ at org.apache.hadoop.fs.GlobFilter.init(GlobFilter.java:71) at org.apache.hadoop.fs.GlobFilter.(GlobFilter.java:50) at org.apache.hadoop.fs.Globber.doGlob(Globber.java:210) at org.apache.hadoop.fs.Globber.glob(Globber.java:149) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1955) at org.apache.hadoop.fs.s3a.S3AFileSystem.globStatus(S3AFileSystem.java:2477) at org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:234) at org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:244) at org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:618) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:350) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:350) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at scala.collection.immutable.List.flatMap(List.scala:344) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:349) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533) at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412) at com.sap.profile.SparkProfileTask.main(SparkProfileTask.java:95) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:635) Caused by: java.util.regex.PatternSyntaxException: Unmatched closing ')' near index 130 A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_?? ^ at java.util.regex.Pattern.error(Pattern.java:1955) at java.util.regex.Pattern.compile(Pattern.java:1700) at java.util.regex.Pattern.(Pattern.java:1351) at java.util.regex.Pattern.compile(Pattern.java:1054) at org.apache.hadoop.fs.GlobPattern.set(GlobPattern.java:156) at org.apache.hadoop.fs.GlobPattern.(GlobPattern.java:42) at org.apache.hadoop.fs.GlobFilter.init(GlobFilter.java:67) ... 25 more 18/02/14 04:52:46 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: java.io.IOException: Illegal file pattern: Unmatched closing ')' near index 130 A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_?? ^) 18/02/14 04:52:46 INFO spark.SparkContext: Invoking stop() from shutdown hook Code is as follows ... Dataset input = sqlContext.read().option("header", "true").option("sep", ",").option("quote", "\"").option("charset", "utf8").option("escape",
[jira] [Updated] (SPARK-23230) When hive.default.fileformat is other kinds of file types, create textfile table cause a serde error
[ https://issues.apache.org/jira/browse/SPARK-23230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-23230: Fix Version/s: 2.2.2 > When hive.default.fileformat is other kinds of file types, create textfile > table cause a serde error > > > Key: SPARK-23230 > URL: https://issues.apache.org/jira/browse/SPARK-23230 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Minor > Fix For: 2.2.2, 2.3.0 > > > When hive.default.fileformat is other kinds of file types, create textfile > table cause a serde error. > We should take the default type of textfile and sequencefile both as > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe. > {code:java} > set hive.default.fileformat=orc; > create table tbl( i string ) stored as textfile; > desc formatted tbl; > Serde Library org.apache.hadoop.hive.ql.io.orc.OrcSerde > InputFormat org.apache.hadoop.mapred.TextInputFormat > OutputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat{code} > > {code:java} > set hive.default.fileformat=orc; > create table tbl stored as textfile > as > select 1 > {code} > {{It failed because it used the wrong SERDE}} > {code:java} > Caused by: java.lang.ClassCastException: > org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow cannot be cast to > org.apache.hadoop.io.BytesWritable > at > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat$1.write(HiveIgnoreKeyTextOutputFormat.java:91) > at > org.apache.spark.sql.hive.execution.HiveOutputWriter.write(HiveFileFormat.scala:149) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:327) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:258) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261) > ... 16 more > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23399) Register a task completion listener first for OrcColumnarBatchReader
[ https://issues.apache.org/jira/browse/SPARK-23399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-23399. - Resolution: Fixed Fix Version/s: 2.3.1 Issue resolved by pull request 20590 [https://github.com/apache/spark/pull/20590] > Register a task completion listener first for OrcColumnarBatchReader > > > Key: SPARK-23399 > URL: https://issues.apache.org/jira/browse/SPARK-23399 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.3.1 > > > This is related with SPARK-23390. > Currently, there was a opened file leak for OrcColumnarBatchReader. > {code} > [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds) > 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in > stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled) > 15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem > connection created at: > java.lang.Throwable > at > org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36) > at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769) > at > org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173) > at > org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254) > at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23399) Register a task completion listener first for OrcColumnarBatchReader
[ https://issues.apache.org/jira/browse/SPARK-23399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-23399: --- Assignee: Dongjoon Hyun > Register a task completion listener first for OrcColumnarBatchReader > > > Key: SPARK-23399 > URL: https://issues.apache.org/jira/browse/SPARK-23399 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.3.1 > > > This is related with SPARK-23390. > Currently, there was a opened file leak for OrcColumnarBatchReader. > {code} > [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds) > 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in > stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled) > 15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem > connection created at: > java.lang.Throwable > at > org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36) > at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769) > at > org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173) > at > org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254) > at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23419) data source v2 write path should re-throw interruption exceptions directly
[ https://issues.apache.org/jira/browse/SPARK-23419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363385#comment-16363385 ] Apache Spark commented on SPARK-23419: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/20605 > data source v2 write path should re-throw interruption exceptions directly > -- > > Key: SPARK-23419 > URL: https://issues.apache.org/jira/browse/SPARK-23419 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23419) data source v2 write path should re-throw interruption exceptions directly
[ https://issues.apache.org/jira/browse/SPARK-23419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23419: Assignee: Apache Spark (was: Wenchen Fan) > data source v2 write path should re-throw interruption exceptions directly > -- > > Key: SPARK-23419 > URL: https://issues.apache.org/jira/browse/SPARK-23419 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23419) data source v2 write path should re-throw interruption exceptions directly
[ https://issues.apache.org/jira/browse/SPARK-23419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23419: Assignee: Wenchen Fan (was: Apache Spark) > data source v2 write path should re-throw interruption exceptions directly > -- > > Key: SPARK-23419 > URL: https://issues.apache.org/jira/browse/SPARK-23419 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23416) flaky test: org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite.stress test for failOnDataLoss=false
[ https://issues.apache.org/jira/browse/SPARK-23416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363386#comment-16363386 ] Apache Spark commented on SPARK-23416: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/20605 > flaky test: > org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite.stress > test for failOnDataLoss=false > -- > > Key: SPARK-23416 > URL: https://issues.apache.org/jira/browse/SPARK-23416 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Minor > > I suspect this is a race condition latent in the DataSourceV2 write path, or > at least the interaction of that write path with StreamTest. > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87241/testReport/org.apache.spark.sql.kafka010/KafkaSourceStressForDontFailOnDataLossSuite/stress_test_for_failOnDataLoss_false/] > h3. Error Message > org.apache.spark.sql.streaming.StreamingQueryException: Query [id = > 16b2a2b1-acdd-44ec-902f-531169193169, runId = > 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job > aborted. > h3. Stacktrace > sbt.ForkMain$ForkError: > org.apache.spark.sql.streaming.StreamingQueryException: Query [id = > 16b2a2b1-acdd-44ec-902f-531169193169, runId = > 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job > aborted. at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189) > Caused by: sbt.ForkMain$ForkError: org.apache.spark.SparkException: Writing > job aborted. at > org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:108) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at > org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:294) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272) > at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) > at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) > at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253) at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252) at > org.apache.spark.sql.Dataset.collect(Dataset.scala:2722) at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$15.apply(MicroBatchExecution.scala:488) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3.apply(MicroBatchExecution.scala:483) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:482) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) > at >
[jira] [Updated] (SPARK-23419) data source v2 write path should re-throw interruption exceptions directly
[ https://issues.apache.org/jira/browse/SPARK-23419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-23419: Summary: data source v2 write path should re-throw interruption exceptions directly (was: data source v2 write path should re-throw fatal errors directly) > data source v2 write path should re-throw interruption exceptions directly > -- > > Key: SPARK-23419 > URL: https://issues.apache.org/jira/browse/SPARK-23419 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23377) Bucketizer with multiple columns persistence bug
[ https://issues.apache.org/jira/browse/SPARK-23377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363367#comment-16363367 ] Liang-Chi Hsieh commented on SPARK-23377: - I agree with what [~mlnick] said. > Bucketizer with multiple columns persistence bug > > > Key: SPARK-23377 > URL: https://issues.apache.org/jira/browse/SPARK-23377 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0 >Reporter: Bago Amirbekian >Priority: Critical > > A Bucketizer with multiple input/output columns get "inputCol" set to the > default value on write -> read which causes it to throw an error on > transform. Here's an example. > {code:java} > import org.apache.spark.ml.feature._ > val splits = Array(Double.NegativeInfinity, 0, 10, 100, > Double.PositiveInfinity) > val bucketizer = new Bucketizer() > .setSplitsArray(Array(splits, splits)) > .setInputCols(Array("foo1", "foo2")) > .setOutputCols(Array("bar1", "bar2")) > val data = Seq((1.0, 2.0), (10.0, 100.0), (101.0, -1.0)).toDF("foo1", "foo2") > bucketizer.transform(data) > val path = "/temp/bucketrizer-persist-test" > bucketizer.write.overwrite.save(path) > val bucketizerAfterRead = Bucketizer.read.load(path) > println(bucketizerAfterRead.isDefined(bucketizerAfterRead.outputCol)) > // This line throws an error because "outputCol" is set > bucketizerAfterRead.transform(data) > {code} > And the trace: > {code:java} > java.lang.IllegalArgumentException: Bucketizer bucketizer_6f0acc3341f7 has > the inputCols Param set for multi-column transform. The following Params are > not applicable and should not be set: outputCol. > at > org.apache.spark.ml.param.ParamValidators$.checkExclusiveParams$1(params.scala:300) > at > org.apache.spark.ml.param.ParamValidators$.checkSingleVsMultiColumnParams(params.scala:314) > at > org.apache.spark.ml.feature.Bucketizer.transformSchema(Bucketizer.scala:189) > at > org.apache.spark.ml.feature.Bucketizer.transform(Bucketizer.scala:141) > at > line251821108a8a433da484ee31f166c83725.$read$$iw$$iw$$iw$$iw$$iw$$iw.(command-6079631:17) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23419) data source v2 write path should re-throw fatal errors directly
[ https://issues.apache.org/jira/browse/SPARK-23419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-23419: Summary: data source v2 write path should re-throw fatal errors directly (was: data source v2 write path should re-throw FetchFailedException directly) > data source v2 write path should re-throw fatal errors directly > --- > > Key: SPARK-23419 > URL: https://issues.apache.org/jira/browse/SPARK-23419 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23419) data source v2 write path should re-throw FetchFailedException directly
Wenchen Fan created SPARK-23419: --- Summary: data source v2 write path should re-throw FetchFailedException directly Key: SPARK-23419 URL: https://issues.apache.org/jira/browse/SPARK-23419 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12140) Support Streaming UI in HistoryServer
[ https://issues.apache.org/jira/browse/SPARK-12140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363300#comment-16363300 ] Saisai Shao commented on SPARK-12140: - [~gschiavon] the community has concern about supporting Streaming UI in history server, mainly due to the scalability issue, so there's no progress on it. > Support Streaming UI in HistoryServer > - > > Key: SPARK-12140 > URL: https://issues.apache.org/jira/browse/SPARK-12140 > Project: Spark > Issue Type: New Feature > Components: DStreams >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin >Priority: Major > > SPARK-11206 added infrastructure that would allow the streaming UI to be > shown in the History Server. We should add the necessary code to make that > happen, although it requires some changes to how events and listeners are > used. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23308) ignoreCorruptFiles should not ignore retryable IOException
[ https://issues.apache.org/jira/browse/SPARK-23308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363217#comment-16363217 ] Márcio Furlani Carmona commented on SPARK-23308: {quote}if your input stream is doing abort/reopen on seek & positioned read, then you get many more S3 requests when reading columnar data {quote} Good point! I'll see if I can get the actual S3 TPS we're reaching. Regarding the SSE-KMS, we're not using it for encryption, so that shouldn't be a problem. > ignoreCorruptFiles should not ignore retryable IOException > -- > > Key: SPARK-23308 > URL: https://issues.apache.org/jira/browse/SPARK-23308 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Márcio Furlani Carmona >Priority: Minor > > When `spark.sql.files.ignoreCorruptFiles` is set it totally ignores any kind > of RuntimeException or IOException, but some possible IOExceptions may happen > even if the file is not corrupted. > One example is the SocketTimeoutException which can be retried to possibly > fetch the data without meaning the data is corrupted. > > See: > https://github.com/apache/spark/blob/e30e2698a2193f0bbdcd4edb884710819ab6397c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L163 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23235) Add executor Threaddump to api
[ https://issues.apache.org/jira/browse/SPARK-23235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-23235. -- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20474 [https://github.com/apache/spark/pull/20474] > Add executor Threaddump to api > -- > > Key: SPARK-23235 > URL: https://issues.apache.org/jira/browse/SPARK-23235 > Project: Spark > Issue Type: New Feature > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Imran Rashid >Assignee: Attila Zsolt Piros >Priority: Minor > Labels: newbie > Fix For: 2.4.0 > > > It looks like the the thread dump {{/executors/threadDump/?executorId=[id]}} > is only available in the UI, not in the rest api at all. This is especially > a pain because that page in the UI has extra formatting which makes it a pain > to send the output to somebody else (most likely you click "expand all" and > then copy paste that, which is OK, but is formatted weirdly). We might also > just want a "format=raw" option even on the UI. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23235) Add executor Threaddump to api
[ https://issues.apache.org/jira/browse/SPARK-23235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid reassigned SPARK-23235: Assignee: Attila Zsolt Piros > Add executor Threaddump to api > -- > > Key: SPARK-23235 > URL: https://issues.apache.org/jira/browse/SPARK-23235 > Project: Spark > Issue Type: New Feature > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Imran Rashid >Assignee: Attila Zsolt Piros >Priority: Minor > Labels: newbie > Fix For: 2.4.0 > > > It looks like the the thread dump {{/executors/threadDump/?executorId=[id]}} > is only available in the UI, not in the rest api at all. This is especially > a pain because that page in the UI has extra formatting which makes it a pain > to send the output to somebody else (most likely you click "expand all" and > then copy paste that, which is OK, but is formatted weirdly). We might also > just want a "format=raw" option even on the UI. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23351) checkpoint corruption in long running application
[ https://issues.apache.org/jira/browse/SPARK-23351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363133#comment-16363133 ] Sean Owen commented on SPARK-23351: --- [~davidahern] for the record, it's usually the opposite. Vendor branches like Cloudera say "2.2.0" but should be read like "2.2.x". They will as a rule only vary from the upstream branch in which commits go into a maintenance branch. This fix could be in a vendor 2.2.x release that isn't in an upstream one (or vice versa), or appear in a maintenance branch earlier from a vendor. That's in theory the value proposition; in this particular case I have no idea. > checkpoint corruption in long running application > - > > Key: SPARK-23351 > URL: https://issues.apache.org/jira/browse/SPARK-23351 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: David Ahern >Priority: Major > > hi, after leaving my (somewhat high volume) Structured Streaming application > running for some time, i get the following exception. The same exception > also repeats when i try to restart the application. The only way to get the > application back running is to clear the checkpoint directory which is far > from ideal. > Maybe a stream is not being flushed/closed properly internally by Spark when > checkpointing? > > User class threw exception: > org.apache.spark.sql.streaming.StreamingQueryException: Job aborted due to > stage failure: Task 55 in stage 1.0 failed 4 times, most recent failure: Lost > task 55.3 in stage 1.0 (TID 240, gbslixaacspa04u.metis.prd, executor 2): > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$readSnapshotFile(HDFSBackedStateStoreProvider.scala:481) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:359) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:358) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:358) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvider.scala:265) > at > org.apache.spark.sql.execution.streaming.state.StateStore$.get(StateStore.scala:200) > at > org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:61) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23365) DynamicAllocation with failure in straggler task can lead to a hung spark job
[ https://issues.apache.org/jira/browse/SPARK-23365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23365: Assignee: (was: Apache Spark) > DynamicAllocation with failure in straggler task can lead to a hung spark job > - > > Key: SPARK-23365 > URL: https://issues.apache.org/jira/browse/SPARK-23365 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.2, 2.2.1, 2.3.0 >Reporter: Imran Rashid >Priority: Major > > Dynamic Allocation can lead to a spark app getting stuck with 0 executors > requested when the executors in the last tasks of a taskset fail (eg. with an > OOM). > This happens when {{ExecutorAllocationManager}} s internal target number of > executors gets out of sync with {{CoarseGrainedSchedulerBackend}} s target > number. {{EAM}} updates the {{CGSB}} in two ways: (1) it tracks how many > tasks are active or pending in submitted stages, and computes how many > executors would be needed for them. And as tasks finish, it will actively > decrease that count, informing the {{CGSB}} along the way. (2) When it > decides executors are inactive for long enough, then it requests that > {{CGSB}} kill the executors -- this also tells the {{CGSB}} to update its > target number of executors: > https://github.com/apache/spark/blob/4df84c3f818aa536515729b442601e08c253ed35/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L622 > So when there is just one task left, you could have the following sequence of > events: > (1) the {{EAM}} sets the desired number of executors to 1, and updates the > {{CGSB}} too > (2) while that final task is still running, the other executors cross the > idle timeout, and the {{EAM}} requests the {{CGSB}} kill them > (3) now the {{EAM}} has a target of 1 executor, and the {{CGSB}} has a target > of 0 executors > If the final task completed normally now, everything would be OK; the next > taskset would get submitted, the {{EAM}} would increase the target number of > executors and it would update the {{CGSB}}. > But if the executor for that final task failed (eg. an OOM), then the {{EAM}} > thinks it [doesn't need to update > anything|https://github.com/apache/spark/blob/4df84c3f818aa536515729b442601e08c253ed35/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L384-L386], > because its target is already 1, which is all it needs for that final task; > and the {{CGSB}} doesn't update anything either since its target is 0. > I think you can determine if this is the cause of a stuck app by looking for > {noformat} > yarn.YarnAllocator: Driver requested a total number of 0 executor(s). > {noformat} > in the logs of the ApplicationMaster (at least on yarn). > You can reproduce this with this test app, run with {{--conf > "spark.dynamicAllocation.minExecutors=1" --conf > "spark.dynamicAllocation.maxExecutors=5" --conf > "spark.dynamicAllocation.executorIdleTimeout=5s"}} > {code} > import org.apache.spark.SparkEnv > sc.setLogLevel("INFO") > sc.parallelize(1 to 1, 1).count() > val execs = sc.parallelize(1 to 1000, 1000).map { _ => > SparkEnv.get.executorId}.collect().toSet > val badExec = execs.head > println("will kill exec " + badExec) > new Thread() { > override def run(): Unit = { > Thread.sleep(1) > println("about to kill exec " + badExec) > sc.killExecutor(badExec) > } > }.start() > sc.parallelize(1 to 5, 5).mapPartitions { itr => > val exec = SparkEnv.get.executorId > if (exec == badExec) { > Thread.sleep(2) // long enough that all the other tasks finish, and > the executors cross the idle timeout > // meanwhile, something else should kill this executor > itr > } else { > itr > } > }.collect() > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23365) DynamicAllocation with failure in straggler task can lead to a hung spark job
[ https://issues.apache.org/jira/browse/SPARK-23365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23365: Assignee: Apache Spark > DynamicAllocation with failure in straggler task can lead to a hung spark job > - > > Key: SPARK-23365 > URL: https://issues.apache.org/jira/browse/SPARK-23365 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.2, 2.2.1, 2.3.0 >Reporter: Imran Rashid >Assignee: Apache Spark >Priority: Major > > Dynamic Allocation can lead to a spark app getting stuck with 0 executors > requested when the executors in the last tasks of a taskset fail (eg. with an > OOM). > This happens when {{ExecutorAllocationManager}} s internal target number of > executors gets out of sync with {{CoarseGrainedSchedulerBackend}} s target > number. {{EAM}} updates the {{CGSB}} in two ways: (1) it tracks how many > tasks are active or pending in submitted stages, and computes how many > executors would be needed for them. And as tasks finish, it will actively > decrease that count, informing the {{CGSB}} along the way. (2) When it > decides executors are inactive for long enough, then it requests that > {{CGSB}} kill the executors -- this also tells the {{CGSB}} to update its > target number of executors: > https://github.com/apache/spark/blob/4df84c3f818aa536515729b442601e08c253ed35/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L622 > So when there is just one task left, you could have the following sequence of > events: > (1) the {{EAM}} sets the desired number of executors to 1, and updates the > {{CGSB}} too > (2) while that final task is still running, the other executors cross the > idle timeout, and the {{EAM}} requests the {{CGSB}} kill them > (3) now the {{EAM}} has a target of 1 executor, and the {{CGSB}} has a target > of 0 executors > If the final task completed normally now, everything would be OK; the next > taskset would get submitted, the {{EAM}} would increase the target number of > executors and it would update the {{CGSB}}. > But if the executor for that final task failed (eg. an OOM), then the {{EAM}} > thinks it [doesn't need to update > anything|https://github.com/apache/spark/blob/4df84c3f818aa536515729b442601e08c253ed35/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L384-L386], > because its target is already 1, which is all it needs for that final task; > and the {{CGSB}} doesn't update anything either since its target is 0. > I think you can determine if this is the cause of a stuck app by looking for > {noformat} > yarn.YarnAllocator: Driver requested a total number of 0 executor(s). > {noformat} > in the logs of the ApplicationMaster (at least on yarn). > You can reproduce this with this test app, run with {{--conf > "spark.dynamicAllocation.minExecutors=1" --conf > "spark.dynamicAllocation.maxExecutors=5" --conf > "spark.dynamicAllocation.executorIdleTimeout=5s"}} > {code} > import org.apache.spark.SparkEnv > sc.setLogLevel("INFO") > sc.parallelize(1 to 1, 1).count() > val execs = sc.parallelize(1 to 1000, 1000).map { _ => > SparkEnv.get.executorId}.collect().toSet > val badExec = execs.head > println("will kill exec " + badExec) > new Thread() { > override def run(): Unit = { > Thread.sleep(1) > println("about to kill exec " + badExec) > sc.killExecutor(badExec) > } > }.start() > sc.parallelize(1 to 5, 5).mapPartitions { itr => > val exec = SparkEnv.get.executorId > if (exec == badExec) { > Thread.sleep(2) // long enough that all the other tasks finish, and > the executors cross the idle timeout > // meanwhile, something else should kill this executor > itr > } else { > itr > } > }.collect() > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23365) DynamicAllocation with failure in straggler task can lead to a hung spark job
[ https://issues.apache.org/jira/browse/SPARK-23365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363130#comment-16363130 ] Apache Spark commented on SPARK-23365: -- User 'squito' has created a pull request for this issue: https://github.com/apache/spark/pull/20604 > DynamicAllocation with failure in straggler task can lead to a hung spark job > - > > Key: SPARK-23365 > URL: https://issues.apache.org/jira/browse/SPARK-23365 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.2, 2.2.1, 2.3.0 >Reporter: Imran Rashid >Priority: Major > > Dynamic Allocation can lead to a spark app getting stuck with 0 executors > requested when the executors in the last tasks of a taskset fail (eg. with an > OOM). > This happens when {{ExecutorAllocationManager}} s internal target number of > executors gets out of sync with {{CoarseGrainedSchedulerBackend}} s target > number. {{EAM}} updates the {{CGSB}} in two ways: (1) it tracks how many > tasks are active or pending in submitted stages, and computes how many > executors would be needed for them. And as tasks finish, it will actively > decrease that count, informing the {{CGSB}} along the way. (2) When it > decides executors are inactive for long enough, then it requests that > {{CGSB}} kill the executors -- this also tells the {{CGSB}} to update its > target number of executors: > https://github.com/apache/spark/blob/4df84c3f818aa536515729b442601e08c253ed35/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L622 > So when there is just one task left, you could have the following sequence of > events: > (1) the {{EAM}} sets the desired number of executors to 1, and updates the > {{CGSB}} too > (2) while that final task is still running, the other executors cross the > idle timeout, and the {{EAM}} requests the {{CGSB}} kill them > (3) now the {{EAM}} has a target of 1 executor, and the {{CGSB}} has a target > of 0 executors > If the final task completed normally now, everything would be OK; the next > taskset would get submitted, the {{EAM}} would increase the target number of > executors and it would update the {{CGSB}}. > But if the executor for that final task failed (eg. an OOM), then the {{EAM}} > thinks it [doesn't need to update > anything|https://github.com/apache/spark/blob/4df84c3f818aa536515729b442601e08c253ed35/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L384-L386], > because its target is already 1, which is all it needs for that final task; > and the {{CGSB}} doesn't update anything either since its target is 0. > I think you can determine if this is the cause of a stuck app by looking for > {noformat} > yarn.YarnAllocator: Driver requested a total number of 0 executor(s). > {noformat} > in the logs of the ApplicationMaster (at least on yarn). > You can reproduce this with this test app, run with {{--conf > "spark.dynamicAllocation.minExecutors=1" --conf > "spark.dynamicAllocation.maxExecutors=5" --conf > "spark.dynamicAllocation.executorIdleTimeout=5s"}} > {code} > import org.apache.spark.SparkEnv > sc.setLogLevel("INFO") > sc.parallelize(1 to 1, 1).count() > val execs = sc.parallelize(1 to 1000, 1000).map { _ => > SparkEnv.get.executorId}.collect().toSet > val badExec = execs.head > println("will kill exec " + badExec) > new Thread() { > override def run(): Unit = { > Thread.sleep(1) > println("about to kill exec " + badExec) > sc.killExecutor(badExec) > } > }.start() > sc.parallelize(1 to 5, 5).mapPartitions { itr => > val exec = SparkEnv.get.executorId > if (exec == badExec) { > Thread.sleep(2) // long enough that all the other tasks finish, and > the executors cross the idle timeout > // meanwhile, something else should kill this executor > itr > } else { > itr > } > }.collect() > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23418) DataSourceV2 should not allow userSpecifiedSchema without ReadSupportWithSchema
[ https://issues.apache.org/jira/browse/SPARK-23418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23418: Assignee: Apache Spark > DataSourceV2 should not allow userSpecifiedSchema without > ReadSupportWithSchema > --- > > Key: SPARK-23418 > URL: https://issues.apache.org/jira/browse/SPARK-23418 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Assignee: Apache Spark >Priority: Major > > DataSourceV2 currently does not reject user-specified schemas when a source > does not implement ReadSupportWithSchema. This is confusing behavior. Here's > a quote from a discussion on SPARK-23203: > {quote}I think this will cause confusion when source schemas change. Also, I > can't think of a situation where it is a good idea to pass a schema that is > ignored. > Here's an example of how this will be confusing: think of a job that supplies > a schema identical to the table's schema and runs fine, so it goes into > production. What happens when the table's schema changes? If someone adds a > column to the table, then the job will start failing and report that the > source doesn't support user-supplied schemas, even though it had previously > worked just fine with a user-supplied schema. In addition, the change to the > table is actually compatible with the old job because the new column will be > removed by a projection. > To fix this situation, it may be tempting to use the user-supplied schema as > an initial projection. But that doesn't make sense because we don't need two > projection mechanisms. If we used this as a second way to project, it would > be confusing that you can't actually leave out columns (at least for CSV) and > it would be odd that using this path you can coerce types, which should > usually be done by Spark. > I think it is best not to allow a user-supplied schema when it isn't > supported by a source. > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23418) DataSourceV2 should not allow userSpecifiedSchema without ReadSupportWithSchema
[ https://issues.apache.org/jira/browse/SPARK-23418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23418: Assignee: (was: Apache Spark) > DataSourceV2 should not allow userSpecifiedSchema without > ReadSupportWithSchema > --- > > Key: SPARK-23418 > URL: https://issues.apache.org/jira/browse/SPARK-23418 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Priority: Major > > DataSourceV2 currently does not reject user-specified schemas when a source > does not implement ReadSupportWithSchema. This is confusing behavior. Here's > a quote from a discussion on SPARK-23203: > {quote}I think this will cause confusion when source schemas change. Also, I > can't think of a situation where it is a good idea to pass a schema that is > ignored. > Here's an example of how this will be confusing: think of a job that supplies > a schema identical to the table's schema and runs fine, so it goes into > production. What happens when the table's schema changes? If someone adds a > column to the table, then the job will start failing and report that the > source doesn't support user-supplied schemas, even though it had previously > worked just fine with a user-supplied schema. In addition, the change to the > table is actually compatible with the old job because the new column will be > removed by a projection. > To fix this situation, it may be tempting to use the user-supplied schema as > an initial projection. But that doesn't make sense because we don't need two > projection mechanisms. If we used this as a second way to project, it would > be confusing that you can't actually leave out columns (at least for CSV) and > it would be odd that using this path you can coerce types, which should > usually be done by Spark. > I think it is best not to allow a user-supplied schema when it isn't > supported by a source. > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23418) DataSourceV2 should not allow userSpecifiedSchema without ReadSupportWithSchema
[ https://issues.apache.org/jira/browse/SPARK-23418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363127#comment-16363127 ] Apache Spark commented on SPARK-23418: -- User 'rdblue' has created a pull request for this issue: https://github.com/apache/spark/pull/20603 > DataSourceV2 should not allow userSpecifiedSchema without > ReadSupportWithSchema > --- > > Key: SPARK-23418 > URL: https://issues.apache.org/jira/browse/SPARK-23418 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Priority: Major > > DataSourceV2 currently does not reject user-specified schemas when a source > does not implement ReadSupportWithSchema. This is confusing behavior. Here's > a quote from a discussion on SPARK-23203: > {quote}I think this will cause confusion when source schemas change. Also, I > can't think of a situation where it is a good idea to pass a schema that is > ignored. > Here's an example of how this will be confusing: think of a job that supplies > a schema identical to the table's schema and runs fine, so it goes into > production. What happens when the table's schema changes? If someone adds a > column to the table, then the job will start failing and report that the > source doesn't support user-supplied schemas, even though it had previously > worked just fine with a user-supplied schema. In addition, the change to the > table is actually compatible with the old job because the new column will be > removed by a projection. > To fix this situation, it may be tempting to use the user-supplied schema as > an initial projection. But that doesn't make sense because we don't need two > projection mechanisms. If we used this as a second way to project, it would > be confusing that you can't actually leave out columns (at least for CSV) and > it would be odd that using this path you can coerce types, which should > usually be done by Spark. > I think it is best not to allow a user-supplied schema when it isn't > supported by a source. > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23418) DataSourceV2 should not allow userSpecifiedSchema without ReadSupportWithSchema
Ryan Blue created SPARK-23418: - Summary: DataSourceV2 should not allow userSpecifiedSchema without ReadSupportWithSchema Key: SPARK-23418 URL: https://issues.apache.org/jira/browse/SPARK-23418 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.3.0 Reporter: Ryan Blue DataSourceV2 currently does not reject user-specified schemas when a source does not implement ReadSupportWithSchema. This is confusing behavior. Here's a quote from a discussion on SPARK-23203: {quote}I think this will cause confusion when source schemas change. Also, I can't think of a situation where it is a good idea to pass a schema that is ignored. Here's an example of how this will be confusing: think of a job that supplies a schema identical to the table's schema and runs fine, so it goes into production. What happens when the table's schema changes? If someone adds a column to the table, then the job will start failing and report that the source doesn't support user-supplied schemas, even though it had previously worked just fine with a user-supplied schema. In addition, the change to the table is actually compatible with the old job because the new column will be removed by a projection. To fix this situation, it may be tempting to use the user-supplied schema as an initial projection. But that doesn't make sense because we don't need two projection mechanisms. If we used this as a second way to project, it would be confusing that you can't actually leave out columns (at least for CSV) and it would be odd that using this path you can coerce types, which should usually be done by Spark. I think it is best not to allow a user-supplied schema when it isn't supported by a source. {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23351) checkpoint corruption in long running application
[ https://issues.apache.org/jira/browse/SPARK-23351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363115#comment-16363115 ] Shixiong Zhu commented on SPARK-23351: -- [~davidahern] It's better to ask the vendor for support. They may have a different release cycle. > checkpoint corruption in long running application > - > > Key: SPARK-23351 > URL: https://issues.apache.org/jira/browse/SPARK-23351 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: David Ahern >Priority: Major > > hi, after leaving my (somewhat high volume) Structured Streaming application > running for some time, i get the following exception. The same exception > also repeats when i try to restart the application. The only way to get the > application back running is to clear the checkpoint directory which is far > from ideal. > Maybe a stream is not being flushed/closed properly internally by Spark when > checkpointing? > > User class threw exception: > org.apache.spark.sql.streaming.StreamingQueryException: Job aborted due to > stage failure: Task 55 in stage 1.0 failed 4 times, most recent failure: Lost > task 55.3 in stage 1.0 (TID 240, gbslixaacspa04u.metis.prd, executor 2): > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$readSnapshotFile(HDFSBackedStateStoreProvider.scala:481) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:359) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:358) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:358) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvider.scala:265) > at > org.apache.spark.sql.execution.streaming.state.StateStore$.get(StateStore.scala:200) > at > org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:61) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23402) Dataset write method not working as expected for postgresql database
[ https://issues.apache.org/jira/browse/SPARK-23402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363110#comment-16363110 ] kevin yu commented on SPARK-23402: -- I just tried with PostgreSQL 9.5.6 on x86_64-pc-linux-gnu with jdbc driver: postgresql-9.4.1210 I can't reproduce the error. scala> val df1 = Seq((1)).toDF("c1") df1: org.apache.spark.sql.DataFrame = [c1: int] scala> df1.write.mode(SaveMode.Append).jdbc("jdbc:postgresql://9.30.167.220:5432/mydb", "emptytable", destProperties) scala> val df3 = spark.read.jdbc("jdbc:postgresql://9.30.167.220:5432/mydb", "emptytable", destProperties).show +---+ | c1| +---+ | 1| +---+ df3: Unit = () > Dataset write method not working as expected for postgresql database > > > Key: SPARK-23402 > URL: https://issues.apache.org/jira/browse/SPARK-23402 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.1 > Environment: PostgreSQL: 9.5.8 (10 + Also same issue) > OS: Cent OS 7 & Windows 7,8 > JDBC: 9.4-1201-jdbc41 > > Spark: I executed in both 2.1.0 and 2.2.1 > Mode: Standalone > OS: Windows 7 >Reporter: Pallapothu Jyothi Swaroop >Priority: Major > Attachments: Emsku[1].jpg > > > I am using spark dataset write to insert data on postgresql existing table. > For this I am using write method mode as append mode. While using i am > getting exception like table already exists. But, I gave option as append > mode. > It's strange. When i change options to sqlserver/oracle append mode is > working as expected. > > *Database Properties:* > {{destinationProps.put("driver", "org.postgresql.Driver"); > destinationProps.put("url", "jdbc:postgresql://127.0.0.1:30001/dbmig"); > destinationProps.put("user", "dbmig");}} > {{destinationProps.put("password", "dbmig");}} > > *Dataset Write Code:* > {{valueAnalysisDataset.write().mode(SaveMode.Append).jdbc(destinationDbMap.get("url"), > "dqvalue", destinationdbProperties);}} > > > {{Exception in thread "main" org.postgresql.util.PSQLException: ERROR: > relation "dqvalue" already exists at > org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2412) > at > org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2125) > at > org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:297) > at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:428) at > org.postgresql.jdbc.PgStatement.execute(PgStatement.java:354) at > org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:301) at > org.postgresql.jdbc.PgStatement.executeCachedSql(PgStatement.java:287) at > org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:264) at > org.postgresql.jdbc.PgStatement.executeUpdate(PgStatement.java:244) at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createTable(JdbcUtils.scala:806) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:95) > at > org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:469) > at > org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:50) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92) > at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233) at > org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:460) at > com.ads.dqam.action.impl.PostgresValueAnalysis.persistValueAnalysis(PostgresValueAnalysis.java:25) >
[jira] [Commented] (SPARK-23411) Deprecate SparkContext.getRDDStorageInfo
[ https://issues.apache.org/jira/browse/SPARK-23411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363037#comment-16363037 ] Marcelo Vanzin commented on SPARK-23411: Argh, copied the wrong method name. Fixed. > Deprecate SparkContext.getRDDStorageInfo > > > Key: SPARK-23411 > URL: https://issues.apache.org/jira/browse/SPARK-23411 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Priority: Minor > > This is another developer API in SparkContext that is better moved to a more > monitoring-minded place such as {{SparkStatusTracker}}. The same information > is also already available through the REST API. > Moreover, it exposes {{RDDInfo}}, which is a mutable internal Spark type, and > exposing that to applications is kinda sketchy. (It can't be made private, > though, since it's exposed indirectly in events posted to the listener bus.) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23411) Deprecate SparkContext.getRDDStorageInfo
[ https://issues.apache.org/jira/browse/SPARK-23411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-23411: --- Summary: Deprecate SparkContext.getRDDStorageInfo (was: Deprecate SparkContext.getExecutorStorageStatus) > Deprecate SparkContext.getRDDStorageInfo > > > Key: SPARK-23411 > URL: https://issues.apache.org/jira/browse/SPARK-23411 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Priority: Minor > > This is another developer API in SparkContext that is better moved to a more > monitoring-minded place such as {{SparkStatusTracker}}. The same information > is also already available through the REST API. > Moreover, it exposes {{RDDInfo}}, which is a mutable internal Spark type, and > exposing that to applications is kinda sketchy. (It can't be made private, > though, since it's exposed indirectly in events posted to the listener bus.) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23414) Plotting using matplotlib in MLlib pyspark
[ https://issues.apache.org/jira/browse/SPARK-23414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363029#comment-16363029 ] Sean Owen commented on SPARK-23414: --- matplotlib doesn't interact with Spark, so issues with using it are unlikely to be relevant to Spark itself anyway. > Plotting using matplotlib in MLlib pyspark > --- > > Key: SPARK-23414 > URL: https://issues.apache.org/jira/browse/SPARK-23414 > Project: Spark > Issue Type: Question > Components: MLlib >Affects Versions: 2.2.1 >Reporter: Waleed Esmail >Priority: Major > > Dear MLlib experts, > I just want to plot a fancy confusion matrix (true values vs predicted > values) like the one produced by seaborn module in python, so I did the > following: > {code:java} > labelIndexer = StringIndexer(inputCol="label", > outputCol="indexedLabel").fit(output) > # Automatically identify categorical features, and index them. > # We specify maxCategories so features with > 4 distinct values are treated > as continuous. > featureIndexer = VectorIndexer(inputCol="features", > outputCol="indexedFeatures").fit(output) > # Split the data into training and test sets (30% held out for testing) > (trainingData, testData) = output.randomSplit([0.7, 0.3]) > dt = DecisionTreeClassifier(labelCol="indexedLabel", > featuresCol="indexedFeatures", maxDepth=15) > # Chain indexers and tree in a Pipeline > pipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt]) > # Train model. This also runs the indexers. > model = pipeline.fit(trainingData) > # Make predictions. > predictions = model.transform(testData) > predictionAndLabels = predictions.select("prediction", "indexedLabel") > y_predicted = np.array(predictions.select("prediction").collect()) > y_test = np.array(predictions.select("indexedLabel").collect()) > from sklearn.metrics import confusion_matrix > import matplotlib.ticker as ticker > figcm, ax = plt.subplots() > cm = confusion_matrix(y_test, y_predicted) > # for normalization > cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] > sns.heatmap(cm, square=True, annot=True, cbar=False) > plt.xlabel('predication') > plt.ylabel('true value') > {code} > is this the right way to do it?!. please note that I am new to Spark and MLlib > > thank you in advance, -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23414) Plotting using matplotlib in MLlib pyspark
[ https://issues.apache.org/jira/browse/SPARK-23414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363025#comment-16363025 ] Waleed Esmail commented on SPARK-23414: --- I am sorry, I didn't get it, what do you mean by "orthogonal"?!. > Plotting using matplotlib in MLlib pyspark > --- > > Key: SPARK-23414 > URL: https://issues.apache.org/jira/browse/SPARK-23414 > Project: Spark > Issue Type: Question > Components: MLlib >Affects Versions: 2.2.1 >Reporter: Waleed Esmail >Priority: Major > > Dear MLlib experts, > I just want to plot a fancy confusion matrix (true values vs predicted > values) like the one produced by seaborn module in python, so I did the > following: > {code:java} > labelIndexer = StringIndexer(inputCol="label", > outputCol="indexedLabel").fit(output) > # Automatically identify categorical features, and index them. > # We specify maxCategories so features with > 4 distinct values are treated > as continuous. > featureIndexer = VectorIndexer(inputCol="features", > outputCol="indexedFeatures").fit(output) > # Split the data into training and test sets (30% held out for testing) > (trainingData, testData) = output.randomSplit([0.7, 0.3]) > dt = DecisionTreeClassifier(labelCol="indexedLabel", > featuresCol="indexedFeatures", maxDepth=15) > # Chain indexers and tree in a Pipeline > pipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt]) > # Train model. This also runs the indexers. > model = pipeline.fit(trainingData) > # Make predictions. > predictions = model.transform(testData) > predictionAndLabels = predictions.select("prediction", "indexedLabel") > y_predicted = np.array(predictions.select("prediction").collect()) > y_test = np.array(predictions.select("indexedLabel").collect()) > from sklearn.metrics import confusion_matrix > import matplotlib.ticker as ticker > figcm, ax = plt.subplots() > cm = confusion_matrix(y_test, y_predicted) > # for normalization > cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] > sns.heatmap(cm, square=True, annot=True, cbar=False) > plt.xlabel('predication') > plt.ylabel('true value') > {code} > is this the right way to do it?!. please note that I am new to Spark and MLlib > > thank you in advance, -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23417) pyspark tests give wrong sbt instructions
Jose Torres created SPARK-23417: --- Summary: pyspark tests give wrong sbt instructions Key: SPARK-23417 URL: https://issues.apache.org/jira/browse/SPARK-23417 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.4.0 Reporter: Jose Torres When running python/run-tests, the script indicates that I must run "'build/sbt assembly/package streaming-kafka-0-8-assembly/assembly' or 'build/mvn -Pkafka-0-8 package'". The sbt command fails: [error] Expected ID character [error] Not a valid command: streaming-kafka-0-8-assembly [error] Expected project ID [error] Expected configuration [error] Expected ':' (if selecting a configuration) [error] Expected key [error] Not a valid key: streaming-kafka-0-8-assembly [error] streaming-kafka-0-8-assembly/assembly [error] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23400) Add the extra constructors for ScalaUDF
[ https://issues.apache.org/jira/browse/SPARK-23400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-23400: - Fix Version/s: 2.4.0 > Add the extra constructors for ScalaUDF > --- > > Key: SPARK-23400 > URL: https://issues.apache.org/jira/browse/SPARK-23400 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Major > Fix For: 2.3.1, 2.4.0 > > > The last few releases, we changed the interface of ScalaUDF. Unfortunately, > some Spark Package (spark-deep-learning) are using our internal class > `ScalaUDF`. In the release 2.3, we added new parameters into these class. The > users hit the binary compatibility issues and got the exception: > > java.lang.NoSuchMethodError: > > org.apache.spark.sql.catalyst.expressions.ScalaUDF.init(Ljava/lang/Object;Lorg/apache/spark/sql/types/DataType;Lscala/collection/Seq;Lscala/collection/Seq;Lscala/Option;)V -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23400) Add the extra constructors for ScalaUDF
[ https://issues.apache.org/jira/browse/SPARK-23400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-23400. -- Resolution: Fixed Fix Version/s: 2.3.1 Issue resolved by pull request 20591 [https://github.com/apache/spark/pull/20591] > Add the extra constructors for ScalaUDF > --- > > Key: SPARK-23400 > URL: https://issues.apache.org/jira/browse/SPARK-23400 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Major > Fix For: 2.3.1 > > > The last few releases, we changed the interface of ScalaUDF. Unfortunately, > some Spark Package (spark-deep-learning) are using our internal class > `ScalaUDF`. In the release 2.3, we added new parameters into these class. The > users hit the binary compatibility issues and got the exception: > > java.lang.NoSuchMethodError: > > org.apache.spark.sql.catalyst.expressions.ScalaUDF.init(Ljava/lang/Object;Lorg/apache/spark/sql/types/DataType;Lscala/collection/Seq;Lscala/collection/Seq;Lscala/Option;)V -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23416) flaky test: org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite.stress test for failOnDataLoss=false
[ https://issues.apache.org/jira/browse/SPARK-23416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362950#comment-16362950 ] Apache Spark commented on SPARK-23416: -- User 'jose-torres' has created a pull request for this issue: https://github.com/apache/spark/pull/20602 > flaky test: > org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite.stress > test for failOnDataLoss=false > -- > > Key: SPARK-23416 > URL: https://issues.apache.org/jira/browse/SPARK-23416 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Minor > > I suspect this is a race condition latent in the DataSourceV2 write path, or > at least the interaction of that write path with StreamTest. > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87241/testReport/org.apache.spark.sql.kafka010/KafkaSourceStressForDontFailOnDataLossSuite/stress_test_for_failOnDataLoss_false/] > h3. Error Message > org.apache.spark.sql.streaming.StreamingQueryException: Query [id = > 16b2a2b1-acdd-44ec-902f-531169193169, runId = > 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job > aborted. > h3. Stacktrace > sbt.ForkMain$ForkError: > org.apache.spark.sql.streaming.StreamingQueryException: Query [id = > 16b2a2b1-acdd-44ec-902f-531169193169, runId = > 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job > aborted. at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189) > Caused by: sbt.ForkMain$ForkError: org.apache.spark.SparkException: Writing > job aborted. at > org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:108) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at > org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:294) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272) > at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) > at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) > at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253) at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252) at > org.apache.spark.sql.Dataset.collect(Dataset.scala:2722) at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$15.apply(MicroBatchExecution.scala:488) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3.apply(MicroBatchExecution.scala:483) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:482) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) > at >
[jira] [Assigned] (SPARK-23416) flaky test: org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite.stress test for failOnDataLoss=false
[ https://issues.apache.org/jira/browse/SPARK-23416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23416: Assignee: Apache Spark > flaky test: > org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite.stress > test for failOnDataLoss=false > -- > > Key: SPARK-23416 > URL: https://issues.apache.org/jira/browse/SPARK-23416 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Assignee: Apache Spark >Priority: Minor > > I suspect this is a race condition latent in the DataSourceV2 write path, or > at least the interaction of that write path with StreamTest. > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87241/testReport/org.apache.spark.sql.kafka010/KafkaSourceStressForDontFailOnDataLossSuite/stress_test_for_failOnDataLoss_false/] > h3. Error Message > org.apache.spark.sql.streaming.StreamingQueryException: Query [id = > 16b2a2b1-acdd-44ec-902f-531169193169, runId = > 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job > aborted. > h3. Stacktrace > sbt.ForkMain$ForkError: > org.apache.spark.sql.streaming.StreamingQueryException: Query [id = > 16b2a2b1-acdd-44ec-902f-531169193169, runId = > 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job > aborted. at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189) > Caused by: sbt.ForkMain$ForkError: org.apache.spark.SparkException: Writing > job aborted. at > org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:108) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at > org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:294) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272) > at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) > at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) > at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253) at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252) at > org.apache.spark.sql.Dataset.collect(Dataset.scala:2722) at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$15.apply(MicroBatchExecution.scala:488) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3.apply(MicroBatchExecution.scala:483) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:482) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) > at >
[jira] [Assigned] (SPARK-23416) flaky test: org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite.stress test for failOnDataLoss=false
[ https://issues.apache.org/jira/browse/SPARK-23416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23416: Assignee: (was: Apache Spark) > flaky test: > org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite.stress > test for failOnDataLoss=false > -- > > Key: SPARK-23416 > URL: https://issues.apache.org/jira/browse/SPARK-23416 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Minor > > I suspect this is a race condition latent in the DataSourceV2 write path, or > at least the interaction of that write path with StreamTest. > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87241/testReport/org.apache.spark.sql.kafka010/KafkaSourceStressForDontFailOnDataLossSuite/stress_test_for_failOnDataLoss_false/] > h3. Error Message > org.apache.spark.sql.streaming.StreamingQueryException: Query [id = > 16b2a2b1-acdd-44ec-902f-531169193169, runId = > 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job > aborted. > h3. Stacktrace > sbt.ForkMain$ForkError: > org.apache.spark.sql.streaming.StreamingQueryException: Query [id = > 16b2a2b1-acdd-44ec-902f-531169193169, runId = > 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job > aborted. at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189) > Caused by: sbt.ForkMain$ForkError: org.apache.spark.SparkException: Writing > job aborted. at > org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:108) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at > org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:294) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272) > at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) > at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) > at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253) at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252) at > org.apache.spark.sql.Dataset.collect(Dataset.scala:2722) at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$15.apply(MicroBatchExecution.scala:488) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3.apply(MicroBatchExecution.scala:483) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:482) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) > at >
[jira] [Comment Edited] (SPARK-23416) flaky test: org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite.stress test for failOnDataLoss=false
[ https://issues.apache.org/jira/browse/SPARK-23416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362941#comment-16362941 ] Jose Torres edited comment on SPARK-23416 at 2/13/18 7:48 PM: -- I think I see the problem. * StreamExecution.stop() works by interrupting the stream execution thread. This is not safe in general, and can throw any variety of exceptions. * StreamExecution.isInterruptedByStop() solves this problem by implementing a whitelist of exceptions which indicate the stop() happened. * The v2 write path adds calls to ThreadUtils.awaitResult(), which weren't in the V1 write path and (if the interrupt happens to fall in them) throw a new exception which isn't accounted for. I'm going to write a PR to add another whitelist entry. This whole edifice is a bit fragile, but I don't have a good solution for that. was (Author: joseph.torres): I think I see the problem. * StreamExecution.stop() works by interrupting the stream execution thread. This is not safe in general, and can throw any variety of exceptions. * StreamExecution.isInterruptedByStop() solves this problem by implementing a whitelist of exceptions which indicate the stop() happened. * The v2 write path adds calls to ThreadUtils.awaitResult(), which weren't in the V1 write path and (if the interrupt happens to fall in them) throw a new exception which isn't accounted for. I'm going to write a PR to add another whitelist entry, but this is quite fragile. > flaky test: > org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite.stress > test for failOnDataLoss=false > -- > > Key: SPARK-23416 > URL: https://issues.apache.org/jira/browse/SPARK-23416 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Minor > > I suspect this is a race condition latent in the DataSourceV2 write path, or > at least the interaction of that write path with StreamTest. > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87241/testReport/org.apache.spark.sql.kafka010/KafkaSourceStressForDontFailOnDataLossSuite/stress_test_for_failOnDataLoss_false/] > h3. Error Message > org.apache.spark.sql.streaming.StreamingQueryException: Query [id = > 16b2a2b1-acdd-44ec-902f-531169193169, runId = > 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job > aborted. > h3. Stacktrace > sbt.ForkMain$ForkError: > org.apache.spark.sql.streaming.StreamingQueryException: Query [id = > 16b2a2b1-acdd-44ec-902f-531169193169, runId = > 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job > aborted. at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189) > Caused by: sbt.ForkMain$ForkError: org.apache.spark.SparkException: Writing > job aborted. at > org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:108) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at > org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:294) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272) > at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) > at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) > at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253) at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252) at > org.apache.spark.sql.Dataset.collect(Dataset.scala:2722) at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$15.apply(MicroBatchExecution.scala:488) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) >
[jira] [Commented] (SPARK-23416) flaky test: org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite.stress test for failOnDataLoss=false
[ https://issues.apache.org/jira/browse/SPARK-23416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362941#comment-16362941 ] Jose Torres commented on SPARK-23416: - I think I see the problem. * StreamExecution.stop() works by interrupting the stream execution thread. This is not safe in general, and can throw any variety of exceptions. * StreamExecution.isInterruptedByStop() solves this problem by implementing a whitelist of exceptions which indicate the stop() happened. * The v2 write path adds calls to ThreadUtils.awaitResult(), which weren't in the V1 write path and (if the interrupt happens to fall in them) throw a new exception which isn't accounted for. I'm going to write a PR to add another whitelist entry, but this is quite fragile. > flaky test: > org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite.stress > test for failOnDataLoss=false > -- > > Key: SPARK-23416 > URL: https://issues.apache.org/jira/browse/SPARK-23416 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Minor > > I suspect this is a race condition latent in the DataSourceV2 write path, or > at least the interaction of that write path with StreamTest. > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87241/testReport/org.apache.spark.sql.kafka010/KafkaSourceStressForDontFailOnDataLossSuite/stress_test_for_failOnDataLoss_false/] > h3. Error Message > org.apache.spark.sql.streaming.StreamingQueryException: Query [id = > 16b2a2b1-acdd-44ec-902f-531169193169, runId = > 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job > aborted. > h3. Stacktrace > sbt.ForkMain$ForkError: > org.apache.spark.sql.streaming.StreamingQueryException: Query [id = > 16b2a2b1-acdd-44ec-902f-531169193169, runId = > 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job > aborted. at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189) > Caused by: sbt.ForkMain$ForkError: org.apache.spark.SparkException: Writing > job aborted. at > org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:108) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at > org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:294) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272) > at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) > at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) > at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253) at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252) at > org.apache.spark.sql.Dataset.collect(Dataset.scala:2722) at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$15.apply(MicroBatchExecution.scala:488) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3.apply(MicroBatchExecution.scala:483) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:482) > at >
[jira] [Commented] (SPARK-23416) flaky test: org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite.stress test for failOnDataLoss=false
[ https://issues.apache.org/jira/browse/SPARK-23416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362920#comment-16362920 ] Marco Gaido commented on SPARK-23416: - I see this failing also with this stacktrace: {code:java} sbt.ForkMain$ForkError: org.apache.spark.sql.streaming.StreamingQueryException: Query memory [id = cca87cf7-0532-41af-b757-0948ec294c0c, runId = c1830af6-1715-4947-bd76-a1a63482280b] terminated with exception: null at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295) at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189) Caused by: sbt.ForkMain$ForkError: java.lang.InterruptedException: null at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326) at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92) at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76) at org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution.runContinuous(ContinuousExecution.scala:271) at org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution.runActivatedStream(ContinuousExecution.scala:89) at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279) ... 1 more {code} https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87401/testReport/org.apache.spark.sql.kafka010/KafkaContinuousSourceStressForDontFailOnDataLossSuite/stress_test_for_failOnDataLoss_false/ > flaky test: > org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite.stress > test for failOnDataLoss=false > -- > > Key: SPARK-23416 > URL: https://issues.apache.org/jira/browse/SPARK-23416 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Minor > > I suspect this is a race condition latent in the DataSourceV2 write path, or > at least the interaction of that write path with StreamTest. > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87241/testReport/org.apache.spark.sql.kafka010/KafkaSourceStressForDontFailOnDataLossSuite/stress_test_for_failOnDataLoss_false/] > h3. Error Message > org.apache.spark.sql.streaming.StreamingQueryException: Query [id = > 16b2a2b1-acdd-44ec-902f-531169193169, runId = > 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job > aborted. > h3. Stacktrace > sbt.ForkMain$ForkError: > org.apache.spark.sql.streaming.StreamingQueryException: Query [id = > 16b2a2b1-acdd-44ec-902f-531169193169, runId = > 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job > aborted. at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189) > Caused by: sbt.ForkMain$ForkError: org.apache.spark.SparkException: Writing > job aborted. at > org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:108) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at > org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:294) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272) > at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) > at
[jira] [Commented] (SPARK-23416) flaky test: org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite.stress test for failOnDataLoss=false
[ https://issues.apache.org/jira/browse/SPARK-23416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362913#comment-16362913 ] Dongjoon Hyun commented on SPARK-23416: --- Thank you for filing this! > flaky test: > org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite.stress > test for failOnDataLoss=false > -- > > Key: SPARK-23416 > URL: https://issues.apache.org/jira/browse/SPARK-23416 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Minor > > I suspect this is a race condition latent in the DataSourceV2 write path, or > at least the interaction of that write path with StreamTest. > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87241/testReport/org.apache.spark.sql.kafka010/KafkaSourceStressForDontFailOnDataLossSuite/stress_test_for_failOnDataLoss_false/] > h3. Error Message > org.apache.spark.sql.streaming.StreamingQueryException: Query [id = > 16b2a2b1-acdd-44ec-902f-531169193169, runId = > 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job > aborted. > h3. Stacktrace > sbt.ForkMain$ForkError: > org.apache.spark.sql.streaming.StreamingQueryException: Query [id = > 16b2a2b1-acdd-44ec-902f-531169193169, runId = > 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job > aborted. at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189) > Caused by: sbt.ForkMain$ForkError: org.apache.spark.SparkException: Writing > job aborted. at > org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:108) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at > org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:294) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272) > at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) > at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) > at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253) at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252) at > org.apache.spark.sql.Dataset.collect(Dataset.scala:2722) at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$15.apply(MicroBatchExecution.scala:488) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3.apply(MicroBatchExecution.scala:483) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:482) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) > at >
[jira] [Resolved] (SPARK-23154) Document backwards compatibility guarantees for ML persistence
[ https://issues.apache.org/jira/browse/SPARK-23154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-23154. --- Resolution: Fixed Fix Version/s: 2.4.0 2.3.1 Resolved via https://github.com/apache/spark/pull/20592 > Document backwards compatibility guarantees for ML persistence > -- > > Key: SPARK-23154 > URL: https://issues.apache.org/jira/browse/SPARK-23154 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Major > Fix For: 2.3.1, 2.4.0 > > > We have (as far as I know) maintained backwards compatibility for ML > persistence, but this is not documented anywhere. I'd like us to document it > (for spark.ml, not for spark.mllib). > I'd recommend something like: > {quote} > In general, MLlib maintains backwards compatibility for ML persistence. > I.e., if you save an ML model or Pipeline in one version of Spark, then you > should be able to load it back and use it in a future version of Spark. > However, there are rare exceptions, described below. > Model persistence: Is a model or Pipeline saved using Apache Spark ML > persistence in Spark version X loadable by Spark version Y? > * Major versions: No guarantees, but best-effort. > * Minor and patch versions: Yes; these are backwards compatible. > * Note about the format: There are no guarantees for a stable persistence > format, but model loading itself is designed to be backwards compatible. > Model behavior: Does a model or Pipeline in Spark version X behave > identically in Spark version Y? > * Major versions: No guarantees, but best-effort. > * Minor and patch versions: Identical behavior, except for bug fixes. > For both model persistence and model behavior, any breaking changes across a > minor version or patch version are reported in the Spark version release > notes. If a breakage is not reported in release notes, then it should be > treated as a bug to be fixed. > {quote} > How does this sound? > Note: We unfortunately don't have tests for backwards compatibility (which > has technical hurdles and can be discussed in [SPARK-15573]). However, we > have made efforts to maintain it during PR review and Spark release QA, and > most users expect it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23344) Add KMeans distanceMeasure param to PySpark
[ https://issues.apache.org/jira/browse/SPARK-23344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362902#comment-16362902 ] Sean Owen commented on SPARK-23344: --- Nah, I think the regulars have different views on this, which nevertheless work OK for them. Often where it becomes a problem are for people trying to inflate a JIRA count, and they won't generally have read or paid attention to threads like that anyway. > Add KMeans distanceMeasure param to PySpark > --- > > Key: SPARK-23344 > URL: https://issues.apache.org/jira/browse/SPARK-23344 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Marco Gaido >Assignee: Marco Gaido >Priority: Minor > Fix For: 2.4.0 > > > SPARK-22119 introduced a new parameter for KMeans, ie. {{distanceMeasure}}. > We should add it also to the Python interface. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23159) Update Cloudpickle to match version 0.4.3
[ https://issues.apache.org/jira/browse/SPARK-23159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-23159: - Description: Update PySpark's version of Cloudpickle to match version 0.4.3. The reasons for doing this are: * Pick up bug fixes, improvements with newer version * Match a specific version as close as possible (Spark has additional changes that might be necessary) to make future upgrades easier There are newer versions of Cloudpickle that can fix bugs with NamedTuple pickling (that Spark currently has workarounds for), but these include other changes that need some verification before bringing into Spark. Upgrading first to 0.4.3 will help make this verification easier. Discussion on the mailing list: [http://apache-spark-developers-list.1001551.n3.nabble.com/Thoughts-on-Cloudpickle-Update-td23188.html] Upgrading to the recent release of v0.4.3 will include the following: * Fix pickling of named tuples [https://github.com/cloudpipe/cloudpickle/pull/113] * Built in type constructors for PyPy compatibility [here]([https://github.com/cloudpipe/cloudpickle/commit/d84980ccaafc7982a50d4e04064011f401f17d1b]) * Fix memoryview support [https://github.com/cloudpipe/cloudpickle/pull/122] * Improved compatibility with other cloudpickle versions [https://github.com/cloudpipe/cloudpickle/pull/128] * Several cleanups [https://github.com/cloudpipe/cloudpickle/pull/121] and [here]([https://github.com/cloudpipe/cloudpickle/commit/c91aaf110441991307f5097f950764079d0f9652]) * [MRG] Regression on pickling classes from the __main__ module [https://github.com/cloudpipe/cloudpickle/pull/149] * BUG: Handle instance methods of builtin types [https://github.com/cloudpipe/cloudpickle/pull/154] * Fix #129 : do not silence RuntimeError in dump() [https://github.com/cloudpipe/cloudpickle/pull/153] was: Update PySpark's version of Cloudpickle to match version 0.4.2. The reasons for doing this are: * Pick up bug fixes, improvements with newer version * Match a specific version as close as possible (Spark has additional changes that might be necessary) to make future upgrades easier There are newer versions of Cloudpickle that can fix bugs with NamedTuple pickling (that Spark currently has workarounds for), but these include other changes that need some verification before bringing into Spark. Upgrading first to 0.4.2 will help make this verification easier. Discussion on the mailing list: http://apache-spark-developers-list.1001551.n3.nabble.com/Thoughts-on-Cloudpickle-Update-td23188.html > Update Cloudpickle to match version 0.4.3 > - > > Key: SPARK-23159 > URL: https://issues.apache.org/jira/browse/SPARK-23159 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Bryan Cutler >Priority: Major > > Update PySpark's version of Cloudpickle to match version 0.4.3. The reasons > for doing this are: > * Pick up bug fixes, improvements with newer version > * Match a specific version as close as possible (Spark has additional > changes that might be necessary) to make future upgrades easier > There are newer versions of Cloudpickle that can fix bugs with NamedTuple > pickling (that Spark currently has workarounds for), but these include other > changes that need some verification before bringing into Spark. Upgrading > first to 0.4.3 will help make this verification easier. > Discussion on the mailing list: > [http://apache-spark-developers-list.1001551.n3.nabble.com/Thoughts-on-Cloudpickle-Update-td23188.html] > Upgrading to the recent release of v0.4.3 will include the following: > * Fix pickling of named tuples > [https://github.com/cloudpipe/cloudpickle/pull/113] > * Built in type constructors for PyPy compatibility > [here]([https://github.com/cloudpipe/cloudpickle/commit/d84980ccaafc7982a50d4e04064011f401f17d1b]) > * Fix memoryview support [https://github.com/cloudpipe/cloudpickle/pull/122] > * Improved compatibility with other cloudpickle versions > [https://github.com/cloudpipe/cloudpickle/pull/128] > * Several cleanups [https://github.com/cloudpipe/cloudpickle/pull/121] and > [here]([https://github.com/cloudpipe/cloudpickle/commit/c91aaf110441991307f5097f950764079d0f9652]) > * [MRG] Regression on pickling classes from the __main__ module > [https://github.com/cloudpipe/cloudpickle/pull/149] > * BUG: Handle instance methods of builtin types > [https://github.com/cloudpipe/cloudpickle/pull/154] > * Fix #129 : do not silence RuntimeError in dump() > [https://github.com/cloudpipe/cloudpickle/pull/153] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For
[jira] [Updated] (SPARK-23159) Update Cloudpickle to match version 0.4.3
[ https://issues.apache.org/jira/browse/SPARK-23159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-23159: - Summary: Update Cloudpickle to match version 0.4.3 (was: Update Cloudpickle to match version 0.4.2) > Update Cloudpickle to match version 0.4.3 > - > > Key: SPARK-23159 > URL: https://issues.apache.org/jira/browse/SPARK-23159 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Bryan Cutler >Priority: Major > > Update PySpark's version of Cloudpickle to match version 0.4.2. The reasons > for doing this are: > * Pick up bug fixes, improvements with newer version > * Match a specific version as close as possible (Spark has additional > changes that might be necessary) to make future upgrades easier > There are newer versions of Cloudpickle that can fix bugs with NamedTuple > pickling (that Spark currently has workarounds for), but these include other > changes that need some verification before bringing into Spark. Upgrading > first to 0.4.2 will help make this verification easier. > Discussion on the mailing list: > http://apache-spark-developers-list.1001551.n3.nabble.com/Thoughts-on-Cloudpickle-Update-td23188.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23416) flaky test: org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite.stress test for failOnDataLoss=false
Jose Torres created SPARK-23416: --- Summary: flaky test: org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite.stress test for failOnDataLoss=false Key: SPARK-23416 URL: https://issues.apache.org/jira/browse/SPARK-23416 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.4.0 Reporter: Jose Torres I suspect this is a race condition latent in the DataSourceV2 write path, or at least the interaction of that write path with StreamTest. [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87241/testReport/org.apache.spark.sql.kafka010/KafkaSourceStressForDontFailOnDataLossSuite/stress_test_for_failOnDataLoss_false/] h3. Error Message org.apache.spark.sql.streaming.StreamingQueryException: Query [id = 16b2a2b1-acdd-44ec-902f-531169193169, runId = 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job aborted. h3. Stacktrace sbt.ForkMain$ForkError: org.apache.spark.sql.streaming.StreamingQueryException: Query [id = 16b2a2b1-acdd-44ec-902f-531169193169, runId = 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job aborted. at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295) at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189) Caused by: sbt.ForkMain$ForkError: org.apache.spark.SparkException: Writing job aborted. at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:108) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:294) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272) at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252) at org.apache.spark.sql.Dataset.collect(Dataset.scala:2722) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$15.apply(MicroBatchExecution.scala:488) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3.apply(MicroBatchExecution.scala:483) at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:482) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121) at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:121) at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:117) at
[jira] [Created] (SPARK-23415) BufferHolderSparkSubmitSuite is flaky
Dongjoon Hyun created SPARK-23415: - Summary: BufferHolderSparkSubmitSuite is flaky Key: SPARK-23415 URL: https://issues.apache.org/jira/browse/SPARK-23415 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Dongjoon Hyun The test suite fails due to 60-second timeout sometimes. {code} Error Message org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to failAfter did not complete within 60 seconds. Stacktrace sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to failAfter did not complete within 60 seconds. {code} - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87380/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4206/ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23410) Unable to read jsons in charset different from UTF-8
[ https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-23410: --- Shepherd: Herman van Hovell > Unable to read jsons in charset different from UTF-8 > > > Key: SPARK-23410 > URL: https://issues.apache.org/jira/browse/SPARK-23410 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Priority: Major > > Currently the Json Parser is forced to read json files in UTF-8. Such > behavior breaks backward compatibility with Spark 2.2.1 and previous versions > that can read json files in UTF-16, UTF-32 and other encodings due to using > of the auto detection mechanism of the jackson library. Need to give back to > users possibility to read json files in specified charset and/or detect > charset automatically as it was before. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23413) Sorting tasks by Host / Executor ID on the Stage page does not work
[ https://issues.apache.org/jira/browse/SPARK-23413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23413: Assignee: Apache Spark > Sorting tasks by Host / Executor ID on the Stage page does not work > --- > > Key: SPARK-23413 > URL: https://issues.apache.org/jira/browse/SPARK-23413 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0, 2.4.0 >Reporter: Attila Zsolt Piros >Assignee: Apache Spark >Priority: Major > > Sorting tasks by Host / Executor ID throws exceptions: > {code} > java.lang.IllegalArgumentException: Invalid sort column: Executor ID at > org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at > org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at > org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at > org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at > org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at > org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at > org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at > org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) > {code} > {code} > java.lang.IllegalArgumentException: Invalid sort column: Host at > org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at > org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at > org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at > org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at > org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at > org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at > org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at > org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at > {code} > !image-2018-02-13-16-50-32-600.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23413) Sorting tasks by Host / Executor ID on the Stage page does not work
[ https://issues.apache.org/jira/browse/SPARK-23413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362830#comment-16362830 ] Apache Spark commented on SPARK-23413: -- User 'attilapiros' has created a pull request for this issue: https://github.com/apache/spark/pull/20601 > Sorting tasks by Host / Executor ID on the Stage page does not work > --- > > Key: SPARK-23413 > URL: https://issues.apache.org/jira/browse/SPARK-23413 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0, 2.4.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Sorting tasks by Host / Executor ID throws exceptions: > {code} > java.lang.IllegalArgumentException: Invalid sort column: Executor ID at > org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at > org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at > org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at > org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at > org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at > org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at > org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at > org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) > {code} > {code} > java.lang.IllegalArgumentException: Invalid sort column: Host at > org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at > org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at > org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at > org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at > org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at > org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at > org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at > org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at > {code} > !image-2018-02-13-16-50-32-600.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23413) Sorting tasks by Host / Executor ID on the Stage page does not work
[ https://issues.apache.org/jira/browse/SPARK-23413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23413: Assignee: (was: Apache Spark) > Sorting tasks by Host / Executor ID on the Stage page does not work > --- > > Key: SPARK-23413 > URL: https://issues.apache.org/jira/browse/SPARK-23413 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0, 2.4.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Sorting tasks by Host / Executor ID throws exceptions: > {code} > java.lang.IllegalArgumentException: Invalid sort column: Executor ID at > org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at > org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at > org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at > org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at > org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at > org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at > org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at > org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) > {code} > {code} > java.lang.IllegalArgumentException: Invalid sort column: Host at > org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at > org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at > org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at > org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at > org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at > org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at > org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at > org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at > {code} > !image-2018-02-13-16-50-32-600.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23344) Add KMeans distanceMeasure param to PySpark
[ https://issues.apache.org/jira/browse/SPARK-23344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362774#comment-16362774 ] Marco Gaido commented on SPARK-23344: - I see. It would be good indeed to decide in the community a "standard" approach for this, I think. Do you think it is worth opening a thread in the dev mailing list about this topic? > Add KMeans distanceMeasure param to PySpark > --- > > Key: SPARK-23344 > URL: https://issues.apache.org/jira/browse/SPARK-23344 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Marco Gaido >Assignee: Marco Gaido >Priority: Minor > Fix For: 2.4.0 > > > SPARK-22119 introduced a new parameter for KMeans, ie. {{distanceMeasure}}. > We should add it also to the Python interface. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23344) Add KMeans distanceMeasure param to PySpark
[ https://issues.apache.org/jira/browse/SPARK-23344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362760#comment-16362760 ] Sean Owen commented on SPARK-23344: --- My general rule of thumb is: how likely is it you would resolve one issue independently of the other, or cherry-pick the fix for one vs the other? if they'd pretty much always go together then they seem like a logical issue. A lot of other concerns modify that. For example very large changes should probably be broken down even if they go together. This one is ambiguous. > Add KMeans distanceMeasure param to PySpark > --- > > Key: SPARK-23344 > URL: https://issues.apache.org/jira/browse/SPARK-23344 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Marco Gaido >Assignee: Marco Gaido >Priority: Minor > Fix For: 2.4.0 > > > SPARK-22119 introduced a new parameter for KMeans, ie. {{distanceMeasure}}. > We should add it also to the Python interface. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23344) Add KMeans distanceMeasure param to PySpark
[ https://issues.apache.org/jira/browse/SPARK-23344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362758#comment-16362758 ] Marco Gaido commented on SPARK-23344: - [~srowen] I did it this way because I always say doing so. Not sure about the reason, maybe in order to make review easier. What would you suggest? > Add KMeans distanceMeasure param to PySpark > --- > > Key: SPARK-23344 > URL: https://issues.apache.org/jira/browse/SPARK-23344 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Marco Gaido >Assignee: Marco Gaido >Priority: Minor > Fix For: 2.4.0 > > > SPARK-22119 introduced a new parameter for KMeans, ie. {{distanceMeasure}}. > We should add it also to the Python interface. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23292) python tests related to pandas are skipped with python 2
[ https://issues.apache.org/jira/browse/SPARK-23292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sameer Agarwal updated SPARK-23292: --- Summary: python tests related to pandas are skipped with python 2 (was: python tests related to pandas are skipped) > python tests related to pandas are skipped with python 2 > > > Key: SPARK-23292 > URL: https://issues.apache.org/jira/browse/SPARK-23292 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.3.0 >Reporter: Yin Huai >Priority: Critical > > I was running python tests and found that > [pyspark.sql.tests.GroupbyAggPandasUDFTests.test_unsupported_types|https://github.com/apache/spark/blob/52e00f70663a87b5837235bdf72a3e6f84e11411/python/pyspark/sql/tests.py#L4528-L4548] > does not run with Python 2 because the test uses "assertRaisesRegex" > (supported by Python 3) instead of "assertRaisesRegexp" (supported by Python > 2). However, spark jenkins does not fail because of this issue (see run > history at > [here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.7/]). > After looking into this issue, [seems test script will skip tests related to > pandas if pandas is not > installed|https://github.com/apache/spark/blob/2ac895be909de7e58e1051dc2a1bba98a25bf4be/python/pyspark/sql/tests.py#L51-L63], > which means that jenkins does not have pandas installed. > > Since pyarrow related tests have the same skipping logic, we will need to > check if jenkins has pyarrow installed correctly as well. > > Since features using pandas and pyarrow are in 2.3, we should fix the test > issue and make sure all tests pass before we make the release. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23217) Add cosine distance measure to ClusteringEvaluator
[ https://issues.apache.org/jira/browse/SPARK-23217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-23217. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20396 [https://github.com/apache/spark/pull/20396] > Add cosine distance measure to ClusteringEvaluator > -- > > Key: SPARK-23217 > URL: https://issues.apache.org/jira/browse/SPARK-23217 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Marco Gaido >Assignee: Marco Gaido >Priority: Major > Fix For: 2.4.0 > > Attachments: SPARK-23217.pdf > > > SPARK-22119 introduced the cosine distance measure for KMeans. Therefore it > would be useful to provide also an implementation of ClusteringEvaluator > using the cosine distance measure. > > Attached you can find a design document for it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23217) Add cosine distance measure to ClusteringEvaluator
[ https://issues.apache.org/jira/browse/SPARK-23217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-23217: - Assignee: Marco Gaido > Add cosine distance measure to ClusteringEvaluator > -- > > Key: SPARK-23217 > URL: https://issues.apache.org/jira/browse/SPARK-23217 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Marco Gaido >Assignee: Marco Gaido >Priority: Major > Fix For: 2.4.0 > > Attachments: SPARK-23217.pdf > > > SPARK-22119 introduced the cosine distance measure for KMeans. Therefore it > would be useful to provide also an implementation of ClusteringEvaluator > using the cosine distance measure. > > Attached you can find a design document for it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23392) Add some test case for images feature
[ https://issues.apache.org/jira/browse/SPARK-23392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-23392. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20583 [https://github.com/apache/spark/pull/20583] > Add some test case for images feature > - > > Key: SPARK-23392 > URL: https://issues.apache.org/jira/browse/SPARK-23392 > Project: Spark > Issue Type: Test > Components: ML, Tests >Affects Versions: 2.3.0 >Reporter: xubo245 >Assignee: xubo245 >Priority: Trivial > Fix For: 2.4.0 > > > Add some test case for images feature: SPARK-21866 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23392) Add some test case for images feature
[ https://issues.apache.org/jira/browse/SPARK-23392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-23392: - Assignee: xubo245 > Add some test case for images feature > - > > Key: SPARK-23392 > URL: https://issues.apache.org/jira/browse/SPARK-23392 > Project: Spark > Issue Type: Test > Components: ML, Tests >Affects Versions: 2.3.0 >Reporter: xubo245 >Assignee: xubo245 >Priority: Trivial > Fix For: 2.4.0 > > > Add some test case for images feature: SPARK-21866 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23388) Support for Parquet Binary DecimalType in VectorizedColumnReader
[ https://issues.apache.org/jira/browse/SPARK-23388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362730#comment-16362730 ] Sameer Agarwal commented on SPARK-23388: yes, I agree > Support for Parquet Binary DecimalType in VectorizedColumnReader > > > Key: SPARK-23388 > URL: https://issues.apache.org/jira/browse/SPARK-23388 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: James Thompson >Assignee: James Thompson >Priority: Major > Fix For: 2.3.1 > > > The following commit to spark removed support for decimal binary types: > [https://github.com/apache/spark/commit/9c29c557635caf739fde942f53255273aac0d7b1#diff-7bdf5fd0ce0b1ccbf4ecf083611976e6R428] > As per the parquet spec, decimal can be used to annotate binary types, so > support should be re-added: > [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23382) Spark Streaming ui about the contents of the form need to have hidden and show features, when the table records very much.
[ https://issues.apache.org/jira/browse/SPARK-23382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-23382: - Assignee: guoxiaolongzte > Spark Streaming ui about the contents of the form need to have hidden and > show features, when the table records very much. > -- > > Key: SPARK-23382 > URL: https://issues.apache.org/jira/browse/SPARK-23382 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.0 >Reporter: guoxiaolongzte >Assignee: guoxiaolongzte >Priority: Minor > Fix For: 2.4.0 > > > Spark Streaming ui about the contents of the form need to have hidden and > show features, when the table records very much. > Specific reasons, please refer to > https://issues.apache.org/jira/browse/SPARK-23024 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23382) Spark Streaming ui about the contents of the form need to have hidden and show features, when the table records very much.
[ https://issues.apache.org/jira/browse/SPARK-23382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-23382. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20570 [https://github.com/apache/spark/pull/20570] > Spark Streaming ui about the contents of the form need to have hidden and > show features, when the table records very much. > -- > > Key: SPARK-23382 > URL: https://issues.apache.org/jira/browse/SPARK-23382 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.0 >Reporter: guoxiaolongzte >Assignee: guoxiaolongzte >Priority: Minor > Fix For: 2.4.0 > > > Spark Streaming ui about the contents of the form need to have hidden and > show features, when the table records very much. > Specific reasons, please refer to > https://issues.apache.org/jira/browse/SPARK-23024 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23340) Upgrade Apache ORC to 1.4.3
[ https://issues.apache.org/jira/browse/SPARK-23340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-23340: -- Summary: Upgrade Apache ORC to 1.4.3 (was: Empty float/double array columns in ORC file should not raise EOFException) > Upgrade Apache ORC to 1.4.3 > --- > > Key: SPARK-23340 > URL: https://issues.apache.org/jira/browse/SPARK-23340 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Priority: Critical > > This issue updates Apache ORC dependencies to 1.4.3 released on February 9th. > Apache ORC 1.4.2 release removes unnecessary dependencies and 1.4.3 has 5 > more patches including bug fixes (https://s.apache.org/Fll8). > Especially, the following ORC-285 is fixed at 1.4.3. > {code} > scala> val df = Seq(Array.empty[Float]).toDF() > scala> df.write.format("orc").save("/tmp/floatarray") > scala> spark.read.orc("/tmp/floatarray") > res1: org.apache.spark.sql.DataFrame = [value: array] > scala> spark.read.orc("/tmp/floatarray").show() > 18/02/12 22:09:10 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.io.IOException: Error reading file: > file:/tmp/floatarray/part-0-9c0b461b-4df1-4c23-aac1-3e4f349ac7d6-c000.snappy.orc > at > org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1191) > at > org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78) > ... > Caused by: java.io.EOFException: Read past EOF for compressed stream Stream > for column 2 kind DATA position: 0 length: 0 range: 0 offset: 0 limit: 0 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23414) Plotting using matplotlib in MLlib pyspark
[ https://issues.apache.org/jira/browse/SPARK-23414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-23414. --- Resolution: Invalid Questions should go to the mailing list. Using matplotlib would be out of scope and pretty orthogonal to using Spark. > Plotting using matplotlib in MLlib pyspark > --- > > Key: SPARK-23414 > URL: https://issues.apache.org/jira/browse/SPARK-23414 > Project: Spark > Issue Type: Question > Components: MLlib >Affects Versions: 2.2.1 >Reporter: Waleed Esmail >Priority: Major > > Dear MLlib experts, > I just want to plot a fancy confusion matrix (true values vs predicted > values) like the one produced by seaborn module in python, so I did the > following: > {code:java} > labelIndexer = StringIndexer(inputCol="label", > outputCol="indexedLabel").fit(output) > # Automatically identify categorical features, and index them. > # We specify maxCategories so features with > 4 distinct values are treated > as continuous. > featureIndexer = VectorIndexer(inputCol="features", > outputCol="indexedFeatures").fit(output) > # Split the data into training and test sets (30% held out for testing) > (trainingData, testData) = output.randomSplit([0.7, 0.3]) > dt = DecisionTreeClassifier(labelCol="indexedLabel", > featuresCol="indexedFeatures", maxDepth=15) > # Chain indexers and tree in a Pipeline > pipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt]) > # Train model. This also runs the indexers. > model = pipeline.fit(trainingData) > # Make predictions. > predictions = model.transform(testData) > predictionAndLabels = predictions.select("prediction", "indexedLabel") > y_predicted = np.array(predictions.select("prediction").collect()) > y_test = np.array(predictions.select("indexedLabel").collect()) > from sklearn.metrics import confusion_matrix > import matplotlib.ticker as ticker > figcm, ax = plt.subplots() > cm = confusion_matrix(y_test, y_predicted) > # for normalization > cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] > sns.heatmap(cm, square=True, annot=True, cbar=False) > plt.xlabel('predication') > plt.ylabel('true value') > {code} > is this the right way to do it?!. please note that I am new to Spark and MLlib > > thank you in advance, -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23414) Plotting using matplotlib in MLlib pyspark
Waleed Esmail created SPARK-23414: - Summary: Plotting using matplotlib in MLlib pyspark Key: SPARK-23414 URL: https://issues.apache.org/jira/browse/SPARK-23414 Project: Spark Issue Type: Question Components: MLlib Affects Versions: 2.2.1 Reporter: Waleed Esmail Dear MLlib experts, I just want to plot a fancy confusion matrix (true values vs predicted values) like the one produced by seaborn module in python, so I did the following: {code:java} labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(output) # Automatically identify categorical features, and index them. # We specify maxCategories so features with > 4 distinct values are treated as continuous. featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures").fit(output) # Split the data into training and test sets (30% held out for testing) (trainingData, testData) = output.randomSplit([0.7, 0.3]) dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", maxDepth=15) # Chain indexers and tree in a Pipeline pipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt]) # Train model. This also runs the indexers. model = pipeline.fit(trainingData) # Make predictions. predictions = model.transform(testData) predictionAndLabels = predictions.select("prediction", "indexedLabel") y_predicted = np.array(predictions.select("prediction").collect()) y_test = np.array(predictions.select("indexedLabel").collect()) from sklearn.metrics import confusion_matrix import matplotlib.ticker as ticker figcm, ax = plt.subplots() cm = confusion_matrix(y_test, y_predicted) # for normalization cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] sns.heatmap(cm, square=True, annot=True, cbar=False) plt.xlabel('predication') plt.ylabel('true value') {code} is this the right way to do it?!. please note that I am new to Spark and MLlib thank you in advance, -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23411) Deprecate SparkContext.getExecutorStorageStatus
[ https://issues.apache.org/jira/browse/SPARK-23411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362597#comment-16362597 ] Marco Gaido commented on SPARK-23411: - I think this method was removed in SPARK-20659. So I think this is a duplicate. But maybe I am missing something, > Deprecate SparkContext.getExecutorStorageStatus > --- > > Key: SPARK-23411 > URL: https://issues.apache.org/jira/browse/SPARK-23411 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Priority: Minor > > This is another developer API in SparkContext that is better moved to a more > monitoring-minded place such as {{SparkStatusTracker}}. The same information > is also already available through the REST API. > Moreover, it exposes {{RDDInfo}}, which is a mutable internal Spark type, and > exposing that to applications is kinda sketchy. (It can't be made private, > though, since it's exposed indirectly in events posted to the listener bus.) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23053) taskBinarySerialization and task partitions calculate in DagScheduler.submitMissingTasks should keep the same RDD checkpoint status
[ https://issues.apache.org/jira/browse/SPARK-23053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-23053. -- Resolution: Fixed > taskBinarySerialization and task partitions calculate in > DagScheduler.submitMissingTasks should keep the same RDD checkpoint status > --- > > Key: SPARK-23053 > URL: https://issues.apache.org/jira/browse/SPARK-23053 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.1.0 >Reporter: huangtengfei >Assignee: huangtengfei >Priority: Major > Fix For: 2.2.2, 2.3.1, 2.4.0 > > > When we run concurrent jobs using the same rdd which is marked to do > checkpoint. If one job has finished running the job, and start the process of > RDD.doCheckpoint, while another job is submitted, then submitStage and > submitMissingTasks will be called. In > [submitMissingTasks|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L961], > will serialize taskBinaryBytes and calculate task partitions which are both > affected by the status of checkpoint, if the former is calculated before > doCheckpoint finished, while the latter is calculated after doCheckpoint > finished, when run task, rdd.compute will be called, for some rdds with > particular partition type such as > [MapWithStateRDD|https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/rdd/MapWithStateRDD.scala] > who will do partition type cast, will get a ClassCastException because the > part params is actually a CheckpointRDDPartition. > This error occurs because rdd.doCheckpoint occurs in the same thread that > called sc.runJob, while the task serialization occurs in the DAGSchedulers > event loop. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23053) taskBinarySerialization and task partitions calculate in DagScheduler.submitMissingTasks should keep the same RDD checkpoint status
[ https://issues.apache.org/jira/browse/SPARK-23053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid reassigned SPARK-23053: Assignee: huangtengfei > taskBinarySerialization and task partitions calculate in > DagScheduler.submitMissingTasks should keep the same RDD checkpoint status > --- > > Key: SPARK-23053 > URL: https://issues.apache.org/jira/browse/SPARK-23053 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.1.0 >Reporter: huangtengfei >Assignee: huangtengfei >Priority: Major > Fix For: 2.2.2, 2.3.1, 2.4.0 > > > When we run concurrent jobs using the same rdd which is marked to do > checkpoint. If one job has finished running the job, and start the process of > RDD.doCheckpoint, while another job is submitted, then submitStage and > submitMissingTasks will be called. In > [submitMissingTasks|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L961], > will serialize taskBinaryBytes and calculate task partitions which are both > affected by the status of checkpoint, if the former is calculated before > doCheckpoint finished, while the latter is calculated after doCheckpoint > finished, when run task, rdd.compute will be called, for some rdds with > particular partition type such as > [MapWithStateRDD|https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/rdd/MapWithStateRDD.scala] > who will do partition type cast, will get a ClassCastException because the > part params is actually a CheckpointRDDPartition. > This error occurs because rdd.doCheckpoint occurs in the same thread that > called sc.runJob, while the task serialization occurs in the DAGSchedulers > event loop. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23053) taskBinarySerialization and task partitions calculate in DagScheduler.submitMissingTasks should keep the same RDD checkpoint status
[ https://issues.apache.org/jira/browse/SPARK-23053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362556#comment-16362556 ] Imran Rashid commented on SPARK-23053: -- Fixed by https://github.com/apache/spark/pull/20244 I set the fix version to 2.3.1, because we're in the middle of voting for RC3. If we cut another RC this would actually be fixed in 2.3.0. > taskBinarySerialization and task partitions calculate in > DagScheduler.submitMissingTasks should keep the same RDD checkpoint status > --- > > Key: SPARK-23053 > URL: https://issues.apache.org/jira/browse/SPARK-23053 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.1.0 >Reporter: huangtengfei >Priority: Major > Fix For: 2.2.2, 2.3.1, 2.4.0 > > > When we run concurrent jobs using the same rdd which is marked to do > checkpoint. If one job has finished running the job, and start the process of > RDD.doCheckpoint, while another job is submitted, then submitStage and > submitMissingTasks will be called. In > [submitMissingTasks|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L961], > will serialize taskBinaryBytes and calculate task partitions which are both > affected by the status of checkpoint, if the former is calculated before > doCheckpoint finished, while the latter is calculated after doCheckpoint > finished, when run task, rdd.compute will be called, for some rdds with > particular partition type such as > [MapWithStateRDD|https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/rdd/MapWithStateRDD.scala] > who will do partition type cast, will get a ClassCastException because the > part params is actually a CheckpointRDDPartition. > This error occurs because rdd.doCheckpoint occurs in the same thread that > called sc.runJob, while the task serialization occurs in the DAGSchedulers > event loop. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23053) taskBinarySerialization and task partitions calculate in DagScheduler.submitMissingTasks should keep the same RDD checkpoint status
[ https://issues.apache.org/jira/browse/SPARK-23053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-23053: - Fix Version/s: 2.4.0 2.3.1 2.2.2 > taskBinarySerialization and task partitions calculate in > DagScheduler.submitMissingTasks should keep the same RDD checkpoint status > --- > > Key: SPARK-23053 > URL: https://issues.apache.org/jira/browse/SPARK-23053 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.1.0 >Reporter: huangtengfei >Priority: Major > Fix For: 2.2.2, 2.3.1, 2.4.0 > > > When we run concurrent jobs using the same rdd which is marked to do > checkpoint. If one job has finished running the job, and start the process of > RDD.doCheckpoint, while another job is submitted, then submitStage and > submitMissingTasks will be called. In > [submitMissingTasks|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L961], > will serialize taskBinaryBytes and calculate task partitions which are both > affected by the status of checkpoint, if the former is calculated before > doCheckpoint finished, while the latter is calculated after doCheckpoint > finished, when run task, rdd.compute will be called, for some rdds with > particular partition type such as > [MapWithStateRDD|https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/rdd/MapWithStateRDD.scala] > who will do partition type cast, will get a ClassCastException because the > part params is actually a CheckpointRDDPartition. > This error occurs because rdd.doCheckpoint occurs in the same thread that > called sc.runJob, while the task serialization occurs in the DAGSchedulers > event loop. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23189) reflect stage level blacklisting on executor tab
[ https://issues.apache.org/jira/browse/SPARK-23189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid reassigned SPARK-23189: Assignee: Attila Zsolt Piros > reflect stage level blacklisting on executor tab > - > > Key: SPARK-23189 > URL: https://issues.apache.org/jira/browse/SPARK-23189 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.1 >Reporter: Attila Zsolt Piros >Assignee: Attila Zsolt Piros >Priority: Major > Fix For: 2.4.0 > > Attachments: afterStageCompleted.png, backlisted.png, > multiple_stages.png, multiple_stages_1.png, multiple_stages_2.png, > multiple_stages_3.png > > > This issue is the came during working on SPARK-22577 where the conclusion was > not only stage tab should reflect stage and application level backlisting but > also the executor tab should be extended with stage level backlisting > information. > As [~irashid] and [~tgraves] are discussed the backlisted stages should be > listed for an executor like "*stage[ , ,...]*". One idea was to list only the > most recent 3 of the blacklisted stages another was list all the active > stages which are blacklisted. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23189) reflect stage level blacklisting on executor tab
[ https://issues.apache.org/jira/browse/SPARK-23189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-23189. -- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20408 [https://github.com/apache/spark/pull/20408] > reflect stage level blacklisting on executor tab > - > > Key: SPARK-23189 > URL: https://issues.apache.org/jira/browse/SPARK-23189 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.1 >Reporter: Attila Zsolt Piros >Assignee: Attila Zsolt Piros >Priority: Major > Fix For: 2.4.0 > > Attachments: afterStageCompleted.png, backlisted.png, > multiple_stages.png, multiple_stages_1.png, multiple_stages_2.png, > multiple_stages_3.png > > > This issue is the came during working on SPARK-22577 where the conclusion was > not only stage tab should reflect stage and application level backlisting but > also the executor tab should be extended with stage level backlisting > information. > As [~irashid] and [~tgraves] are discussed the backlisted stages should be > listed for an executor like "*stage[ , ,...]*". One idea was to list only the > most recent 3 of the blacklisted stages another was list all the active > stages which are blacklisted. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23413) Sorting tasks by Host / Executor ID on the Stage page does not work
[ https://issues.apache.org/jira/browse/SPARK-23413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-23413: --- Description: Sorting tasks by Host / Executor ID throws exceptions: {code} java.lang.IllegalArgumentException: Invalid sort column: Executor ID at org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) {code} {code} java.lang.IllegalArgumentException: Invalid sort column: Host at org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at {code} !image-2018-02-13-16-50-32-600.png! was: Sorting tasks by Host / Executor ID throws exceptions: {noformat} java.lang.IllegalArgumentException: Invalid sort column: Executor ID at org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) {noformat} {noformat} java.lang.IllegalArgumentException: Invalid sort column: Host at org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at {noformat} !image-2018-02-13-16-50-32-600.png! > Sorting tasks by Host / Executor ID on the Stage page does not work > --- > > Key: SPARK-23413 > URL: https://issues.apache.org/jira/browse/SPARK-23413 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0, 2.4.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Sorting tasks by Host / Executor ID throws exceptions: > {code} > java.lang.IllegalArgumentException: Invalid sort column: Executor ID at > org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at > org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at > org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at > org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at > org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at > org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at > org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at > org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) > {code} > {code} > java.lang.IllegalArgumentException: Invalid sort column: Host at > org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at >
[jira] [Commented] (SPARK-23413) Sorting tasks by Host / Executor ID on the Stage page does not work
[ https://issues.apache.org/jira/browse/SPARK-23413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362513#comment-16362513 ] Attila Zsolt Piros commented on SPARK-23413: I am working on that. > Sorting tasks by Host / Executor ID on the Stage page does not work > --- > > Key: SPARK-23413 > URL: https://issues.apache.org/jira/browse/SPARK-23413 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0, 2.4.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Sorting tasks by Host / Executor ID throws exceptions: > {noformat} > java.lang.IllegalArgumentException: Invalid sort column: Executor ID at > org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at > org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at > org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at > org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at > org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at > org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at > org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at > org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) > {noformat} > {noformat} > java.lang.IllegalArgumentException: Invalid sort column: Host at > org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at > org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at > org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at > org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at > org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at > org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at > org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at > org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at > {noformat} > !image-2018-02-13-16-50-32-600.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23413) Sorting tasks by Host / Executor ID on the Stage page does not work
Attila Zsolt Piros created SPARK-23413: -- Summary: Sorting tasks by Host / Executor ID on the Stage page does not work Key: SPARK-23413 URL: https://issues.apache.org/jira/browse/SPARK-23413 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.3.0, 2.4.0 Reporter: Attila Zsolt Piros Sorting tasks by Host / Executor ID throws exceptions: {noformat} java.lang.IllegalArgumentException: Invalid sort column: Executor ID at org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) {noformat} {noformat} java.lang.IllegalArgumentException: Invalid sort column: Host at org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at {noformat} !image-2018-02-13-16-50-32-600.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23412) Add cosine distance measure to BisectingKMeans
[ https://issues.apache.org/jira/browse/SPARK-23412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362493#comment-16362493 ] Apache Spark commented on SPARK-23412: -- User 'mgaido91' has created a pull request for this issue: https://github.com/apache/spark/pull/20600 > Add cosine distance measure to BisectingKMeans > -- > > Key: SPARK-23412 > URL: https://issues.apache.org/jira/browse/SPARK-23412 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.4.0 >Reporter: Marco Gaido >Priority: Minor > > SPARK-22119 introduced cosine distance for KMeans. > This ticket is to support the cosine distance measure on BisectingKMeans too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23412) Add cosine distance measure to BisectingKMeans
[ https://issues.apache.org/jira/browse/SPARK-23412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23412: Assignee: Apache Spark > Add cosine distance measure to BisectingKMeans > -- > > Key: SPARK-23412 > URL: https://issues.apache.org/jira/browse/SPARK-23412 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.4.0 >Reporter: Marco Gaido >Assignee: Apache Spark >Priority: Minor > > SPARK-22119 introduced cosine distance for KMeans. > This ticket is to support the cosine distance measure on BisectingKMeans too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23412) Add cosine distance measure to BisectingKMeans
[ https://issues.apache.org/jira/browse/SPARK-23412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23412: Assignee: (was: Apache Spark) > Add cosine distance measure to BisectingKMeans > -- > > Key: SPARK-23412 > URL: https://issues.apache.org/jira/browse/SPARK-23412 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.4.0 >Reporter: Marco Gaido >Priority: Minor > > SPARK-22119 introduced cosine distance for KMeans. > This ticket is to support the cosine distance measure on BisectingKMeans too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23412) Add cosine distance measure to BisectingKMeans
Marco Gaido created SPARK-23412: --- Summary: Add cosine distance measure to BisectingKMeans Key: SPARK-23412 URL: https://issues.apache.org/jira/browse/SPARK-23412 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 2.4.0 Reporter: Marco Gaido SPARK-22119 introduced cosine distance for KMeans. This ticket is to support the cosine distance measure on BisectingKMeans too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20659) Remove StorageStatus, or make it private.
[ https://issues.apache.org/jira/browse/SPARK-20659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-20659: -- Assignee: Attila Zsolt Piros > Remove StorageStatus, or make it private. > - > > Key: SPARK-20659 > URL: https://issues.apache.org/jira/browse/SPARK-20659 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Attila Zsolt Piros >Priority: Major > Fix For: 2.4.0 > > > With the work being done in SPARK-18085, StorageStatus is not used anymore by > the UI. It's still used in a couple of other places, though: > - {{SparkContext.getExecutorStorageStatus}} > - {{BlockManagerSource}} (a metrics source) > Both could be changed to use the REST API types; the first one could be > replaced with a new method in {{SparkStatusTracker}}, which I also think is a > better place for it anyway. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20659) Remove StorageStatus, or make it private.
[ https://issues.apache.org/jira/browse/SPARK-20659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-20659. Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20546 [https://github.com/apache/spark/pull/20546] > Remove StorageStatus, or make it private. > - > > Key: SPARK-20659 > URL: https://issues.apache.org/jira/browse/SPARK-20659 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Attila Zsolt Piros >Priority: Major > Fix For: 2.4.0 > > > With the work being done in SPARK-18085, StorageStatus is not used anymore by > the UI. It's still used in a couple of other places, though: > - {{SparkContext.getExecutorStorageStatus}} > - {{BlockManagerSource}} (a metrics source) > Both could be changed to use the REST API types; the first one could be > replaced with a new method in {{SparkStatusTracker}}, which I also think is a > better place for it anyway. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23411) Deprecate SparkContext.getExecutorStorageStatus
[ https://issues.apache.org/jira/browse/SPARK-23411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-23411: --- Issue Type: Improvement (was: Bug) > Deprecate SparkContext.getExecutorStorageStatus > --- > > Key: SPARK-23411 > URL: https://issues.apache.org/jira/browse/SPARK-23411 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Priority: Minor > > This is another developer API in SparkContext that is better moved to a more > monitoring-minded place such as {{SparkStatusTracker}}. The same information > is also already available through the REST API. > Moreover, it exposes {{RDDInfo}}, which is a mutable internal Spark type, and > exposing that to applications is kinda sketchy. (It can't be made private, > though, since it's exposed indirectly in events posted to the listener bus.) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23411) Deprecate SparkContext.getExecutorStorageStatus
Marcelo Vanzin created SPARK-23411: -- Summary: Deprecate SparkContext.getExecutorStorageStatus Key: SPARK-23411 URL: https://issues.apache.org/jira/browse/SPARK-23411 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.0 Reporter: Marcelo Vanzin This is another developer API in SparkContext that is better moved to a more monitoring-minded place such as {{SparkStatusTracker}}. The same information is also already available through the REST API. Moreover, it exposes {{RDDInfo}}, which is a mutable internal Spark type, and exposing that to applications is kinda sketchy. (It can't be made private, though, since it's exposed indirectly in events posted to the listener bus.) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21501) Spark shuffle index cache size should be memory based
[ https://issues.apache.org/jira/browse/SPARK-21501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362424#comment-16362424 ] Xun REN commented on SPARK-21501: - Hi guys, Could you tell me how to figure out how many memory the NM with spark shuffle service has used ? And how to know a spark job has used how many reducers ? Because I have the same problem recently and I want to get a list of spark jobs by sorting by number of reducers. Thanks. Regards, Xun REN. > Spark shuffle index cache size should be memory based > - > > Key: SPARK-21501 > URL: https://issues.apache.org/jira/browse/SPARK-21501 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.1.0 >Reporter: Thomas Graves >Assignee: Sanket Reddy >Priority: Major > Fix For: 2.3.0 > > > Right now the spark shuffle service has a cache for index files. It is based > on a # of files cached (spark.shuffle.service.index.cache.entries). This can > cause issues if people have a lot of reducers because the size of each entry > can fluctuate based on the # of reducers. > We saw an issues with a job that had 17 reducers and it caused NM with > spark shuffle service to use 700-800MB or memory in NM by itself. > We should change this cache to be memory based and only allow a certain > memory size used. When I say memory based I mean the cache should have a > limit of say 100MB. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21232) New built-in SQL function - Data_Type
[ https://issues.apache.org/jira/browse/SPARK-21232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-21232: -- Fix Version/s: (was: 2.2.2) (was: 2.3.0) > New built-in SQL function - Data_Type > - > > Key: SPARK-21232 > URL: https://issues.apache.org/jira/browse/SPARK-21232 > Project: Spark > Issue Type: Improvement > Components: PySpark, SparkR, SQL >Affects Versions: 2.1.1 >Reporter: Mario Molina >Priority: Minor > > This function returns the data type of a given column. > {code:java} > data_type("a") > // returns string > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23395) Add an option to return an empty DataFrame from an RDD generated by a Hadoop file when there are no usable paths
[ https://issues.apache.org/jira/browse/SPARK-23395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-23395: -- Target Version/s: (was: 2.2.0, 2.2.1) > Add an option to return an empty DataFrame from an RDD generated by a Hadoop > file when there are no usable paths > > > Key: SPARK-23395 > URL: https://issues.apache.org/jira/browse/SPARK-23395 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.2.0, 2.2.1 >Reporter: Jens Rabe >Priority: Minor > Labels: DataFrame, HadoopInputFormat, RDD > > When using file-based data from custom formats, Spark's ability to use > Hadoop's FileInputFormats is very handy. However, when the path they are > pointed at contains no usable data, they throw an IOException saying "No > input paths specified in job". > It would be a nice feature if the DataFrame API somehow could capture this > and return an empty DataFrame instead of failing the job. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23410) Unable to read jsons in charset different from UTF-8
Maxim Gekk created SPARK-23410: -- Summary: Unable to read jsons in charset different from UTF-8 Key: SPARK-23410 URL: https://issues.apache.org/jira/browse/SPARK-23410 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 2.3.0 Reporter: Maxim Gekk Currently the Json Parser is forced to read json files in UTF-8. Such behavior breaks backward compatibility with Spark 2.2.1 and previous versions that can read json files in UTF-16, UTF-32 and other encodings due to using of the auto detection mechanism of the jackson library. Need to give back to users possibility to read json files in specified charset and/or detect charset automatically as it was before. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23409) RandomForest/DecisionTree (syntactic) pruning of redundant subtrees
Alessandro Solimando created SPARK-23409: Summary: RandomForest/DecisionTree (syntactic) pruning of redundant subtrees Key: SPARK-23409 URL: https://issues.apache.org/jira/browse/SPARK-23409 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 2.2.1 Environment: Reporter: Alessandro Solimando Improvement: redundancy elimination from decision trees where all the leaves of a given subtree share the same prediction. Benefits: * Model interpretability * Faster unitary model invocation (relevant for massive ) * Smaller model memory footprint For instance, consider the following decision tree. {panel:title=Original Decision Tree} {noformat} DecisionTreeClassificationModel (uid=dtc_e794a5a3aa9e) of depth 3 with 15 nodes If (feature 1 <= 0.5) If (feature 2 <= 0.5) If (feature 0 <= 0.5) Predict: 0.0 Else (feature 0 > 0.5) Predict: 0.0 Else (feature 2 > 0.5) If (feature 0 <= 0.5) Predict: 0.0 Else (feature 0 > 0.5) Predict: 0.0 Else (feature 1 > 0.5) If (feature 2 <= 0.5) If (feature 0 <= 0.5) Predict: 1.0 Else (feature 0 > 0.5) Predict: 1.0 Else (feature 2 > 0.5) If (feature 0 <= 0.5) Predict: 0.0 Else (feature 0 > 0.5) Predict: 0.0 {noformat} {panel} The proposed method, taken as input the first tree, aims at producing as output the following (semantically equivalent) tree: {panel:title=Pruned Decision Tree} {noformat} DecisionTreeClassificationModel (uid=dtc_e794a5a3aa9e) of depth 3 with 15 nodes If (feature 1 <= 0.5) Predict: 0.0 Else (feature 1 > 0.5) If (feature 2 <= 0.5) Predict: 1.0 Else (feature 2 > 0.5) Predict: 0.0 {noformat} {panel} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23408) Flaky test: StreamingOuterJoinSuite.left outer early state exclusion on right
Marcelo Vanzin created SPARK-23408: -- Summary: Flaky test: StreamingOuterJoinSuite.left outer early state exclusion on right Key: SPARK-23408 URL: https://issues.apache.org/jira/browse/SPARK-23408 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 2.4.0 Reporter: Marcelo Vanzin Seen on an unrelated PR. https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87386/testReport/org.apache.spark.sql.streaming/StreamingOuterJoinSuite/left_outer_early_state_exclusion_on_right/ {noformat} sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: Assert on query failed: Check total state rows = List(4), updated state rows = List(4): Array(1) did not equal List(4) incorrect updates rows org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528) org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501) org.apache.spark.sql.streaming.StateStoreMetricsTest$$anonfun$assertNumStateRows$1.apply(StateStoreMetricsTest.scala:28) org.apache.spark.sql.streaming.StateStoreMetricsTest$$anonfun$assertNumStateRows$1.apply(StateStoreMetricsTest.scala:23) org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1$$anonfun$apply$14.apply$mcZ$sp(StreamTest.scala:568) org.apache.spark.sql.streaming.StreamTest$class.verify$1(StreamTest.scala:371) org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:568) org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:432) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) == Progress == AddData to MemoryStream[value#19652]: 3,4,5 AddData to MemoryStream[value#19662]: 1,2,3 CheckLastBatch: [3,10,6,9] => AssertOnQuery(, Check total state rows = List(4), updated state rows = List(4)) AddData to MemoryStream[value#19652]: 20 AddData to MemoryStream[value#19662]: 21 CheckLastBatch: AddData to MemoryStream[value#19662]: 20 CheckLastBatch: [20,30,40,60],[4,10,8,null],[5,10,10,null] == Stream == Output Mode: Append Stream state: {MemoryStream[value#19652]: 0,MemoryStream[value#19662]: 0} Thread state: alive Thread stack trace: java.lang.Thread.sleep(Native Method) org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:152) org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56) org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:120) org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279) org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189) {noformat} No other failures in the history, though. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23364) 'desc table' command in spark-sql add column head display
[ https://issues.apache.org/jira/browse/SPARK-23364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-23364. --- Resolution: Not A Problem > 'desc table' command in spark-sql add column head display > - > > Key: SPARK-23364 > URL: https://issues.apache.org/jira/browse/SPARK-23364 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: guoxiaolongzte >Priority: Minor > Attachments: 1.png, 2.png > > > fix before: > !2.png! > fix after: > !1.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23407) add a config to try to inline all mutable states during codegen
[ https://issues.apache.org/jira/browse/SPARK-23407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23407: Assignee: Wenchen Fan (was: Apache Spark) > add a config to try to inline all mutable states during codegen > --- > > Key: SPARK-23407 > URL: https://issues.apache.org/jira/browse/SPARK-23407 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23407) add a config to try to inline all mutable states during codegen
[ https://issues.apache.org/jira/browse/SPARK-23407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23407: Assignee: Apache Spark (was: Wenchen Fan) > add a config to try to inline all mutable states during codegen > --- > > Key: SPARK-23407 > URL: https://issues.apache.org/jira/browse/SPARK-23407 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23407) add a config to try to inline all mutable states during codegen
[ https://issues.apache.org/jira/browse/SPARK-23407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362218#comment-16362218 ] Apache Spark commented on SPARK-23407: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/20599 > add a config to try to inline all mutable states during codegen > --- > > Key: SPARK-23407 > URL: https://issues.apache.org/jira/browse/SPARK-23407 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12140) Support Streaming UI in HistoryServer
[ https://issues.apache.org/jira/browse/SPARK-12140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362216#comment-16362216 ] German Schiavon Matteo commented on SPARK-12140: Hi guys, is there any progress about this? [~jerryshao] > Support Streaming UI in HistoryServer > - > > Key: SPARK-12140 > URL: https://issues.apache.org/jira/browse/SPARK-12140 > Project: Spark > Issue Type: New Feature > Components: DStreams >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin >Priority: Major > > SPARK-11206 added infrastructure that would allow the streaming UI to be > shown in the History Server. We should add the necessary code to make that > happen, although it requires some changes to how events and listeners are > used. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23396) Spark HistoryServer will OMM if the event log is big
[ https://issues.apache.org/jira/browse/SPARK-23396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-23396. --- Resolution: Not A Problem > Spark HistoryServer will OMM if the event log is big > > > Key: SPARK-23396 > URL: https://issues.apache.org/jira/browse/SPARK-23396 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: KaiXinXIaoLei >Priority: Major > Attachments: eventlog.png, historyServer.png > > > if the event log is big, the historyServer web will be out of memory . My > eventlog size is 5.1G: > > !eventlog.png! > I open the web, the ui will be OMM. > > !historyServer.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org