[jira] [Commented] (SPARK-23402) Dataset write method not working as expected for postgresql database

2018-02-13 Thread kevin yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363584#comment-16363584
 ] 

kevin yu commented on SPARK-23402:
--

Thanks, I will install the 9.5.8, and try again.

Sent from my iPhone



> Dataset write method not working as expected for postgresql database
> 
>
> Key: SPARK-23402
> URL: https://issues.apache.org/jira/browse/SPARK-23402
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
> Environment: PostgreSQL: 9.5.8 (10 + Also same issue)
> OS: Cent OS 7 & Windows 7,8
> JDBC: 9.4-1201-jdbc41
>  
> Spark:  I executed in both 2.1.0 and 2.2.1
> Mode: Standalone
> OS: Windows 7
>Reporter: Pallapothu Jyothi Swaroop
>Priority: Major
> Attachments: Emsku[1].jpg
>
>
> I am using spark dataset write to insert data on postgresql existing table. 
> For this I am using  write method mode as append mode. While using i am 
> getting exception like table already exists. But, I gave option as append 
> mode.
> It's strange. When i change options to sqlserver/oracle append mode is 
> working as expected.
>  
> *Database Properties:*
> {{destinationProps.put("driver", "org.postgresql.Driver"); 
> destinationProps.put("url", "jdbc:postgresql://127.0.0.1:30001/dbmig"); 
> destinationProps.put("user", "dbmig");}}
> {{destinationProps.put("password", "dbmig");}}
>  
> *Dataset Write Code:*
> {{valueAnalysisDataset.write().mode(SaveMode.Append).jdbc(destinationDbMap.get("url"),
>  "dqvalue", destinationdbProperties);}} 
>  
>  
> {{Exception in thread "main" org.postgresql.util.PSQLException: ERROR: 
> relation "dqvalue" already exists at 
> org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2412)
>  at 
> org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2125)
>  at 
> org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:297) 
> at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:428) at 
> org.postgresql.jdbc.PgStatement.execute(PgStatement.java:354) at 
> org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:301) at 
> org.postgresql.jdbc.PgStatement.executeCachedSql(PgStatement.java:287) at 
> org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:264) at 
> org.postgresql.jdbc.PgStatement.executeUpdate(PgStatement.java:244) at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createTable(JdbcUtils.scala:806)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:95)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:469)
>  at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:50)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92) 
> at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609) 
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233) at 
> org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:460) at 
> com.ads.dqam.action.impl.PostgresValueAnalysis.persistValueAnalysis(PostgresValueAnalysis.java:25)
>  at 
> com.ads.dqam.action.AbstractValueAnalysis.persistAnalysis(AbstractValueAnalysis.java:81)
>  at com.ads.dqam.Analysis.doAnalysis(Analysis.java:32) at 
> com.ads.dqam.Client.main(Client.java:71)}}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23402) Dataset write method not working as expected for postgresql database

2018-02-13 Thread Pallapothu Jyothi Swaroop (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363555#comment-16363555
 ] 

Pallapothu Jyothi Swaroop edited comment on SPARK-23402 at 2/14/18 6:48 AM:


[~kevinyu98]

Thanks for checking again. I tested with 9.5.4 Append mode is working with out 
exception.

I analyzed some thing that may use full for you.


Can you check above scala file. 
https://github.com/apache/spark/blob/v2.2.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala


Below is the statement for checking table exists or not. In this statement it 
is failing.
  val tableExists = JdbcUtils.tableExists(conn, options)

But i am not sure. Why it is failing. I executed sql for table exists command 
taken from the postgres dialect. It is executed successfully in database.


was (Author: swaroopp):
Thanks for checking again. I tested with 9.5.4 Append mode is working with out 
exception.

I analyzed some thing that may use full for you.


Can you check above scala file. 
https://github.com/apache/spark/blob/v2.2.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala


Below is the statement for checking table exists or not. In this statement it 
is failing.
  val tableExists = JdbcUtils.tableExists(conn, options)

But i am not sure. Why it is failing. I executed sql for table exists command 
taken from the postgres dialect. It is executed successfully in database.

> Dataset write method not working as expected for postgresql database
> 
>
> Key: SPARK-23402
> URL: https://issues.apache.org/jira/browse/SPARK-23402
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
> Environment: PostgreSQL: 9.5.8 (10 + Also same issue)
> OS: Cent OS 7 & Windows 7,8
> JDBC: 9.4-1201-jdbc41
>  
> Spark:  I executed in both 2.1.0 and 2.2.1
> Mode: Standalone
> OS: Windows 7
>Reporter: Pallapothu Jyothi Swaroop
>Priority: Major
> Attachments: Emsku[1].jpg
>
>
> I am using spark dataset write to insert data on postgresql existing table. 
> For this I am using  write method mode as append mode. While using i am 
> getting exception like table already exists. But, I gave option as append 
> mode.
> It's strange. When i change options to sqlserver/oracle append mode is 
> working as expected.
>  
> *Database Properties:*
> {{destinationProps.put("driver", "org.postgresql.Driver"); 
> destinationProps.put("url", "jdbc:postgresql://127.0.0.1:30001/dbmig"); 
> destinationProps.put("user", "dbmig");}}
> {{destinationProps.put("password", "dbmig");}}
>  
> *Dataset Write Code:*
> {{valueAnalysisDataset.write().mode(SaveMode.Append).jdbc(destinationDbMap.get("url"),
>  "dqvalue", destinationdbProperties);}} 
>  
>  
> {{Exception in thread "main" org.postgresql.util.PSQLException: ERROR: 
> relation "dqvalue" already exists at 
> org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2412)
>  at 
> org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2125)
>  at 
> org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:297) 
> at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:428) at 
> org.postgresql.jdbc.PgStatement.execute(PgStatement.java:354) at 
> org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:301) at 
> org.postgresql.jdbc.PgStatement.executeCachedSql(PgStatement.java:287) at 
> org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:264) at 
> org.postgresql.jdbc.PgStatement.executeUpdate(PgStatement.java:244) at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createTable(JdbcUtils.scala:806)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:95)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:469)
>  at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:50)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
>  at 
> 

[jira] [Commented] (SPARK-23402) Dataset write method not working as expected for postgresql database

2018-02-13 Thread Pallapothu Jyothi Swaroop (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363555#comment-16363555
 ] 

Pallapothu Jyothi Swaroop commented on SPARK-23402:
---

Thanks for checking again. I tested with 9.5.4 Append mode is working with out 
exception.

I analyzed some thing that may use full for you.


Can you check above scala file. 
https://github.com/apache/spark/blob/v2.2.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala


Below is the statement for checking table exists or not. In this statement it 
is failing.
  val tableExists = JdbcUtils.tableExists(conn, options)

But i am not sure. Why it is failing. I executed sql for table exists command 
taken from the postgres dialect. It is executed successfully in database.

> Dataset write method not working as expected for postgresql database
> 
>
> Key: SPARK-23402
> URL: https://issues.apache.org/jira/browse/SPARK-23402
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
> Environment: PostgreSQL: 9.5.8 (10 + Also same issue)
> OS: Cent OS 7 & Windows 7,8
> JDBC: 9.4-1201-jdbc41
>  
> Spark:  I executed in both 2.1.0 and 2.2.1
> Mode: Standalone
> OS: Windows 7
>Reporter: Pallapothu Jyothi Swaroop
>Priority: Major
> Attachments: Emsku[1].jpg
>
>
> I am using spark dataset write to insert data on postgresql existing table. 
> For this I am using  write method mode as append mode. While using i am 
> getting exception like table already exists. But, I gave option as append 
> mode.
> It's strange. When i change options to sqlserver/oracle append mode is 
> working as expected.
>  
> *Database Properties:*
> {{destinationProps.put("driver", "org.postgresql.Driver"); 
> destinationProps.put("url", "jdbc:postgresql://127.0.0.1:30001/dbmig"); 
> destinationProps.put("user", "dbmig");}}
> {{destinationProps.put("password", "dbmig");}}
>  
> *Dataset Write Code:*
> {{valueAnalysisDataset.write().mode(SaveMode.Append).jdbc(destinationDbMap.get("url"),
>  "dqvalue", destinationdbProperties);}} 
>  
>  
> {{Exception in thread "main" org.postgresql.util.PSQLException: ERROR: 
> relation "dqvalue" already exists at 
> org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2412)
>  at 
> org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2125)
>  at 
> org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:297) 
> at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:428) at 
> org.postgresql.jdbc.PgStatement.execute(PgStatement.java:354) at 
> org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:301) at 
> org.postgresql.jdbc.PgStatement.executeCachedSql(PgStatement.java:287) at 
> org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:264) at 
> org.postgresql.jdbc.PgStatement.executeUpdate(PgStatement.java:244) at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createTable(JdbcUtils.scala:806)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:95)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:469)
>  at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:50)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92) 
> at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609) 
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233) at 
> org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:460) at 
> com.ads.dqam.action.impl.PostgresValueAnalysis.persistValueAnalysis(PostgresValueAnalysis.java:25)
>  at 
> 

[jira] [Commented] (SPARK-23402) Dataset write method not working as expected for postgresql database

2018-02-13 Thread kevin yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363539#comment-16363539
 ] 

kevin yu commented on SPARK-23402:
--

Yes, I create empty table (emptytable) in database (mydb) in the postgresql, 
then I run the above statement from spark-shell, it works fine. The only 
difference I see is that my postgres is at 9.5.6, yours is at 9.5.8 +. 

> Dataset write method not working as expected for postgresql database
> 
>
> Key: SPARK-23402
> URL: https://issues.apache.org/jira/browse/SPARK-23402
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
> Environment: PostgreSQL: 9.5.8 (10 + Also same issue)
> OS: Cent OS 7 & Windows 7,8
> JDBC: 9.4-1201-jdbc41
>  
> Spark:  I executed in both 2.1.0 and 2.2.1
> Mode: Standalone
> OS: Windows 7
>Reporter: Pallapothu Jyothi Swaroop
>Priority: Major
> Attachments: Emsku[1].jpg
>
>
> I am using spark dataset write to insert data on postgresql existing table. 
> For this I am using  write method mode as append mode. While using i am 
> getting exception like table already exists. But, I gave option as append 
> mode.
> It's strange. When i change options to sqlserver/oracle append mode is 
> working as expected.
>  
> *Database Properties:*
> {{destinationProps.put("driver", "org.postgresql.Driver"); 
> destinationProps.put("url", "jdbc:postgresql://127.0.0.1:30001/dbmig"); 
> destinationProps.put("user", "dbmig");}}
> {{destinationProps.put("password", "dbmig");}}
>  
> *Dataset Write Code:*
> {{valueAnalysisDataset.write().mode(SaveMode.Append).jdbc(destinationDbMap.get("url"),
>  "dqvalue", destinationdbProperties);}} 
>  
>  
> {{Exception in thread "main" org.postgresql.util.PSQLException: ERROR: 
> relation "dqvalue" already exists at 
> org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2412)
>  at 
> org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2125)
>  at 
> org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:297) 
> at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:428) at 
> org.postgresql.jdbc.PgStatement.execute(PgStatement.java:354) at 
> org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:301) at 
> org.postgresql.jdbc.PgStatement.executeCachedSql(PgStatement.java:287) at 
> org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:264) at 
> org.postgresql.jdbc.PgStatement.executeUpdate(PgStatement.java:244) at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createTable(JdbcUtils.scala:806)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:95)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:469)
>  at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:50)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92) 
> at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609) 
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233) at 
> org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:460) at 
> com.ads.dqam.action.impl.PostgresValueAnalysis.persistValueAnalysis(PostgresValueAnalysis.java:25)
>  at 
> com.ads.dqam.action.AbstractValueAnalysis.persistAnalysis(AbstractValueAnalysis.java:81)
>  at com.ads.dqam.Analysis.doAnalysis(Analysis.java:32) at 
> com.ads.dqam.Client.main(Client.java:71)}}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Updated] (SPARK-23368) Avoid unnecessary Exchange or Sort after projection

2018-02-13 Thread Maryann Xue (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maryann Xue updated SPARK-23368:

Summary: Avoid unnecessary Exchange or Sort after projection  (was: 
OutputOrdering and OutputPartitioning in ProjectExec should reflect the 
projected columns)

> Avoid unnecessary Exchange or Sort after projection
> ---
>
> Key: SPARK-23368
> URL: https://issues.apache.org/jira/browse/SPARK-23368
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maryann Xue
>Priority: Minor
>
> After column rename projection, the ProjectExec's outputOrdering and 
> outputPartitioning should reflect the projected columns as well. For example,
> {code:java}
> SELECT b1
> FROM (
> SELECT a a1, b b1
> FROM testData2
> ORDER BY a
> )
> ORDER BY a1{code}
> The inner query is ordered on a1 as well. If we had a rule to eliminate Sort 
> on sorted result, together with this fix, the order-by in the outer query 
> could have been optimized out.
>  
> Similarly, the below query
> {code:java}
> SELECT *
> FROM (
> SELECT t1.a a1, t2.a a2, t1.b b1, t2.b b2
> FROM testData2 t1
> LEFT JOIN testData2 t2
> ON t1.a = t2.a
> )
> JOIN testData2 t3
> ON a1 = t3.a{code}
> is equivalent to
> {code:java}
> SELECT *
> FROM testData2 t1
> LEFT JOIN testData2 t2
> ON t1.a = t2.a
> JOIN testData2 t3
> ON t1.a = t3.a{code}
> , so the unnecessary sorting and hash-partitioning that have been optimized 
> out for the second query should have be eliminated in the first query as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23402) Dataset write method not working as expected for postgresql database

2018-02-13 Thread Pallapothu Jyothi Swaroop (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363512#comment-16363512
 ] 

Pallapothu Jyothi Swaroop commented on SPARK-23402:
---

[~kevinyu98] Did you create table before execute above instructions? It will 
throw exception only when table already exists in database. Please run above 
statements again you will get exception and let me know issue replicated or not.

> Dataset write method not working as expected for postgresql database
> 
>
> Key: SPARK-23402
> URL: https://issues.apache.org/jira/browse/SPARK-23402
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
> Environment: PostgreSQL: 9.5.8 (10 + Also same issue)
> OS: Cent OS 7 & Windows 7,8
> JDBC: 9.4-1201-jdbc41
>  
> Spark:  I executed in both 2.1.0 and 2.2.1
> Mode: Standalone
> OS: Windows 7
>Reporter: Pallapothu Jyothi Swaroop
>Priority: Major
> Attachments: Emsku[1].jpg
>
>
> I am using spark dataset write to insert data on postgresql existing table. 
> For this I am using  write method mode as append mode. While using i am 
> getting exception like table already exists. But, I gave option as append 
> mode.
> It's strange. When i change options to sqlserver/oracle append mode is 
> working as expected.
>  
> *Database Properties:*
> {{destinationProps.put("driver", "org.postgresql.Driver"); 
> destinationProps.put("url", "jdbc:postgresql://127.0.0.1:30001/dbmig"); 
> destinationProps.put("user", "dbmig");}}
> {{destinationProps.put("password", "dbmig");}}
>  
> *Dataset Write Code:*
> {{valueAnalysisDataset.write().mode(SaveMode.Append).jdbc(destinationDbMap.get("url"),
>  "dqvalue", destinationdbProperties);}} 
>  
>  
> {{Exception in thread "main" org.postgresql.util.PSQLException: ERROR: 
> relation "dqvalue" already exists at 
> org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2412)
>  at 
> org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2125)
>  at 
> org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:297) 
> at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:428) at 
> org.postgresql.jdbc.PgStatement.execute(PgStatement.java:354) at 
> org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:301) at 
> org.postgresql.jdbc.PgStatement.executeCachedSql(PgStatement.java:287) at 
> org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:264) at 
> org.postgresql.jdbc.PgStatement.executeUpdate(PgStatement.java:244) at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createTable(JdbcUtils.scala:806)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:95)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:469)
>  at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:50)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92) 
> at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609) 
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233) at 
> org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:460) at 
> com.ads.dqam.action.impl.PostgresValueAnalysis.persistValueAnalysis(PostgresValueAnalysis.java:25)
>  at 
> com.ads.dqam.action.AbstractValueAnalysis.persistAnalysis(AbstractValueAnalysis.java:81)
>  at com.ads.dqam.Analysis.doAnalysis(Analysis.java:32) at 
> com.ads.dqam.Client.main(Client.java:71)}}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: 

[jira] [Created] (SPARK-23420) Datasource loading not handling paths with regex chars.

2018-02-13 Thread Mitchell (JIRA)
Mitchell created SPARK-23420:


 Summary: Datasource loading not handling paths with regex chars.
 Key: SPARK-23420
 URL: https://issues.apache.org/jira/browse/SPARK-23420
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 2.2.1
Reporter: Mitchell


Greetings, during some recent testing I ran across an issue attempting to load 
files with regex chars like []()* etc. in them. The files are valid in the 
various storages and the normal hadoop APIs all function properly accessing 
them.

When my code is executed, I get the following stack trace.

8/02/14 04:52:46 ERROR yarn.ApplicationMaster: User class threw exception: 
java.io.IOException: Illegal file pattern: Unmatched closing ')' near index 130 
A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_??
 ^ java.io.IOException: Illegal file pattern: Unmatched closing ')' near index 
130 
A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_??
 ^ at org.apache.hadoop.fs.GlobFilter.init(GlobFilter.java:71) at 
org.apache.hadoop.fs.GlobFilter.(GlobFilter.java:50) at 
org.apache.hadoop.fs.Globber.doGlob(Globber.java:210) at 
org.apache.hadoop.fs.Globber.glob(Globber.java:149) at 
org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1955) at 
org.apache.hadoop.fs.s3a.S3AFileSystem.globStatus(S3AFileSystem.java:2477) at 
org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:234) at 
org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:244)
 at 
org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:618)
 at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:350)
 at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:350)
 at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
 at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
 at scala.collection.immutable.List.foreach(List.scala:381) at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at 
scala.collection.immutable.List.flatMap(List.scala:344) at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:349)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at 
org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533) at 
org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412) at 
com.sap.profile.SparkProfileTask.main(SparkProfileTask.java:95) at 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498) at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:635)
 Caused by: java.util.regex.PatternSyntaxException: Unmatched closing ')' near 
index 130 
A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_??
 ^ at java.util.regex.Pattern.error(Pattern.java:1955) at 
java.util.regex.Pattern.compile(Pattern.java:1700) at 
java.util.regex.Pattern.(Pattern.java:1351) at 
java.util.regex.Pattern.compile(Pattern.java:1054) at 
org.apache.hadoop.fs.GlobPattern.set(GlobPattern.java:156) at 
org.apache.hadoop.fs.GlobPattern.(GlobPattern.java:42) at 
org.apache.hadoop.fs.GlobFilter.init(GlobFilter.java:67) ... 25 more 18/02/14 
04:52:46 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, 
(reason: User class threw exception: java.io.IOException: Illegal file pattern: 
Unmatched closing ')' near index 130 
A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_??
 ^) 18/02/14 04:52:46 INFO spark.SparkContext: Invoking stop() from shutdown 
hook

 

Code is as follows ...

Dataset input = sqlContext.read().option("header", "true").option("sep", 
",").option("quote", "\"").option("charset", "utf8").option("escape", 

[jira] [Updated] (SPARK-23230) When hive.default.fileformat is other kinds of file types, create textfile table cause a serde error

2018-02-13 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23230:

Fix Version/s: 2.2.2

> When hive.default.fileformat is other kinds of file types, create textfile 
> table cause a serde error
> 
>
> Key: SPARK-23230
> URL: https://issues.apache.org/jira/browse/SPARK-23230
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Minor
> Fix For: 2.2.2, 2.3.0
>
>
> When hive.default.fileformat is other kinds of file types, create textfile 
> table cause a serde error.
>  We should take the default type of textfile and sequencefile both as 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.
> {code:java}
> set hive.default.fileformat=orc;
> create table tbl( i string ) stored as textfile;
> desc formatted tbl;
> Serde Library org.apache.hadoop.hive.ql.io.orc.OrcSerde
> InputFormat  org.apache.hadoop.mapred.TextInputFormat
> OutputFormat  org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat{code}
>  
> {code:java}
> set hive.default.fileformat=orc;
> create table tbl stored as textfile
> as
> select  1
> {code}
> {{It failed because it used the wrong SERDE}}
> {code:java}
> Caused by: java.lang.ClassCastException: 
> org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow cannot be cast to 
> org.apache.hadoop.io.BytesWritable
>   at 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat$1.write(HiveIgnoreKeyTextOutputFormat.java:91)
>   at 
> org.apache.spark.sql.hive.execution.HiveOutputWriter.write(HiveFileFormat.scala:149)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:327)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:258)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261)
>   ... 16 more
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23399) Register a task completion listener first for OrcColumnarBatchReader

2018-02-13 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23399.
-
   Resolution: Fixed
Fix Version/s: 2.3.1

Issue resolved by pull request 20590
[https://github.com/apache/spark/pull/20590]

> Register a task completion listener first for OrcColumnarBatchReader
> 
>
> Key: SPARK-23399
> URL: https://issues.apache.org/jira/browse/SPARK-23399
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.3.1
>
>
> This is related with SPARK-23390.
> Currently, there was a opened file leak for OrcColumnarBatchReader.
> {code}
> [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds)
> 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in 
> stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled)
> 15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem 
> connection created at:
> java.lang.Throwable
>   at 
> org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
>   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
>   at 
> org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173)
>   at 
> org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254)
>   at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23399) Register a task completion listener first for OrcColumnarBatchReader

2018-02-13 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-23399:
---

Assignee: Dongjoon Hyun

> Register a task completion listener first for OrcColumnarBatchReader
> 
>
> Key: SPARK-23399
> URL: https://issues.apache.org/jira/browse/SPARK-23399
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.3.1
>
>
> This is related with SPARK-23390.
> Currently, there was a opened file leak for OrcColumnarBatchReader.
> {code}
> [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds)
> 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in 
> stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled)
> 15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem 
> connection created at:
> java.lang.Throwable
>   at 
> org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
>   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
>   at 
> org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173)
>   at 
> org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254)
>   at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23419) data source v2 write path should re-throw interruption exceptions directly

2018-02-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363385#comment-16363385
 ] 

Apache Spark commented on SPARK-23419:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/20605

> data source v2 write path should re-throw interruption exceptions directly
> --
>
> Key: SPARK-23419
> URL: https://issues.apache.org/jira/browse/SPARK-23419
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23419) data source v2 write path should re-throw interruption exceptions directly

2018-02-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23419:


Assignee: Apache Spark  (was: Wenchen Fan)

> data source v2 write path should re-throw interruption exceptions directly
> --
>
> Key: SPARK-23419
> URL: https://issues.apache.org/jira/browse/SPARK-23419
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23419) data source v2 write path should re-throw interruption exceptions directly

2018-02-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23419:


Assignee: Wenchen Fan  (was: Apache Spark)

> data source v2 write path should re-throw interruption exceptions directly
> --
>
> Key: SPARK-23419
> URL: https://issues.apache.org/jira/browse/SPARK-23419
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23416) flaky test: org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite.stress test for failOnDataLoss=false

2018-02-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363386#comment-16363386
 ] 

Apache Spark commented on SPARK-23416:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/20605

> flaky test: 
> org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite.stress
>  test for failOnDataLoss=false
> --
>
> Key: SPARK-23416
> URL: https://issues.apache.org/jira/browse/SPARK-23416
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Minor
>
> I suspect this is a race condition latent in the DataSourceV2 write path, or 
> at least the interaction of that write path with StreamTest.
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87241/testReport/org.apache.spark.sql.kafka010/KafkaSourceStressForDontFailOnDataLossSuite/stress_test_for_failOnDataLoss_false/]
> h3. Error Message
> org.apache.spark.sql.streaming.StreamingQueryException: Query [id = 
> 16b2a2b1-acdd-44ec-902f-531169193169, runId = 
> 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job 
> aborted.
> h3. Stacktrace
> sbt.ForkMain$ForkError: 
> org.apache.spark.sql.streaming.StreamingQueryException: Query [id = 
> 16b2a2b1-acdd-44ec-902f-531169193169, runId = 
> 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job 
> aborted. at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
>  Caused by: sbt.ForkMain$ForkError: org.apache.spark.SparkException: Writing 
> job aborted. at 
> org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:108)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) 
> at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:294) 
> at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272)
>  at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) 
> at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) 
> at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253) at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252) at 
> org.apache.spark.sql.Dataset.collect(Dataset.scala:2722) at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$15.apply(MicroBatchExecution.scala:488)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3.apply(MicroBatchExecution.scala:483)
>  at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:482)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
>  at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
>  at 
> 

[jira] [Updated] (SPARK-23419) data source v2 write path should re-throw interruption exceptions directly

2018-02-13 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-23419:

Summary: data source v2 write path should re-throw interruption exceptions 
directly  (was: data source v2 write path should re-throw fatal errors directly)

> data source v2 write path should re-throw interruption exceptions directly
> --
>
> Key: SPARK-23419
> URL: https://issues.apache.org/jira/browse/SPARK-23419
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23377) Bucketizer with multiple columns persistence bug

2018-02-13 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363367#comment-16363367
 ] 

Liang-Chi Hsieh commented on SPARK-23377:
-

I agree with what [~mlnick] said.

> Bucketizer with multiple columns persistence bug
> 
>
> Key: SPARK-23377
> URL: https://issues.apache.org/jira/browse/SPARK-23377
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Priority: Critical
>
> A Bucketizer with multiple input/output columns get "inputCol" set to the 
> default value on write -> read which causes it to throw an error on 
> transform. Here's an example.
> {code:java}
> import org.apache.spark.ml.feature._
> val splits = Array(Double.NegativeInfinity, 0, 10, 100, 
> Double.PositiveInfinity)
> val bucketizer = new Bucketizer()
>   .setSplitsArray(Array(splits, splits))
>   .setInputCols(Array("foo1", "foo2"))
>   .setOutputCols(Array("bar1", "bar2"))
> val data = Seq((1.0, 2.0), (10.0, 100.0), (101.0, -1.0)).toDF("foo1", "foo2")
> bucketizer.transform(data)
> val path = "/temp/bucketrizer-persist-test"
> bucketizer.write.overwrite.save(path)
> val bucketizerAfterRead = Bucketizer.read.load(path)
> println(bucketizerAfterRead.isDefined(bucketizerAfterRead.outputCol))
> // This line throws an error because "outputCol" is set
> bucketizerAfterRead.transform(data)
> {code}
> And the trace:
> {code:java}
> java.lang.IllegalArgumentException: Bucketizer bucketizer_6f0acc3341f7 has 
> the inputCols Param set for multi-column transform. The following Params are 
> not applicable and should not be set: outputCol.
>   at 
> org.apache.spark.ml.param.ParamValidators$.checkExclusiveParams$1(params.scala:300)
>   at 
> org.apache.spark.ml.param.ParamValidators$.checkSingleVsMultiColumnParams(params.scala:314)
>   at 
> org.apache.spark.ml.feature.Bucketizer.transformSchema(Bucketizer.scala:189)
>   at 
> org.apache.spark.ml.feature.Bucketizer.transform(Bucketizer.scala:141)
>   at 
> line251821108a8a433da484ee31f166c83725.$read$$iw$$iw$$iw$$iw$$iw$$iw.(command-6079631:17)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23419) data source v2 write path should re-throw fatal errors directly

2018-02-13 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-23419:

Summary: data source v2 write path should re-throw fatal errors directly  
(was: data source v2 write path should re-throw FetchFailedException directly)

> data source v2 write path should re-throw fatal errors directly
> ---
>
> Key: SPARK-23419
> URL: https://issues.apache.org/jira/browse/SPARK-23419
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23419) data source v2 write path should re-throw FetchFailedException directly

2018-02-13 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-23419:
---

 Summary: data source v2 write path should re-throw 
FetchFailedException directly
 Key: SPARK-23419
 URL: https://issues.apache.org/jira/browse/SPARK-23419
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12140) Support Streaming UI in HistoryServer

2018-02-13 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363300#comment-16363300
 ] 

Saisai Shao commented on SPARK-12140:
-

[~gschiavon] the community has concern about supporting Streaming UI in history 
server, mainly due to the scalability issue, so there's no progress on it.

> Support Streaming UI in HistoryServer
> -
>
> Key: SPARK-12140
> URL: https://issues.apache.org/jira/browse/SPARK-12140
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> SPARK-11206 added infrastructure that would allow the streaming UI to be 
> shown in the History Server. We should add the necessary code to make that 
> happen, although it requires some changes to how events and listeners are 
> used.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23308) ignoreCorruptFiles should not ignore retryable IOException

2018-02-13 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-23308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363217#comment-16363217
 ] 

Márcio Furlani Carmona commented on SPARK-23308:


{quote}if your input stream is doing abort/reopen on seek & positioned read, 
then you get many more S3 requests when reading columnar data
{quote}
Good point! I'll see if I can get the actual S3 TPS we're reaching.

 

Regarding the SSE-KMS, we're not using it for encryption, so that shouldn't be 
a problem.

> ignoreCorruptFiles should not ignore retryable IOException
> --
>
> Key: SPARK-23308
> URL: https://issues.apache.org/jira/browse/SPARK-23308
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Márcio Furlani Carmona
>Priority: Minor
>
> When `spark.sql.files.ignoreCorruptFiles` is set it totally ignores any kind 
> of RuntimeException or IOException, but some possible IOExceptions may happen 
> even if the file is not corrupted.
> One example is the SocketTimeoutException which can be retried to possibly 
> fetch the data without meaning the data is corrupted.
>  
> See: 
> https://github.com/apache/spark/blob/e30e2698a2193f0bbdcd4edb884710819ab6397c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L163



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23235) Add executor Threaddump to api

2018-02-13 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-23235.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20474
[https://github.com/apache/spark/pull/20474]

> Add executor Threaddump to api
> --
>
> Key: SPARK-23235
> URL: https://issues.apache.org/jira/browse/SPARK-23235
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Assignee: Attila Zsolt Piros
>Priority: Minor
>  Labels: newbie
> Fix For: 2.4.0
>
>
> It looks like the the thread dump {{/executors/threadDump/?executorId=[id]}} 
> is only available in the UI, not in the rest api at all.  This is especially 
> a pain because that page in the UI has extra formatting which makes it a pain 
> to send the output to somebody else (most likely you click "expand all" and 
> then copy paste that, which is OK, but is formatted weirdly).  We might also 
> just want a "format=raw" option even on the UI.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23235) Add executor Threaddump to api

2018-02-13 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reassigned SPARK-23235:


Assignee: Attila Zsolt Piros

> Add executor Threaddump to api
> --
>
> Key: SPARK-23235
> URL: https://issues.apache.org/jira/browse/SPARK-23235
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Assignee: Attila Zsolt Piros
>Priority: Minor
>  Labels: newbie
> Fix For: 2.4.0
>
>
> It looks like the the thread dump {{/executors/threadDump/?executorId=[id]}} 
> is only available in the UI, not in the rest api at all.  This is especially 
> a pain because that page in the UI has extra formatting which makes it a pain 
> to send the output to somebody else (most likely you click "expand all" and 
> then copy paste that, which is OK, but is formatted weirdly).  We might also 
> just want a "format=raw" option even on the UI.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23351) checkpoint corruption in long running application

2018-02-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363133#comment-16363133
 ] 

Sean Owen commented on SPARK-23351:
---

[~davidahern] for the record, it's usually the opposite. Vendor branches like 
Cloudera say "2.2.0" but should be read like "2.2.x". They will as a rule only 
vary from the upstream branch in which commits go into a maintenance branch. 
This fix could be in a vendor 2.2.x release that isn't in an upstream one (or 
vice versa), or appear in a maintenance branch earlier from a vendor. That's in 
theory the value proposition; in this particular case I have no idea.

> checkpoint corruption in long running application
> -
>
> Key: SPARK-23351
> URL: https://issues.apache.org/jira/browse/SPARK-23351
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: David Ahern
>Priority: Major
>
> hi, after leaving my (somewhat high volume) Structured Streaming application 
> running for some time, i get the following exception.  The same exception 
> also repeats when i try to restart the application.  The only way to get the 
> application back running is to clear the checkpoint directory which is far 
> from ideal.
> Maybe a stream is not being flushed/closed properly internally by Spark when 
> checkpointing?
>  
>  User class threw exception: 
> org.apache.spark.sql.streaming.StreamingQueryException: Job aborted due to 
> stage failure: Task 55 in stage 1.0 failed 4 times, most recent failure: Lost 
> task 55.3 in stage 1.0 (TID 240, gbslixaacspa04u.metis.prd, executor 2): 
> java.io.EOFException
>  at java.io.DataInputStream.readInt(DataInputStream.java:392)
>  at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$readSnapshotFile(HDFSBackedStateStoreProvider.scala:481)
>  at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:359)
>  at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:358)
>  at scala.Option.getOrElse(Option.scala:121)
>  at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:358)
>  at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvider.scala:265)
>  at 
> org.apache.spark.sql.execution.streaming.state.StateStore$.get(StateStore.scala:200)
>  at 
> org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:61)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>  at org.apache.spark.scheduler.Task.run(Task.scala:108)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23365) DynamicAllocation with failure in straggler task can lead to a hung spark job

2018-02-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23365:


Assignee: (was: Apache Spark)

> DynamicAllocation with failure in straggler task can lead to a hung spark job
> -
>
> Key: SPARK-23365
> URL: https://issues.apache.org/jira/browse/SPARK-23365
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.2, 2.2.1, 2.3.0
>Reporter: Imran Rashid
>Priority: Major
>
> Dynamic Allocation can lead to a spark app getting stuck with 0 executors 
> requested when the executors in the last tasks of a taskset fail (eg. with an 
> OOM).
> This happens when {{ExecutorAllocationManager}} s internal target number of 
> executors gets out of sync with {{CoarseGrainedSchedulerBackend}} s target 
> number.  {{EAM}} updates the {{CGSB}} in two ways: (1) it tracks how many 
> tasks are active or pending in submitted stages, and computes how many 
> executors would be needed for them.  And as tasks finish, it will actively 
> decrease that count, informing the {{CGSB}} along the way.  (2) When it 
> decides executors are inactive for long enough, then it requests that 
> {{CGSB}} kill the executors -- this also tells the {{CGSB}} to update its 
> target number of executors: 
> https://github.com/apache/spark/blob/4df84c3f818aa536515729b442601e08c253ed35/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L622
> So when there is just one task left, you could have the following sequence of 
> events:
> (1) the {{EAM}} sets the desired number of executors to 1, and updates the 
> {{CGSB}} too
> (2) while that final task is still running, the other executors cross the 
> idle timeout, and the {{EAM}} requests the {{CGSB}} kill them
> (3) now the {{EAM}} has a target of 1 executor, and the {{CGSB}} has a target 
> of 0 executors
> If the final task completed normally now, everything would be OK; the next 
> taskset would get submitted, the {{EAM}} would increase the target number of 
> executors and it would update the {{CGSB}}.
> But if the executor for that final task failed (eg. an OOM), then the {{EAM}} 
> thinks it [doesn't need to update 
> anything|https://github.com/apache/spark/blob/4df84c3f818aa536515729b442601e08c253ed35/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L384-L386],
>  because its target is already 1, which is all it needs for that final task; 
> and the {{CGSB}} doesn't update anything either since its target is 0.
> I think you can determine if this is the cause of a stuck app by looking for
> {noformat}
> yarn.YarnAllocator: Driver requested a total number of 0 executor(s).
> {noformat}
> in the logs of the ApplicationMaster (at least on yarn).
> You can reproduce this with this test app, run with {{--conf 
> "spark.dynamicAllocation.minExecutors=1" --conf 
> "spark.dynamicAllocation.maxExecutors=5" --conf 
> "spark.dynamicAllocation.executorIdleTimeout=5s"}}
> {code}
> import org.apache.spark.SparkEnv
> sc.setLogLevel("INFO")
> sc.parallelize(1 to 1, 1).count()
> val execs = sc.parallelize(1 to 1000, 1000).map { _ => 
> SparkEnv.get.executorId}.collect().toSet
> val badExec = execs.head
> println("will kill exec " + badExec)
> new Thread() {
>   override def run(): Unit = {
> Thread.sleep(1)
> println("about to kill exec " + badExec)
> sc.killExecutor(badExec)
>   }
> }.start()
> sc.parallelize(1 to 5, 5).mapPartitions { itr =>
>   val exec = SparkEnv.get.executorId
>   if (exec == badExec) {
> Thread.sleep(2) // long enough that all the other tasks finish, and 
> the executors cross the idle timeout
> // meanwhile, something else should kill this executor
> itr
>   } else {
> itr
>   }
> }.collect()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23365) DynamicAllocation with failure in straggler task can lead to a hung spark job

2018-02-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23365:


Assignee: Apache Spark

> DynamicAllocation with failure in straggler task can lead to a hung spark job
> -
>
> Key: SPARK-23365
> URL: https://issues.apache.org/jira/browse/SPARK-23365
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.2, 2.2.1, 2.3.0
>Reporter: Imran Rashid
>Assignee: Apache Spark
>Priority: Major
>
> Dynamic Allocation can lead to a spark app getting stuck with 0 executors 
> requested when the executors in the last tasks of a taskset fail (eg. with an 
> OOM).
> This happens when {{ExecutorAllocationManager}} s internal target number of 
> executors gets out of sync with {{CoarseGrainedSchedulerBackend}} s target 
> number.  {{EAM}} updates the {{CGSB}} in two ways: (1) it tracks how many 
> tasks are active or pending in submitted stages, and computes how many 
> executors would be needed for them.  And as tasks finish, it will actively 
> decrease that count, informing the {{CGSB}} along the way.  (2) When it 
> decides executors are inactive for long enough, then it requests that 
> {{CGSB}} kill the executors -- this also tells the {{CGSB}} to update its 
> target number of executors: 
> https://github.com/apache/spark/blob/4df84c3f818aa536515729b442601e08c253ed35/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L622
> So when there is just one task left, you could have the following sequence of 
> events:
> (1) the {{EAM}} sets the desired number of executors to 1, and updates the 
> {{CGSB}} too
> (2) while that final task is still running, the other executors cross the 
> idle timeout, and the {{EAM}} requests the {{CGSB}} kill them
> (3) now the {{EAM}} has a target of 1 executor, and the {{CGSB}} has a target 
> of 0 executors
> If the final task completed normally now, everything would be OK; the next 
> taskset would get submitted, the {{EAM}} would increase the target number of 
> executors and it would update the {{CGSB}}.
> But if the executor for that final task failed (eg. an OOM), then the {{EAM}} 
> thinks it [doesn't need to update 
> anything|https://github.com/apache/spark/blob/4df84c3f818aa536515729b442601e08c253ed35/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L384-L386],
>  because its target is already 1, which is all it needs for that final task; 
> and the {{CGSB}} doesn't update anything either since its target is 0.
> I think you can determine if this is the cause of a stuck app by looking for
> {noformat}
> yarn.YarnAllocator: Driver requested a total number of 0 executor(s).
> {noformat}
> in the logs of the ApplicationMaster (at least on yarn).
> You can reproduce this with this test app, run with {{--conf 
> "spark.dynamicAllocation.minExecutors=1" --conf 
> "spark.dynamicAllocation.maxExecutors=5" --conf 
> "spark.dynamicAllocation.executorIdleTimeout=5s"}}
> {code}
> import org.apache.spark.SparkEnv
> sc.setLogLevel("INFO")
> sc.parallelize(1 to 1, 1).count()
> val execs = sc.parallelize(1 to 1000, 1000).map { _ => 
> SparkEnv.get.executorId}.collect().toSet
> val badExec = execs.head
> println("will kill exec " + badExec)
> new Thread() {
>   override def run(): Unit = {
> Thread.sleep(1)
> println("about to kill exec " + badExec)
> sc.killExecutor(badExec)
>   }
> }.start()
> sc.parallelize(1 to 5, 5).mapPartitions { itr =>
>   val exec = SparkEnv.get.executorId
>   if (exec == badExec) {
> Thread.sleep(2) // long enough that all the other tasks finish, and 
> the executors cross the idle timeout
> // meanwhile, something else should kill this executor
> itr
>   } else {
> itr
>   }
> }.collect()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23365) DynamicAllocation with failure in straggler task can lead to a hung spark job

2018-02-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363130#comment-16363130
 ] 

Apache Spark commented on SPARK-23365:
--

User 'squito' has created a pull request for this issue:
https://github.com/apache/spark/pull/20604

> DynamicAllocation with failure in straggler task can lead to a hung spark job
> -
>
> Key: SPARK-23365
> URL: https://issues.apache.org/jira/browse/SPARK-23365
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.2, 2.2.1, 2.3.0
>Reporter: Imran Rashid
>Priority: Major
>
> Dynamic Allocation can lead to a spark app getting stuck with 0 executors 
> requested when the executors in the last tasks of a taskset fail (eg. with an 
> OOM).
> This happens when {{ExecutorAllocationManager}} s internal target number of 
> executors gets out of sync with {{CoarseGrainedSchedulerBackend}} s target 
> number.  {{EAM}} updates the {{CGSB}} in two ways: (1) it tracks how many 
> tasks are active or pending in submitted stages, and computes how many 
> executors would be needed for them.  And as tasks finish, it will actively 
> decrease that count, informing the {{CGSB}} along the way.  (2) When it 
> decides executors are inactive for long enough, then it requests that 
> {{CGSB}} kill the executors -- this also tells the {{CGSB}} to update its 
> target number of executors: 
> https://github.com/apache/spark/blob/4df84c3f818aa536515729b442601e08c253ed35/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L622
> So when there is just one task left, you could have the following sequence of 
> events:
> (1) the {{EAM}} sets the desired number of executors to 1, and updates the 
> {{CGSB}} too
> (2) while that final task is still running, the other executors cross the 
> idle timeout, and the {{EAM}} requests the {{CGSB}} kill them
> (3) now the {{EAM}} has a target of 1 executor, and the {{CGSB}} has a target 
> of 0 executors
> If the final task completed normally now, everything would be OK; the next 
> taskset would get submitted, the {{EAM}} would increase the target number of 
> executors and it would update the {{CGSB}}.
> But if the executor for that final task failed (eg. an OOM), then the {{EAM}} 
> thinks it [doesn't need to update 
> anything|https://github.com/apache/spark/blob/4df84c3f818aa536515729b442601e08c253ed35/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L384-L386],
>  because its target is already 1, which is all it needs for that final task; 
> and the {{CGSB}} doesn't update anything either since its target is 0.
> I think you can determine if this is the cause of a stuck app by looking for
> {noformat}
> yarn.YarnAllocator: Driver requested a total number of 0 executor(s).
> {noformat}
> in the logs of the ApplicationMaster (at least on yarn).
> You can reproduce this with this test app, run with {{--conf 
> "spark.dynamicAllocation.minExecutors=1" --conf 
> "spark.dynamicAllocation.maxExecutors=5" --conf 
> "spark.dynamicAllocation.executorIdleTimeout=5s"}}
> {code}
> import org.apache.spark.SparkEnv
> sc.setLogLevel("INFO")
> sc.parallelize(1 to 1, 1).count()
> val execs = sc.parallelize(1 to 1000, 1000).map { _ => 
> SparkEnv.get.executorId}.collect().toSet
> val badExec = execs.head
> println("will kill exec " + badExec)
> new Thread() {
>   override def run(): Unit = {
> Thread.sleep(1)
> println("about to kill exec " + badExec)
> sc.killExecutor(badExec)
>   }
> }.start()
> sc.parallelize(1 to 5, 5).mapPartitions { itr =>
>   val exec = SparkEnv.get.executorId
>   if (exec == badExec) {
> Thread.sleep(2) // long enough that all the other tasks finish, and 
> the executors cross the idle timeout
> // meanwhile, something else should kill this executor
> itr
>   } else {
> itr
>   }
> }.collect()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23418) DataSourceV2 should not allow userSpecifiedSchema without ReadSupportWithSchema

2018-02-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23418:


Assignee: Apache Spark

> DataSourceV2 should not allow userSpecifiedSchema without 
> ReadSupportWithSchema
> ---
>
> Key: SPARK-23418
> URL: https://issues.apache.org/jira/browse/SPARK-23418
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Assignee: Apache Spark
>Priority: Major
>
> DataSourceV2 currently does not reject user-specified schemas when a source 
> does not implement ReadSupportWithSchema. This is confusing behavior. Here's 
> a quote from a discussion on SPARK-23203:
> {quote}I think this will cause confusion when source schemas change. Also, I 
> can't think of a situation where it is a good idea to pass a schema that is 
> ignored.
> Here's an example of how this will be confusing: think of a job that supplies 
> a schema identical to the table's schema and runs fine, so it goes into 
> production. What happens when the table's schema changes? If someone adds a 
> column to the table, then the job will start failing and report that the 
> source doesn't support user-supplied schemas, even though it had previously 
> worked just fine with a user-supplied schema. In addition, the change to the 
> table is actually compatible with the old job because the new column will be 
> removed by a projection.
> To fix this situation, it may be tempting to use the user-supplied schema as 
> an initial projection. But that doesn't make sense because we don't need two 
> projection mechanisms. If we used this as a second way to project, it would 
> be confusing that you can't actually leave out columns (at least for CSV) and 
> it would be odd that using this path you can coerce types, which should 
> usually be done by Spark.
> I think it is best not to allow a user-supplied schema when it isn't 
> supported by a source.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23418) DataSourceV2 should not allow userSpecifiedSchema without ReadSupportWithSchema

2018-02-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23418:


Assignee: (was: Apache Spark)

> DataSourceV2 should not allow userSpecifiedSchema without 
> ReadSupportWithSchema
> ---
>
> Key: SPARK-23418
> URL: https://issues.apache.org/jira/browse/SPARK-23418
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>
> DataSourceV2 currently does not reject user-specified schemas when a source 
> does not implement ReadSupportWithSchema. This is confusing behavior. Here's 
> a quote from a discussion on SPARK-23203:
> {quote}I think this will cause confusion when source schemas change. Also, I 
> can't think of a situation where it is a good idea to pass a schema that is 
> ignored.
> Here's an example of how this will be confusing: think of a job that supplies 
> a schema identical to the table's schema and runs fine, so it goes into 
> production. What happens when the table's schema changes? If someone adds a 
> column to the table, then the job will start failing and report that the 
> source doesn't support user-supplied schemas, even though it had previously 
> worked just fine with a user-supplied schema. In addition, the change to the 
> table is actually compatible with the old job because the new column will be 
> removed by a projection.
> To fix this situation, it may be tempting to use the user-supplied schema as 
> an initial projection. But that doesn't make sense because we don't need two 
> projection mechanisms. If we used this as a second way to project, it would 
> be confusing that you can't actually leave out columns (at least for CSV) and 
> it would be odd that using this path you can coerce types, which should 
> usually be done by Spark.
> I think it is best not to allow a user-supplied schema when it isn't 
> supported by a source.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23418) DataSourceV2 should not allow userSpecifiedSchema without ReadSupportWithSchema

2018-02-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363127#comment-16363127
 ] 

Apache Spark commented on SPARK-23418:
--

User 'rdblue' has created a pull request for this issue:
https://github.com/apache/spark/pull/20603

> DataSourceV2 should not allow userSpecifiedSchema without 
> ReadSupportWithSchema
> ---
>
> Key: SPARK-23418
> URL: https://issues.apache.org/jira/browse/SPARK-23418
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>
> DataSourceV2 currently does not reject user-specified schemas when a source 
> does not implement ReadSupportWithSchema. This is confusing behavior. Here's 
> a quote from a discussion on SPARK-23203:
> {quote}I think this will cause confusion when source schemas change. Also, I 
> can't think of a situation where it is a good idea to pass a schema that is 
> ignored.
> Here's an example of how this will be confusing: think of a job that supplies 
> a schema identical to the table's schema and runs fine, so it goes into 
> production. What happens when the table's schema changes? If someone adds a 
> column to the table, then the job will start failing and report that the 
> source doesn't support user-supplied schemas, even though it had previously 
> worked just fine with a user-supplied schema. In addition, the change to the 
> table is actually compatible with the old job because the new column will be 
> removed by a projection.
> To fix this situation, it may be tempting to use the user-supplied schema as 
> an initial projection. But that doesn't make sense because we don't need two 
> projection mechanisms. If we used this as a second way to project, it would 
> be confusing that you can't actually leave out columns (at least for CSV) and 
> it would be odd that using this path you can coerce types, which should 
> usually be done by Spark.
> I think it is best not to allow a user-supplied schema when it isn't 
> supported by a source.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23418) DataSourceV2 should not allow userSpecifiedSchema without ReadSupportWithSchema

2018-02-13 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-23418:
-

 Summary: DataSourceV2 should not allow userSpecifiedSchema without 
ReadSupportWithSchema
 Key: SPARK-23418
 URL: https://issues.apache.org/jira/browse/SPARK-23418
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Ryan Blue


DataSourceV2 currently does not reject user-specified schemas when a source 
does not implement ReadSupportWithSchema. This is confusing behavior. Here's a 
quote from a discussion on SPARK-23203:
{quote}I think this will cause confusion when source schemas change. Also, I 
can't think of a situation where it is a good idea to pass a schema that is 
ignored.

Here's an example of how this will be confusing: think of a job that supplies a 
schema identical to the table's schema and runs fine, so it goes into 
production. What happens when the table's schema changes? If someone adds a 
column to the table, then the job will start failing and report that the source 
doesn't support user-supplied schemas, even though it had previously worked 
just fine with a user-supplied schema. In addition, the change to the table is 
actually compatible with the old job because the new column will be removed by 
a projection.

To fix this situation, it may be tempting to use the user-supplied schema as an 
initial projection. But that doesn't make sense because we don't need two 
projection mechanisms. If we used this as a second way to project, it would be 
confusing that you can't actually leave out columns (at least for CSV) and it 
would be odd that using this path you can coerce types, which should usually be 
done by Spark.

I think it is best not to allow a user-supplied schema when it isn't supported 
by a source.
{quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23351) checkpoint corruption in long running application

2018-02-13 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363115#comment-16363115
 ] 

Shixiong Zhu commented on SPARK-23351:
--

[~davidahern] It's better to ask the vendor for support. They may have a 
different release cycle.

> checkpoint corruption in long running application
> -
>
> Key: SPARK-23351
> URL: https://issues.apache.org/jira/browse/SPARK-23351
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: David Ahern
>Priority: Major
>
> hi, after leaving my (somewhat high volume) Structured Streaming application 
> running for some time, i get the following exception.  The same exception 
> also repeats when i try to restart the application.  The only way to get the 
> application back running is to clear the checkpoint directory which is far 
> from ideal.
> Maybe a stream is not being flushed/closed properly internally by Spark when 
> checkpointing?
>  
>  User class threw exception: 
> org.apache.spark.sql.streaming.StreamingQueryException: Job aborted due to 
> stage failure: Task 55 in stage 1.0 failed 4 times, most recent failure: Lost 
> task 55.3 in stage 1.0 (TID 240, gbslixaacspa04u.metis.prd, executor 2): 
> java.io.EOFException
>  at java.io.DataInputStream.readInt(DataInputStream.java:392)
>  at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$readSnapshotFile(HDFSBackedStateStoreProvider.scala:481)
>  at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:359)
>  at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:358)
>  at scala.Option.getOrElse(Option.scala:121)
>  at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:358)
>  at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvider.scala:265)
>  at 
> org.apache.spark.sql.execution.streaming.state.StateStore$.get(StateStore.scala:200)
>  at 
> org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:61)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>  at org.apache.spark.scheduler.Task.run(Task.scala:108)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23402) Dataset write method not working as expected for postgresql database

2018-02-13 Thread kevin yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363110#comment-16363110
 ] 

kevin yu commented on SPARK-23402:
--

I just tried with 

PostgreSQL 9.5.6 on x86_64-pc-linux-gnu

with jdbc driver: 

postgresql-9.4.1210

 

I can't reproduce the error.

scala> val df1 = Seq((1)).toDF("c1")

df1: org.apache.spark.sql.DataFrame = [c1: int]

scala> 
df1.write.mode(SaveMode.Append).jdbc("jdbc:postgresql://9.30.167.220:5432/mydb",
 "emptytable", destProperties)

                                                                                

scala> val df3 = spark.read.jdbc("jdbc:postgresql://9.30.167.220:5432/mydb", 
"emptytable", destProperties).show

+---+                                                                           

| c1|

+---+

|  1|

+---+

 

df3: Unit = ()

> Dataset write method not working as expected for postgresql database
> 
>
> Key: SPARK-23402
> URL: https://issues.apache.org/jira/browse/SPARK-23402
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
> Environment: PostgreSQL: 9.5.8 (10 + Also same issue)
> OS: Cent OS 7 & Windows 7,8
> JDBC: 9.4-1201-jdbc41
>  
> Spark:  I executed in both 2.1.0 and 2.2.1
> Mode: Standalone
> OS: Windows 7
>Reporter: Pallapothu Jyothi Swaroop
>Priority: Major
> Attachments: Emsku[1].jpg
>
>
> I am using spark dataset write to insert data on postgresql existing table. 
> For this I am using  write method mode as append mode. While using i am 
> getting exception like table already exists. But, I gave option as append 
> mode.
> It's strange. When i change options to sqlserver/oracle append mode is 
> working as expected.
>  
> *Database Properties:*
> {{destinationProps.put("driver", "org.postgresql.Driver"); 
> destinationProps.put("url", "jdbc:postgresql://127.0.0.1:30001/dbmig"); 
> destinationProps.put("user", "dbmig");}}
> {{destinationProps.put("password", "dbmig");}}
>  
> *Dataset Write Code:*
> {{valueAnalysisDataset.write().mode(SaveMode.Append).jdbc(destinationDbMap.get("url"),
>  "dqvalue", destinationdbProperties);}} 
>  
>  
> {{Exception in thread "main" org.postgresql.util.PSQLException: ERROR: 
> relation "dqvalue" already exists at 
> org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2412)
>  at 
> org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2125)
>  at 
> org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:297) 
> at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:428) at 
> org.postgresql.jdbc.PgStatement.execute(PgStatement.java:354) at 
> org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:301) at 
> org.postgresql.jdbc.PgStatement.executeCachedSql(PgStatement.java:287) at 
> org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:264) at 
> org.postgresql.jdbc.PgStatement.executeUpdate(PgStatement.java:244) at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createTable(JdbcUtils.scala:806)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:95)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:469)
>  at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:50)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92) 
> at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609) 
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233) at 
> org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:460) at 
> com.ads.dqam.action.impl.PostgresValueAnalysis.persistValueAnalysis(PostgresValueAnalysis.java:25)
>  

[jira] [Commented] (SPARK-23411) Deprecate SparkContext.getRDDStorageInfo

2018-02-13 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363037#comment-16363037
 ] 

Marcelo Vanzin commented on SPARK-23411:


Argh, copied the wrong method name. Fixed.

> Deprecate SparkContext.getRDDStorageInfo
> 
>
> Key: SPARK-23411
> URL: https://issues.apache.org/jira/browse/SPARK-23411
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> This is another developer API in SparkContext that is better moved to a more 
> monitoring-minded place such as {{SparkStatusTracker}}. The same information 
> is also already available through the REST API.
> Moreover, it exposes {{RDDInfo}}, which is a mutable internal Spark type, and 
> exposing that to applications is kinda sketchy. (It can't be made private, 
> though, since it's exposed indirectly in events posted to the listener bus.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23411) Deprecate SparkContext.getRDDStorageInfo

2018-02-13 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-23411:
---
Summary: Deprecate SparkContext.getRDDStorageInfo  (was: Deprecate 
SparkContext.getExecutorStorageStatus)

> Deprecate SparkContext.getRDDStorageInfo
> 
>
> Key: SPARK-23411
> URL: https://issues.apache.org/jira/browse/SPARK-23411
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> This is another developer API in SparkContext that is better moved to a more 
> monitoring-minded place such as {{SparkStatusTracker}}. The same information 
> is also already available through the REST API.
> Moreover, it exposes {{RDDInfo}}, which is a mutable internal Spark type, and 
> exposing that to applications is kinda sketchy. (It can't be made private, 
> though, since it's exposed indirectly in events posted to the listener bus.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23414) Plotting using matplotlib in MLlib pyspark

2018-02-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363029#comment-16363029
 ] 

Sean Owen commented on SPARK-23414:
---

matplotlib doesn't interact with Spark, so issues with using it are unlikely to 
be relevant to Spark itself anyway.

> Plotting using matplotlib in MLlib pyspark 
> ---
>
> Key: SPARK-23414
> URL: https://issues.apache.org/jira/browse/SPARK-23414
> Project: Spark
>  Issue Type: Question
>  Components: MLlib
>Affects Versions: 2.2.1
>Reporter: Waleed Esmail
>Priority: Major
>
> Dear MLlib experts,
> I just want to plot a fancy confusion matrix (true values vs predicted 
> values) like the one produced by seaborn module in python, so I did the 
> following:
> {code:java}
> labelIndexer = StringIndexer(inputCol="label", 
> outputCol="indexedLabel").fit(output)
> # Automatically identify categorical features, and index them.
> # We specify maxCategories so features with > 4 distinct values are treated 
> as continuous.
> featureIndexer = VectorIndexer(inputCol="features", 
> outputCol="indexedFeatures").fit(output)
> # Split the data into training and test sets (30% held out for testing)
> (trainingData, testData) = output.randomSplit([0.7, 0.3])
> dt = DecisionTreeClassifier(labelCol="indexedLabel", 
> featuresCol="indexedFeatures", maxDepth=15)
> # Chain indexers and tree in a Pipeline
> pipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt])
> # Train model.  This also runs the indexers.
> model = pipeline.fit(trainingData)
> # Make predictions.
> predictions = model.transform(testData)
> predictionAndLabels = predictions.select("prediction", "indexedLabel")
> y_predicted = np.array(predictions.select("prediction").collect())
> y_test = np.array(predictions.select("indexedLabel").collect())
> from sklearn.metrics import confusion_matrix
> import matplotlib.ticker as ticker
> figcm, ax = plt.subplots()
> cm = confusion_matrix(y_test, y_predicted)
> # for normalization
> cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
> sns.heatmap(cm, square=True, annot=True, cbar=False)
> plt.xlabel('predication')
> plt.ylabel('true value')
> {code}
> is this the right way to do it?!. please note that I am new to Spark and MLlib
>  
> thank you in advance,



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23414) Plotting using matplotlib in MLlib pyspark

2018-02-13 Thread Waleed Esmail (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363025#comment-16363025
 ] 

Waleed Esmail commented on SPARK-23414:
---

I am sorry, I didn't get it, what do you mean by "orthogonal"?!.

> Plotting using matplotlib in MLlib pyspark 
> ---
>
> Key: SPARK-23414
> URL: https://issues.apache.org/jira/browse/SPARK-23414
> Project: Spark
>  Issue Type: Question
>  Components: MLlib
>Affects Versions: 2.2.1
>Reporter: Waleed Esmail
>Priority: Major
>
> Dear MLlib experts,
> I just want to plot a fancy confusion matrix (true values vs predicted 
> values) like the one produced by seaborn module in python, so I did the 
> following:
> {code:java}
> labelIndexer = StringIndexer(inputCol="label", 
> outputCol="indexedLabel").fit(output)
> # Automatically identify categorical features, and index them.
> # We specify maxCategories so features with > 4 distinct values are treated 
> as continuous.
> featureIndexer = VectorIndexer(inputCol="features", 
> outputCol="indexedFeatures").fit(output)
> # Split the data into training and test sets (30% held out for testing)
> (trainingData, testData) = output.randomSplit([0.7, 0.3])
> dt = DecisionTreeClassifier(labelCol="indexedLabel", 
> featuresCol="indexedFeatures", maxDepth=15)
> # Chain indexers and tree in a Pipeline
> pipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt])
> # Train model.  This also runs the indexers.
> model = pipeline.fit(trainingData)
> # Make predictions.
> predictions = model.transform(testData)
> predictionAndLabels = predictions.select("prediction", "indexedLabel")
> y_predicted = np.array(predictions.select("prediction").collect())
> y_test = np.array(predictions.select("indexedLabel").collect())
> from sklearn.metrics import confusion_matrix
> import matplotlib.ticker as ticker
> figcm, ax = plt.subplots()
> cm = confusion_matrix(y_test, y_predicted)
> # for normalization
> cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
> sns.heatmap(cm, square=True, annot=True, cbar=False)
> plt.xlabel('predication')
> plt.ylabel('true value')
> {code}
> is this the right way to do it?!. please note that I am new to Spark and MLlib
>  
> thank you in advance,



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23417) pyspark tests give wrong sbt instructions

2018-02-13 Thread Jose Torres (JIRA)
Jose Torres created SPARK-23417:
---

 Summary: pyspark tests give wrong sbt instructions
 Key: SPARK-23417
 URL: https://issues.apache.org/jira/browse/SPARK-23417
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.0
Reporter: Jose Torres


When running python/run-tests, the script indicates that I must run "'build/sbt 
assembly/package streaming-kafka-0-8-assembly/assembly' or 'build/mvn 
-Pkafka-0-8 package'". The sbt command fails:

 

[error] Expected ID character

[error] Not a valid command: streaming-kafka-0-8-assembly

[error] Expected project ID

[error] Expected configuration

[error] Expected ':' (if selecting a configuration)

[error] Expected key

[error] Not a valid key: streaming-kafka-0-8-assembly

[error] streaming-kafka-0-8-assembly/assembly

[error] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23400) Add the extra constructors for ScalaUDF

2018-02-13 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-23400:
-
Fix Version/s: 2.4.0

> Add the extra constructors for ScalaUDF
> ---
>
> Key: SPARK-23400
> URL: https://issues.apache.org/jira/browse/SPARK-23400
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> The last few releases, we changed the interface of ScalaUDF. Unfortunately, 
> some Spark Package (spark-deep-learning) are using our internal class 
> `ScalaUDF`. In the release 2.3, we added new parameters into these class. The 
> users hit the binary compatibility issues and got the exception:
> > java.lang.NoSuchMethodError: 
> > org.apache.spark.sql.catalyst.expressions.ScalaUDF.init(Ljava/lang/Object;Lorg/apache/spark/sql/types/DataType;Lscala/collection/Seq;Lscala/collection/Seq;Lscala/Option;)V



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23400) Add the extra constructors for ScalaUDF

2018-02-13 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-23400.
--
   Resolution: Fixed
Fix Version/s: 2.3.1

Issue resolved by pull request 20591
[https://github.com/apache/spark/pull/20591]

> Add the extra constructors for ScalaUDF
> ---
>
> Key: SPARK-23400
> URL: https://issues.apache.org/jira/browse/SPARK-23400
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
> Fix For: 2.3.1
>
>
> The last few releases, we changed the interface of ScalaUDF. Unfortunately, 
> some Spark Package (spark-deep-learning) are using our internal class 
> `ScalaUDF`. In the release 2.3, we added new parameters into these class. The 
> users hit the binary compatibility issues and got the exception:
> > java.lang.NoSuchMethodError: 
> > org.apache.spark.sql.catalyst.expressions.ScalaUDF.init(Ljava/lang/Object;Lorg/apache/spark/sql/types/DataType;Lscala/collection/Seq;Lscala/collection/Seq;Lscala/Option;)V



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23416) flaky test: org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite.stress test for failOnDataLoss=false

2018-02-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362950#comment-16362950
 ] 

Apache Spark commented on SPARK-23416:
--

User 'jose-torres' has created a pull request for this issue:
https://github.com/apache/spark/pull/20602

> flaky test: 
> org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite.stress
>  test for failOnDataLoss=false
> --
>
> Key: SPARK-23416
> URL: https://issues.apache.org/jira/browse/SPARK-23416
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Minor
>
> I suspect this is a race condition latent in the DataSourceV2 write path, or 
> at least the interaction of that write path with StreamTest.
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87241/testReport/org.apache.spark.sql.kafka010/KafkaSourceStressForDontFailOnDataLossSuite/stress_test_for_failOnDataLoss_false/]
> h3. Error Message
> org.apache.spark.sql.streaming.StreamingQueryException: Query [id = 
> 16b2a2b1-acdd-44ec-902f-531169193169, runId = 
> 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job 
> aborted.
> h3. Stacktrace
> sbt.ForkMain$ForkError: 
> org.apache.spark.sql.streaming.StreamingQueryException: Query [id = 
> 16b2a2b1-acdd-44ec-902f-531169193169, runId = 
> 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job 
> aborted. at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
>  Caused by: sbt.ForkMain$ForkError: org.apache.spark.SparkException: Writing 
> job aborted. at 
> org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:108)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) 
> at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:294) 
> at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272)
>  at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) 
> at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) 
> at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253) at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252) at 
> org.apache.spark.sql.Dataset.collect(Dataset.scala:2722) at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$15.apply(MicroBatchExecution.scala:488)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3.apply(MicroBatchExecution.scala:483)
>  at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:482)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
>  at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
>  at 
> 

[jira] [Assigned] (SPARK-23416) flaky test: org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite.stress test for failOnDataLoss=false

2018-02-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23416:


Assignee: Apache Spark

> flaky test: 
> org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite.stress
>  test for failOnDataLoss=false
> --
>
> Key: SPARK-23416
> URL: https://issues.apache.org/jira/browse/SPARK-23416
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Apache Spark
>Priority: Minor
>
> I suspect this is a race condition latent in the DataSourceV2 write path, or 
> at least the interaction of that write path with StreamTest.
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87241/testReport/org.apache.spark.sql.kafka010/KafkaSourceStressForDontFailOnDataLossSuite/stress_test_for_failOnDataLoss_false/]
> h3. Error Message
> org.apache.spark.sql.streaming.StreamingQueryException: Query [id = 
> 16b2a2b1-acdd-44ec-902f-531169193169, runId = 
> 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job 
> aborted.
> h3. Stacktrace
> sbt.ForkMain$ForkError: 
> org.apache.spark.sql.streaming.StreamingQueryException: Query [id = 
> 16b2a2b1-acdd-44ec-902f-531169193169, runId = 
> 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job 
> aborted. at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
>  Caused by: sbt.ForkMain$ForkError: org.apache.spark.SparkException: Writing 
> job aborted. at 
> org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:108)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) 
> at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:294) 
> at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272)
>  at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) 
> at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) 
> at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253) at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252) at 
> org.apache.spark.sql.Dataset.collect(Dataset.scala:2722) at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$15.apply(MicroBatchExecution.scala:488)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3.apply(MicroBatchExecution.scala:483)
>  at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:482)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
>  at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
>  at 
> 

[jira] [Assigned] (SPARK-23416) flaky test: org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite.stress test for failOnDataLoss=false

2018-02-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23416:


Assignee: (was: Apache Spark)

> flaky test: 
> org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite.stress
>  test for failOnDataLoss=false
> --
>
> Key: SPARK-23416
> URL: https://issues.apache.org/jira/browse/SPARK-23416
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Minor
>
> I suspect this is a race condition latent in the DataSourceV2 write path, or 
> at least the interaction of that write path with StreamTest.
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87241/testReport/org.apache.spark.sql.kafka010/KafkaSourceStressForDontFailOnDataLossSuite/stress_test_for_failOnDataLoss_false/]
> h3. Error Message
> org.apache.spark.sql.streaming.StreamingQueryException: Query [id = 
> 16b2a2b1-acdd-44ec-902f-531169193169, runId = 
> 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job 
> aborted.
> h3. Stacktrace
> sbt.ForkMain$ForkError: 
> org.apache.spark.sql.streaming.StreamingQueryException: Query [id = 
> 16b2a2b1-acdd-44ec-902f-531169193169, runId = 
> 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job 
> aborted. at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
>  Caused by: sbt.ForkMain$ForkError: org.apache.spark.SparkException: Writing 
> job aborted. at 
> org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:108)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) 
> at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:294) 
> at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272)
>  at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) 
> at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) 
> at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253) at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252) at 
> org.apache.spark.sql.Dataset.collect(Dataset.scala:2722) at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$15.apply(MicroBatchExecution.scala:488)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3.apply(MicroBatchExecution.scala:483)
>  at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:482)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
>  at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>  at 
> 

[jira] [Comment Edited] (SPARK-23416) flaky test: org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite.stress test for failOnDataLoss=false

2018-02-13 Thread Jose Torres (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362941#comment-16362941
 ] 

Jose Torres edited comment on SPARK-23416 at 2/13/18 7:48 PM:
--

I think I see the problem.
 * StreamExecution.stop() works by interrupting the stream execution thread. 
This is not safe in general, and can throw any variety of exceptions.
 * StreamExecution.isInterruptedByStop() solves this problem by implementing a 
whitelist of exceptions which indicate the stop() happened.
 * The v2 write path adds calls to ThreadUtils.awaitResult(), which weren't in 
the V1 write path and (if the interrupt happens to fall in them) throw a new 
exception which isn't accounted for.

I'm going to write a PR to add another whitelist entry. This whole edifice is a 
bit fragile, but I don't have a good solution for that.


was (Author: joseph.torres):
I think I see the problem.
 * StreamExecution.stop() works by interrupting the stream execution thread. 
This is not safe in general, and can throw any variety of exceptions.
 * StreamExecution.isInterruptedByStop() solves this problem by implementing a 
whitelist of exceptions which indicate the stop() happened.
 * The v2 write path adds calls to ThreadUtils.awaitResult(), which weren't in 
the V1 write path and (if the interrupt happens to fall in them) throw a new 
exception which isn't accounted for.

I'm going to write a PR to add another whitelist entry, but this is quite 
fragile.

> flaky test: 
> org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite.stress
>  test for failOnDataLoss=false
> --
>
> Key: SPARK-23416
> URL: https://issues.apache.org/jira/browse/SPARK-23416
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Minor
>
> I suspect this is a race condition latent in the DataSourceV2 write path, or 
> at least the interaction of that write path with StreamTest.
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87241/testReport/org.apache.spark.sql.kafka010/KafkaSourceStressForDontFailOnDataLossSuite/stress_test_for_failOnDataLoss_false/]
> h3. Error Message
> org.apache.spark.sql.streaming.StreamingQueryException: Query [id = 
> 16b2a2b1-acdd-44ec-902f-531169193169, runId = 
> 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job 
> aborted.
> h3. Stacktrace
> sbt.ForkMain$ForkError: 
> org.apache.spark.sql.streaming.StreamingQueryException: Query [id = 
> 16b2a2b1-acdd-44ec-902f-531169193169, runId = 
> 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job 
> aborted. at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
>  Caused by: sbt.ForkMain$ForkError: org.apache.spark.SparkException: Writing 
> job aborted. at 
> org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:108)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) 
> at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:294) 
> at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272)
>  at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) 
> at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) 
> at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253) at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252) at 
> org.apache.spark.sql.Dataset.collect(Dataset.scala:2722) at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$15.apply(MicroBatchExecution.scala:488)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>  

[jira] [Commented] (SPARK-23416) flaky test: org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite.stress test for failOnDataLoss=false

2018-02-13 Thread Jose Torres (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362941#comment-16362941
 ] 

Jose Torres commented on SPARK-23416:
-

I think I see the problem.
 * StreamExecution.stop() works by interrupting the stream execution thread. 
This is not safe in general, and can throw any variety of exceptions.
 * StreamExecution.isInterruptedByStop() solves this problem by implementing a 
whitelist of exceptions which indicate the stop() happened.
 * The v2 write path adds calls to ThreadUtils.awaitResult(), which weren't in 
the V1 write path and (if the interrupt happens to fall in them) throw a new 
exception which isn't accounted for.

I'm going to write a PR to add another whitelist entry, but this is quite 
fragile.

> flaky test: 
> org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite.stress
>  test for failOnDataLoss=false
> --
>
> Key: SPARK-23416
> URL: https://issues.apache.org/jira/browse/SPARK-23416
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Minor
>
> I suspect this is a race condition latent in the DataSourceV2 write path, or 
> at least the interaction of that write path with StreamTest.
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87241/testReport/org.apache.spark.sql.kafka010/KafkaSourceStressForDontFailOnDataLossSuite/stress_test_for_failOnDataLoss_false/]
> h3. Error Message
> org.apache.spark.sql.streaming.StreamingQueryException: Query [id = 
> 16b2a2b1-acdd-44ec-902f-531169193169, runId = 
> 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job 
> aborted.
> h3. Stacktrace
> sbt.ForkMain$ForkError: 
> org.apache.spark.sql.streaming.StreamingQueryException: Query [id = 
> 16b2a2b1-acdd-44ec-902f-531169193169, runId = 
> 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job 
> aborted. at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
>  Caused by: sbt.ForkMain$ForkError: org.apache.spark.SparkException: Writing 
> job aborted. at 
> org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:108)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) 
> at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:294) 
> at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272)
>  at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) 
> at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) 
> at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253) at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252) at 
> org.apache.spark.sql.Dataset.collect(Dataset.scala:2722) at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$15.apply(MicroBatchExecution.scala:488)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3.apply(MicroBatchExecution.scala:483)
>  at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:482)
>  at 
> 

[jira] [Commented] (SPARK-23416) flaky test: org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite.stress test for failOnDataLoss=false

2018-02-13 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362920#comment-16362920
 ] 

Marco Gaido commented on SPARK-23416:
-

I see this failing also with this stacktrace:


{code:java}
sbt.ForkMain$ForkError: org.apache.spark.sql.streaming.StreamingQueryException: 
Query memory [id = cca87cf7-0532-41af-b757-0948ec294c0c, runId = 
c1830af6-1715-4947-bd76-a1a63482280b] terminated with exception: null
at 
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
Caused by: sbt.ForkMain$ForkError: java.lang.InterruptedException: null
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326)
at 
scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
at 
org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution.runContinuous(ContinuousExecution.scala:271)
at 
org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution.runActivatedStream(ContinuousExecution.scala:89)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279)
... 1 more
{code}

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87401/testReport/org.apache.spark.sql.kafka010/KafkaContinuousSourceStressForDontFailOnDataLossSuite/stress_test_for_failOnDataLoss_false/

> flaky test: 
> org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite.stress
>  test for failOnDataLoss=false
> --
>
> Key: SPARK-23416
> URL: https://issues.apache.org/jira/browse/SPARK-23416
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Minor
>
> I suspect this is a race condition latent in the DataSourceV2 write path, or 
> at least the interaction of that write path with StreamTest.
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87241/testReport/org.apache.spark.sql.kafka010/KafkaSourceStressForDontFailOnDataLossSuite/stress_test_for_failOnDataLoss_false/]
> h3. Error Message
> org.apache.spark.sql.streaming.StreamingQueryException: Query [id = 
> 16b2a2b1-acdd-44ec-902f-531169193169, runId = 
> 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job 
> aborted.
> h3. Stacktrace
> sbt.ForkMain$ForkError: 
> org.apache.spark.sql.streaming.StreamingQueryException: Query [id = 
> 16b2a2b1-acdd-44ec-902f-531169193169, runId = 
> 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job 
> aborted. at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
>  Caused by: sbt.ForkMain$ForkError: org.apache.spark.SparkException: Writing 
> job aborted. at 
> org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:108)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) 
> at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:294) 
> at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272)
>  at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) 
> at 

[jira] [Commented] (SPARK-23416) flaky test: org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite.stress test for failOnDataLoss=false

2018-02-13 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362913#comment-16362913
 ] 

Dongjoon Hyun commented on SPARK-23416:
---

Thank you for filing this!

> flaky test: 
> org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite.stress
>  test for failOnDataLoss=false
> --
>
> Key: SPARK-23416
> URL: https://issues.apache.org/jira/browse/SPARK-23416
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Minor
>
> I suspect this is a race condition latent in the DataSourceV2 write path, or 
> at least the interaction of that write path with StreamTest.
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87241/testReport/org.apache.spark.sql.kafka010/KafkaSourceStressForDontFailOnDataLossSuite/stress_test_for_failOnDataLoss_false/]
> h3. Error Message
> org.apache.spark.sql.streaming.StreamingQueryException: Query [id = 
> 16b2a2b1-acdd-44ec-902f-531169193169, runId = 
> 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job 
> aborted.
> h3. Stacktrace
> sbt.ForkMain$ForkError: 
> org.apache.spark.sql.streaming.StreamingQueryException: Query [id = 
> 16b2a2b1-acdd-44ec-902f-531169193169, runId = 
> 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job 
> aborted. at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
>  Caused by: sbt.ForkMain$ForkError: org.apache.spark.SparkException: Writing 
> job aborted. at 
> org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:108)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) 
> at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:294) 
> at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272)
>  at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) 
> at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) 
> at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253) at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252) at 
> org.apache.spark.sql.Dataset.collect(Dataset.scala:2722) at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$15.apply(MicroBatchExecution.scala:488)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3.apply(MicroBatchExecution.scala:483)
>  at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:482)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
>  at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
>  at 
> 

[jira] [Resolved] (SPARK-23154) Document backwards compatibility guarantees for ML persistence

2018-02-13 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-23154.
---
   Resolution: Fixed
Fix Version/s: 2.4.0
   2.3.1

Resolved via https://github.com/apache/spark/pull/20592

> Document backwards compatibility guarantees for ML persistence
> --
>
> Key: SPARK-23154
> URL: https://issues.apache.org/jira/browse/SPARK-23154
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> We have (as far as I know) maintained backwards compatibility for ML 
> persistence, but this is not documented anywhere.  I'd like us to document it 
> (for spark.ml, not for spark.mllib).
> I'd recommend something like:
> {quote}
> In general, MLlib maintains backwards compatibility for ML persistence.  
> I.e., if you save an ML model or Pipeline in one version of Spark, then you 
> should be able to load it back and use it in a future version of Spark.  
> However, there are rare exceptions, described below.
> Model persistence: Is a model or Pipeline saved using Apache Spark ML 
> persistence in Spark version X loadable by Spark version Y?
> * Major versions: No guarantees, but best-effort.
> * Minor and patch versions: Yes; these are backwards compatible.
> * Note about the format: There are no guarantees for a stable persistence 
> format, but model loading itself is designed to be backwards compatible.
> Model behavior: Does a model or Pipeline in Spark version X behave 
> identically in Spark version Y?
> * Major versions: No guarantees, but best-effort.
> * Minor and patch versions: Identical behavior, except for bug fixes.
> For both model persistence and model behavior, any breaking changes across a 
> minor version or patch version are reported in the Spark version release 
> notes. If a breakage is not reported in release notes, then it should be 
> treated as a bug to be fixed.
> {quote}
> How does this sound?
> Note: We unfortunately don't have tests for backwards compatibility (which 
> has technical hurdles and can be discussed in [SPARK-15573]).  However, we 
> have made efforts to maintain it during PR review and Spark release QA, and 
> most users expect it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23344) Add KMeans distanceMeasure param to PySpark

2018-02-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362902#comment-16362902
 ] 

Sean Owen commented on SPARK-23344:
---

Nah, I think the regulars have different views on this, which nevertheless work 
OK for them. Often where it becomes a problem are for people trying to inflate 
a JIRA count, and they won't generally have read or paid attention to threads 
like that anyway.

> Add KMeans distanceMeasure param to PySpark
> ---
>
> Key: SPARK-23344
> URL: https://issues.apache.org/jira/browse/SPARK-23344
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 2.4.0
>
>
> SPARK-22119 introduced a new parameter for KMeans, ie. {{distanceMeasure}}. 
> We should add it also to the Python interface.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23159) Update Cloudpickle to match version 0.4.3

2018-02-13 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-23159:
-
Description: 
Update PySpark's version of Cloudpickle to match version 0.4.3.  The reasons 
for doing this are:
 * Pick up bug fixes, improvements with newer version
 * Match a specific version as close as possible (Spark has additional changes 
that might be necessary) to make future upgrades easier

There are newer versions of Cloudpickle that can fix bugs with NamedTuple 
pickling (that Spark currently has workarounds for), but these include other 
changes that need some verification before bringing into Spark.  Upgrading 
first to 0.4.3 will help make this verification easier.

Discussion on the mailing list: 
[http://apache-spark-developers-list.1001551.n3.nabble.com/Thoughts-on-Cloudpickle-Update-td23188.html]

Upgrading to the recent release of v0.4.3 will include the following:
 * Fix pickling of named tuples 
[https://github.com/cloudpipe/cloudpickle/pull/113]
 * Built in type constructors for PyPy compatibility 
[here]([https://github.com/cloudpipe/cloudpickle/commit/d84980ccaafc7982a50d4e04064011f401f17d1b])
 * Fix memoryview support [https://github.com/cloudpipe/cloudpickle/pull/122]
 * Improved compatibility with other cloudpickle versions 
[https://github.com/cloudpipe/cloudpickle/pull/128]
 * Several cleanups [https://github.com/cloudpipe/cloudpickle/pull/121] and 
[here]([https://github.com/cloudpipe/cloudpickle/commit/c91aaf110441991307f5097f950764079d0f9652])
 * [MRG] Regression on pickling classes from the __main__ module 
[https://github.com/cloudpipe/cloudpickle/pull/149]
 * BUG: Handle instance methods of builtin types 
[https://github.com/cloudpipe/cloudpickle/pull/154]
 * Fix #129 : do not silence RuntimeError in dump() 
[https://github.com/cloudpipe/cloudpickle/pull/153]

  was:
Update PySpark's version of Cloudpickle to match version 0.4.2.  The reasons 
for doing this are:
 * Pick up bug fixes, improvements with newer version
 * Match a specific version as close as possible (Spark has additional changes 
that might be necessary) to make future upgrades easier

There are newer versions of Cloudpickle that can fix bugs with NamedTuple 
pickling (that Spark currently has workarounds for), but these include other 
changes that need some verification before bringing into Spark.  Upgrading 
first to 0.4.2 will help make this verification easier.

Discussion on the mailing list: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Thoughts-on-Cloudpickle-Update-td23188.html


> Update Cloudpickle to match version 0.4.3
> -
>
> Key: SPARK-23159
> URL: https://issues.apache.org/jira/browse/SPARK-23159
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Update PySpark's version of Cloudpickle to match version 0.4.3.  The reasons 
> for doing this are:
>  * Pick up bug fixes, improvements with newer version
>  * Match a specific version as close as possible (Spark has additional 
> changes that might be necessary) to make future upgrades easier
> There are newer versions of Cloudpickle that can fix bugs with NamedTuple 
> pickling (that Spark currently has workarounds for), but these include other 
> changes that need some verification before bringing into Spark.  Upgrading 
> first to 0.4.3 will help make this verification easier.
> Discussion on the mailing list: 
> [http://apache-spark-developers-list.1001551.n3.nabble.com/Thoughts-on-Cloudpickle-Update-td23188.html]
> Upgrading to the recent release of v0.4.3 will include the following:
>  * Fix pickling of named tuples 
> [https://github.com/cloudpipe/cloudpickle/pull/113]
>  * Built in type constructors for PyPy compatibility 
> [here]([https://github.com/cloudpipe/cloudpickle/commit/d84980ccaafc7982a50d4e04064011f401f17d1b])
>  * Fix memoryview support [https://github.com/cloudpipe/cloudpickle/pull/122]
>  * Improved compatibility with other cloudpickle versions 
> [https://github.com/cloudpipe/cloudpickle/pull/128]
>  * Several cleanups [https://github.com/cloudpipe/cloudpickle/pull/121] and 
> [here]([https://github.com/cloudpipe/cloudpickle/commit/c91aaf110441991307f5097f950764079d0f9652])
>  * [MRG] Regression on pickling classes from the __main__ module 
> [https://github.com/cloudpipe/cloudpickle/pull/149]
>  * BUG: Handle instance methods of builtin types 
> [https://github.com/cloudpipe/cloudpickle/pull/154]
>  * Fix #129 : do not silence RuntimeError in dump() 
> [https://github.com/cloudpipe/cloudpickle/pull/153]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For 

[jira] [Updated] (SPARK-23159) Update Cloudpickle to match version 0.4.3

2018-02-13 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-23159:
-
Summary: Update Cloudpickle to match version 0.4.3  (was: Update 
Cloudpickle to match version 0.4.2)

> Update Cloudpickle to match version 0.4.3
> -
>
> Key: SPARK-23159
> URL: https://issues.apache.org/jira/browse/SPARK-23159
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Update PySpark's version of Cloudpickle to match version 0.4.2.  The reasons 
> for doing this are:
>  * Pick up bug fixes, improvements with newer version
>  * Match a specific version as close as possible (Spark has additional 
> changes that might be necessary) to make future upgrades easier
> There are newer versions of Cloudpickle that can fix bugs with NamedTuple 
> pickling (that Spark currently has workarounds for), but these include other 
> changes that need some verification before bringing into Spark.  Upgrading 
> first to 0.4.2 will help make this verification easier.
> Discussion on the mailing list: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Thoughts-on-Cloudpickle-Update-td23188.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23416) flaky test: org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite.stress test for failOnDataLoss=false

2018-02-13 Thread Jose Torres (JIRA)
Jose Torres created SPARK-23416:
---

 Summary: flaky test: 
org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite.stress
 test for failOnDataLoss=false
 Key: SPARK-23416
 URL: https://issues.apache.org/jira/browse/SPARK-23416
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Jose Torres


I suspect this is a race condition latent in the DataSourceV2 write path, or at 
least the interaction of that write path with StreamTest.

[https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87241/testReport/org.apache.spark.sql.kafka010/KafkaSourceStressForDontFailOnDataLossSuite/stress_test_for_failOnDataLoss_false/]
h3. Error Message

org.apache.spark.sql.streaming.StreamingQueryException: Query [id = 
16b2a2b1-acdd-44ec-902f-531169193169, runId = 
9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job 
aborted.
h3. Stacktrace

sbt.ForkMain$ForkError: org.apache.spark.sql.streaming.StreamingQueryException: 
Query [id = 16b2a2b1-acdd-44ec-902f-531169193169, runId = 
9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job 
aborted. at 
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
 at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
 Caused by: sbt.ForkMain$ForkError: org.apache.spark.SparkException: Writing 
job aborted. at 
org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:108)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) 
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) 
at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:294) 
at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272)
 at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) 
at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) at 
org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253) at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
 at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252) at 
org.apache.spark.sql.Dataset.collect(Dataset.scala:2722) at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$15.apply(MicroBatchExecution.scala:488)
 at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
 at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3.apply(MicroBatchExecution.scala:483)
 at 
org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
 at 
org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
 at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:482)
 at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133)
 at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
 at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
 at 
org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
 at 
org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
 at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:121)
 at 
org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
 at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:117)
 at 

[jira] [Created] (SPARK-23415) BufferHolderSparkSubmitSuite is flaky

2018-02-13 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-23415:
-

 Summary: BufferHolderSparkSubmitSuite is flaky
 Key: SPARK-23415
 URL: https://issues.apache.org/jira/browse/SPARK-23415
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Dongjoon Hyun


The test suite fails due to 60-second timeout sometimes.

{code}
Error Message
org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
failAfter did not complete within 60 seconds.
Stacktrace
sbt.ForkMain$ForkError: 
org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
failAfter did not complete within 60 seconds.
{code}


- https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87380/
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4206/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23410) Unable to read jsons in charset different from UTF-8

2018-02-13 Thread Maxim Gekk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-23410:
---
Shepherd: Herman van Hovell

> Unable to read jsons in charset different from UTF-8
> 
>
> Key: SPARK-23410
> URL: https://issues.apache.org/jira/browse/SPARK-23410
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently the Json Parser is forced to read json files in UTF-8. Such 
> behavior breaks backward compatibility with Spark 2.2.1 and previous versions 
> that can read json files in UTF-16, UTF-32 and other encodings due to using 
> of the auto detection mechanism of the jackson library. Need to give back to 
> users possibility to read json files in specified charset and/or detect 
> charset automatically as it was before.    



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23413) Sorting tasks by Host / Executor ID on the Stage page does not work

2018-02-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23413:


Assignee: Apache Spark

> Sorting tasks by Host / Executor ID on the Stage page does not work
> ---
>
> Key: SPARK-23413
> URL: https://issues.apache.org/jira/browse/SPARK-23413
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Attila Zsolt Piros
>Assignee: Apache Spark
>Priority: Major
>
> Sorting tasks by Host / Executor ID throws exceptions: 
> {code}
> java.lang.IllegalArgumentException: Invalid sort column: Executor ID at 
> org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at 
> org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at 
> org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at 
> org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at 
> org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at 
> org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at 
> org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
> org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90)
> {code}
> {code}
> java.lang.IllegalArgumentException: Invalid sort column: Host at 
> org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at 
> org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at 
> org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at 
> org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at 
> org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at 
> org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at 
> org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
> org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at 
> {code}
>   !image-2018-02-13-16-50-32-600.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23413) Sorting tasks by Host / Executor ID on the Stage page does not work

2018-02-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362830#comment-16362830
 ] 

Apache Spark commented on SPARK-23413:
--

User 'attilapiros' has created a pull request for this issue:
https://github.com/apache/spark/pull/20601

> Sorting tasks by Host / Executor ID on the Stage page does not work
> ---
>
> Key: SPARK-23413
> URL: https://issues.apache.org/jira/browse/SPARK-23413
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Sorting tasks by Host / Executor ID throws exceptions: 
> {code}
> java.lang.IllegalArgumentException: Invalid sort column: Executor ID at 
> org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at 
> org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at 
> org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at 
> org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at 
> org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at 
> org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at 
> org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
> org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90)
> {code}
> {code}
> java.lang.IllegalArgumentException: Invalid sort column: Host at 
> org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at 
> org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at 
> org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at 
> org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at 
> org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at 
> org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at 
> org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
> org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at 
> {code}
>   !image-2018-02-13-16-50-32-600.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23413) Sorting tasks by Host / Executor ID on the Stage page does not work

2018-02-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23413:


Assignee: (was: Apache Spark)

> Sorting tasks by Host / Executor ID on the Stage page does not work
> ---
>
> Key: SPARK-23413
> URL: https://issues.apache.org/jira/browse/SPARK-23413
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Sorting tasks by Host / Executor ID throws exceptions: 
> {code}
> java.lang.IllegalArgumentException: Invalid sort column: Executor ID at 
> org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at 
> org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at 
> org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at 
> org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at 
> org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at 
> org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at 
> org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
> org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90)
> {code}
> {code}
> java.lang.IllegalArgumentException: Invalid sort column: Host at 
> org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at 
> org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at 
> org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at 
> org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at 
> org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at 
> org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at 
> org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
> org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at 
> {code}
>   !image-2018-02-13-16-50-32-600.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23344) Add KMeans distanceMeasure param to PySpark

2018-02-13 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362774#comment-16362774
 ] 

Marco Gaido commented on SPARK-23344:
-

I see. It would be good indeed to decide in the community a "standard" approach 
for this, I think. Do you think it is worth opening a thread in the dev mailing 
list about this topic?

> Add KMeans distanceMeasure param to PySpark
> ---
>
> Key: SPARK-23344
> URL: https://issues.apache.org/jira/browse/SPARK-23344
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 2.4.0
>
>
> SPARK-22119 introduced a new parameter for KMeans, ie. {{distanceMeasure}}. 
> We should add it also to the Python interface.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23344) Add KMeans distanceMeasure param to PySpark

2018-02-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362760#comment-16362760
 ] 

Sean Owen commented on SPARK-23344:
---

My general rule of thumb is: how likely is it you would resolve one issue 
independently of the other, or cherry-pick the fix for one vs the other? if 
they'd pretty much always go together then they seem like a logical issue. A 
lot of other concerns modify that. For example very large changes should 
probably be broken down even if they go together. This one is ambiguous.

> Add KMeans distanceMeasure param to PySpark
> ---
>
> Key: SPARK-23344
> URL: https://issues.apache.org/jira/browse/SPARK-23344
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 2.4.0
>
>
> SPARK-22119 introduced a new parameter for KMeans, ie. {{distanceMeasure}}. 
> We should add it also to the Python interface.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23344) Add KMeans distanceMeasure param to PySpark

2018-02-13 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362758#comment-16362758
 ] 

Marco Gaido commented on SPARK-23344:
-

[~srowen] I did it this way because I always say doing so. Not sure about the 
reason, maybe in order to make review easier. What would you suggest?

> Add KMeans distanceMeasure param to PySpark
> ---
>
> Key: SPARK-23344
> URL: https://issues.apache.org/jira/browse/SPARK-23344
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 2.4.0
>
>
> SPARK-22119 introduced a new parameter for KMeans, ie. {{distanceMeasure}}. 
> We should add it also to the Python interface.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23292) python tests related to pandas are skipped with python 2

2018-02-13 Thread Sameer Agarwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-23292:
---
Summary: python tests related to pandas are skipped with python 2  (was: 
python tests related to pandas are skipped)

> python tests related to pandas are skipped with python 2
> 
>
> Key: SPARK-23292
> URL: https://issues.apache.org/jira/browse/SPARK-23292
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Yin Huai
>Priority: Critical
>
> I was running python tests and found that 
> [pyspark.sql.tests.GroupbyAggPandasUDFTests.test_unsupported_types|https://github.com/apache/spark/blob/52e00f70663a87b5837235bdf72a3e6f84e11411/python/pyspark/sql/tests.py#L4528-L4548]
>  does not run with Python 2 because the test uses "assertRaisesRegex" 
> (supported by Python 3) instead of "assertRaisesRegexp" (supported by Python 
> 2). However, spark jenkins does not fail because of this issue (see run 
> history at 
> [here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.7/]).
>  After looking into this issue, [seems test script will skip tests related to 
> pandas if pandas is not 
> installed|https://github.com/apache/spark/blob/2ac895be909de7e58e1051dc2a1bba98a25bf4be/python/pyspark/sql/tests.py#L51-L63],
>  which means that jenkins does not have pandas installed. 
>  
> Since pyarrow related tests have the same skipping logic, we will need to 
> check if jenkins has pyarrow installed correctly as well. 
>  
> Since features using pandas and pyarrow are in 2.3, we should fix the test 
> issue and make sure all tests pass before we make the release.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23217) Add cosine distance measure to ClusteringEvaluator

2018-02-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23217.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20396
[https://github.com/apache/spark/pull/20396]

> Add cosine distance measure to ClusteringEvaluator
> --
>
> Key: SPARK-23217
> URL: https://issues.apache.org/jira/browse/SPARK-23217
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: SPARK-23217.pdf
>
>
> SPARK-22119 introduced the cosine distance measure for KMeans. Therefore it 
> would be useful to provide also an implementation of ClusteringEvaluator 
> using the cosine distance measure.
>  
> Attached you can find a design document for it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23217) Add cosine distance measure to ClusteringEvaluator

2018-02-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-23217:
-

Assignee: Marco Gaido

> Add cosine distance measure to ClusteringEvaluator
> --
>
> Key: SPARK-23217
> URL: https://issues.apache.org/jira/browse/SPARK-23217
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: SPARK-23217.pdf
>
>
> SPARK-22119 introduced the cosine distance measure for KMeans. Therefore it 
> would be useful to provide also an implementation of ClusteringEvaluator 
> using the cosine distance measure.
>  
> Attached you can find a design document for it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23392) Add some test case for images feature

2018-02-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23392.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20583
[https://github.com/apache/spark/pull/20583]

> Add some test case for images feature
> -
>
> Key: SPARK-23392
> URL: https://issues.apache.org/jira/browse/SPARK-23392
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: xubo245
>Assignee: xubo245
>Priority: Trivial
> Fix For: 2.4.0
>
>
> Add some test case for images feature: SPARK-21866



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23392) Add some test case for images feature

2018-02-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-23392:
-

Assignee: xubo245

> Add some test case for images feature
> -
>
> Key: SPARK-23392
> URL: https://issues.apache.org/jira/browse/SPARK-23392
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: xubo245
>Assignee: xubo245
>Priority: Trivial
> Fix For: 2.4.0
>
>
> Add some test case for images feature: SPARK-21866



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23388) Support for Parquet Binary DecimalType in VectorizedColumnReader

2018-02-13 Thread Sameer Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362730#comment-16362730
 ] 

Sameer Agarwal commented on SPARK-23388:


yes, I agree

> Support for Parquet Binary DecimalType in VectorizedColumnReader
> 
>
> Key: SPARK-23388
> URL: https://issues.apache.org/jira/browse/SPARK-23388
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: James Thompson
>Assignee: James Thompson
>Priority: Major
> Fix For: 2.3.1
>
>
> The following commit to spark removed support for decimal binary types: 
> [https://github.com/apache/spark/commit/9c29c557635caf739fde942f53255273aac0d7b1#diff-7bdf5fd0ce0b1ccbf4ecf083611976e6R428]
> As per the parquet spec, decimal can be used to annotate binary types, so 
> support should be re-added: 
> [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23382) Spark Streaming ui about the contents of the form need to have hidden and show features, when the table records very much.

2018-02-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-23382:
-

Assignee: guoxiaolongzte

> Spark Streaming ui about the contents of the form need to have hidden and 
> show features, when the table records very much.
> --
>
> Key: SPARK-23382
> URL: https://issues.apache.org/jira/browse/SPARK-23382
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: guoxiaolongzte
>Assignee: guoxiaolongzte
>Priority: Minor
> Fix For: 2.4.0
>
>
> Spark Streaming ui about the contents of the form need to have hidden and 
> show features, when the table records very much.
> Specific reasons, please refer to 
> https://issues.apache.org/jira/browse/SPARK-23024



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23382) Spark Streaming ui about the contents of the form need to have hidden and show features, when the table records very much.

2018-02-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23382.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20570
[https://github.com/apache/spark/pull/20570]

> Spark Streaming ui about the contents of the form need to have hidden and 
> show features, when the table records very much.
> --
>
> Key: SPARK-23382
> URL: https://issues.apache.org/jira/browse/SPARK-23382
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: guoxiaolongzte
>Assignee: guoxiaolongzte
>Priority: Minor
> Fix For: 2.4.0
>
>
> Spark Streaming ui about the contents of the form need to have hidden and 
> show features, when the table records very much.
> Specific reasons, please refer to 
> https://issues.apache.org/jira/browse/SPARK-23024



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23340) Upgrade Apache ORC to 1.4.3

2018-02-13 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23340:
--
Summary: Upgrade Apache ORC to 1.4.3  (was: Empty float/double array 
columns in ORC file should not raise EOFException)

> Upgrade Apache ORC to 1.4.3
> ---
>
> Key: SPARK-23340
> URL: https://issues.apache.org/jira/browse/SPARK-23340
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Critical
>
> This issue updates Apache ORC dependencies to 1.4.3 released on February 9th.
> Apache ORC 1.4.2 release removes unnecessary dependencies and 1.4.3 has 5 
> more patches including bug fixes (https://s.apache.org/Fll8).
> Especially, the following ORC-285 is fixed at 1.4.3.
> {code}
> scala> val df = Seq(Array.empty[Float]).toDF()
> scala> df.write.format("orc").save("/tmp/floatarray")
> scala> spark.read.orc("/tmp/floatarray")
> res1: org.apache.spark.sql.DataFrame = [value: array]
> scala> spark.read.orc("/tmp/floatarray").show()
> 18/02/12 22:09:10 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.io.IOException: Error reading file: 
> file:/tmp/floatarray/part-0-9c0b461b-4df1-4c23-aac1-3e4f349ac7d6-c000.snappy.orc
>   at 
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1191)
>   at 
> org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78)
> ...
> Caused by: java.io.EOFException: Read past EOF for compressed stream Stream 
> for column 2 kind DATA position: 0 length: 0 range: 0 offset: 0 limit: 0
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23414) Plotting using matplotlib in MLlib pyspark

2018-02-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23414.
---
Resolution: Invalid

Questions should go to the mailing list. Using matplotlib would be out of scope 
and pretty orthogonal to using Spark.

> Plotting using matplotlib in MLlib pyspark 
> ---
>
> Key: SPARK-23414
> URL: https://issues.apache.org/jira/browse/SPARK-23414
> Project: Spark
>  Issue Type: Question
>  Components: MLlib
>Affects Versions: 2.2.1
>Reporter: Waleed Esmail
>Priority: Major
>
> Dear MLlib experts,
> I just want to plot a fancy confusion matrix (true values vs predicted 
> values) like the one produced by seaborn module in python, so I did the 
> following:
> {code:java}
> labelIndexer = StringIndexer(inputCol="label", 
> outputCol="indexedLabel").fit(output)
> # Automatically identify categorical features, and index them.
> # We specify maxCategories so features with > 4 distinct values are treated 
> as continuous.
> featureIndexer = VectorIndexer(inputCol="features", 
> outputCol="indexedFeatures").fit(output)
> # Split the data into training and test sets (30% held out for testing)
> (trainingData, testData) = output.randomSplit([0.7, 0.3])
> dt = DecisionTreeClassifier(labelCol="indexedLabel", 
> featuresCol="indexedFeatures", maxDepth=15)
> # Chain indexers and tree in a Pipeline
> pipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt])
> # Train model.  This also runs the indexers.
> model = pipeline.fit(trainingData)
> # Make predictions.
> predictions = model.transform(testData)
> predictionAndLabels = predictions.select("prediction", "indexedLabel")
> y_predicted = np.array(predictions.select("prediction").collect())
> y_test = np.array(predictions.select("indexedLabel").collect())
> from sklearn.metrics import confusion_matrix
> import matplotlib.ticker as ticker
> figcm, ax = plt.subplots()
> cm = confusion_matrix(y_test, y_predicted)
> # for normalization
> cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
> sns.heatmap(cm, square=True, annot=True, cbar=False)
> plt.xlabel('predication')
> plt.ylabel('true value')
> {code}
> is this the right way to do it?!. please note that I am new to Spark and MLlib
>  
> thank you in advance,



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23414) Plotting using matplotlib in MLlib pyspark

2018-02-13 Thread Waleed Esmail (JIRA)
Waleed Esmail created SPARK-23414:
-

 Summary: Plotting using matplotlib in MLlib pyspark 
 Key: SPARK-23414
 URL: https://issues.apache.org/jira/browse/SPARK-23414
 Project: Spark
  Issue Type: Question
  Components: MLlib
Affects Versions: 2.2.1
Reporter: Waleed Esmail


Dear MLlib experts,

I just want to plot a fancy confusion matrix (true values vs predicted values) 
like the one produced by seaborn module in python, so I did the following:
{code:java}
labelIndexer = StringIndexer(inputCol="label", 
outputCol="indexedLabel").fit(output)
# Automatically identify categorical features, and index them.
# We specify maxCategories so features with > 4 distinct values are treated as 
continuous.
featureIndexer = VectorIndexer(inputCol="features", 
outputCol="indexedFeatures").fit(output)

# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = output.randomSplit([0.7, 0.3])


dt = DecisionTreeClassifier(labelCol="indexedLabel", 
featuresCol="indexedFeatures", maxDepth=15)

# Chain indexers and tree in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt])
# Train model.  This also runs the indexers.
model = pipeline.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)
predictionAndLabels = predictions.select("prediction", "indexedLabel")

y_predicted = np.array(predictions.select("prediction").collect())
y_test = np.array(predictions.select("indexedLabel").collect())



from sklearn.metrics import confusion_matrix
import matplotlib.ticker as ticker

figcm, ax = plt.subplots()
cm = confusion_matrix(y_test, y_predicted)
# for normalization
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
sns.heatmap(cm, square=True, annot=True, cbar=False)
plt.xlabel('predication')
plt.ylabel('true value')
{code}
is this the right way to do it?!. please note that I am new to Spark and MLlib

 

thank you in advance,



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23411) Deprecate SparkContext.getExecutorStorageStatus

2018-02-13 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362597#comment-16362597
 ] 

Marco Gaido commented on SPARK-23411:
-

I think this method was removed in SPARK-20659. So I think this is a duplicate. 
But maybe I am missing something,

> Deprecate SparkContext.getExecutorStorageStatus
> ---
>
> Key: SPARK-23411
> URL: https://issues.apache.org/jira/browse/SPARK-23411
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> This is another developer API in SparkContext that is better moved to a more 
> monitoring-minded place such as {{SparkStatusTracker}}. The same information 
> is also already available through the REST API.
> Moreover, it exposes {{RDDInfo}}, which is a mutable internal Spark type, and 
> exposing that to applications is kinda sketchy. (It can't be made private, 
> though, since it's exposed indirectly in events posted to the listener bus.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23053) taskBinarySerialization and task partitions calculate in DagScheduler.submitMissingTasks should keep the same RDD checkpoint status

2018-02-13 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-23053.
--
Resolution: Fixed

> taskBinarySerialization and task partitions calculate in 
> DagScheduler.submitMissingTasks should keep the same RDD checkpoint status
> ---
>
> Key: SPARK-23053
> URL: https://issues.apache.org/jira/browse/SPARK-23053
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.1.0
>Reporter: huangtengfei
>Assignee: huangtengfei
>Priority: Major
> Fix For: 2.2.2, 2.3.1, 2.4.0
>
>
> When we run concurrent jobs using the same rdd which is marked to do 
> checkpoint. If one job has finished running the job, and start the process of 
> RDD.doCheckpoint, while another job is submitted, then submitStage and 
> submitMissingTasks will be called. In 
> [submitMissingTasks|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L961],
>  will serialize taskBinaryBytes and calculate task partitions which are both 
> affected by the status of checkpoint, if the former is calculated before 
> doCheckpoint finished, while the latter is calculated after doCheckpoint 
> finished, when run task, rdd.compute will be called, for some rdds with 
> particular partition type such as 
> [MapWithStateRDD|https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/rdd/MapWithStateRDD.scala]
>  who will do partition type cast, will get a ClassCastException because the 
> part params is actually a CheckpointRDDPartition.
> This error occurs because rdd.doCheckpoint occurs in the same thread that 
> called sc.runJob, while the task serialization occurs in the DAGSchedulers 
> event loop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23053) taskBinarySerialization and task partitions calculate in DagScheduler.submitMissingTasks should keep the same RDD checkpoint status

2018-02-13 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reassigned SPARK-23053:


Assignee: huangtengfei

> taskBinarySerialization and task partitions calculate in 
> DagScheduler.submitMissingTasks should keep the same RDD checkpoint status
> ---
>
> Key: SPARK-23053
> URL: https://issues.apache.org/jira/browse/SPARK-23053
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.1.0
>Reporter: huangtengfei
>Assignee: huangtengfei
>Priority: Major
> Fix For: 2.2.2, 2.3.1, 2.4.0
>
>
> When we run concurrent jobs using the same rdd which is marked to do 
> checkpoint. If one job has finished running the job, and start the process of 
> RDD.doCheckpoint, while another job is submitted, then submitStage and 
> submitMissingTasks will be called. In 
> [submitMissingTasks|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L961],
>  will serialize taskBinaryBytes and calculate task partitions which are both 
> affected by the status of checkpoint, if the former is calculated before 
> doCheckpoint finished, while the latter is calculated after doCheckpoint 
> finished, when run task, rdd.compute will be called, for some rdds with 
> particular partition type such as 
> [MapWithStateRDD|https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/rdd/MapWithStateRDD.scala]
>  who will do partition type cast, will get a ClassCastException because the 
> part params is actually a CheckpointRDDPartition.
> This error occurs because rdd.doCheckpoint occurs in the same thread that 
> called sc.runJob, while the task serialization occurs in the DAGSchedulers 
> event loop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23053) taskBinarySerialization and task partitions calculate in DagScheduler.submitMissingTasks should keep the same RDD checkpoint status

2018-02-13 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362556#comment-16362556
 ] 

Imran Rashid commented on SPARK-23053:
--

Fixed by https://github.com/apache/spark/pull/20244

I set the fix version to 2.3.1, because we're in the middle of voting for RC3.  
If we cut another RC this would actually be fixed in 2.3.0.

> taskBinarySerialization and task partitions calculate in 
> DagScheduler.submitMissingTasks should keep the same RDD checkpoint status
> ---
>
> Key: SPARK-23053
> URL: https://issues.apache.org/jira/browse/SPARK-23053
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.1.0
>Reporter: huangtengfei
>Priority: Major
> Fix For: 2.2.2, 2.3.1, 2.4.0
>
>
> When we run concurrent jobs using the same rdd which is marked to do 
> checkpoint. If one job has finished running the job, and start the process of 
> RDD.doCheckpoint, while another job is submitted, then submitStage and 
> submitMissingTasks will be called. In 
> [submitMissingTasks|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L961],
>  will serialize taskBinaryBytes and calculate task partitions which are both 
> affected by the status of checkpoint, if the former is calculated before 
> doCheckpoint finished, while the latter is calculated after doCheckpoint 
> finished, when run task, rdd.compute will be called, for some rdds with 
> particular partition type such as 
> [MapWithStateRDD|https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/rdd/MapWithStateRDD.scala]
>  who will do partition type cast, will get a ClassCastException because the 
> part params is actually a CheckpointRDDPartition.
> This error occurs because rdd.doCheckpoint occurs in the same thread that 
> called sc.runJob, while the task serialization occurs in the DAGSchedulers 
> event loop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23053) taskBinarySerialization and task partitions calculate in DagScheduler.submitMissingTasks should keep the same RDD checkpoint status

2018-02-13 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-23053:
-
Fix Version/s: 2.4.0
   2.3.1
   2.2.2

> taskBinarySerialization and task partitions calculate in 
> DagScheduler.submitMissingTasks should keep the same RDD checkpoint status
> ---
>
> Key: SPARK-23053
> URL: https://issues.apache.org/jira/browse/SPARK-23053
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.1.0
>Reporter: huangtengfei
>Priority: Major
> Fix For: 2.2.2, 2.3.1, 2.4.0
>
>
> When we run concurrent jobs using the same rdd which is marked to do 
> checkpoint. If one job has finished running the job, and start the process of 
> RDD.doCheckpoint, while another job is submitted, then submitStage and 
> submitMissingTasks will be called. In 
> [submitMissingTasks|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L961],
>  will serialize taskBinaryBytes and calculate task partitions which are both 
> affected by the status of checkpoint, if the former is calculated before 
> doCheckpoint finished, while the latter is calculated after doCheckpoint 
> finished, when run task, rdd.compute will be called, for some rdds with 
> particular partition type such as 
> [MapWithStateRDD|https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/rdd/MapWithStateRDD.scala]
>  who will do partition type cast, will get a ClassCastException because the 
> part params is actually a CheckpointRDDPartition.
> This error occurs because rdd.doCheckpoint occurs in the same thread that 
> called sc.runJob, while the task serialization occurs in the DAGSchedulers 
> event loop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23189) reflect stage level blacklisting on executor tab

2018-02-13 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reassigned SPARK-23189:


Assignee: Attila Zsolt Piros

> reflect stage level blacklisting on executor tab 
> -
>
> Key: SPARK-23189
> URL: https://issues.apache.org/jira/browse/SPARK-23189
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.1
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: afterStageCompleted.png, backlisted.png, 
> multiple_stages.png, multiple_stages_1.png, multiple_stages_2.png, 
> multiple_stages_3.png
>
>
> This issue is the came during working on SPARK-22577 where the conclusion was 
> not only stage tab should reflect stage and application level backlisting but 
> also the executor tab should be extended with stage level backlisting 
> information.
> As [~irashid] and [~tgraves] are discussed the backlisted stages should be 
> listed for an executor like "*stage[ , ,...]*". One idea was to list only the 
> most recent 3 of the blacklisted stages another was list all the active 
> stages which are blacklisted.  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23189) reflect stage level blacklisting on executor tab

2018-02-13 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-23189.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20408
[https://github.com/apache/spark/pull/20408]

> reflect stage level blacklisting on executor tab 
> -
>
> Key: SPARK-23189
> URL: https://issues.apache.org/jira/browse/SPARK-23189
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.1
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: afterStageCompleted.png, backlisted.png, 
> multiple_stages.png, multiple_stages_1.png, multiple_stages_2.png, 
> multiple_stages_3.png
>
>
> This issue is the came during working on SPARK-22577 where the conclusion was 
> not only stage tab should reflect stage and application level backlisting but 
> also the executor tab should be extended with stage level backlisting 
> information.
> As [~irashid] and [~tgraves] are discussed the backlisted stages should be 
> listed for an executor like "*stage[ , ,...]*". One idea was to list only the 
> most recent 3 of the blacklisted stages another was list all the active 
> stages which are blacklisted.  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23413) Sorting tasks by Host / Executor ID on the Stage page does not work

2018-02-13 Thread Attila Zsolt Piros (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-23413:
---
Description: 
Sorting tasks by Host / Executor ID throws exceptions: 
{code}
java.lang.IllegalArgumentException: Invalid sort column: Executor ID at 
org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at 
org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at 
org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at 
org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at 
org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at 
org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at 
org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at 
org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90)
{code}

{code}
java.lang.IllegalArgumentException: Invalid sort column: Host at 
org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at 
org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at 
org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at 
org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at 
org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at 
org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at 
org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at 
org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at 
{code}

  !image-2018-02-13-16-50-32-600.png!

  was:
Sorting tasks by Host / Executor ID throws exceptions: 
{noformat}
java.lang.IllegalArgumentException: Invalid sort column: Executor ID at 
org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at 
org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at 
org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at 
org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at 
org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at 
org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at 
org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at 
org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90)
{noformat}
{noformat}
java.lang.IllegalArgumentException: Invalid sort column: Host at 
org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at 
org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at 
org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at 
org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at 
org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at 
org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at 
org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at 
org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at 
{noformat}
  !image-2018-02-13-16-50-32-600.png!


> Sorting tasks by Host / Executor ID on the Stage page does not work
> ---
>
> Key: SPARK-23413
> URL: https://issues.apache.org/jira/browse/SPARK-23413
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Sorting tasks by Host / Executor ID throws exceptions: 
> {code}
> java.lang.IllegalArgumentException: Invalid sort column: Executor ID at 
> org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at 
> org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at 
> org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at 
> org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at 
> org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at 
> org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at 
> org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
> org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90)
> {code}
> {code}
> java.lang.IllegalArgumentException: Invalid sort column: Host at 
> org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at 
> 

[jira] [Commented] (SPARK-23413) Sorting tasks by Host / Executor ID on the Stage page does not work

2018-02-13 Thread Attila Zsolt Piros (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362513#comment-16362513
 ] 

Attila Zsolt Piros commented on SPARK-23413:


I am working on that.

> Sorting tasks by Host / Executor ID on the Stage page does not work
> ---
>
> Key: SPARK-23413
> URL: https://issues.apache.org/jira/browse/SPARK-23413
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Sorting tasks by Host / Executor ID throws exceptions: 
> {noformat}
> java.lang.IllegalArgumentException: Invalid sort column: Executor ID at 
> org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at 
> org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at 
> org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at 
> org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at 
> org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at 
> org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at 
> org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
> org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90)
> {noformat}
> {noformat}
> java.lang.IllegalArgumentException: Invalid sort column: Host at 
> org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at 
> org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at 
> org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at 
> org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at 
> org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at 
> org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at 
> org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
> org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at 
> {noformat}
>   !image-2018-02-13-16-50-32-600.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23413) Sorting tasks by Host / Executor ID on the Stage page does not work

2018-02-13 Thread Attila Zsolt Piros (JIRA)
Attila Zsolt Piros created SPARK-23413:
--

 Summary: Sorting tasks by Host / Executor ID on the Stage page 
does not work
 Key: SPARK-23413
 URL: https://issues.apache.org/jira/browse/SPARK-23413
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.3.0, 2.4.0
Reporter: Attila Zsolt Piros


Sorting tasks by Host / Executor ID throws exceptions: 
{noformat}
java.lang.IllegalArgumentException: Invalid sort column: Executor ID at 
org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at 
org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at 
org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at 
org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at 
org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at 
org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at 
org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at 
org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90)
{noformat}
{noformat}
java.lang.IllegalArgumentException: Invalid sort column: Host at 
org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at 
org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at 
org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at 
org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at 
org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at 
org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at 
org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at 
org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at 
{noformat}
  !image-2018-02-13-16-50-32-600.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23412) Add cosine distance measure to BisectingKMeans

2018-02-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362493#comment-16362493
 ] 

Apache Spark commented on SPARK-23412:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/20600

> Add cosine distance measure to BisectingKMeans
> --
>
> Key: SPARK-23412
> URL: https://issues.apache.org/jira/browse/SPARK-23412
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Priority: Minor
>
> SPARK-22119 introduced cosine distance for KMeans.
> This ticket is to support the cosine distance measure on BisectingKMeans too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23412) Add cosine distance measure to BisectingKMeans

2018-02-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23412:


Assignee: Apache Spark

> Add cosine distance measure to BisectingKMeans
> --
>
> Key: SPARK-23412
> URL: https://issues.apache.org/jira/browse/SPARK-23412
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Apache Spark
>Priority: Minor
>
> SPARK-22119 introduced cosine distance for KMeans.
> This ticket is to support the cosine distance measure on BisectingKMeans too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23412) Add cosine distance measure to BisectingKMeans

2018-02-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23412:


Assignee: (was: Apache Spark)

> Add cosine distance measure to BisectingKMeans
> --
>
> Key: SPARK-23412
> URL: https://issues.apache.org/jira/browse/SPARK-23412
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Priority: Minor
>
> SPARK-22119 introduced cosine distance for KMeans.
> This ticket is to support the cosine distance measure on BisectingKMeans too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23412) Add cosine distance measure to BisectingKMeans

2018-02-13 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-23412:
---

 Summary: Add cosine distance measure to BisectingKMeans
 Key: SPARK-23412
 URL: https://issues.apache.org/jira/browse/SPARK-23412
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 2.4.0
Reporter: Marco Gaido


SPARK-22119 introduced cosine distance for KMeans.

This ticket is to support the cosine distance measure on BisectingKMeans too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20659) Remove StorageStatus, or make it private.

2018-02-13 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-20659:
--

Assignee: Attila Zsolt Piros

> Remove StorageStatus, or make it private.
> -
>
> Key: SPARK-20659
> URL: https://issues.apache.org/jira/browse/SPARK-20659
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 2.4.0
>
>
> With the work being done in SPARK-18085, StorageStatus is not used anymore by 
> the UI. It's still used in a couple of other places, though:
> - {{SparkContext.getExecutorStorageStatus}}
> - {{BlockManagerSource}} (a metrics source)
> Both could be changed to use the REST API types; the first one could be 
> replaced with a new method in {{SparkStatusTracker}}, which I also think is a 
> better place for it anyway.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20659) Remove StorageStatus, or make it private.

2018-02-13 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-20659.

   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20546
[https://github.com/apache/spark/pull/20546]

> Remove StorageStatus, or make it private.
> -
>
> Key: SPARK-20659
> URL: https://issues.apache.org/jira/browse/SPARK-20659
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 2.4.0
>
>
> With the work being done in SPARK-18085, StorageStatus is not used anymore by 
> the UI. It's still used in a couple of other places, though:
> - {{SparkContext.getExecutorStorageStatus}}
> - {{BlockManagerSource}} (a metrics source)
> Both could be changed to use the REST API types; the first one could be 
> replaced with a new method in {{SparkStatusTracker}}, which I also think is a 
> better place for it anyway.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23411) Deprecate SparkContext.getExecutorStorageStatus

2018-02-13 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-23411:
---
Issue Type: Improvement  (was: Bug)

> Deprecate SparkContext.getExecutorStorageStatus
> ---
>
> Key: SPARK-23411
> URL: https://issues.apache.org/jira/browse/SPARK-23411
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> This is another developer API in SparkContext that is better moved to a more 
> monitoring-minded place such as {{SparkStatusTracker}}. The same information 
> is also already available through the REST API.
> Moreover, it exposes {{RDDInfo}}, which is a mutable internal Spark type, and 
> exposing that to applications is kinda sketchy. (It can't be made private, 
> though, since it's exposed indirectly in events posted to the listener bus.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23411) Deprecate SparkContext.getExecutorStorageStatus

2018-02-13 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-23411:
--

 Summary: Deprecate SparkContext.getExecutorStorageStatus
 Key: SPARK-23411
 URL: https://issues.apache.org/jira/browse/SPARK-23411
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Marcelo Vanzin


This is another developer API in SparkContext that is better moved to a more 
monitoring-minded place such as {{SparkStatusTracker}}. The same information is 
also already available through the REST API.

Moreover, it exposes {{RDDInfo}}, which is a mutable internal Spark type, and 
exposing that to applications is kinda sketchy. (It can't be made private, 
though, since it's exposed indirectly in events posted to the listener bus.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21501) Spark shuffle index cache size should be memory based

2018-02-13 Thread Xun REN (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362424#comment-16362424
 ] 

Xun REN commented on SPARK-21501:
-

Hi guys,

Could you tell me how to figure out how many memory the NM with spark shuffle 
service has used ? And how to know a spark job has used how many reducers ?

Because I have the same problem recently and I want to get a list of spark jobs 
by sorting by number of reducers.

 

Thanks.

Regards,

Xun REN.

> Spark shuffle index cache size should be memory based
> -
>
> Key: SPARK-21501
> URL: https://issues.apache.org/jira/browse/SPARK-21501
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>Assignee: Sanket Reddy
>Priority: Major
> Fix For: 2.3.0
>
>
> Right now the spark shuffle service has a cache for index files. It is based 
> on a # of files cached (spark.shuffle.service.index.cache.entries). This can 
> cause issues if people have a lot of reducers because the size of each entry 
> can fluctuate based on the # of reducers. 
> We saw an issues with a job that had 17 reducers and it caused NM with 
> spark shuffle service to use 700-800MB or memory in NM by itself.
> We should change this cache to be memory based and only allow a certain 
> memory size used. When I say memory based I mean the cache should have a 
> limit of say 100MB.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21232) New built-in SQL function - Data_Type

2018-02-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21232:
--
Fix Version/s: (was: 2.2.2)
   (was: 2.3.0)

> New built-in SQL function - Data_Type
> -
>
> Key: SPARK-21232
> URL: https://issues.apache.org/jira/browse/SPARK-21232
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SparkR, SQL
>Affects Versions: 2.1.1
>Reporter: Mario Molina
>Priority: Minor
>
> This function returns the data type of a given column.
> {code:java}
> data_type("a")
> // returns string
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23395) Add an option to return an empty DataFrame from an RDD generated by a Hadoop file when there are no usable paths

2018-02-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-23395:
--
Target Version/s:   (was: 2.2.0, 2.2.1)

> Add an option to return an empty DataFrame from an RDD generated by a Hadoop 
> file when there are no usable paths
> 
>
> Key: SPARK-23395
> URL: https://issues.apache.org/jira/browse/SPARK-23395
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Jens Rabe
>Priority: Minor
>  Labels: DataFrame, HadoopInputFormat, RDD
>
> When using file-based data from custom formats, Spark's ability to use 
> Hadoop's FileInputFormats is very handy. However, when the path they are 
> pointed at contains no usable data, they throw an IOException saying "No 
> input paths specified in job".
> It would be a nice feature if the DataFrame API somehow could capture this 
> and return an empty DataFrame instead of failing the job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23410) Unable to read jsons in charset different from UTF-8

2018-02-13 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-23410:
--

 Summary: Unable to read jsons in charset different from UTF-8
 Key: SPARK-23410
 URL: https://issues.apache.org/jira/browse/SPARK-23410
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 2.3.0
Reporter: Maxim Gekk


Currently the Json Parser is forced to read json files in UTF-8. Such behavior 
breaks backward compatibility with Spark 2.2.1 and previous versions that can 
read json files in UTF-16, UTF-32 and other encodings due to using of the auto 
detection mechanism of the jackson library. Need to give back to users 
possibility to read json files in specified charset and/or detect charset 
automatically as it was before.    



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23409) RandomForest/DecisionTree (syntactic) pruning of redundant subtrees

2018-02-13 Thread Alessandro Solimando (JIRA)
Alessandro Solimando created SPARK-23409:


 Summary: RandomForest/DecisionTree (syntactic) pruning of 
redundant subtrees
 Key: SPARK-23409
 URL: https://issues.apache.org/jira/browse/SPARK-23409
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 2.2.1
 Environment: 

Reporter: Alessandro Solimando


Improvement: redundancy elimination from decision trees where all the leaves of 
a given subtree share the same prediction.

Benefits:
* Model interpretability
* Faster unitary model invocation (relevant for massive )
* Smaller model memory footprint

For instance, consider the following decision tree.

{panel:title=Original Decision Tree}

{noformat}
DecisionTreeClassificationModel (uid=dtc_e794a5a3aa9e) of depth 3 with 15 nodes
  If (feature 1 <= 0.5)
   If (feature 2 <= 0.5)
If (feature 0 <= 0.5)
 Predict: 0.0
Else (feature 0 > 0.5)
 Predict: 0.0
   Else (feature 2 > 0.5)
If (feature 0 <= 0.5)
 Predict: 0.0
Else (feature 0 > 0.5)
 Predict: 0.0
  Else (feature 1 > 0.5)
   If (feature 2 <= 0.5)
If (feature 0 <= 0.5)
 Predict: 1.0
Else (feature 0 > 0.5)
 Predict: 1.0
   Else (feature 2 > 0.5)
If (feature 0 <= 0.5)
 Predict: 0.0
Else (feature 0 > 0.5)
 Predict: 0.0
{noformat}

{panel}

The proposed method, taken as input the first tree, aims at producing as output 
the following (semantically equivalent) tree:

{panel:title=Pruned Decision Tree}

{noformat}
DecisionTreeClassificationModel (uid=dtc_e794a5a3aa9e) of depth 3 with 15 nodes
  If (feature 1 <= 0.5)
   Predict: 0.0
  Else (feature 1 > 0.5)
   If (feature 2 <= 0.5)
Predict: 1.0
   Else (feature 2 > 0.5)
Predict: 0.0
{noformat}

{panel}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23408) Flaky test: StreamingOuterJoinSuite.left outer early state exclusion on right

2018-02-13 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-23408:
--

 Summary: Flaky test: StreamingOuterJoinSuite.left outer early 
state exclusion on right
 Key: SPARK-23408
 URL: https://issues.apache.org/jira/browse/SPARK-23408
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 2.4.0
Reporter: Marcelo Vanzin


Seen on an unrelated PR.

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87386/testReport/org.apache.spark.sql.streaming/StreamingOuterJoinSuite/left_outer_early_state_exclusion_on_right/

{noformat}
sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 
Assert on query failed: Check total state rows = List(4), updated state rows = 
List(4): Array(1) did not equal List(4) incorrect updates rows
org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)

org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)

org.apache.spark.sql.streaming.StateStoreMetricsTest$$anonfun$assertNumStateRows$1.apply(StateStoreMetricsTest.scala:28)

org.apache.spark.sql.streaming.StateStoreMetricsTest$$anonfun$assertNumStateRows$1.apply(StateStoreMetricsTest.scala:23)

org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1$$anonfun$apply$14.apply$mcZ$sp(StreamTest.scala:568)

org.apache.spark.sql.streaming.StreamTest$class.verify$1(StreamTest.scala:371)

org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:568)

org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:432)

scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)


== Progress ==
   AddData to MemoryStream[value#19652]: 3,4,5
   AddData to MemoryStream[value#19662]: 1,2,3
   CheckLastBatch: [3,10,6,9]
=> AssertOnQuery(, Check total state rows = List(4), updated state 
rows = List(4))
   AddData to MemoryStream[value#19652]: 20
   AddData to MemoryStream[value#19662]: 21
   CheckLastBatch: 
   AddData to MemoryStream[value#19662]: 20
   CheckLastBatch: [20,30,40,60],[4,10,8,null],[5,10,10,null]

== Stream ==
Output Mode: Append
Stream state: {MemoryStream[value#19652]: 0,MemoryStream[value#19662]: 0}
Thread state: alive
Thread stack trace: java.lang.Thread.sleep(Native Method)
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:152)
org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:120)
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279)
org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
{noformat}

No other failures in the history, though.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23364) 'desc table' command in spark-sql add column head display

2018-02-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23364.
---
Resolution: Not A Problem

> 'desc table' command in spark-sql add column head display
> -
>
> Key: SPARK-23364
> URL: https://issues.apache.org/jira/browse/SPARK-23364
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: guoxiaolongzte
>Priority: Minor
> Attachments: 1.png, 2.png
>
>
> fix before: 
>  !2.png! 
> fix after:
>  !1.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23407) add a config to try to inline all mutable states during codegen

2018-02-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23407:


Assignee: Wenchen Fan  (was: Apache Spark)

> add a config to try to inline all mutable states during codegen
> ---
>
> Key: SPARK-23407
> URL: https://issues.apache.org/jira/browse/SPARK-23407
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23407) add a config to try to inline all mutable states during codegen

2018-02-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23407:


Assignee: Apache Spark  (was: Wenchen Fan)

> add a config to try to inline all mutable states during codegen
> ---
>
> Key: SPARK-23407
> URL: https://issues.apache.org/jira/browse/SPARK-23407
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23407) add a config to try to inline all mutable states during codegen

2018-02-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362218#comment-16362218
 ] 

Apache Spark commented on SPARK-23407:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/20599

> add a config to try to inline all mutable states during codegen
> ---
>
> Key: SPARK-23407
> URL: https://issues.apache.org/jira/browse/SPARK-23407
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12140) Support Streaming UI in HistoryServer

2018-02-13 Thread German Schiavon Matteo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362216#comment-16362216
 ] 

German Schiavon Matteo commented on SPARK-12140:


Hi guys, is there any progress about this? [~jerryshao] 

> Support Streaming UI in HistoryServer
> -
>
> Key: SPARK-12140
> URL: https://issues.apache.org/jira/browse/SPARK-12140
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> SPARK-11206 added infrastructure that would allow the streaming UI to be 
> shown in the History Server. We should add the necessary code to make that 
> happen, although it requires some changes to how events and listeners are 
> used.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23396) Spark HistoryServer will OMM if the event log is big

2018-02-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23396.
---
Resolution: Not A Problem

> Spark HistoryServer will OMM if the event log is big
> 
>
> Key: SPARK-23396
> URL: https://issues.apache.org/jira/browse/SPARK-23396
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: KaiXinXIaoLei
>Priority: Major
> Attachments: eventlog.png, historyServer.png
>
>
> if the event log is  big, the historyServer web will be out of memory . My 
> eventlog size is 5.1G:
>  
> !eventlog.png!
> I open the web, the ui will be OMM. 
>  
> !historyServer.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >