[jira] [Assigned] (SPARK-17813) Maximum data per trigger

2016-10-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17813:


Assignee: (was: Apache Spark)

> Maximum data per trigger
> 
>
> Key: SPARK-17813
> URL: https://issues.apache.org/jira/browse/SPARK-17813
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>
> At any given point in a streaming query execution, we process all available 
> data.  This maximizes throughput at the cost of latency.  We should add 
> something similar to the {{maxFilesPerTrigger}} option available for files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17813) Maximum data per trigger

2016-10-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17813:


Assignee: Apache Spark

> Maximum data per trigger
> 
>
> Key: SPARK-17813
> URL: https://issues.apache.org/jira/browse/SPARK-17813
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Apache Spark
>
> At any given point in a streaming query execution, we process all available 
> data.  This maximizes throughput at the cost of latency.  We should add 
> something similar to the {{maxFilesPerTrigger}} option available for files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17813) Maximum data per trigger

2016-10-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15584515#comment-15584515
 ] 

Apache Spark commented on SPARK-17813:
--

User 'koeninger' has created a pull request for this issue:
https://github.com/apache/spark/pull/15527

> Maximum data per trigger
> 
>
> Key: SPARK-17813
> URL: https://issues.apache.org/jira/browse/SPARK-17813
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>
> At any given point in a streaming query execution, we process all available 
> data.  This maximizes throughput at the cost of latency.  We should add 
> something similar to the {{maxFilesPerTrigger}} option available for files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17986) SQLTransformer leaks temporary tables

2016-10-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17986:


Assignee: Apache Spark

> SQLTransformer leaks temporary tables
> -
>
> Key: SPARK-17986
> URL: https://issues.apache.org/jira/browse/SPARK-17986
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.1
>Reporter: Drew Robb
>Assignee: Apache Spark
>Priority: Minor
>
> The SQLTransformer creates a temporary table when called, and does not delete 
> this temporary table. When using a SQLTransformer in a long running Spark 
> Streaming task, these temporary tables accumulate.
> I believe that the fix would be as simple as calling  
> `dataset.sparkSession.catalog.dropTempView(tableName)` in the last part of 
> `transform`:
> https://github.com/apache/spark/blob/v2.0.1/mllib/src/main/scala/org/apache/spark/ml/feature/SQLTransformer.scala#L65.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17986) SQLTransformer leaks temporary tables

2016-10-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17986:


Assignee: (was: Apache Spark)

> SQLTransformer leaks temporary tables
> -
>
> Key: SPARK-17986
> URL: https://issues.apache.org/jira/browse/SPARK-17986
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.1
>Reporter: Drew Robb
>Priority: Minor
>
> The SQLTransformer creates a temporary table when called, and does not delete 
> this temporary table. When using a SQLTransformer in a long running Spark 
> Streaming task, these temporary tables accumulate.
> I believe that the fix would be as simple as calling  
> `dataset.sparkSession.catalog.dropTempView(tableName)` in the last part of 
> `transform`:
> https://github.com/apache/spark/blob/v2.0.1/mllib/src/main/scala/org/apache/spark/ml/feature/SQLTransformer.scala#L65.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17986) SQLTransformer leaks temporary tables

2016-10-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15584466#comment-15584466
 ] 

Apache Spark commented on SPARK-17986:
--

User 'drewrobb' has created a pull request for this issue:
https://github.com/apache/spark/pull/15526

> SQLTransformer leaks temporary tables
> -
>
> Key: SPARK-17986
> URL: https://issues.apache.org/jira/browse/SPARK-17986
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.1
>Reporter: Drew Robb
>Priority: Minor
>
> The SQLTransformer creates a temporary table when called, and does not delete 
> this temporary table. When using a SQLTransformer in a long running Spark 
> Streaming task, these temporary tables accumulate.
> I believe that the fix would be as simple as calling  
> `dataset.sparkSession.catalog.dropTempView(tableName)` in the last part of 
> `transform`:
> https://github.com/apache/spark/blob/v2.0.1/mllib/src/main/scala/org/apache/spark/ml/feature/SQLTransformer.scala#L65.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17986) SQLTransformer leaks temporary tables

2016-10-17 Thread Drew Robb (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Drew Robb updated SPARK-17986:
--
Description: 
The SQLTransformer creates a temporary table when called, and does not delete 
this temporary table. When using a SQLTransformer in a long running Spark 
Streaming task, these temporary tables accumulate.

I believe that the fix would be as simple as calling  
`dataset.sparkSession.catalog.dropTempView(tableName)` in the last part of 
`transform`:
https://github.com/apache/spark/blob/v2.0.1/mllib/src/main/scala/org/apache/spark/ml/feature/SQLTransformer.scala#L65.
 

  was:
The SQLTransformer creates a temporary table when called, and does not delete 
this temporary table. When using a SQLTransformer in a long running Spark 
Streaming task, these temporary tables accumulate.

I believe that the fix would be as simple as calling  
`dataset.sparkSession.catalog.dropTempView(tableName)` in the last part of 
`transform`:
https://github.com/apache/spark/blob/v2.0.1/mllib/src/main/scala/org/apache/spark/ml/feature/SQLTransformer.scala#L65.
 I would be happy to attempt this fix myself if someone could validate this 
issue.


> SQLTransformer leaks temporary tables
> -
>
> Key: SPARK-17986
> URL: https://issues.apache.org/jira/browse/SPARK-17986
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.1
>Reporter: Drew Robb
>Priority: Minor
>
> The SQLTransformer creates a temporary table when called, and does not delete 
> this temporary table. When using a SQLTransformer in a long running Spark 
> Streaming task, these temporary tables accumulate.
> I believe that the fix would be as simple as calling  
> `dataset.sparkSession.catalog.dropTempView(tableName)` in the last part of 
> `transform`:
> https://github.com/apache/spark/blob/v2.0.1/mllib/src/main/scala/org/apache/spark/ml/feature/SQLTransformer.scala#L65.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-17956) ProjectExec has incorrect outputOrdering property

2016-10-17 Thread Liang-Chi Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh closed SPARK-17956.
---
Resolution: Won't Fix

> ProjectExec has incorrect outputOrdering property
> -
>
> Key: SPARK-17956
> URL: https://issues.apache.org/jira/browse/SPARK-17956
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> Currently ProjectExec simply takes child plan's outputOrdering as its 
> outputOrdering. In some cases, this leads to incorrect outputOrdering. This 
> applies to TakeOrderedAndProjectExec too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17974) Refactor FileCatalog classes to simplify the inheritance tree

2016-10-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17974:


Assignee: Apache Spark  (was: Eric Liang)

> Refactor FileCatalog classes to simplify the inheritance tree
> -
>
> Key: SPARK-17974
> URL: https://issues.apache.org/jira/browse/SPARK-17974
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 2.1.0
>
>
> This is a follow-up item for https://github.com/apache/spark/pull/14690 which 
> adds support for metastore partition pruning of converted hive tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17974) Refactor FileCatalog classes to simplify the inheritance tree

2016-10-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17974:


Assignee: Eric Liang  (was: Apache Spark)

> Refactor FileCatalog classes to simplify the inheritance tree
> -
>
> Key: SPARK-17974
> URL: https://issues.apache.org/jira/browse/SPARK-17974
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Assignee: Eric Liang
>Priority: Minor
> Fix For: 2.1.0
>
>
> This is a follow-up item for https://github.com/apache/spark/pull/14690 which 
> adds support for metastore partition pruning of converted hive tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-17974) Refactor FileCatalog classes to simplify the inheritance tree

2016-10-17 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin reopened SPARK-17974:
-

Reopening since the previous commit was not tested by Jenkins (failed Scala 
linter).


> Refactor FileCatalog classes to simplify the inheritance tree
> -
>
> Key: SPARK-17974
> URL: https://issues.apache.org/jira/browse/SPARK-17974
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Assignee: Eric Liang
>Priority: Minor
> Fix For: 2.1.0
>
>
> This is a follow-up item for https://github.com/apache/spark/pull/14690 which 
> adds support for metastore partition pruning of converted hive tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17862) Feature flag SPARK-16980

2016-10-17 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15584374#comment-15584374
 ] 

Reynold Xin commented on SPARK-17862:
-

cc [~ekhliang] this was done right? Can you put the flag here?


> Feature flag SPARK-16980
> 
>
> Key: SPARK-17862
> URL: https://issues.apache.org/jira/browse/SPARK-17862
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17970) store partition spec in metastore for data source table

2016-10-17 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17970:

Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-17861

> store partition spec in metastore for data source table
> ---
>
> Key: SPARK-17970
> URL: https://issues.apache.org/jira/browse/SPARK-17970
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17974) Refactor FileCatalog classes to simplify the inheritance tree

2016-10-17 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17974:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-17861

> Refactor FileCatalog classes to simplify the inheritance tree
> -
>
> Key: SPARK-17974
> URL: https://issues.apache.org/jira/browse/SPARK-17974
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Minor
> Fix For: 2.1.0
>
>
> This is a follow-up item for https://github.com/apache/spark/pull/14690 which 
> adds support for metastore partition pruning of converted hive tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17974) Refactor FileCatalog classes to simplify the inheritance tree

2016-10-17 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-17974.
-
   Resolution: Fixed
 Assignee: Eric Liang
Fix Version/s: 2.1.0

> Refactor FileCatalog classes to simplify the inheritance tree
> -
>
> Key: SPARK-17974
> URL: https://issues.apache.org/jira/browse/SPARK-17974
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Assignee: Eric Liang
>Priority: Minor
> Fix For: 2.1.0
>
>
> This is a follow-up item for https://github.com/apache/spark/pull/14690 which 
> adds support for metastore partition pruning of converted hive tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14212) Add configuration element for --packages option

2016-10-17 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15584142#comment-15584142
 ] 

Marcelo Vanzin edited comment on SPARK-14212 at 10/18/16 3:50 AM:
--

SPARK-15760 added the docs to 2.0 only, but I'm pretty sure the options were 
there in 1.6.


was (Author: vanzin):
SPARK-15760 added the docs to 2.0 only, but I'm pretty sure the options were 
these in 1.6.

> Add configuration element for --packages option
> ---
>
> Key: SPARK-14212
> URL: https://issues.apache.org/jira/browse/SPARK-14212
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, PySpark
>Affects Versions: 1.6.1
>Reporter: Russell Jurney
>Priority: Trivial
>  Labels: config, starter
>
> I use PySpark with the --packages option, for instance to load support for 
> CSV: 
> pyspark --packages com.databricks:spark-csv_2.10:1.4.0
> I would like to not have to set this every time at the command line, so a 
> corresponding element for --packages in the configuration file 
> spark-defaults.conf, would be good to have.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17620) hive.default.fileformat=orc does not set OrcSerde

2016-10-17 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-17620.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15495
[https://github.com/apache/spark/pull/15495]

> hive.default.fileformat=orc does not set OrcSerde
> -
>
> Key: SPARK-17620
> URL: https://issues.apache.org/jira/browse/SPARK-17620
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Brian Cho
>Assignee: Dilip Biswal
>Priority: Minor
> Fix For: 2.1.0
>
>
> Setting {{hive.default.fileformat=orc}} does not set OrcSerde. This behavior 
> is inconsistent with {{STORED AS ORC}}. This means we cannot set a default 
> behavior for creating tables using orc.
> The behavior using stored as:
> {noformat}
> scala> spark.sql("CREATE TABLE tmp_stored_as(id INT) STORED AS ORC")
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("DESC FORMATTED tmp_stored_as").collect.foreach(println)
> ...
> [# Storage Information,,]
> [SerDe Library:,org.apache.hadoop.hive.ql.io.orc.OrcSerde,]
> [InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,]
> [OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,]
> ...
> {noformat}
> Behavior setting default conf (SerDe Library is not set properly):
> {noformat}
> scala> spark.sql("SET hive.default.fileformat=orc")
> res2: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> spark.sql("CREATE TABLE tmp_default(id INT)")
> res3: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("DESC FORMATTED tmp_default").collect.foreach(println)
> ...
> [# Storage Information,,]
> [SerDe Library:,org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,]
> [InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,]
> [OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,]
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17986) SQLTransformer leaks temporary tables

2016-10-17 Thread Drew Robb (JIRA)
Drew Robb created SPARK-17986:
-

 Summary: SQLTransformer leaks temporary tables
 Key: SPARK-17986
 URL: https://issues.apache.org/jira/browse/SPARK-17986
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.0.1
Reporter: Drew Robb
Priority: Minor


The SQLTransformer creates a temporary table when called, and does not delete 
this temporary table. When using a SQLTransformer in a long running Spark 
Streaming task, these temporary tables accumulate.

I believe that the fix would be as simple as calling  
`dataset.sparkSession.catalog.dropTempView(tableName)` in the last part of 
`transform`:
https://github.com/apache/spark/blob/v2.0.1/mllib/src/main/scala/org/apache/spark/ml/feature/SQLTransformer.scala#L65.
 I would be happy to attempt this fix myself if someone could validate this 
issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17985) Bump commons-lang3 version to 3.5.

2016-10-17 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-17985:
-

 Summary: Bump commons-lang3 version to 3.5.
 Key: SPARK-17985
 URL: https://issues.apache.org/jira/browse/SPARK-17985
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Takuya Ueshin


{{SerializationUtils.clone()}} of commons-lang3 (<3.5) has a bug that break 
thread safety, which gets stack sometimes caused by race condition of 
initializing hash map.
See https://issues.apache.org/jira/browse/LANG-1251.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17985) Bump commons-lang3 version to 3.5.

2016-10-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17985:


Assignee: Apache Spark

> Bump commons-lang3 version to 3.5.
> --
>
> Key: SPARK-17985
> URL: https://issues.apache.org/jira/browse/SPARK-17985
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>
> {{SerializationUtils.clone()}} of commons-lang3 (<3.5) has a bug that break 
> thread safety, which gets stack sometimes caused by race condition of 
> initializing hash map.
> See https://issues.apache.org/jira/browse/LANG-1251.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17985) Bump commons-lang3 version to 3.5.

2016-10-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17985:


Assignee: (was: Apache Spark)

> Bump commons-lang3 version to 3.5.
> --
>
> Key: SPARK-17985
> URL: https://issues.apache.org/jira/browse/SPARK-17985
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Takuya Ueshin
>
> {{SerializationUtils.clone()}} of commons-lang3 (<3.5) has a bug that break 
> thread safety, which gets stack sometimes caused by race condition of 
> initializing hash map.
> See https://issues.apache.org/jira/browse/LANG-1251.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17985) Bump commons-lang3 version to 3.5.

2016-10-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15584263#comment-15584263
 ] 

Apache Spark commented on SPARK-17985:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/15525

> Bump commons-lang3 version to 3.5.
> --
>
> Key: SPARK-17985
> URL: https://issues.apache.org/jira/browse/SPARK-17985
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Takuya Ueshin
>
> {{SerializationUtils.clone()}} of commons-lang3 (<3.5) has a bug that break 
> thread safety, which gets stack sometimes caused by race condition of 
> initializing hash map.
> See https://issues.apache.org/jira/browse/LANG-1251.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17984) Add support for numa aware

2016-10-17 Thread quanfuwang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

quanfuwang updated SPARK-17984:
---
Description: 
This Jira is target to add support numa aware feature which can help improve 
performance by making core access local memory rather than remote one. 

 A patch is being developed, see https://github.com/apache/spark/pull/15524.
And the whole task includes 3 subtasks and will be developed iteratively:
Numa aware support for Yarn based deployment mode
Numa aware support for Mesos based deployment mode
Numa aware support for Standalone based deployment mode

  was:
This Jira is target to add support numa aware feature which can help improve 
performance by making core access local memory rather than remote one. 

 A patch is being developed, see https://github.com/apache/spark/pull/15524.
And the whole task includes 3 subtask and will be developed iteratively:
Numa aware support for Yarn based deployment mode
Numa aware support for Mesos based deployment mode
Numa aware support for Standalone based deployment mode


> Add support for numa aware
> --
>
> Key: SPARK-17984
> URL: https://issues.apache.org/jira/browse/SPARK-17984
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, Mesos, YARN
>Affects Versions: 2.0.1
> Environment: Cluster Topo: 1 Master + 4 Slaves
> CPU: Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz(72 Cores)
> Memory: 128GB(2 NUMA Nodes)
> SW Version: Hadoop-5.7.0 + Spark-2.0.0
>Reporter: quanfuwang
> Fix For: 2.0.1
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> This Jira is target to add support numa aware feature which can help improve 
> performance by making core access local memory rather than remote one. 
>  A patch is being developed, see https://github.com/apache/spark/pull/15524.
> And the whole task includes 3 subtasks and will be developed iteratively:
> Numa aware support for Yarn based deployment mode
> Numa aware support for Mesos based deployment mode
> Numa aware support for Standalone based deployment mode



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17984) Add support for numa aware feature

2016-10-17 Thread quanfuwang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

quanfuwang updated SPARK-17984:
---
Summary: Add support for numa aware feature  (was: Add support for numa 
aware)

> Add support for numa aware feature
> --
>
> Key: SPARK-17984
> URL: https://issues.apache.org/jira/browse/SPARK-17984
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, Mesos, YARN
>Affects Versions: 2.0.1
> Environment: Cluster Topo: 1 Master + 4 Slaves
> CPU: Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz(72 Cores)
> Memory: 128GB(2 NUMA Nodes)
> SW Version: Hadoop-5.7.0 + Spark-2.0.0
>Reporter: quanfuwang
> Fix For: 2.0.1
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> This Jira is target to add support numa aware feature which can help improve 
> performance by making core access local memory rather than remote one. 
>  A patch is being developed, see https://github.com/apache/spark/pull/15524.
> And the whole task includes 3 subtasks and will be developed iteratively:
> Numa aware support for Yarn based deployment mode
> Numa aware support for Mesos based deployment mode
> Numa aware support for Standalone based deployment mode



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5230) Print usage for spark-submit and spark-class in Windows

2016-10-17 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-5230.
---
Resolution: Done

Pretty sure I implemented this somewhere in the 1.x line.

> Print usage for spark-submit and spark-class in Windows
> ---
>
> Key: SPARK-5230
> URL: https://issues.apache.org/jira/browse/SPARK-5230
> Project: Spark
>  Issue Type: Improvement
>  Components: Windows
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Priority: Minor
>
> We currently only print the usage in `bin/spark-shell2.cmd`. We should do it 
> for `bin/spark-submit2.cmd` and `bin/spark-class2.cmd` too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5925) YARN - Spark progress bar stucks at 10% but after finishing shows 100%

2016-10-17 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-5925.
---
Resolution: Won't Fix

I don't think this can be fixed in Spark at all. There's no way to know 
beforehand how many jobs or tasks or stages an app will run. Imagine a long 
running spark-shell where the user is running a lot of small jobs... what's the 
progress of the overall app?

There's just a mismatch between the YARN API and how Spark works. The YARN API 
makes a lot of sense for MapReduce apps. It doesn't make sense for Spark. 
Unless Spark exposes its own API for applications to report progress and proxy 
that information to YARN, but I don't see that happening.

> YARN - Spark progress bar stucks at 10% but after finishing shows 100%
> --
>
> Key: SPARK-5925
> URL: https://issues.apache.org/jira/browse/SPARK-5925
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.2.1
>Reporter: Laszlo Fesus
>Priority: Minor
>
> I did set up a yarn cluster (CDH5) and spark (1.2.1), and also started Spark 
> History Server. Now I am able to click on more details on yarn's web 
> interface and get redirected to the appropriate spark logs during both job 
> execution and also after the job has finished. 
> My only concern is that while a spark job is being executed (either 
> yarn-client or yarn-cluster), the progress bar stucks at 10% and doesn't 
> increase as for MapReduce jobs. After finishing, it shows 100% properly, but 
> we are loosing the real-time tracking capability of the status bar. 
> Also tested yarn restful web interface, and it retrieves again 10% during 
> (yarn) spark job execution, and works well again after finishing. (I suppose 
> for the while being I should have a look on Spark Job Server and see if it's 
> possible to track the job via its restful web interface.)
> Did anyone else experience this behaviour? Thanks in advance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6108) No application number limit in spark history server

2016-10-17 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-6108.
---
Resolution: Won't Fix

There are many ways currently to control how many applications are kept around; 
the SHS can even clean up old logs.

HDFS overhead is less of a problem since we started using a single file for 
event logs.

> No application number limit in spark history server
> ---
>
> Key: SPARK-6108
> URL: https://issues.apache.org/jira/browse/SPARK-6108
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.2.1
>Reporter: Xia Hu
>Priority: Minor
>
> There isn't a limit for the application number in spark history server. The 
> only limit I found is "spark.history.retainedApplications", but this one only 
> controls how many apps could be stored in memory. 
> But I think a history application number limit is needed, for if it's number 
> is too big, it can be inconvenient for both HDFS and history server. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7882) HBase Input Format Example does not allow passing ZK parent node

2016-10-17 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-7882.
---
Resolution: Not A Problem

HBase examples are not included anymore.

> HBase Input Format Example does not allow passing ZK parent node
> 
>
> Key: SPARK-7882
> URL: https://issues.apache.org/jira/browse/SPARK-7882
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Reporter: Ram Sriharsha
>Assignee: Ram Sriharsha
>Priority: Minor
>
> HBase Input Format example here:
> https://github.com/apache/spark/blob/master/examples/src/main/python/hbase_inputformat.py#L52
> precludes passing a fourth parameter (zk.node.parent) even though down the 
> line there is code checking for a possible fourth parameter and interpreting 
> it as zk.node.parent here :
> https://github.com/apache/spark/blob/master/examples/src/main/python/hbase_inputformat.py#L71



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17984) Add support for numa aware

2016-10-17 Thread quanfuwang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

quanfuwang updated SPARK-17984:
---
Issue Type: New Feature  (was: Task)

> Add support for numa aware
> --
>
> Key: SPARK-17984
> URL: https://issues.apache.org/jira/browse/SPARK-17984
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, Mesos, YARN
>Affects Versions: 2.0.1
> Environment: Cluster Topo: 1 Master + 4 Slaves
> CPU: Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz(72 Cores)
> Memory: 128GB(2 NUMA Nodes)
> SW Version: Hadoop-5.7.0 + Spark-2.0.0
>Reporter: quanfuwang
> Fix For: 2.0.1
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> This Jira is target to add support numa aware feature which can help improve 
> performance by making core access local memory rather than remote one. 
>  A patch is being developed, see https://github.com/apache/spark/pull/15524.
> And the whole task includes 3 subtask and will be developed iteratively:
> Numa aware support for Yarn based deployment mode
> Numa aware support for Mesos based deployment mode
> Numa aware support for Standalone based deployment mode



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17984) Add support for numa aware

2016-10-17 Thread quanfuwang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

quanfuwang updated SPARK-17984:
---
Shepherd:   (was: quanfuwang)

> Add support for numa aware
> --
>
> Key: SPARK-17984
> URL: https://issues.apache.org/jira/browse/SPARK-17984
> Project: Spark
>  Issue Type: Task
>  Components: Deploy, Mesos, YARN
>Affects Versions: 2.0.1
> Environment: Cluster Topo: 1 Master + 4 Slaves
> CPU: Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz(72 Cores)
> Memory: 128GB(2 NUMA Nodes)
> SW Version: Hadoop-5.7.0 + Spark-2.0.0
>Reporter: quanfuwang
> Fix For: 2.0.1
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> This Jira is target to add support numa aware feature which can help improve 
> performance by making core access local memory rather than remote one. 
>  A patch is being developed, see https://github.com/apache/spark/pull/15524.
> And the whole task includes 3 subtask and will be developed iteratively:
> Numa aware support for Yarn based deployment mode
> Numa aware support for Mesos based deployment mode
> Numa aware support for Standalone based deployment mode



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8122) ParquetRelation.enableLogForwarding() may fail to configure loggers

2016-10-17 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-8122.
---
Resolution: Won't Fix

This code doesn't exist anymore in 2.x at least, so I'll assume this won't be 
fixed in old maintenance releases.

> ParquetRelation.enableLogForwarding() may fail to configure loggers
> ---
>
> Key: SPARK-8122
> URL: https://issues.apache.org/jira/browse/SPARK-8122
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Konstantin Shaposhnikov
>Priority: Minor
>
> _enableLogForwarding()_ doesn't hold to the created loggers that can be 
> garbage collected and all configuration changes will be gone. From 
> https://docs.oracle.com/javase/6/docs/api/java/util/logging/Logger.html 
> javadocs:  _It is important to note that the Logger returned by one of the 
> getLogger factory methods may be garbage collected at any time if a strong 
> reference to the Logger is not kept._
> All created logger references need to be kept, e.g. in static variables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17984) Add support for numa aware

2016-10-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15584206#comment-15584206
 ] 

Apache Spark commented on SPARK-17984:
--

User 'quanfuw' has created a pull request for this issue:
https://github.com/apache/spark/pull/15524

> Add support for numa aware
> --
>
> Key: SPARK-17984
> URL: https://issues.apache.org/jira/browse/SPARK-17984
> Project: Spark
>  Issue Type: Task
>  Components: Deploy, Mesos, YARN
>Affects Versions: 2.0.1
> Environment: Cluster Topo: 1 Master + 4 Slaves
> CPU: Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz(72 Cores)
> Memory: 128GB(2 NUMA Nodes)
> SW Version: Hadoop-5.7.0 + Spark-2.0.0
>Reporter: quanfuwang
> Fix For: 2.0.1
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> This Jira is target to add support numa aware feature which can help improve 
> performance by making core access local memory rather than remote one. 
>  A patch is being developed, see https://github.com/apache/spark/pull/15524.
> And the whole task includes 3 subtask and will be developed iteratively:
> Numa aware support for Yarn based deployment mode
> Numa aware support for Mesos based deployment mode
> Numa aware support for Standalone based deployment mode



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17984) Add support for numa aware

2016-10-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17984:


Assignee: (was: Apache Spark)

> Add support for numa aware
> --
>
> Key: SPARK-17984
> URL: https://issues.apache.org/jira/browse/SPARK-17984
> Project: Spark
>  Issue Type: Task
>  Components: Deploy, Mesos, YARN
>Affects Versions: 2.0.1
> Environment: Cluster Topo: 1 Master + 4 Slaves
> CPU: Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz(72 Cores)
> Memory: 128GB(2 NUMA Nodes)
> SW Version: Hadoop-5.7.0 + Spark-2.0.0
>Reporter: quanfuwang
> Fix For: 2.0.1
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> This Jira is target to add support numa aware feature which can help improve 
> performance by making core access local memory rather than remote one. 
>  A patch is being developed, see https://github.com/apache/spark/pull/15524.
> And the whole task includes 3 subtask and will be developed iteratively:
> Numa aware support for Yarn based deployment mode
> Numa aware support for Mesos based deployment mode
> Numa aware support for Standalone based deployment mode



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17984) Add support for numa aware

2016-10-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17984:


Assignee: Apache Spark

> Add support for numa aware
> --
>
> Key: SPARK-17984
> URL: https://issues.apache.org/jira/browse/SPARK-17984
> Project: Spark
>  Issue Type: Task
>  Components: Deploy, Mesos, YARN
>Affects Versions: 2.0.1
> Environment: Cluster Topo: 1 Master + 4 Slaves
> CPU: Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz(72 Cores)
> Memory: 128GB(2 NUMA Nodes)
> SW Version: Hadoop-5.7.0 + Spark-2.0.0
>Reporter: quanfuwang
>Assignee: Apache Spark
> Fix For: 2.0.1
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> This Jira is target to add support numa aware feature which can help improve 
> performance by making core access local memory rather than remote one. 
>  A patch is being developed, see https://github.com/apache/spark/pull/15524.
> And the whole task includes 3 subtask and will be developed iteratively:
> Numa aware support for Yarn based deployment mode
> Numa aware support for Mesos based deployment mode
> Numa aware support for Standalone based deployment mode



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12280) "--packages" command doesn't work in "spark-submit"

2016-10-17 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-12280.

Resolution: Cannot Reproduce

Please reopen with more info if you're still running into issues. Lots of 
people use this command line option and haven't run into problems.

> "--packages" command doesn't work in "spark-submit"
> ---
>
> Key: SPARK-12280
> URL: https://issues.apache.org/jira/browse/SPARK-12280
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Reporter: Anton Loss
>Priority: Minor
>
> when running "spark-shell", then "--packages" option works as expected, but 
> with "spark-submit" it produces following stacktrace
> 15/12/11 17:05:48 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 15/12/11 17:05:51 WARN Client: Resource 
> file:/home/anton/data-tools-1.0-SNAPSHOT-jar-with-dependencies.jar added 
> multiple times to distributed cache.
> Exception in thread "main" java.io.FileNotFoundException: Requested file 
> maprfs:///home/mapr/.ivy2/jars/com.databricks_spark-csv_2.11-1.3.0.jar does 
> not exist.
>   at 
> com.mapr.fs.MapRFileSystem.getMapRFileStatus(MapRFileSystem.java:1332)
>   at com.mapr.fs.MapRFileSystem.getFileStatus(MapRFileSystem.java:942)
>   at com.mapr.fs.MFS.getFileStatus(MFS.java:151)
>   at 
> org.apache.hadoop.fs.AbstractFileSystem.resolvePath(AbstractFileSystem.java:467)
>   at org.apache.hadoop.fs.FileContext$25.next(FileContext.java:2193)
>   at org.apache.hadoop.fs.FileContext$25.next(FileContext.java:2189)
>   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
>   at org.apache.hadoop.fs.FileContext.resolve(FileContext.java:2189)
>   at org.apache.hadoop.fs.FileContext.resolvePath(FileContext.java:601)
>   at 
> org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:242)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$1.apply(Client.scala:366)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$1.apply(Client.scala:360)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6.apply(Client.scala:360)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6.apply(Client.scala:358)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:358)
>   at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:561)
>   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:115)
>   at org.apache.spark.deploy.yarn.Client.run(Client.scala:842)
>   at org.apache.spark.deploy.yarn.Client$.main(Client.scala:881)
>   at org.apache.spark.deploy.yarn.Client.main(Client.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> it seems it's looking in the wrong place, as jar is clearly present here
> file:///home/mapr/.ivy2/jars/com.databricks_spark-csv_2.11-1.3.0.jar



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17984) Add support for numa aware

2016-10-17 Thread quanfuwang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

quanfuwang updated SPARK-17984:
---
Description: 
This Jira is target to add support numa aware feature which can help improve 
performance by making core access local memory rather than remote one. 

 A patch is being developed, see https://github.com/apache/spark/pull/15524.
And the whole task includes 3 subtask and will be developed iteratively:
Numa aware support for Yarn based deployment mode
Numa aware support for Mesos based deployment mode
Numa aware support for Standalone based deployment mode

  was:
This Jira is target to add support numa aware feature which can help improve 
performance by making core access local memory rather the remote one. 

 A patch is being developed, see https://github.com/apache/spark/pull/15524.
And the whole task includes 3 subtask and will be developed iteratively:
Numa aware support for Yarn based deployment mode
Numa aware support for Mesos based deployment mode
Numa aware support for Standalone based deployment mode


> Add support for numa aware
> --
>
> Key: SPARK-17984
> URL: https://issues.apache.org/jira/browse/SPARK-17984
> Project: Spark
>  Issue Type: Task
>  Components: Deploy, Mesos, YARN
>Affects Versions: 2.0.1
> Environment: Cluster Topo: 1 Master + 4 Slaves
> CPU: Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz(72 Cores)
> Memory: 128GB(2 NUMA Nodes)
> SW Version: Hadoop-5.7.0 + Spark-2.0.0
>Reporter: quanfuwang
> Fix For: 2.0.1
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> This Jira is target to add support numa aware feature which can help improve 
> performance by making core access local memory rather than remote one. 
>  A patch is being developed, see https://github.com/apache/spark/pull/15524.
> And the whole task includes 3 subtask and will be developed iteratively:
> Numa aware support for Yarn based deployment mode
> Numa aware support for Mesos based deployment mode
> Numa aware support for Standalone based deployment mode



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17984) Add support for numa aware

2016-10-17 Thread quanfuwang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

quanfuwang updated SPARK-17984:
---
Description: 
This Jira is target to add support numa aware feature which can help improve 
performance by making core access local memory rather the remote one. 

 A patch is being developed, see https://github.com/apache/spark/pull/15524.
And the whole task includes 3 subtask and will be developed iteratively:
Numa aware support for Yarn based deployment mode
Numa aware support for Mesos based deployment mode
Numa aware support for Standalone based deployment mode

  was:
This Jira is target to add support numa aware feature which make can help 
improve performance by making core access local memory rather the remote one. 

 A patch is being developed, see https://github.com/apache/spark/pull/15524.
And the whole task includes 3 subtask and will be developed iteratively:
Numa aware support for Yarn based deployment mode
Numa aware support for Mesos based deployment mode
Numa aware support for Standalone based deployment mode


> Add support for numa aware
> --
>
> Key: SPARK-17984
> URL: https://issues.apache.org/jira/browse/SPARK-17984
> Project: Spark
>  Issue Type: Task
>  Components: Deploy, Mesos, YARN
>Affects Versions: 2.0.1
> Environment: Cluster Topo: 1 Master + 4 Slaves
> CPU: Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz(72 Cores)
> Memory: 128GB(2 NUMA Nodes)
> SW Version: Hadoop-5.7.0 + Spark-2.0.0
>Reporter: quanfuwang
> Fix For: 2.0.1
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> This Jira is target to add support numa aware feature which can help improve 
> performance by making core access local memory rather the remote one. 
>  A patch is being developed, see https://github.com/apache/spark/pull/15524.
> And the whole task includes 3 subtask and will be developed iteratively:
> Numa aware support for Yarn based deployment mode
> Numa aware support for Mesos based deployment mode
> Numa aware support for Standalone based deployment mode



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails when select statement contains limit clause

2016-10-17 Thread Franck Tago (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15584178#comment-15584178
 ] 

Franck Tago commented on SPARK-17982:
-

== SQL ==
SELECT `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT 
`gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT 
`gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS 
`gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS 
gen_subquery_1
^^^

  at 
org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
  at 
org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
  at 
org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:189)
  ... 64 more


> Spark 2.0.0  CREATE VIEW statement fails when select statement contains limit 
> clause
> 
>
> Key: SPARK-17982
> URL: https://issues.apache.org/jira/browse/SPARK-17982
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
> Environment: spark 2.0.0
>Reporter: Franck Tago
>
> The following statement fails in the spark shell . 
> scala> spark.sql("CREATE VIEW 
> DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS 
> SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2")
> scala> spark.sql("CREATE VIEW 
> DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS 
> SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2")
> java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT 
> `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT 
> `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT 
> `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS 
> `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS 
> gen_subquery_1
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192)
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:186)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:167)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)
> This appears to be a limitation of the create view statement .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17984) Add support for numa aware

2016-10-17 Thread quanfuwang (JIRA)
quanfuwang created SPARK-17984:
--

 Summary: Add support for numa aware
 Key: SPARK-17984
 URL: https://issues.apache.org/jira/browse/SPARK-17984
 Project: Spark
  Issue Type: Task
  Components: Deploy, Mesos, YARN
Affects Versions: 2.0.1
 Environment: Cluster Topo: 1 Master + 4 Slaves
CPU: Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz(72 Cores)
Memory: 128GB(2 NUMA Nodes)
SW Version: Hadoop-5.7.0 + Spark-2.0.0
Reporter: quanfuwang
 Fix For: 2.0.1


This Jira is target to add support numa aware feature which make can help 
improve performance by making core access local memory rather the remote one. 

 A patch is being developed, see https://github.com/apache/spark/pull/15524.
And the whole task includes 3 subtask and will be developed iteratively:
Numa aware support for Yarn based deployment mode
Numa aware support for Mesos based deployment mode
Numa aware support for Standalone based deployment mode



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets

2016-10-17 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15584172#comment-15584172
 ] 

Cody Koeninger commented on SPARK-17147:


Well, are you using compacted topics?

> Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets
> 
>
> Key: SPARK-17147
> URL: https://issues.apache.org/jira/browse/SPARK-17147
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Robert Conrad
>
> When Kafka does log compaction offsets often end up with gaps, meaning the 
> next requested offset will be frequently not be offset+1. The logic in 
> KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset 
> will always be just an increment of 1 above the previous offset. 
> I have worked around this problem by changing CachedKafkaConsumer to use the 
> returned record's offset, from:
> {{nextOffset = offset + 1}}
> to:
> {{nextOffset = record.offset + 1}}
> and changed KafkaRDD from:
> {{requestOffset += 1}}
> to:
> {{requestOffset = r.offset() + 1}}
> (I also had to change some assert logic in CachedKafkaConsumer).
> There's a strong possibility that I have misconstrued how to use the 
> streaming kafka consumer, and I'm happy to close this out if that's the case. 
> If, however, it is supposed to support non-consecutive offsets (e.g. due to 
> log compaction) I am also happy to contribute a PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets

2016-10-17 Thread Justin Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15584169#comment-15584169
 ] 

Justin Miller commented on SPARK-17147:
---

Could this possibly be related to why I'm seeing the following?

16/10/18 02:11:02 WARN TaskSetManager: Lost task 6.0 in stage 2.0 (TID 5823, 
ip-172-20-222-162.int.protectwise.net): java.lang.IllegalStateException: This 
consumer has already been closed.
at 
org.apache.kafka.clients.consumer.KafkaConsumer.ensureNotClosed(KafkaConsumer.java:1417)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1428)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:929)
at 
org.apache.spark.streaming.kafka010.CachedKafkaConsumer.poll(CachedKafkaConsumer.scala:99)
at 
org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:73)
at 
org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
at 
org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)

> Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets
> 
>
> Key: SPARK-17147
> URL: https://issues.apache.org/jira/browse/SPARK-17147
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Robert Conrad
>
> When Kafka does log compaction offsets often end up with gaps, meaning the 
> next requested offset will be frequently not be offset+1. The logic in 
> KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset 
> will always be just an increment of 1 above the previous offset. 
> I have worked around this problem by changing CachedKafkaConsumer to use the 
> returned record's offset, from:
> {{nextOffset = offset + 1}}
> to:
> {{nextOffset = record.offset + 1}}
> and changed KafkaRDD from:
> {{requestOffset += 1}}
> to:
> {{requestOffset = r.offset() + 1}}
> (I also had to change some assert logic in CachedKafkaConsumer).
> There's a strong possibility that I have misconstrued how to use the 
> streaming kafka consumer, and I'm happy to close this out if that's the case. 
> If, however, it is supposed to support non-consecutive offsets (e.g. due to 
> log compaction) I am also happy to contribute a PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17504) Spark App Handle from SparkLauncher always returns UNKNOWN app state when used with Mesos in Client Mode

2016-10-17 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-17504.

Resolution: Duplicate

> Spark App Handle from SparkLauncher always returns UNKNOWN app state when 
> used with Mesos in Client Mode 
> -
>
> Key: SPARK-17504
> URL: https://issues.apache.org/jira/browse/SPARK-17504
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.0.0
>Reporter: Adam Jakubowski
>Priority: Minor
>
> Spark App Handle returned from Spark Launcher when used with Mesos in Client 
> Mode always returnrs UNKNOWN app state. Even if I kill the process it won't 
> change to LOST state.
> It works with YARN cluster and Spark Standalone.
> Expected behaviour:
> Spark App Handle .getState() should go through CONNECTED, SUBMITTED, RUNNING, 
> FINISHED states and not yield UKNOWN every time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14212) Add configuration element for --packages option

2016-10-17 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15584142#comment-15584142
 ] 

Marcelo Vanzin commented on SPARK-14212:


SPARK-15760 added the docs to 2.0 only, but I'm pretty sure the options were 
these in 1.6.

> Add configuration element for --packages option
> ---
>
> Key: SPARK-14212
> URL: https://issues.apache.org/jira/browse/SPARK-14212
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, PySpark
>Affects Versions: 1.6.1
>Reporter: Russell Jurney
>Priority: Trivial
>  Labels: config, starter
>
> I use PySpark with the --packages option, for instance to load support for 
> CSV: 
> pyspark --packages com.databricks:spark-csv_2.10:1.4.0
> I would like to not have to set this every time at the command line, so a 
> corresponding element for --packages in the configuration file 
> spark-defaults.conf, would be good to have.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4160) Standalone cluster mode does not upload all needed jars to driver node

2016-10-17 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15584137#comment-15584137
 ] 

Marcelo Vanzin commented on SPARK-4160:
---

You don't need to ask for permission to work on things.

> Standalone cluster mode does not upload all needed jars to driver node
> --
>
> Key: SPARK-4160
> URL: https://issues.apache.org/jira/browse/SPARK-4160
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Marcelo Vanzin
>
> If you look at the code in {{DriverRunner.scala}}, there is code to download 
> the main application jar from the launcher node. But that's the only jar 
> that's downloaded - if the driver depends on one of the jars or files 
> specified via {{spark-submit --jars  --files }}, it won't be able 
> to run.
> It should be possible to use the same mechanism to distribute the other files 
> to the driver node, even if that's not the most efficient way of doing it. 
> That way, at least, you don't need any external dependencies to be able to 
> distribute the files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17983) Can't filter over mixed case parquet columns of converted Hive tables

2016-10-17 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-17983:
---
Description: 
We should probably revive https://github.com/apache/spark/pull/14750 in order 
to fix this issue and related classes of issues.

The only other alternatives are (1) reconciling on-disk schemas with metastore 
schema at planning time, which seems pretty messy, and (2) fixing all the 
datasources to support case-insensitive matching, which also has issues.

Reproduction:
{code}
  private def setupPartitionedTable(tableName: String, dir: File): Unit = {
spark.range(5).selectExpr("id as normalCol", "id as partCol1", "id as 
partCol2").write
  .partitionBy("partCol1", "partCol2")
  .mode("overwrite")
  .parquet(dir.getAbsolutePath)

spark.sql(s"""
  |create external table $tableName (normalCol long)
  |partitioned by (partCol1 int, partCol2 int)
  |stored as parquet
  |location "${dir.getAbsolutePath}.stripMargin)
spark.sql(s"msck repair table $tableName")
  }

  test("filter by mixed case col") {
withTable("test") {
  withTempDir { dir =>
setupPartitionedTable("test", dir)
val df = spark.sql("select * from test where normalCol = 3")
assert(df.count() == 1)
  }
}
  }
{code}
cc [~cloud_fan]

  was:
We should probably revive https://github.com/apache/spark/pull/14750 in order 
to fix this issue and related classes of issues.

The only other alternatives are (1) reconciling on-disk schemas with metastore 
schema at planning time, which seems pretty messy, and (2) fixing all the 
datasources to support case-insensitive matching, which also has issues.

cc [~cloud_fan]


> Can't filter over mixed case parquet columns of converted Hive tables
> -
>
> Key: SPARK-17983
> URL: https://issues.apache.org/jira/browse/SPARK-17983
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Critical
>
> We should probably revive https://github.com/apache/spark/pull/14750 in order 
> to fix this issue and related classes of issues.
> The only other alternatives are (1) reconciling on-disk schemas with 
> metastore schema at planning time, which seems pretty messy, and (2) fixing 
> all the datasources to support case-insensitive matching, which also has 
> issues.
> Reproduction:
> {code}
>   private def setupPartitionedTable(tableName: String, dir: File): Unit = {
> spark.range(5).selectExpr("id as normalCol", "id as partCol1", "id as 
> partCol2").write
>   .partitionBy("partCol1", "partCol2")
>   .mode("overwrite")
>   .parquet(dir.getAbsolutePath)
> spark.sql(s"""
>   |create external table $tableName (normalCol long)
>   |partitioned by (partCol1 int, partCol2 int)
>   |stored as parquet
>   |location "${dir.getAbsolutePath}.stripMargin)
> spark.sql(s"msck repair table $tableName")
>   }
>   test("filter by mixed case col") {
> withTable("test") {
>   withTempDir { dir =>
> setupPartitionedTable("test", dir)
> val df = spark.sql("select * from test where normalCol = 3")
> assert(df.count() == 1)
>   }
> }
>   }
> {code}
> cc [~cloud_fan]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails when select statement contains limit clause

2016-10-17 Thread Franck Tago (JIRA)
Franck Tago created SPARK-17982:
---

 Summary: Spark 2.0.0  CREATE VIEW statement fails when select 
statement contains limit clause
 Key: SPARK-17982
 URL: https://issues.apache.org/jira/browse/SPARK-17982
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.1, 2.0.0
 Environment: spark 2.0.0
Reporter: Franck Tago


The following statement fails in the spark shell . 

scala> spark.sql("CREATE VIEW DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where 
(WHERE_ID , WHERE_NAME ) AS SELECT `where`.id,`where`.name FROM DEFAULT.`where` 
limit 2")

scala> spark.sql("CREATE VIEW DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where 
(WHERE_ID , WHERE_NAME ) AS SELECT `where`.id,`where`.name FROM DEFAULT.`where` 
limit 2")
java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT 
`gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT 
`gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT 
`gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS 
`gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS 
gen_subquery_1
  at 
org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192)
  at 
org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
  at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
  at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
  at org.apache.spark.sql.Dataset.(Dataset.scala:186)
  at org.apache.spark.sql.Dataset.(Dataset.scala:167)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)


This appears to be a limitation of the create view statement .




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10872) Derby error (XSDB6) when creating new HiveContext after restarting SparkContext

2016-10-17 Thread Angus Gerry (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15584052#comment-15584052
 ] 

Angus Gerry commented on SPARK-10872:
-

Hi [~srowen],

I'm chasing down something in our code base at the moment that might be 
tangentially related to this issue. In our tests, we start and stop a new 
{{TestHiveContext}} for each test suite. Our builds recently started failing 
with this stack trace, ultimately caused by an {{IOException}} because "Too 
many open files"
{noformat}
java.lang.IllegalStateException: failed to create a child event loop
at 
io.netty.util.concurrent.MultithreadEventExecutorGroup.(MultithreadEventExecutorGroup.java:68)
at 
io.netty.channel.MultithreadEventLoopGroup.(MultithreadEventLoopGroup.java:49)
at io.netty.channel.nio.NioEventLoopGroup.(NioEventLoopGroup.java:61)
at io.netty.channel.nio.NioEventLoopGroup.(NioEventLoopGroup.java:52)
at org.apache.spark.network.util.NettyUtils.createEventLoop(NettyUtils.java:56)
at 
org.apache.spark.network.client.TransportClientFactory.(TransportClientFactory.java:104)
at 
org.apache.spark.network.TransportContext.createClientFactory(TransportContext.java:88)
at 
org.apache.spark.network.netty.NettyBlockTransferService.init(NettyBlockTransferService.scala:63)
at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:177)
at org.apache.spark.SparkContext.(SparkContext.scala:536)
...
Cause: io.netty.channel.ChannelException: failed to open a new selector
at io.netty.channel.nio.NioEventLoop.openSelector(NioEventLoop.java:128)
at io.netty.channel.nio.NioEventLoop.(NioEventLoop.java:120)
at io.netty.channel.nio.NioEventLoopGroup.newChild(NioEventLoopGroup.java:87)
at 
io.netty.util.concurrent.MultithreadEventExecutorGroup.(MultithreadEventExecutorGroup.java:64)
at 
io.netty.channel.MultithreadEventLoopGroup.(MultithreadEventLoopGroup.java:49)
at io.netty.channel.nio.NioEventLoopGroup.(NioEventLoopGroup.java:61)
at io.netty.channel.nio.NioEventLoopGroup.(NioEventLoopGroup.java:52)
at org.apache.spark.network.util.NettyUtils.createEventLoop(NettyUtils.java:56)
at 
org.apache.spark.network.client.TransportClientFactory.(TransportClientFactory.java:104)
at 
org.apache.spark.network.TransportContext.createClientFactory(TransportContext.java:88)
...
Cause: java.io.IOException: Too many open files
at sun.nio.ch.IOUtil.makePipe(Native Method)
at sun.nio.ch.EPollSelectorImpl.(EPollSelectorImpl.java:65)
at sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:36)
at io.netty.channel.nio.NioEventLoop.openSelector(NioEventLoop.java:126)
at io.netty.channel.nio.NioEventLoop.(NioEventLoop.java:120)
at io.netty.channel.nio.NioEventLoopGroup.newChild(NioEventLoopGroup.java:87)
at 
io.netty.util.concurrent.MultithreadEventExecutorGroup.(MultithreadEventExecutorGroup.java:64)
at 
io.netty.channel.MultithreadEventLoopGroup.(MultithreadEventLoopGroup.java:49)
at io.netty.channel.nio.NioEventLoopGroup.(NioEventLoopGroup.java:61)
at io.netty.channel.nio.NioEventLoopGroup.(NioEventLoopGroup.java:52)
{noformat}

Running our test suite locally, and keeping an eye on the jvm process with 
lsof, I can see that the number of open file handles continues to grow larger 
and larger, and over 75% of the paths look something like this: 
{{/tmp/spark-a0ff08e6-ae94-42ad-8a9c-bc43dee0b283/metastore/seg0/c530.dat}}

My initial tracing through the code indicates that even though we're stopping 
the context, it's not closing its connection to the {{executionHive}} object, 
which runs as a derby DB in a tmp directory as above.

This is where my 'tangentially related' comes in - if the context were actually 
closing its derby DB connections, then we mightn't be hitting the issue at all.

FWIW the [programming 
guide|http://spark.apache.org/docs/latest/programming-guide.html#initializing-spark]
 does state the following, which at the very least _implies_ that stopping and 
then subsequently starting a context within one JVM is supported.
{quote}
Only one SparkContext may be active per JVM. You must stop() the active 
SparkContext before creating a new one.
{quote}

Personally I don't much care about said support other than needing it for our 
tests. If [~belevtsoff] doesn't start working on a PR for this, I'll start 
trying to work on a fix for my problems shortly.

> Derby error (XSDB6) when creating new HiveContext after restarting 
> SparkContext
> ---
>
> Key: SPARK-10872
> URL: https://issues.apache.org/jira/browse/SPARK-10872
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.0, 1.4.1, 1.5.0
>Reporter: Dmytro Bielievtsov
>
> Starting from spark 1.4.0 (works well on 1.3.1), the following code fails 
> with "XSDB6: Another instance of Derby may have already booted the 

[jira] [Comment Edited] (SPARK-17950) Match SparseVector behavior with DenseVector

2016-10-17 Thread AbderRahman Sobh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583877#comment-15583877
 ] 

AbderRahman Sobh edited comment on SPARK-17950 at 10/18/16 12:07 AM:
-

Yes, the full array needs to be expanded since the numpy functions potentially 
need to operate on every value in the array. There is room for another 
implementation that instead simply mimics the numpy functions (and their 
handles) and provides smarter implementations for solving means and such when 
using a SparseVector. If that is preferable, I can modify the code to do that 
instead.

Note also that the unpacked array is automatically cleared out after the call.


was (Author: itg-abby):
Yes, the full array needs to be expanded since the numpy functions potentially 
need to operate on every value in the array. There is room for another 
implementation that instead simply mimics the numpy functions (and their 
handles) and provides smarter implementations for solving means and such when 
using a SparseVector. If that is preferable, I can modify the code to do that 
instead.

> Match SparseVector behavior with DenseVector
> 
>
> Key: SPARK-17950
> URL: https://issues.apache.org/jira/browse/SPARK-17950
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 2.0.1
>Reporter: AbderRahman Sobh
>Priority: Minor
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Simply added the `__getattr__` to SparseVector that DenseVector has, but 
> calls self.toArray() instead of storing a vector all the time in self.array
> This allows for use of numpy functions on the values of a SparseVector in the 
> same direct way that users interact with DenseVectors.
>  i.e. you can simply call SparseVector.mean() to average the values in the 
> entire vector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17950) Match SparseVector behavior with DenseVector

2016-10-17 Thread AbderRahman Sobh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583877#comment-15583877
 ] 

AbderRahman Sobh edited comment on SPARK-17950 at 10/18/16 12:07 AM:
-

Yes, the full array needs to be expanded since the numpy functions potentially 
need to operate on every value in the array. There is room for another 
implementation that instead simply mimics the numpy functions (and their 
handles) and provides smarter implementations for solving means and such when 
using a SparseVector. If that is preferable, I can modify the code to do that 
instead.


was (Author: itg-abby):
Yes, the full array needs to be expanded since the numpy functions potentially 
need to operate on every value in the array. There is room for another 
implementation that instead simply mimics the numpy functions (and their 
handles) and provides smarter implementations for solving means and such when 
using a SparseVector. If that is preferable, I can modify the code to do that 
instead.

I also just realized that I am not 100% sure if the garbage collection works as 
I am expecting. My assumption was that Python would automatically clean up 
after using the array, but since it is technically inside of the object's magic 
method I cannot tell if it might need another line to explicitly clear the 
array out.

> Match SparseVector behavior with DenseVector
> 
>
> Key: SPARK-17950
> URL: https://issues.apache.org/jira/browse/SPARK-17950
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 2.0.1
>Reporter: AbderRahman Sobh
>Priority: Minor
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Simply added the `__getattr__` to SparseVector that DenseVector has, but 
> calls self.toArray() instead of storing a vector all the time in self.array
> This allows for use of numpy functions on the values of a SparseVector in the 
> same direct way that users interact with DenseVectors.
>  i.e. you can simply call SparseVector.mean() to average the values in the 
> entire vector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17950) Match SparseVector behavior with DenseVector

2016-10-17 Thread AbderRahman Sobh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583877#comment-15583877
 ] 

AbderRahman Sobh edited comment on SPARK-17950 at 10/18/16 12:05 AM:
-

Yes, the full array needs to be expanded since the numpy functions potentially 
need to operate on every value in the array. There is room for another 
implementation that instead simply mimics the numpy functions (and their 
handles) and provides smarter implementations for solving means and such when 
using a SparseVector. If that is preferable, I can modify the code to do that 
instead.

I also just realized that I am not 100% sure if the garbage collection works as 
I am expecting. My assumption was that Python would automatically clean up 
after using the array, but since it is technically inside of the object's magic 
method I cannot tell if it might need another line to explicitly clear the 
array out.


was (Author: itg-abby):
Yes, the full array needs to be expanded since the numpy functions potentially 
need to operate on every value in the array. There is room for another 
implementation that instead simply mimics the numpy functions (and their 
handles) and provides smarter implementations for solving means and such when 
using a SparseVector. If that is preferable, I can modify the code to do that 
instead.

I also just realized that I am not 100% sure if the garbage collection works as 
I am expecting. My assumption was that Python would automatically clean up 
after using the array, but since it is technically inside of the object it 
might need another line to explicitly clear the array out?

> Match SparseVector behavior with DenseVector
> 
>
> Key: SPARK-17950
> URL: https://issues.apache.org/jira/browse/SPARK-17950
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 2.0.1
>Reporter: AbderRahman Sobh
>Priority: Minor
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Simply added the `__getattr__` to SparseVector that DenseVector has, but 
> calls self.toArray() instead of storing a vector all the time in self.array
> This allows for use of numpy functions on the values of a SparseVector in the 
> same direct way that users interact with DenseVectors.
>  i.e. you can simply call SparseVector.mean() to average the values in the 
> entire vector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17731) Metrics for Structured Streaming

2016-10-17 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-17731:
--
Fix Version/s: 2.0.2

> Metrics for Structured Streaming
> 
>
> Key: SPARK-17731
> URL: https://issues.apache.org/jira/browse/SPARK-17731
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 2.0.2, 2.1.0
>
>
> Metrics are needed for monitoring structured streaming apps. Here is the 
> design doc for implementing the necessary metrics.
> https://docs.google.com/document/d/1NIdcGuR1B3WIe8t7VxLrt58TJB4DtipWEbj5I_mzJys/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17950) Match SparseVector behavior with DenseVector

2016-10-17 Thread AbderRahman Sobh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583877#comment-15583877
 ] 

AbderRahman Sobh commented on SPARK-17950:
--

Yes, the full array needs to be expanded since the numpy functions potentially 
need to operate on every value in the array. There is room for another 
implementation that instead simply mimics the numpy functions (and their 
handles) and provides smarter implementations for solving means and such when 
using a SparseVector. If that is preferable, I can modify the code to do that 
instead.

I also just realized that I am not 100% sure if the garbage collection works as 
I am expecting. My assumption was that Python would automatically clean up 
after using the array, but since it is technically inside of the object it 
might need another line to explicitly clear the array out?

> Match SparseVector behavior with DenseVector
> 
>
> Key: SPARK-17950
> URL: https://issues.apache.org/jira/browse/SPARK-17950
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 2.0.1
>Reporter: AbderRahman Sobh
>Priority: Minor
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Simply added the `__getattr__` to SparseVector that DenseVector has, but 
> calls self.toArray() instead of storing a vector all the time in self.array
> This allows for use of numpy functions on the values of a SparseVector in the 
> same direct way that users interact with DenseVectors.
>  i.e. you can simply call SparseVector.mean() to average the values in the 
> entire vector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17981) Incorrectly Set Nullability to False in FilterExec

2016-10-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17981:


Assignee: Apache Spark  (was: Xiao Li)

> Incorrectly Set Nullability to False in FilterExec
> --
>
> Key: SPARK-17981
> URL: https://issues.apache.org/jira/browse/SPARK-17981
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Critical
>
> When FilterExec contains isNotNull, which could be inferred and pushed down 
> or users specified, we convert the nullability of the involved columns if the 
> top-layer expression is null-intolerant. However, this is not true, if the 
> top-layer expression is not a leaf expression, it could still tolerate the 
> null when it has null-tolerant child expression. 
> For example, cast(coalesce(a#5, a#15) as double). Although cast is a 
> null-intolerant expression, but obviously coalesce is a null-tolerant. 
> When the nullability is wrong, we could generate incorrect results in 
> different cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17957) Calling outer join and na.fill(0) and then inner join will miss rows

2016-10-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17957:


Assignee: Apache Spark  (was: Xiao Li)

> Calling outer join and na.fill(0) and then inner join will miss rows
> 
>
> Key: SPARK-17957
> URL: https://issues.apache.org/jira/browse/SPARK-17957
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: Spark 2.0.1, Mac, Local
>Reporter: Linbo
>Assignee: Apache Spark
>Priority: Critical
>  Labels: correctness
>
> I reported a similar bug two months ago and it's fixed in Spark 2.0.1: 
> https://issues.apache.org/jira/browse/SPARK-17060 But I find a new bug: when 
> I insert a na.fill(0) call between outer join and inner join in the same 
> workflow in SPARK-17060 I get wrong result.
> {code:title=spark-shell|borderStyle=solid}
> scala> val a = Seq((1, 2), (2, 3)).toDF("a", "b")
> a: org.apache.spark.sql.DataFrame = [a: int, b: int]
> scala> val b = Seq((2, 5), (3, 4)).toDF("a", "c")
> b: org.apache.spark.sql.DataFrame = [a: int, c: int]
> scala> val ab = a.join(b, Seq("a"), "fullouter").na.fill(0)
> ab: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]
> scala> ab.show
> +---+---+---+
> |  a|  b|  c|
> +---+---+---+
> |  1|  2|  0|
> |  3|  0|  4|
> |  2|  3|  5|
> +---+---+---+
> scala> val c = Seq((3, 1)).toDF("a", "d")
> c: org.apache.spark.sql.DataFrame = [a: int, d: int]
> scala> c.show
> +---+---+
> |  a|  d|
> +---+---+
> |  3|  1|
> +---+---+
> scala> ab.join(c, "a").show
> +---+---+---+---+
> |  a|  b|  c|  d|
> +---+---+---+---+
> +---+---+---+---+
> {code}
> And again if i use persist, the result is correct. I think the problem is 
> join optimizer similar to this pr: https://github.com/apache/spark/pull/14661
> {code:title=spark-shell|borderStyle=solid}
> scala> val ab = a.join(b, Seq("a"), "outer").na.fill(0).persist
> ab: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int 
> ... 1 more field]
> scala> ab.show
> +---+---+---+
> |  a|  b|  c|
> +---+---+---+
> |  1|  2|  0|
> |  3|  0|  4|
> |  2|  3|  5|
> +---+---+---+
> scala> ab.join(c, "a").show
> +---+---+---+---+
> |  a|  b|  c|  d|
> +---+---+---+---+
> |  3|  0|  4|  1|
> +---+---+---+---+
> {code}
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17957) Calling outer join and na.fill(0) and then inner join will miss rows

2016-10-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583874#comment-15583874
 ] 

Apache Spark commented on SPARK-17957:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/15523

> Calling outer join and na.fill(0) and then inner join will miss rows
> 
>
> Key: SPARK-17957
> URL: https://issues.apache.org/jira/browse/SPARK-17957
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: Spark 2.0.1, Mac, Local
>Reporter: Linbo
>Assignee: Xiao Li
>Priority: Critical
>  Labels: correctness
>
> I reported a similar bug two months ago and it's fixed in Spark 2.0.1: 
> https://issues.apache.org/jira/browse/SPARK-17060 But I find a new bug: when 
> I insert a na.fill(0) call between outer join and inner join in the same 
> workflow in SPARK-17060 I get wrong result.
> {code:title=spark-shell|borderStyle=solid}
> scala> val a = Seq((1, 2), (2, 3)).toDF("a", "b")
> a: org.apache.spark.sql.DataFrame = [a: int, b: int]
> scala> val b = Seq((2, 5), (3, 4)).toDF("a", "c")
> b: org.apache.spark.sql.DataFrame = [a: int, c: int]
> scala> val ab = a.join(b, Seq("a"), "fullouter").na.fill(0)
> ab: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]
> scala> ab.show
> +---+---+---+
> |  a|  b|  c|
> +---+---+---+
> |  1|  2|  0|
> |  3|  0|  4|
> |  2|  3|  5|
> +---+---+---+
> scala> val c = Seq((3, 1)).toDF("a", "d")
> c: org.apache.spark.sql.DataFrame = [a: int, d: int]
> scala> c.show
> +---+---+
> |  a|  d|
> +---+---+
> |  3|  1|
> +---+---+
> scala> ab.join(c, "a").show
> +---+---+---+---+
> |  a|  b|  c|  d|
> +---+---+---+---+
> +---+---+---+---+
> {code}
> And again if i use persist, the result is correct. I think the problem is 
> join optimizer similar to this pr: https://github.com/apache/spark/pull/14661
> {code:title=spark-shell|borderStyle=solid}
> scala> val ab = a.join(b, Seq("a"), "outer").na.fill(0).persist
> ab: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int 
> ... 1 more field]
> scala> ab.show
> +---+---+---+
> |  a|  b|  c|
> +---+---+---+
> |  1|  2|  0|
> |  3|  0|  4|
> |  2|  3|  5|
> +---+---+---+
> scala> ab.join(c, "a").show
> +---+---+---+---+
> |  a|  b|  c|  d|
> +---+---+---+---+
> |  3|  0|  4|  1|
> +---+---+---+---+
> {code}
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17368) Scala value classes create encoder problems and break at runtime

2016-10-17 Thread Aris Vlasakakis (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583873#comment-15583873
 ] 

Aris Vlasakakis commented on SPARK-17368:
-

That is great, thank you for the help with this.

> Scala value classes create encoder problems and break at runtime
> 
>
> Key: SPARK-17368
> URL: https://issues.apache.org/jira/browse/SPARK-17368
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.2, 2.0.0
> Environment: JDK 8 on MacOS
> Scala 2.11.8
> Spark 2.0.0
>Reporter: Aris Vlasakakis
>Assignee: Jakob Odersky
> Fix For: 2.1.0
>
>
> Using Scala value classes as the inner type for Datasets breaks in Spark 2.0 
> and 1.6.X.
> This simple Spark 2 application demonstrates that the code will compile, but 
> will break at runtime with the error. The value class is of course 
> *FeatureId*, as it extends AnyVal.
> {noformat}
> Exception in thread "main" java.lang.RuntimeException: Error while encoding: 
> java.lang.RuntimeException: Couldn't find v on int
> assertnotnull(input[0, int, true], top level non-flat input object).v AS v#0
> +- assertnotnull(input[0, int, true], top level non-flat input object).v
>+- assertnotnull(input[0, int, true], top level non-flat input object)
>   +- input[0, int, true]".
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:279)
> at 
> org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421)
> at 
> org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421)
> {noformat}
> Test code for Spark 2.0.0:
> {noformat}
> import org.apache.spark.sql.{Dataset, SparkSession}
> object BreakSpark {
>   case class FeatureId(v: Int) extends AnyVal
>   def main(args: Array[String]): Unit = {
> val seq = Seq(FeatureId(1), FeatureId(2), FeatureId(3))
> val spark = SparkSession.builder.getOrCreate()
> import spark.implicits._
> spark.sparkContext.setLogLevel("warn")
> val ds: Dataset[FeatureId] = spark.createDataset(seq)
> println(s"BREAK HERE: ${ds.count}")
>   }
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17957) Calling outer join and na.fill(0) and then inner join will miss rows

2016-10-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17957:


Assignee: Xiao Li  (was: Apache Spark)

> Calling outer join and na.fill(0) and then inner join will miss rows
> 
>
> Key: SPARK-17957
> URL: https://issues.apache.org/jira/browse/SPARK-17957
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: Spark 2.0.1, Mac, Local
>Reporter: Linbo
>Assignee: Xiao Li
>Priority: Critical
>  Labels: correctness
>
> I reported a similar bug two months ago and it's fixed in Spark 2.0.1: 
> https://issues.apache.org/jira/browse/SPARK-17060 But I find a new bug: when 
> I insert a na.fill(0) call between outer join and inner join in the same 
> workflow in SPARK-17060 I get wrong result.
> {code:title=spark-shell|borderStyle=solid}
> scala> val a = Seq((1, 2), (2, 3)).toDF("a", "b")
> a: org.apache.spark.sql.DataFrame = [a: int, b: int]
> scala> val b = Seq((2, 5), (3, 4)).toDF("a", "c")
> b: org.apache.spark.sql.DataFrame = [a: int, c: int]
> scala> val ab = a.join(b, Seq("a"), "fullouter").na.fill(0)
> ab: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]
> scala> ab.show
> +---+---+---+
> |  a|  b|  c|
> +---+---+---+
> |  1|  2|  0|
> |  3|  0|  4|
> |  2|  3|  5|
> +---+---+---+
> scala> val c = Seq((3, 1)).toDF("a", "d")
> c: org.apache.spark.sql.DataFrame = [a: int, d: int]
> scala> c.show
> +---+---+
> |  a|  d|
> +---+---+
> |  3|  1|
> +---+---+
> scala> ab.join(c, "a").show
> +---+---+---+---+
> |  a|  b|  c|  d|
> +---+---+---+---+
> +---+---+---+---+
> {code}
> And again if i use persist, the result is correct. I think the problem is 
> join optimizer similar to this pr: https://github.com/apache/spark/pull/14661
> {code:title=spark-shell|borderStyle=solid}
> scala> val ab = a.join(b, Seq("a"), "outer").na.fill(0).persist
> ab: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int 
> ... 1 more field]
> scala> ab.show
> +---+---+---+
> |  a|  b|  c|
> +---+---+---+
> |  1|  2|  0|
> |  3|  0|  4|
> |  2|  3|  5|
> +---+---+---+
> scala> ab.join(c, "a").show
> +---+---+---+---+
> |  a|  b|  c|  d|
> +---+---+---+---+
> |  3|  0|  4|  1|
> +---+---+---+---+
> {code}
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17981) Incorrectly Set Nullability to False in FilterExec

2016-10-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17981:


Assignee: Xiao Li  (was: Apache Spark)

> Incorrectly Set Nullability to False in FilterExec
> --
>
> Key: SPARK-17981
> URL: https://issues.apache.org/jira/browse/SPARK-17981
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Critical
>
> When FilterExec contains isNotNull, which could be inferred and pushed down 
> or users specified, we convert the nullability of the involved columns if the 
> top-layer expression is null-intolerant. However, this is not true, if the 
> top-layer expression is not a leaf expression, it could still tolerate the 
> null when it has null-tolerant child expression. 
> For example, cast(coalesce(a#5, a#15) as double). Although cast is a 
> null-intolerant expression, but obviously coalesce is a null-tolerant. 
> When the nullability is wrong, we could generate incorrect results in 
> different cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17981) Incorrectly Set Nullability to False in FilterExec

2016-10-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583872#comment-15583872
 ] 

Apache Spark commented on SPARK-17981:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/15523

> Incorrectly Set Nullability to False in FilterExec
> --
>
> Key: SPARK-17981
> URL: https://issues.apache.org/jira/browse/SPARK-17981
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Critical
>
> When FilterExec contains isNotNull, which could be inferred and pushed down 
> or users specified, we convert the nullability of the involved columns if the 
> top-layer expression is null-intolerant. However, this is not true, if the 
> top-layer expression is not a leaf expression, it could still tolerate the 
> null when it has null-tolerant child expression. 
> For example, cast(coalesce(a#5, a#15) as double). Although cast is a 
> null-intolerant expression, but obviously coalesce is a null-tolerant. 
> When the nullability is wrong, we could generate incorrect results in 
> different cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17981) Incorrectly Set Nullability to False in FilterExec

2016-10-17 Thread Xiao Li (JIRA)
Xiao Li created SPARK-17981:
---

 Summary: Incorrectly Set Nullability to False in FilterExec
 Key: SPARK-17981
 URL: https://issues.apache.org/jira/browse/SPARK-17981
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.1, 2.1.0
Reporter: Xiao Li
Assignee: Xiao Li
Priority: Critical


When FilterExec contains isNotNull, which could be inferred and pushed down or 
users specified, we convert the nullability of the involved columns if the 
top-layer expression is null-intolerant. However, this is not true, if the 
top-layer expression is not a leaf expression, it could still tolerate the null 
when it has null-tolerant child expression. 

For example, cast(coalesce(a#5, a#15) as double). Although cast is a 
null-intolerant expression, but obviously coalesce is a null-tolerant. 

When the nullability is wrong, we could generate incorrect results in different 
cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15708) Tasks table in Detailed Stage page shows ip instead of hostname under Executor ID/Host

2016-10-17 Thread Alex Bozarth (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583751#comment-15583751
 ] 

Alex Bozarth commented on SPARK-15708:
--

I'm not sure closing this as cannot reproduce was correct, but I'm not sure how 
it could be fixed either. Due to the nature of those tables they get the host 
string from entirely different places in code. For the task table it's stored 
in {{TaskInfo}} but for the Agg. Metrics tables it's stored in 
{{BlockManagerId}}. The better question is when can these two end up with 
different host strings (IP vs hostname) and why. [~tgraves] is this something 
you would want fixed or was it just an behavioral oddity?

> Tasks table in Detailed Stage page shows ip instead of hostname under 
> Executor ID/Host
> --
>
> Key: SPARK-15708
> URL: https://issues.apache.org/jira/browse/SPARK-15708
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Thomas Graves
>Priority: Minor
>
> If you go to the detailed Stages page in Spark 2.0, the Tasks table under the 
> Executor ID/Host columns hosts the hostname as an ip address rather then a 
> fully qualified hostname.
> The table above it (Aggregated Metrics by Executor) shows the "Address" as 
> the full hostname.
> I'm running spark on yarn on latest branch-2.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17979) Remove deprecated support for config SPARK_YARN_USER_ENV

2016-10-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17979:
--
  Priority: Trivial  (was: Major)
Issue Type: Improvement  (was: Bug)

(Please set fields appropriately)
There are a number of deprecated env variables that can be removed. Can you 
look through others and identify a logical set to remove together? it may not 
be all of them, but is probably more than this one.

> Remove deprecated support for config SPARK_YARN_USER_ENV 
> -
>
> Key: SPARK-17979
> URL: https://issues.apache.org/jira/browse/SPARK-17979
> Project: Spark
>  Issue Type: Improvement
>Reporter: Kishor Patil
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17971) Unix timestamp handling in Spark SQL not allowing calculations on UTC times

2016-10-17 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583676#comment-15583676
 ] 

Sean Owen commented on SPARK-17971:
---

I'll say that I find the semantics of the Hive QL datetime + timezone functions 
odd, and Spark SQL is just mimicing them. For example, the behavior of 
from_utc_timestamp is already hard to understand because it operates on longs, 
essentially, and these can only really be thought of as absolute time since the 
epoch, not a quantity with a time zone inside that can vary. That is, what's 
the "non-UTC" timestamp that comes out?

So from_utc_timestamp(x, "PST") will return a timestamp whose value is smaller 
by 8 * 3600 * 1000 because PST is GMT-8 (GMT vs UTC issue noted). But what does 
that even mean? it's still a "UTC" timestamp, just an 8-hour earlier one. It's 
the timestamp whose UTC-hour would equal the PST-hour of timestamp x.

hour() et al will answer with respect the current system timezone, yes. If your 
system is in PST, and you want to know the UTC-hour of a timestamp x, then you 
need a time whose PST-hour matches the UTC-hour of x. That's the reverse. I 
believe you want:

select hour(to_utc_timestamp(cast(1476354405 as timestamp), "PST"))

That works for me. Of course you can programmatically insert 
TimeZone.getDefault.getID instead of "PST". I believe that then works as 
desired everywhere. It has some logic in that it reads as "the hour of a UTC 
timestamp ..." but it's not straightforward IMHO. But, there are tools for this 
and these are those tools

Hive has the same, and so I think this would be considered working as intended.

I looked at MySQL just now and it seems to have similar behaviors, with 
somewhat different methods, FWIW.

> Unix timestamp handling in Spark SQL not allowing calculations on UTC times
> ---
>
> Key: SPARK-17971
> URL: https://issues.apache.org/jira/browse/SPARK-17971
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.2
> Environment: MacOS X JDK 7
>Reporter: Gabriele Del Prete
>
> In our Spark data pipeline we store timed events using a bigint column called 
> 'timestamp', the values contained being Unix timestamp time points.
> Our datacenter servers Java VMs are all set up to start with timezone set to 
> UTC, while developer's computers are all in the US Eastern timezone. 
> Given how Spark SQL datetime functions work, it's impossible to do 
> calculations (eg. extract and compare hours, year-month-date triplets) using 
> UTC values:
> - from_unixtime takes a bigint unix timestamp and forces it to the computer's 
> local timezone;
> - casting the bigint column to timestamp does the same (it converts it to the 
> local timezone)
> - from_utc_timestamp works in the same way, the only difference being that it 
> gets a string as input instead of a bigint.
> The result of all of this is that it's impossible to extract individual 
> fields of a UTC timestamp, since all timestamp always get converted to the 
> local timezone.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17980) Fix refreshByPath for converted Hive tables

2016-10-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17980:


Assignee: Apache Spark

> Fix refreshByPath for converted Hive tables
> ---
>
> Key: SPARK-17980
> URL: https://issues.apache.org/jira/browse/SPARK-17980
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Assignee: Apache Spark
>Priority: Minor
>
> There is a small bug introduced in https://github.com/apache/spark/pull/14690 
> which broke refreshByPath with converted hive tables (though, it turns out it 
> was very difficult to refresh converted hive tables anyways, since you had to 
> specify the exact path of one of the partitions).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17980) Fix refreshByPath for converted Hive tables

2016-10-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583652#comment-15583652
 ] 

Apache Spark commented on SPARK-17980:
--

User 'ericl' has created a pull request for this issue:
https://github.com/apache/spark/pull/15521

> Fix refreshByPath for converted Hive tables
> ---
>
> Key: SPARK-17980
> URL: https://issues.apache.org/jira/browse/SPARK-17980
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Minor
>
> There is a small bug introduced in https://github.com/apache/spark/pull/14690 
> which broke refreshByPath with converted hive tables (though, it turns out it 
> was very difficult to refresh converted hive tables anyways, since you had to 
> specify the exact path of one of the partitions).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17980) Fix refreshByPath for converted Hive tables

2016-10-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17980:


Assignee: (was: Apache Spark)

> Fix refreshByPath for converted Hive tables
> ---
>
> Key: SPARK-17980
> URL: https://issues.apache.org/jira/browse/SPARK-17980
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Minor
>
> There is a small bug introduced in https://github.com/apache/spark/pull/14690 
> which broke refreshByPath with converted hive tables (though, it turns out it 
> was very difficult to refresh converted hive tables anyways, since you had to 
> specify the exact path of one of the partitions).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17980) Fix refreshByPath for converted Hive tables

2016-10-17 Thread Eric Liang (JIRA)
Eric Liang created SPARK-17980:
--

 Summary: Fix refreshByPath for converted Hive tables
 Key: SPARK-17980
 URL: https://issues.apache.org/jira/browse/SPARK-17980
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Eric Liang
Priority: Minor


There is a small bug introduced in https://github.com/apache/spark/pull/14690 
which broke refreshByPath with converted hive tables (though, it turns out it 
was very difficult to refresh converted hive tables anyways, since you had to 
specify the exact path of one of the partitions).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7721) Generate test coverage report from Python

2016-10-17 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583642#comment-15583642
 ] 

Josh Rosen commented on SPARK-7721:
---

IIRC when I looked into this I hit problems with the HTML Publisher Plugin not 
being able to properly publish / serve HTML reports which weren't present on 
the Jenkins master because the underlying files weren't being archived properly 
from the remote build workspaces. From a cursory Google search, it looks like 
other folks have hit similar problems with this: 
https://issues.jenkins-ci.org/browse/JENKINS-6780 
https://issues.jenkins-ci.org/browse/JENKINS-15301

Ideally we could use the Codecov service to aggregate and publish these 
reports. Last month I opened a ticket with Apache Infra to ask about obtaining 
the token which would let us push results to that service, but they haven't 
responded back to my latest comment yet: 
https://issues.apache.org/jira/browse/INFRA-12640

Alternatively, we could write some one-off shell to archive the reports to a 
public S3 bucket and serve them as static files.

> Generate test coverage report from Python
> -
>
> Key: SPARK-7721
> URL: https://issues.apache.org/jira/browse/SPARK-7721
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, Tests
>Reporter: Reynold Xin
>
> Would be great to have test coverage report for Python. Compared with Scala, 
> it is tricker to understand the coverage without coverage reports in Python 
> because we employ both docstring tests and unit tests in test files. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10915) Add support for UDAFs in Python

2016-10-17 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583638#comment-15583638
 ] 

Davies Liu commented on SPARK-10915:


Currently all the aggregate functions are implemented in Scala, which execute 
one row at a time. This will not work for Python UDAF, the overhead between JVM 
and Python process will make it super slow.

> Add support for UDAFs in Python
> ---
>
> Key: SPARK-10915
> URL: https://issues.apache.org/jira/browse/SPARK-10915
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Reporter: Justin Uang
>
> This should support python defined lambdas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17979) Remove deprecated support for config SPARK_YARN_USER_ENV

2016-10-17 Thread Kishor Patil (JIRA)
Kishor Patil created SPARK-17979:


 Summary: Remove deprecated support for config SPARK_YARN_USER_ENV 
 Key: SPARK-17979
 URL: https://issues.apache.org/jira/browse/SPARK-17979
 Project: Spark
  Issue Type: Bug
Reporter: Kishor Patil






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3132) Avoid serialization for Array[Byte] in TorrentBroadcast

2016-10-17 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-3132.
--
Resolution: Not A Problem

Marking it as not-a-problem for now given Josh's comment.


> Avoid serialization for Array[Byte] in TorrentBroadcast
> ---
>
> Key: SPARK-3132
> URL: https://issues.apache.org/jira/browse/SPARK-3132
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>
> If the input data is a byte array, we should allow TorrentBroadcast to skip 
> serializing and compressing the input.
> To do this, we should add a new parameter (shortCircuitByteArray) to 
> TorrentBroadcast, and then avoid serialization in if the input is byte array 
> and shortCircuitByteArray is true.
> We should then also do compression in task serialization itself instead of 
> doing it in TorrentBroadcast.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3132) Avoid serialization for Array[Byte] in TorrentBroadcast

2016-10-17 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583605#comment-15583605
 ] 

Josh Rosen commented on SPARK-3132:
---

I don't think that this is being actively worked on. I remember doing a POC 
prototype of using a custom {{Serializer}} for byte arrays and found that doing 
that by itself didn't seem to result in huge performance gains, but if we can 
manage to skip JVM-side compression of already-compressed Python arrays then I 
could see that being a reasonable small win.

> Avoid serialization for Array[Byte] in TorrentBroadcast
> ---
>
> Key: SPARK-3132
> URL: https://issues.apache.org/jira/browse/SPARK-3132
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>
> If the input data is a byte array, we should allow TorrentBroadcast to skip 
> serializing and compressing the input.
> To do this, we should add a new parameter (shortCircuitByteArray) to 
> TorrentBroadcast, and then avoid serialization in if the input is byte array 
> and shortCircuitByteArray is true.
> We should then also do compression in task serialization itself instead of 
> doing it in TorrentBroadcast.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3132) Avoid serialization for Array[Byte] in TorrentBroadcast

2016-10-17 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-3132:
--
Assignee: (was: Davies Liu)

> Avoid serialization for Array[Byte] in TorrentBroadcast
> ---
>
> Key: SPARK-3132
> URL: https://issues.apache.org/jira/browse/SPARK-3132
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>
> If the input data is a byte array, we should allow TorrentBroadcast to skip 
> serializing and compressing the input.
> To do this, we should add a new parameter (shortCircuitByteArray) to 
> TorrentBroadcast, and then avoid serialization in if the input is byte array 
> and shortCircuitByteArray is true.
> We should then also do compression in task serialization itself instead of 
> doing it in TorrentBroadcast.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4160) Standalone cluster mode does not upload all needed jars to driver node

2016-10-17 Thread Amit Assudani (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583531#comment-15583531
 ] 

Amit Assudani commented on SPARK-4160:
--

I can work on fixing this. Let me know.

> Standalone cluster mode does not upload all needed jars to driver node
> --
>
> Key: SPARK-4160
> URL: https://issues.apache.org/jira/browse/SPARK-4160
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Marcelo Vanzin
>
> If you look at the code in {{DriverRunner.scala}}, there is code to download 
> the main application jar from the launcher node. But that's the only jar 
> that's downloaded - if the driver depends on one of the jars or files 
> specified via {{spark-submit --jars  --files }}, it won't be able 
> to run.
> It should be possible to use the same mechanism to distribute the other files 
> to the driver node, even if that's not the most efficient way of doing it. 
> That way, at least, you don't need any external dependencies to be able to 
> distribute the files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-17976) Global options to spark-submit should not be position-sensitive

2016-10-17 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas closed SPARK-17976.

Resolution: Not A Problem

Ah, makes perfect sense. Would have realized that myself if I had held off on 
reporting this for just a day or so. Apologies.

> Global options to spark-submit should not be position-sensitive
> ---
>
> Key: SPARK-17976
> URL: https://issues.apache.org/jira/browse/SPARK-17976
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Nicholas Chammas
>Priority: Minor
>
> It is maddening that this does what you expect:
> {code}
> spark-submit --packages graphframes:graphframes:0.2.0-spark2.0-s_2.11 \
> file.py 
> {code}
> whereas this doesn't because {{--packages}} is totally ignored:
> {code}
> spark-submit file.py \
> --packages graphframes:graphframes:0.2.0-spark2.0-s_2.11
> {code}
> Ideally, global options should be valid no matter where they are specified.
> If that's too much work, then I think at the very least {{spark-submit}} 
> should display a warning that some input is being ignored. (Ideally, it 
> should error out, but that's probably not possible for 
> backwards-compatibility reasons at this point.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17977) DataFrameReader and DataStreamReader should have an ancestor class

2016-10-17 Thread Amit Assudani (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amit Assudani updated SPARK-17977:
--
Affects Version/s: 2.0.0

> DataFrameReader and DataStreamReader should have an ancestor class
> --
>
> Key: SPARK-17977
> URL: https://issues.apache.org/jira/browse/SPARK-17977
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Amit Assudani
>Priority: Critical
>
> There should be an ancestor class of DataFrameReader and DataStreamReader to 
> configure common options / format and use common methods. Most of the methods 
> are exact same having exact same arguments. This will help create utilities / 
> generic code being used for stream / batch applications. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17978) --jars option in spark-submit does not load jars for driver in spark - standalone mode

2016-10-17 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-17978.

Resolution: Duplicate

> --jars option in spark-submit does not load jars for driver in spark - 
> standalone mode 
> ---
>
> Key: SPARK-17978
> URL: https://issues.apache.org/jira/browse/SPARK-17978
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Submit
>Affects Versions: 1.6.1, 1.6.2, 2.0.0, 2.0.1
>Reporter: Amit Assudani
>
> Additional jars ( jar location urls ) provided using --jars option in spark 
> submit is  not retrieved and loaded in DriverWrapper making it not available 
> for application driver to find. This is handled for executors. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17978) --jars option in spark-submit does not load jars for driver in spark - standalone mode

2016-10-17 Thread Amit Assudani (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583508#comment-15583508
 ] 

Amit Assudani commented on SPARK-17978:
---

I can fix this and send a PR. Let me know.

> --jars option in spark-submit does not load jars for driver in spark - 
> standalone mode 
> ---
>
> Key: SPARK-17978
> URL: https://issues.apache.org/jira/browse/SPARK-17978
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Submit
>Affects Versions: 1.6.1, 1.6.2, 2.0.0, 2.0.1
>Reporter: Amit Assudani
>
> Additional jars ( jar location urls ) provided using --jars option in spark 
> submit is  not retrieved and loaded in DriverWrapper making it not available 
> for application driver to find. This is handled for executors. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17976) Global options to spark-submit should not be position-sensitive

2016-10-17 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583506#comment-15583506
 ] 

Marcelo Vanzin commented on SPARK-17976:


They are not being ignored. They are being passed as arguments to "file.py".

A long time ago it was decided that the "resource" (i.e. the jar file or python 
file) would separate Spark options from application options. This was chosen 
for backwards compatibility; another option would be to use an explicit 
separator (e.g. "\-\-") but that would not be compatible with existing user 
scripts.

So unless you have suggestion on how to differentiate Spark options from app 
options without the need for an explicit separator, this should probably be 
closed.

> Global options to spark-submit should not be position-sensitive
> ---
>
> Key: SPARK-17976
> URL: https://issues.apache.org/jira/browse/SPARK-17976
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Nicholas Chammas
>Priority: Minor
>
> It is maddening that this does what you expect:
> {code}
> spark-submit --packages graphframes:graphframes:0.2.0-spark2.0-s_2.11 \
> file.py 
> {code}
> whereas this doesn't because {{--packages}} is totally ignored:
> {code}
> spark-submit file.py \
> --packages graphframes:graphframes:0.2.0-spark2.0-s_2.11
> {code}
> Ideally, global options should be valid no matter where they are specified.
> If that's too much work, then I think at the very least {{spark-submit}} 
> should display a warning that some input is being ignored. (Ideally, it 
> should error out, but that's probably not possible for 
> backwards-compatibility reasons at this point.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17978) --jars option in spark-submit does not load jars for driver in spark - standalone mode

2016-10-17 Thread Amit Assudani (JIRA)
Amit Assudani created SPARK-17978:
-

 Summary: --jars option in spark-submit does not load jars for 
driver in spark - standalone mode 
 Key: SPARK-17978
 URL: https://issues.apache.org/jira/browse/SPARK-17978
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Spark Submit
Affects Versions: 2.0.1, 2.0.0, 1.6.2, 1.6.1
Reporter: Amit Assudani


Additional jars ( jar location urls ) provided using --jars option in spark 
submit is  not retrieved and loaded in DriverWrapper making it not available 
for application driver to find. This is handled for executors. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17977) DataFrameReader and DataStreamReader should have an ancestor class

2016-10-17 Thread Amit Assudani (JIRA)
Amit Assudani created SPARK-17977:
-

 Summary: DataFrameReader and DataStreamReader should have an 
ancestor class
 Key: SPARK-17977
 URL: https://issues.apache.org/jira/browse/SPARK-17977
 Project: Spark
  Issue Type: Wish
  Components: SQL
Affects Versions: 2.0.1
Reporter: Amit Assudani
Priority: Critical


There should be an ancestor class of DataFrameReader and DataStreamReader to 
configure common options / format and use common methods. Most of the methods 
are exact same having exact same arguments. This will help create utilities / 
generic code being used for stream / batch applications. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17976) Global options to spark-submit should not be position-sensitive

2016-10-17 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-17976:


 Summary: Global options to spark-submit should not be 
position-sensitive
 Key: SPARK-17976
 URL: https://issues.apache.org/jira/browse/SPARK-17976
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Affects Versions: 2.0.1, 2.0.0
Reporter: Nicholas Chammas
Priority: Minor


It is maddening that this does what you expect:

{code}
spark-submit --packages graphframes:graphframes:0.2.0-spark2.0-s_2.11 \
file.py 
{code}

whereas this doesn't because {{--packages}} is totally ignored:

{code}
spark-submit file.py \
--packages graphframes:graphframes:0.2.0-spark2.0-s_2.11
{code}

Ideally, global options should be valid no matter where they are specified.

If that's too much work, then I think at the very least {{spark-submit}} should 
display a warning that some input is being ignored. (Ideally, it should error 
out, but that's probably not possible for backwards-compatibility reasons at 
this point.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool

2016-10-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583453#comment-15583453
 ] 

Apache Spark commented on SPARK-13747:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/15520

> Concurrent execution in SQL doesn't work with Scala ForkJoinPool
> 
>
> Key: SPARK-13747
> URL: https://issues.apache.org/jira/browse/SPARK-13747
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.0
>
>
> Run the following codes may fail
> {code}
> (1 to 100).par.foreach { _ =>
>   println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count())
> }
> java.lang.IllegalArgumentException: spark.sql.execution.id is already set 
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>  
> at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) 
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) 
> {code}
> This is because SparkContext.runJob can be suspended when using a 
> ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it 
> calls Await.ready (introduced by https://github.com/apache/spark/pull/9264).
> So when SparkContext.runJob is suspended, ForkJoinPool will run another task 
> in the same thread, however, the local properties has been polluted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool

2016-10-17 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583454#comment-15583454
 ] 

Shixiong Zhu commented on SPARK-13747:
--

[~chinwei] Could you test https://github.com/apache/spark/pull/15520 and see if 
the error is gone?

> Concurrent execution in SQL doesn't work with Scala ForkJoinPool
> 
>
> Key: SPARK-13747
> URL: https://issues.apache.org/jira/browse/SPARK-13747
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
> Fix For: 2.0.0
>
>
> Run the following codes may fail
> {code}
> (1 to 100).par.foreach { _ =>
>   println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count())
> }
> java.lang.IllegalArgumentException: spark.sql.execution.id is already set 
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>  
> at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) 
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) 
> {code}
> This is because SparkContext.runJob can be suspended when using a 
> ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it 
> calls Await.ready (introduced by https://github.com/apache/spark/pull/9264).
> So when SparkContext.runJob is suspended, ForkJoinPool will run another task 
> in the same thread, however, the local properties has been polluted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool

2016-10-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13747:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Concurrent execution in SQL doesn't work with Scala ForkJoinPool
> 
>
> Key: SPARK-13747
> URL: https://issues.apache.org/jira/browse/SPARK-13747
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
> Fix For: 2.0.0
>
>
> Run the following codes may fail
> {code}
> (1 to 100).par.foreach { _ =>
>   println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count())
> }
> java.lang.IllegalArgumentException: spark.sql.execution.id is already set 
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>  
> at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) 
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) 
> {code}
> This is because SparkContext.runJob can be suspended when using a 
> ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it 
> calls Await.ready (introduced by https://github.com/apache/spark/pull/9264).
> So when SparkContext.runJob is suspended, ForkJoinPool will run another task 
> in the same thread, however, the local properties has been polluted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool

2016-10-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13747:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Concurrent execution in SQL doesn't work with Scala ForkJoinPool
> 
>
> Key: SPARK-13747
> URL: https://issues.apache.org/jira/browse/SPARK-13747
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.0
>
>
> Run the following codes may fail
> {code}
> (1 to 100).par.foreach { _ =>
>   println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count())
> }
> java.lang.IllegalArgumentException: spark.sql.execution.id is already set 
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>  
> at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) 
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) 
> {code}
> This is because SparkContext.runJob can be suspended when using a 
> ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it 
> calls Await.ready (introduced by https://github.com/apache/spark/pull/9264).
> So when SparkContext.runJob is suspended, ForkJoinPool will run another task 
> in the same thread, however, the local properties has been polluted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool

2016-10-17 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reopened SPARK-13747:
--
  Assignee: Shixiong Zhu  (was: Andrew Or)

There are other places need to be fixed.

> Concurrent execution in SQL doesn't work with Scala ForkJoinPool
> 
>
> Key: SPARK-13747
> URL: https://issues.apache.org/jira/browse/SPARK-13747
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.0
>
>
> Run the following codes may fail
> {code}
> (1 to 100).par.foreach { _ =>
>   println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count())
> }
> java.lang.IllegalArgumentException: spark.sql.execution.id is already set 
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>  
> at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) 
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) 
> {code}
> This is because SparkContext.runJob can be suspended when using a 
> ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it 
> calls Await.ready (introduced by https://github.com/apache/spark/pull/9264).
> So when SparkContext.runJob is suspended, ForkJoinPool will run another task 
> in the same thread, however, the local properties has been polluted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17944) sbin/start-* scripts use of `hostname -f` fail with Solaris

2016-10-17 Thread Erik O'Shaughnessy (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik O'Shaughnessy updated SPARK-17944:
---
Component/s: Deploy

> sbin/start-* scripts use of `hostname -f` fail with Solaris 
> 
>
> Key: SPARK-17944
> URL: https://issues.apache.org/jira/browse/SPARK-17944
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.0.1
> Environment: Solaris 10, Solaris 11
>Reporter: Erik O'Shaughnessy
>Priority: Trivial
>
> {{$SPARK_HOME/sbin/start-master.sh}} fails:
> {noformat}
> $ ./start-master.sh 
> usage: hostname [[-t] system_name]
>hostname [-D]
> starting org.apache.spark.deploy.master.Master, logging to 
> /home/eoshaugh/local/spark/logs/spark-eoshaugh-org.apache.spark.deploy.master.Master-1-m7-16-002-ld1.out
> failed to launch org.apache.spark.deploy.master.Master:
> --properties-file FILE Path to a custom Spark properties file.
>Default is conf/spark-defaults.conf.
> full log in 
> /home/eoshaugh/local/spark/logs/spark-eoshaugh-org.apache.spark.deploy.master.Master-1-m7-16-002-ld1.out
> {noformat}
> I found SPARK-17546 which changed the invocation of hostname in 
> sbin/start-master.sh, sbin/start-slaves.sh and sbin/start-mesos-dispatcher.sh 
> to include the flag {{-f}}, which is not a valid command line option for the 
> Solaris hostname implementation. 
> As a workaround, Solaris users can substitute:
> {noformat}
> `/usr/sbin/check-hostname | awk '{print $NF}'`
> {noformat}
> Admittedly not an obvious fix, but it provides equivalent functionality. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15689) Data source API v2

2016-10-17 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15689:

Description: 
This ticket tracks progress in creating the v2 of data source API. This new API 
should focus on:

1. Have a small surface so it is easy to freeze and maintain compatibility for 
a long time. Ideally, this API should survive architectural rewrites and 
user-facing API revamps of Spark.

2. Have a well-defined column batch interface for high performance. Convenience 
methods should exist to convert row-oriented formats into column batches for 
data source developers.

3. Still support filter push down, similar to the existing API.

4. Nice-to-have: support additional common operators, including limit and 
sampling.


Note that both 1 and 2 are problems that the current data source API (v1) 
suffers. The current data source API has a wide surface with dependency on 
DataFrame/SQLContext, making the data source API compatibility depending on the 
upper level API. The current data source API is also only row oriented and has 
to go through an expensive external data type conversion to internal data type.


  was:
This ticket tracks progress in creating the v2 of data source API. This new API 
should focus on:

1. Have a small surface so it is easy to freeze and maintain compatibility for 
a long time. Ideally, this API should survive architectural rewrites and 
user-facing API revamps of Spark.

2. Have a well-defined column batch interface for high performance. Convenience 
methods should exist to convert row-oriented formats into column batches for 
data source developers.

3. Still support filter push down, similar to the existing API.

4. Support sampling.


Note that both 1 and 2 are problems that the current data source API (v1) 
suffers. The current data source API has a wide surface with dependency on 
DataFrame/SQLContext, making the data source API compatibility depending on the 
upper level API. The current data source API is also only row oriented and has 
to go through an expensive external data type conversion to internal data type.



> Data source API v2
> --
>
> Key: SPARK-15689
> URL: https://issues.apache.org/jira/browse/SPARK-15689
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> This ticket tracks progress in creating the v2 of data source API. This new 
> API should focus on:
> 1. Have a small surface so it is easy to freeze and maintain compatibility 
> for a long time. Ideally, this API should survive architectural rewrites and 
> user-facing API revamps of Spark.
> 2. Have a well-defined column batch interface for high performance. 
> Convenience methods should exist to convert row-oriented formats into column 
> batches for data source developers.
> 3. Still support filter push down, similar to the existing API.
> 4. Nice-to-have: support additional common operators, including limit and 
> sampling.
> Note that both 1 and 2 are problems that the current data source API (v1) 
> suffers. The current data source API has a wide surface with dependency on 
> DataFrame/SQLContext, making the data source API compatibility depending on 
> the upper level API. The current data source API is also only row oriented 
> and has to go through an expensive external data type conversion to internal 
> data type.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15689) Data source API v2

2016-10-17 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15689:

Description: 
This ticket tracks progress in creating the v2 of data source API. This new API 
should focus on:

1. Have a small surface so it is easy to freeze and maintain compatibility for 
a long time. Ideally, this API should survive architectural rewrites and 
user-facing API revamps of Spark.

2. Have a well-defined column batch interface for high performance. Convenience 
methods should exist to convert row-oriented formats into column batches for 
data source developers.

3. Still support filter push down, similar to the existing API.

4. Support sampling.


Note that both 1 and 2 are problems that the current data source API (v1) 
suffers. The current data source API has a wide surface with dependency on 
DataFrame/SQLContext, making the data source API compatibility depending on the 
upper level API. The current data source API is also only row oriented and has 
to go through an expensive external data type conversion to internal data type.


  was:
This ticket tracks progress in creating the v2 of data source API. This new API 
should focus on:

1. Have a small surface so it is easy to freeze and maintain compatibility for 
a long time. Ideally, this API should survive architectural rewrites and 
user-facing API revamps of Spark.

2. Have a well-defined column batch interface for high performance. Convenience 
methods should exist to convert row-oriented formats into column batches for 
data source developers.

3. Still support filter push down, similar to the existing API.


Note that both 1 and 2 are problems that the current data source API (v1) 
suffers. The current data source API has a wide surface with dependency on 
DataFrame/SQLContext, making the data source API compatibility depending on the 
upper level API. The current data source API is also only row oriented and has 
to go through an expensive external data type conversion to internal data type.



> Data source API v2
> --
>
> Key: SPARK-15689
> URL: https://issues.apache.org/jira/browse/SPARK-15689
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> This ticket tracks progress in creating the v2 of data source API. This new 
> API should focus on:
> 1. Have a small surface so it is easy to freeze and maintain compatibility 
> for a long time. Ideally, this API should survive architectural rewrites and 
> user-facing API revamps of Spark.
> 2. Have a well-defined column batch interface for high performance. 
> Convenience methods should exist to convert row-oriented formats into column 
> batches for data source developers.
> 3. Still support filter push down, similar to the existing API.
> 4. Support sampling.
> Note that both 1 and 2 are problems that the current data source API (v1) 
> suffers. The current data source API has a wide surface with dependency on 
> DataFrame/SQLContext, making the data source API compatibility depending on 
> the upper level API. The current data source API is also only row oriented 
> and has to go through an expensive external data type conversion to internal 
> data type.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17911) Scheduler does not need messageScheduler for ResubmitFailedStages

2016-10-17 Thread Mark Hamstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583334#comment-15583334
 ] 

Mark Hamstra commented on SPARK-17911:
--

I think we're pretty much on the same page when it comes to the net effects of 
just eliminating the RESUBMIT_TIMEOUT delay.  I need to find some time to think 
about what something better than the current delayed-resubmit-event approach 
would look like.

> Scheduler does not need messageScheduler for ResubmitFailedStages
> -
>
> Key: SPARK-17911
> URL: https://issues.apache.org/jira/browse/SPARK-17911
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.0.0
>Reporter: Imran Rashid
>
> Its not totally clear what the purpose of the {{messageScheduler}} is in 
> {{DAGScheduler}}.  It can perhaps be eliminated completely; or perhaps we 
> should just clearly document its purpose.
> This comes from a long discussion w/ [~markhamstra] on an unrelated PR here: 
> https://github.com/apache/spark/pull/15335/files/c80ad22a242255cac91cce2c7c537f9b21100f70#diff-6a9ff7fb74fd490a50462d45db2d5e11
> But its tricky so breaking it out here for archiving the discussion.
> Note: this issue requires a decision on what to do before a code change, so 
> lets just discuss it on jira first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17975) EMLDAOptimizer fails with ClassCastException on YARN

2016-10-17 Thread Jeff Stein (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Stein updated SPARK-17975:
---
Description: 
I'm able to reproduce the error consistently with a 2000 record text file with 
each record having 1-5 terms and checkpointing enabled. It looks like the 
problem was introduced with the resolution for SPARK-13355.

The EdgeRDD class seems to be lying about it's type in a way that causes 
RDD.mapPartitionsWithIndex method to be unusable when it's referenced as an RDD 
of Edge elements.

{code}
val spark = SparkSession.builder.appName("lda").getOrCreate()
spark.sparkContext.setCheckpointDir("hdfs:///tmp/checkpoints")
val data: RDD[(Long, Vector)] = // snip
data.setName("data").cache()
val lda = new LDA
val optimizer = new EMLDAOptimizer
lda.setOptimizer(optimizer)
  .setK(10)
  .setMaxIterations(400)
  .setAlpha(-1)
  .setBeta(-1)
  .setCheckpointInterval(7)
val ldaModel = lda.run(data)
{code}

{noformat}
16/10/16 23:53:54 WARN TaskSetManager: Lost task 3.0 in stage 348.0 (TID 1225, 
server2.domain): java.lang.ClassCastException: scala.Tuple2 cannot be cast to 
org.apache.spark.graphx.Edge
at 
org.apache.spark.graphx.EdgeRDD$$anonfun$1$$anonfun$apply$1.apply(EdgeRDD.scala:107)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at 
org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107)
at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:105)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:332)
at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:330)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)
at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
at org.apache.spark.graphx.EdgeRDD.compute(EdgeRDD.scala:50)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)
{noformat}

  was:
I'm able to reproduce the error consistently with a 2000 record text file with 
each record having 1-5 terms and checkpointing enabled. It looks like the 
problem was introduced with the resolution for SPARK-13355.

The EdgeRDD class seems to be lying about it's type in a way that causes 
RDD.mapPartitionsWithIndex method to be unusable when it's referenced as an RDD 
of Edge elements.

{code}
val spark = SparkSession.builder.appName("lda").getOrCreate()
spark.sparkContext.setCheckpointDir("hdfs:///tmp/checkpoints")
val data: RDD[(Long, Vector)] = // snip
data.setName("data").cache()
val lda = new LDA
val optimizer = new EMLDAOptimizer
lda.setOptimizer(optimizer)
  .setK(10)
  .setMaxIterations(400)
  .setAlpha(-1)
  .setBeta(-1)
  .setCheckpointInterval(7)
val ldaModel = lda.run(data)
{code}


> EMLDAOptimizer 

[jira] [Commented] (SPARK-17975) EMLDAOptimizer fails with ClassCastException on YARN

2016-10-17 Thread Jeff Stein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583284#comment-15583284
 ] 

Jeff Stein commented on SPARK-17975:


Another issue that seems to be related to EdgeRDD partition problems.

> EMLDAOptimizer fails with ClassCastException on YARN
> 
>
> Key: SPARK-17975
> URL: https://issues.apache.org/jira/browse/SPARK-17975
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.1
> Environment: Centos 6, CDH 5.7, Java 1.7u80
>Reporter: Jeff Stein
>
> I'm able to reproduce the error consistently with a 2000 record text file 
> with each record having 1-5 terms and checkpointing enabled. It looks like 
> the problem was introduced with the resolution for SPARK-13355.
> The EdgeRDD class seems to be lying about it's type in a way that causes 
> RDD.mapPartitionsWithIndex method to be unusable when it's referenced as an 
> RDD of Edge elements.
> {code}
> val spark = SparkSession.builder.appName("lda").getOrCreate()
> spark.sparkContext.setCheckpointDir("hdfs:///tmp/checkpoints")
> val data: RDD[(Long, Vector)] = // snip
> data.setName("data").cache()
> val lda = new LDA
> val optimizer = new EMLDAOptimizer
> lda.setOptimizer(optimizer)
>   .setK(10)
>   .setMaxIterations(400)
>   .setAlpha(-1)
>   .setBeta(-1)
>   .setCheckpointInterval(7)
> val ldaModel = lda.run(data)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17975) EMLDAOptimizer fails with ClassCastException on YARN

2016-10-17 Thread Jeff Stein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583284#comment-15583284
 ] 

Jeff Stein edited comment on SPARK-17975 at 10/17/16 8:04 PM:
--

Adding a link to another issue that seems to be related to EdgeRDD partition 
problems.


was (Author: jvstein):
Another issue that seems to be related to EdgeRDD partition problems.

> EMLDAOptimizer fails with ClassCastException on YARN
> 
>
> Key: SPARK-17975
> URL: https://issues.apache.org/jira/browse/SPARK-17975
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.1
> Environment: Centos 6, CDH 5.7, Java 1.7u80
>Reporter: Jeff Stein
>
> I'm able to reproduce the error consistently with a 2000 record text file 
> with each record having 1-5 terms and checkpointing enabled. It looks like 
> the problem was introduced with the resolution for SPARK-13355.
> The EdgeRDD class seems to be lying about it's type in a way that causes 
> RDD.mapPartitionsWithIndex method to be unusable when it's referenced as an 
> RDD of Edge elements.
> {code}
> val spark = SparkSession.builder.appName("lda").getOrCreate()
> spark.sparkContext.setCheckpointDir("hdfs:///tmp/checkpoints")
> val data: RDD[(Long, Vector)] = // snip
> data.setName("data").cache()
> val lda = new LDA
> val optimizer = new EMLDAOptimizer
> lda.setOptimizer(optimizer)
>   .setK(10)
>   .setMaxIterations(400)
>   .setAlpha(-1)
>   .setBeta(-1)
>   .setCheckpointInterval(7)
> val ldaModel = lda.run(data)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17971) Unix timestamp handling in Spark SQL not allowing calculations on UTC times

2016-10-17 Thread Gabriele Del Prete (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583274#comment-15583274
 ] 

Gabriele Del Prete commented on SPARK-17971:


Already tried, and I could not make it to work. 

from_utc_timestamp can't accept a bigint column as input, only a timestamp 
column, and if I cast my bigint column to timestamp, the returned timestamp is 
shifted in the local node's timezone.

uinx time 1476354405 is ~ 2016-10-13 at *10*:26

*select hour(from_utc_timestamp(cast(1476354405 as timestamp), "UTC"));*

when run on our servers (set to UTC) returns *10*, when run on my personal dev 
machine (set to US/Eastern) returns *6*.

> Unix timestamp handling in Spark SQL not allowing calculations on UTC times
> ---
>
> Key: SPARK-17971
> URL: https://issues.apache.org/jira/browse/SPARK-17971
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.2
> Environment: MacOS X JDK 7
>Reporter: Gabriele Del Prete
>
> In our Spark data pipeline we store timed events using a bigint column called 
> 'timestamp', the values contained being Unix timestamp time points.
> Our datacenter servers Java VMs are all set up to start with timezone set to 
> UTC, while developer's computers are all in the US Eastern timezone. 
> Given how Spark SQL datetime functions work, it's impossible to do 
> calculations (eg. extract and compare hours, year-month-date triplets) using 
> UTC values:
> - from_unixtime takes a bigint unix timestamp and forces it to the computer's 
> local timezone;
> - casting the bigint column to timestamp does the same (it converts it to the 
> local timezone)
> - from_utc_timestamp works in the same way, the only difference being that it 
> gets a string as input instead of a bigint.
> The result of all of this is that it's impossible to extract individual 
> fields of a UTC timestamp, since all timestamp always get converted to the 
> local timezone.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17975) EMLDAOptimizer fails with ClassCastException on YARN

2016-10-17 Thread Jeff Stein (JIRA)
Jeff Stein created SPARK-17975:
--

 Summary: EMLDAOptimizer fails with ClassCastException on YARN
 Key: SPARK-17975
 URL: https://issues.apache.org/jira/browse/SPARK-17975
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 2.0.1
 Environment: Centos 6, CDH 5.7, Java 1.7u80
Reporter: Jeff Stein


I'm able to reproduce the error consistently with a 2000 record text file with 
each record having 1-5 terms and checkpointing enabled. It looks like the 
problem was introduced with the resolution for SPARK-13355.

The EdgeRDD class seems to be lying about it's type in a way that causes 
RDD.mapPartitionsWithIndex method to be unusable when it's referenced as an RDD 
of Edge elements.

{code}
val spark = SparkSession.builder.appName("lda").getOrCreate()
spark.sparkContext.setCheckpointDir("hdfs:///tmp/checkpoints")
val data: RDD[(Long, Vector)] = // snip
data.setName("data").cache()
val lda = new LDA
val optimizer = new EMLDAOptimizer
lda.setOptimizer(optimizer)
  .setK(10)
  .setMaxIterations(400)
  .setAlpha(-1)
  .setBeta(-1)
  .setCheckpointInterval(7)
val ldaModel = lda.run(data)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10915) Add support for UDAFs in Python

2016-10-17 Thread Tobi Bosede (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583180#comment-15583180
 ] 

Tobi Bosede commented on SPARK-10915:
-

Thanks Davies. Someone also mentioned collect on the mailing list. I think I 
will use pandas' pivot for now rather than collect and create a UDF. (Hopefully 
I have enough memory). 
So how are the current (built in) aggregate functions being implemented? They 
are batch right?

> Add support for UDAFs in Python
> ---
>
> Key: SPARK-10915
> URL: https://issues.apache.org/jira/browse/SPARK-10915
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Reporter: Justin Uang
>
> This should support python defined lambdas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17971) Unix timestamp handling in Spark SQL not allowing calculations on UTC times

2016-10-17 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583176#comment-15583176
 ] 

Sean Owen commented on SPARK-17971:
---

Oops I copied the wrong link.  I mean :

https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html#from_utc_timestamp(org.apache.spark.sql.Column,%20java.lang.String)

A UNIX timestamp defines the same point in time and does not depend on a 
timezone to interpret it. I think we are clear on that and it isn't the point. 
You just need the methods that don't use system tz.

> Unix timestamp handling in Spark SQL not allowing calculations on UTC times
> ---
>
> Key: SPARK-17971
> URL: https://issues.apache.org/jira/browse/SPARK-17971
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.2
> Environment: MacOS X JDK 7
>Reporter: Gabriele Del Prete
>
> In our Spark data pipeline we store timed events using a bigint column called 
> 'timestamp', the values contained being Unix timestamp time points.
> Our datacenter servers Java VMs are all set up to start with timezone set to 
> UTC, while developer's computers are all in the US Eastern timezone. 
> Given how Spark SQL datetime functions work, it's impossible to do 
> calculations (eg. extract and compare hours, year-month-date triplets) using 
> UTC values:
> - from_unixtime takes a bigint unix timestamp and forces it to the computer's 
> local timezone;
> - casting the bigint column to timestamp does the same (it converts it to the 
> local timezone)
> - from_utc_timestamp works in the same way, the only difference being that it 
> gets a string as input instead of a bigint.
> The result of all of this is that it's impossible to extract individual 
> fields of a UTC timestamp, since all timestamp always get converted to the 
> local timezone.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17974) Refactor FileCatalog classes to simplify the inheritance tree

2016-10-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583170#comment-15583170
 ] 

Apache Spark commented on SPARK-17974:
--

User 'ericl' has created a pull request for this issue:
https://github.com/apache/spark/pull/15518

> Refactor FileCatalog classes to simplify the inheritance tree
> -
>
> Key: SPARK-17974
> URL: https://issues.apache.org/jira/browse/SPARK-17974
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Minor
>
> This is a follow-up item for https://github.com/apache/spark/pull/14690 which 
> adds support for metastore partition pruning of converted hive tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17974) Refactor FileCatalog classes to simplify the inheritance tree

2016-10-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17974:


Assignee: Apache Spark

> Refactor FileCatalog classes to simplify the inheritance tree
> -
>
> Key: SPARK-17974
> URL: https://issues.apache.org/jira/browse/SPARK-17974
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Assignee: Apache Spark
>Priority: Minor
>
> This is a follow-up item for https://github.com/apache/spark/pull/14690 which 
> adds support for metastore partition pruning of converted hive tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >