[jira] [Assigned] (SPARK-17813) Maximum data per trigger
[ https://issues.apache.org/jira/browse/SPARK-17813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17813: Assignee: (was: Apache Spark) > Maximum data per trigger > > > Key: SPARK-17813 > URL: https://issues.apache.org/jira/browse/SPARK-17813 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust > > At any given point in a streaming query execution, we process all available > data. This maximizes throughput at the cost of latency. We should add > something similar to the {{maxFilesPerTrigger}} option available for files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17813) Maximum data per trigger
[ https://issues.apache.org/jira/browse/SPARK-17813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17813: Assignee: Apache Spark > Maximum data per trigger > > > Key: SPARK-17813 > URL: https://issues.apache.org/jira/browse/SPARK-17813 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust >Assignee: Apache Spark > > At any given point in a streaming query execution, we process all available > data. This maximizes throughput at the cost of latency. We should add > something similar to the {{maxFilesPerTrigger}} option available for files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17813) Maximum data per trigger
[ https://issues.apache.org/jira/browse/SPARK-17813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15584515#comment-15584515 ] Apache Spark commented on SPARK-17813: -- User 'koeninger' has created a pull request for this issue: https://github.com/apache/spark/pull/15527 > Maximum data per trigger > > > Key: SPARK-17813 > URL: https://issues.apache.org/jira/browse/SPARK-17813 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust > > At any given point in a streaming query execution, we process all available > data. This maximizes throughput at the cost of latency. We should add > something similar to the {{maxFilesPerTrigger}} option available for files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17986) SQLTransformer leaks temporary tables
[ https://issues.apache.org/jira/browse/SPARK-17986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17986: Assignee: Apache Spark > SQLTransformer leaks temporary tables > - > > Key: SPARK-17986 > URL: https://issues.apache.org/jira/browse/SPARK-17986 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.0.1 >Reporter: Drew Robb >Assignee: Apache Spark >Priority: Minor > > The SQLTransformer creates a temporary table when called, and does not delete > this temporary table. When using a SQLTransformer in a long running Spark > Streaming task, these temporary tables accumulate. > I believe that the fix would be as simple as calling > `dataset.sparkSession.catalog.dropTempView(tableName)` in the last part of > `transform`: > https://github.com/apache/spark/blob/v2.0.1/mllib/src/main/scala/org/apache/spark/ml/feature/SQLTransformer.scala#L65. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17986) SQLTransformer leaks temporary tables
[ https://issues.apache.org/jira/browse/SPARK-17986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17986: Assignee: (was: Apache Spark) > SQLTransformer leaks temporary tables > - > > Key: SPARK-17986 > URL: https://issues.apache.org/jira/browse/SPARK-17986 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.0.1 >Reporter: Drew Robb >Priority: Minor > > The SQLTransformer creates a temporary table when called, and does not delete > this temporary table. When using a SQLTransformer in a long running Spark > Streaming task, these temporary tables accumulate. > I believe that the fix would be as simple as calling > `dataset.sparkSession.catalog.dropTempView(tableName)` in the last part of > `transform`: > https://github.com/apache/spark/blob/v2.0.1/mllib/src/main/scala/org/apache/spark/ml/feature/SQLTransformer.scala#L65. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17986) SQLTransformer leaks temporary tables
[ https://issues.apache.org/jira/browse/SPARK-17986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15584466#comment-15584466 ] Apache Spark commented on SPARK-17986: -- User 'drewrobb' has created a pull request for this issue: https://github.com/apache/spark/pull/15526 > SQLTransformer leaks temporary tables > - > > Key: SPARK-17986 > URL: https://issues.apache.org/jira/browse/SPARK-17986 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.0.1 >Reporter: Drew Robb >Priority: Minor > > The SQLTransformer creates a temporary table when called, and does not delete > this temporary table. When using a SQLTransformer in a long running Spark > Streaming task, these temporary tables accumulate. > I believe that the fix would be as simple as calling > `dataset.sparkSession.catalog.dropTempView(tableName)` in the last part of > `transform`: > https://github.com/apache/spark/blob/v2.0.1/mllib/src/main/scala/org/apache/spark/ml/feature/SQLTransformer.scala#L65. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17986) SQLTransformer leaks temporary tables
[ https://issues.apache.org/jira/browse/SPARK-17986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Drew Robb updated SPARK-17986: -- Description: The SQLTransformer creates a temporary table when called, and does not delete this temporary table. When using a SQLTransformer in a long running Spark Streaming task, these temporary tables accumulate. I believe that the fix would be as simple as calling `dataset.sparkSession.catalog.dropTempView(tableName)` in the last part of `transform`: https://github.com/apache/spark/blob/v2.0.1/mllib/src/main/scala/org/apache/spark/ml/feature/SQLTransformer.scala#L65. was: The SQLTransformer creates a temporary table when called, and does not delete this temporary table. When using a SQLTransformer in a long running Spark Streaming task, these temporary tables accumulate. I believe that the fix would be as simple as calling `dataset.sparkSession.catalog.dropTempView(tableName)` in the last part of `transform`: https://github.com/apache/spark/blob/v2.0.1/mllib/src/main/scala/org/apache/spark/ml/feature/SQLTransformer.scala#L65. I would be happy to attempt this fix myself if someone could validate this issue. > SQLTransformer leaks temporary tables > - > > Key: SPARK-17986 > URL: https://issues.apache.org/jira/browse/SPARK-17986 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.0.1 >Reporter: Drew Robb >Priority: Minor > > The SQLTransformer creates a temporary table when called, and does not delete > this temporary table. When using a SQLTransformer in a long running Spark > Streaming task, these temporary tables accumulate. > I believe that the fix would be as simple as calling > `dataset.sparkSession.catalog.dropTempView(tableName)` in the last part of > `transform`: > https://github.com/apache/spark/blob/v2.0.1/mllib/src/main/scala/org/apache/spark/ml/feature/SQLTransformer.scala#L65. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-17956) ProjectExec has incorrect outputOrdering property
[ https://issues.apache.org/jira/browse/SPARK-17956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh closed SPARK-17956. --- Resolution: Won't Fix > ProjectExec has incorrect outputOrdering property > - > > Key: SPARK-17956 > URL: https://issues.apache.org/jira/browse/SPARK-17956 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh > > Currently ProjectExec simply takes child plan's outputOrdering as its > outputOrdering. In some cases, this leads to incorrect outputOrdering. This > applies to TakeOrderedAndProjectExec too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17974) Refactor FileCatalog classes to simplify the inheritance tree
[ https://issues.apache.org/jira/browse/SPARK-17974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17974: Assignee: Apache Spark (was: Eric Liang) > Refactor FileCatalog classes to simplify the inheritance tree > - > > Key: SPARK-17974 > URL: https://issues.apache.org/jira/browse/SPARK-17974 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Assignee: Apache Spark >Priority: Minor > Fix For: 2.1.0 > > > This is a follow-up item for https://github.com/apache/spark/pull/14690 which > adds support for metastore partition pruning of converted hive tables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17974) Refactor FileCatalog classes to simplify the inheritance tree
[ https://issues.apache.org/jira/browse/SPARK-17974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17974: Assignee: Eric Liang (was: Apache Spark) > Refactor FileCatalog classes to simplify the inheritance tree > - > > Key: SPARK-17974 > URL: https://issues.apache.org/jira/browse/SPARK-17974 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Assignee: Eric Liang >Priority: Minor > Fix For: 2.1.0 > > > This is a follow-up item for https://github.com/apache/spark/pull/14690 which > adds support for metastore partition pruning of converted hive tables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-17974) Refactor FileCatalog classes to simplify the inheritance tree
[ https://issues.apache.org/jira/browse/SPARK-17974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin reopened SPARK-17974: - Reopening since the previous commit was not tested by Jenkins (failed Scala linter). > Refactor FileCatalog classes to simplify the inheritance tree > - > > Key: SPARK-17974 > URL: https://issues.apache.org/jira/browse/SPARK-17974 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Assignee: Eric Liang >Priority: Minor > Fix For: 2.1.0 > > > This is a follow-up item for https://github.com/apache/spark/pull/14690 which > adds support for metastore partition pruning of converted hive tables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17862) Feature flag SPARK-16980
[ https://issues.apache.org/jira/browse/SPARK-17862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15584374#comment-15584374 ] Reynold Xin commented on SPARK-17862: - cc [~ekhliang] this was done right? Can you put the flag here? > Feature flag SPARK-16980 > > > Key: SPARK-17862 > URL: https://issues.apache.org/jira/browse/SPARK-17862 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17970) store partition spec in metastore for data source table
[ https://issues.apache.org/jira/browse/SPARK-17970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-17970: Issue Type: Sub-task (was: New Feature) Parent: SPARK-17861 > store partition spec in metastore for data source table > --- > > Key: SPARK-17970 > URL: https://issues.apache.org/jira/browse/SPARK-17970 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17974) Refactor FileCatalog classes to simplify the inheritance tree
[ https://issues.apache.org/jira/browse/SPARK-17974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-17974: Issue Type: Sub-task (was: Improvement) Parent: SPARK-17861 > Refactor FileCatalog classes to simplify the inheritance tree > - > > Key: SPARK-17974 > URL: https://issues.apache.org/jira/browse/SPARK-17974 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Minor > Fix For: 2.1.0 > > > This is a follow-up item for https://github.com/apache/spark/pull/14690 which > adds support for metastore partition pruning of converted hive tables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17974) Refactor FileCatalog classes to simplify the inheritance tree
[ https://issues.apache.org/jira/browse/SPARK-17974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-17974. - Resolution: Fixed Assignee: Eric Liang Fix Version/s: 2.1.0 > Refactor FileCatalog classes to simplify the inheritance tree > - > > Key: SPARK-17974 > URL: https://issues.apache.org/jira/browse/SPARK-17974 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Assignee: Eric Liang >Priority: Minor > Fix For: 2.1.0 > > > This is a follow-up item for https://github.com/apache/spark/pull/14690 which > adds support for metastore partition pruning of converted hive tables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14212) Add configuration element for --packages option
[ https://issues.apache.org/jira/browse/SPARK-14212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15584142#comment-15584142 ] Marcelo Vanzin edited comment on SPARK-14212 at 10/18/16 3:50 AM: -- SPARK-15760 added the docs to 2.0 only, but I'm pretty sure the options were there in 1.6. was (Author: vanzin): SPARK-15760 added the docs to 2.0 only, but I'm pretty sure the options were these in 1.6. > Add configuration element for --packages option > --- > > Key: SPARK-14212 > URL: https://issues.apache.org/jira/browse/SPARK-14212 > Project: Spark > Issue Type: New Feature > Components: Documentation, PySpark >Affects Versions: 1.6.1 >Reporter: Russell Jurney >Priority: Trivial > Labels: config, starter > > I use PySpark with the --packages option, for instance to load support for > CSV: > pyspark --packages com.databricks:spark-csv_2.10:1.4.0 > I would like to not have to set this every time at the command line, so a > corresponding element for --packages in the configuration file > spark-defaults.conf, would be good to have. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17620) hive.default.fileformat=orc does not set OrcSerde
[ https://issues.apache.org/jira/browse/SPARK-17620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-17620. - Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 15495 [https://github.com/apache/spark/pull/15495] > hive.default.fileformat=orc does not set OrcSerde > - > > Key: SPARK-17620 > URL: https://issues.apache.org/jira/browse/SPARK-17620 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Brian Cho >Assignee: Dilip Biswal >Priority: Minor > Fix For: 2.1.0 > > > Setting {{hive.default.fileformat=orc}} does not set OrcSerde. This behavior > is inconsistent with {{STORED AS ORC}}. This means we cannot set a default > behavior for creating tables using orc. > The behavior using stored as: > {noformat} > scala> spark.sql("CREATE TABLE tmp_stored_as(id INT) STORED AS ORC") > res0: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("DESC FORMATTED tmp_stored_as").collect.foreach(println) > ... > [# Storage Information,,] > [SerDe Library:,org.apache.hadoop.hive.ql.io.orc.OrcSerde,] > [InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,] > [OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,] > ... > {noformat} > Behavior setting default conf (SerDe Library is not set properly): > {noformat} > scala> spark.sql("SET hive.default.fileformat=orc") > res2: org.apache.spark.sql.DataFrame = [key: string, value: string] > scala> spark.sql("CREATE TABLE tmp_default(id INT)") > res3: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("DESC FORMATTED tmp_default").collect.foreach(println) > ... > [# Storage Information,,] > [SerDe Library:,org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,] > [InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,] > [OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,] > ... > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17986) SQLTransformer leaks temporary tables
Drew Robb created SPARK-17986: - Summary: SQLTransformer leaks temporary tables Key: SPARK-17986 URL: https://issues.apache.org/jira/browse/SPARK-17986 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.0.1 Reporter: Drew Robb Priority: Minor The SQLTransformer creates a temporary table when called, and does not delete this temporary table. When using a SQLTransformer in a long running Spark Streaming task, these temporary tables accumulate. I believe that the fix would be as simple as calling `dataset.sparkSession.catalog.dropTempView(tableName)` in the last part of `transform`: https://github.com/apache/spark/blob/v2.0.1/mllib/src/main/scala/org/apache/spark/ml/feature/SQLTransformer.scala#L65. I would be happy to attempt this fix myself if someone could validate this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17985) Bump commons-lang3 version to 3.5.
Takuya Ueshin created SPARK-17985: - Summary: Bump commons-lang3 version to 3.5. Key: SPARK-17985 URL: https://issues.apache.org/jira/browse/SPARK-17985 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Takuya Ueshin {{SerializationUtils.clone()}} of commons-lang3 (<3.5) has a bug that break thread safety, which gets stack sometimes caused by race condition of initializing hash map. See https://issues.apache.org/jira/browse/LANG-1251. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17985) Bump commons-lang3 version to 3.5.
[ https://issues.apache.org/jira/browse/SPARK-17985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17985: Assignee: Apache Spark > Bump commons-lang3 version to 3.5. > -- > > Key: SPARK-17985 > URL: https://issues.apache.org/jira/browse/SPARK-17985 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Takuya Ueshin >Assignee: Apache Spark > > {{SerializationUtils.clone()}} of commons-lang3 (<3.5) has a bug that break > thread safety, which gets stack sometimes caused by race condition of > initializing hash map. > See https://issues.apache.org/jira/browse/LANG-1251. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17985) Bump commons-lang3 version to 3.5.
[ https://issues.apache.org/jira/browse/SPARK-17985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17985: Assignee: (was: Apache Spark) > Bump commons-lang3 version to 3.5. > -- > > Key: SPARK-17985 > URL: https://issues.apache.org/jira/browse/SPARK-17985 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Takuya Ueshin > > {{SerializationUtils.clone()}} of commons-lang3 (<3.5) has a bug that break > thread safety, which gets stack sometimes caused by race condition of > initializing hash map. > See https://issues.apache.org/jira/browse/LANG-1251. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17985) Bump commons-lang3 version to 3.5.
[ https://issues.apache.org/jira/browse/SPARK-17985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15584263#comment-15584263 ] Apache Spark commented on SPARK-17985: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/15525 > Bump commons-lang3 version to 3.5. > -- > > Key: SPARK-17985 > URL: https://issues.apache.org/jira/browse/SPARK-17985 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Takuya Ueshin > > {{SerializationUtils.clone()}} of commons-lang3 (<3.5) has a bug that break > thread safety, which gets stack sometimes caused by race condition of > initializing hash map. > See https://issues.apache.org/jira/browse/LANG-1251. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17984) Add support for numa aware
[ https://issues.apache.org/jira/browse/SPARK-17984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] quanfuwang updated SPARK-17984: --- Description: This Jira is target to add support numa aware feature which can help improve performance by making core access local memory rather than remote one. A patch is being developed, see https://github.com/apache/spark/pull/15524. And the whole task includes 3 subtasks and will be developed iteratively: Numa aware support for Yarn based deployment mode Numa aware support for Mesos based deployment mode Numa aware support for Standalone based deployment mode was: This Jira is target to add support numa aware feature which can help improve performance by making core access local memory rather than remote one. A patch is being developed, see https://github.com/apache/spark/pull/15524. And the whole task includes 3 subtask and will be developed iteratively: Numa aware support for Yarn based deployment mode Numa aware support for Mesos based deployment mode Numa aware support for Standalone based deployment mode > Add support for numa aware > -- > > Key: SPARK-17984 > URL: https://issues.apache.org/jira/browse/SPARK-17984 > Project: Spark > Issue Type: New Feature > Components: Deploy, Mesos, YARN >Affects Versions: 2.0.1 > Environment: Cluster Topo: 1 Master + 4 Slaves > CPU: Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz(72 Cores) > Memory: 128GB(2 NUMA Nodes) > SW Version: Hadoop-5.7.0 + Spark-2.0.0 >Reporter: quanfuwang > Fix For: 2.0.1 > > Original Estimate: 672h > Remaining Estimate: 672h > > This Jira is target to add support numa aware feature which can help improve > performance by making core access local memory rather than remote one. > A patch is being developed, see https://github.com/apache/spark/pull/15524. > And the whole task includes 3 subtasks and will be developed iteratively: > Numa aware support for Yarn based deployment mode > Numa aware support for Mesos based deployment mode > Numa aware support for Standalone based deployment mode -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17984) Add support for numa aware feature
[ https://issues.apache.org/jira/browse/SPARK-17984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] quanfuwang updated SPARK-17984: --- Summary: Add support for numa aware feature (was: Add support for numa aware) > Add support for numa aware feature > -- > > Key: SPARK-17984 > URL: https://issues.apache.org/jira/browse/SPARK-17984 > Project: Spark > Issue Type: New Feature > Components: Deploy, Mesos, YARN >Affects Versions: 2.0.1 > Environment: Cluster Topo: 1 Master + 4 Slaves > CPU: Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz(72 Cores) > Memory: 128GB(2 NUMA Nodes) > SW Version: Hadoop-5.7.0 + Spark-2.0.0 >Reporter: quanfuwang > Fix For: 2.0.1 > > Original Estimate: 672h > Remaining Estimate: 672h > > This Jira is target to add support numa aware feature which can help improve > performance by making core access local memory rather than remote one. > A patch is being developed, see https://github.com/apache/spark/pull/15524. > And the whole task includes 3 subtasks and will be developed iteratively: > Numa aware support for Yarn based deployment mode > Numa aware support for Mesos based deployment mode > Numa aware support for Standalone based deployment mode -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5230) Print usage for spark-submit and spark-class in Windows
[ https://issues.apache.org/jira/browse/SPARK-5230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-5230. --- Resolution: Done Pretty sure I implemented this somewhere in the 1.x line. > Print usage for spark-submit and spark-class in Windows > --- > > Key: SPARK-5230 > URL: https://issues.apache.org/jira/browse/SPARK-5230 > Project: Spark > Issue Type: Improvement > Components: Windows >Affects Versions: 1.0.0 >Reporter: Andrew Or >Priority: Minor > > We currently only print the usage in `bin/spark-shell2.cmd`. We should do it > for `bin/spark-submit2.cmd` and `bin/spark-class2.cmd` too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5925) YARN - Spark progress bar stucks at 10% but after finishing shows 100%
[ https://issues.apache.org/jira/browse/SPARK-5925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-5925. --- Resolution: Won't Fix I don't think this can be fixed in Spark at all. There's no way to know beforehand how many jobs or tasks or stages an app will run. Imagine a long running spark-shell where the user is running a lot of small jobs... what's the progress of the overall app? There's just a mismatch between the YARN API and how Spark works. The YARN API makes a lot of sense for MapReduce apps. It doesn't make sense for Spark. Unless Spark exposes its own API for applications to report progress and proxy that information to YARN, but I don't see that happening. > YARN - Spark progress bar stucks at 10% but after finishing shows 100% > -- > > Key: SPARK-5925 > URL: https://issues.apache.org/jira/browse/SPARK-5925 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.2.1 >Reporter: Laszlo Fesus >Priority: Minor > > I did set up a yarn cluster (CDH5) and spark (1.2.1), and also started Spark > History Server. Now I am able to click on more details on yarn's web > interface and get redirected to the appropriate spark logs during both job > execution and also after the job has finished. > My only concern is that while a spark job is being executed (either > yarn-client or yarn-cluster), the progress bar stucks at 10% and doesn't > increase as for MapReduce jobs. After finishing, it shows 100% properly, but > we are loosing the real-time tracking capability of the status bar. > Also tested yarn restful web interface, and it retrieves again 10% during > (yarn) spark job execution, and works well again after finishing. (I suppose > for the while being I should have a look on Spark Job Server and see if it's > possible to track the job via its restful web interface.) > Did anyone else experience this behaviour? Thanks in advance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6108) No application number limit in spark history server
[ https://issues.apache.org/jira/browse/SPARK-6108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-6108. --- Resolution: Won't Fix There are many ways currently to control how many applications are kept around; the SHS can even clean up old logs. HDFS overhead is less of a problem since we started using a single file for event logs. > No application number limit in spark history server > --- > > Key: SPARK-6108 > URL: https://issues.apache.org/jira/browse/SPARK-6108 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.2.1 >Reporter: Xia Hu >Priority: Minor > > There isn't a limit for the application number in spark history server. The > only limit I found is "spark.history.retainedApplications", but this one only > controls how many apps could be stored in memory. > But I think a history application number limit is needed, for if it's number > is too big, it can be inconvenient for both HDFS and history server. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7882) HBase Input Format Example does not allow passing ZK parent node
[ https://issues.apache.org/jira/browse/SPARK-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-7882. --- Resolution: Not A Problem HBase examples are not included anymore. > HBase Input Format Example does not allow passing ZK parent node > > > Key: SPARK-7882 > URL: https://issues.apache.org/jira/browse/SPARK-7882 > Project: Spark > Issue Type: Bug > Components: Examples >Reporter: Ram Sriharsha >Assignee: Ram Sriharsha >Priority: Minor > > HBase Input Format example here: > https://github.com/apache/spark/blob/master/examples/src/main/python/hbase_inputformat.py#L52 > precludes passing a fourth parameter (zk.node.parent) even though down the > line there is code checking for a possible fourth parameter and interpreting > it as zk.node.parent here : > https://github.com/apache/spark/blob/master/examples/src/main/python/hbase_inputformat.py#L71 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17984) Add support for numa aware
[ https://issues.apache.org/jira/browse/SPARK-17984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] quanfuwang updated SPARK-17984: --- Issue Type: New Feature (was: Task) > Add support for numa aware > -- > > Key: SPARK-17984 > URL: https://issues.apache.org/jira/browse/SPARK-17984 > Project: Spark > Issue Type: New Feature > Components: Deploy, Mesos, YARN >Affects Versions: 2.0.1 > Environment: Cluster Topo: 1 Master + 4 Slaves > CPU: Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz(72 Cores) > Memory: 128GB(2 NUMA Nodes) > SW Version: Hadoop-5.7.0 + Spark-2.0.0 >Reporter: quanfuwang > Fix For: 2.0.1 > > Original Estimate: 672h > Remaining Estimate: 672h > > This Jira is target to add support numa aware feature which can help improve > performance by making core access local memory rather than remote one. > A patch is being developed, see https://github.com/apache/spark/pull/15524. > And the whole task includes 3 subtask and will be developed iteratively: > Numa aware support for Yarn based deployment mode > Numa aware support for Mesos based deployment mode > Numa aware support for Standalone based deployment mode -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17984) Add support for numa aware
[ https://issues.apache.org/jira/browse/SPARK-17984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] quanfuwang updated SPARK-17984: --- Shepherd: (was: quanfuwang) > Add support for numa aware > -- > > Key: SPARK-17984 > URL: https://issues.apache.org/jira/browse/SPARK-17984 > Project: Spark > Issue Type: Task > Components: Deploy, Mesos, YARN >Affects Versions: 2.0.1 > Environment: Cluster Topo: 1 Master + 4 Slaves > CPU: Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz(72 Cores) > Memory: 128GB(2 NUMA Nodes) > SW Version: Hadoop-5.7.0 + Spark-2.0.0 >Reporter: quanfuwang > Fix For: 2.0.1 > > Original Estimate: 672h > Remaining Estimate: 672h > > This Jira is target to add support numa aware feature which can help improve > performance by making core access local memory rather than remote one. > A patch is being developed, see https://github.com/apache/spark/pull/15524. > And the whole task includes 3 subtask and will be developed iteratively: > Numa aware support for Yarn based deployment mode > Numa aware support for Mesos based deployment mode > Numa aware support for Standalone based deployment mode -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8122) ParquetRelation.enableLogForwarding() may fail to configure loggers
[ https://issues.apache.org/jira/browse/SPARK-8122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-8122. --- Resolution: Won't Fix This code doesn't exist anymore in 2.x at least, so I'll assume this won't be fixed in old maintenance releases. > ParquetRelation.enableLogForwarding() may fail to configure loggers > --- > > Key: SPARK-8122 > URL: https://issues.apache.org/jira/browse/SPARK-8122 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Konstantin Shaposhnikov >Priority: Minor > > _enableLogForwarding()_ doesn't hold to the created loggers that can be > garbage collected and all configuration changes will be gone. From > https://docs.oracle.com/javase/6/docs/api/java/util/logging/Logger.html > javadocs: _It is important to note that the Logger returned by one of the > getLogger factory methods may be garbage collected at any time if a strong > reference to the Logger is not kept._ > All created logger references need to be kept, e.g. in static variables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17984) Add support for numa aware
[ https://issues.apache.org/jira/browse/SPARK-17984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15584206#comment-15584206 ] Apache Spark commented on SPARK-17984: -- User 'quanfuw' has created a pull request for this issue: https://github.com/apache/spark/pull/15524 > Add support for numa aware > -- > > Key: SPARK-17984 > URL: https://issues.apache.org/jira/browse/SPARK-17984 > Project: Spark > Issue Type: Task > Components: Deploy, Mesos, YARN >Affects Versions: 2.0.1 > Environment: Cluster Topo: 1 Master + 4 Slaves > CPU: Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz(72 Cores) > Memory: 128GB(2 NUMA Nodes) > SW Version: Hadoop-5.7.0 + Spark-2.0.0 >Reporter: quanfuwang > Fix For: 2.0.1 > > Original Estimate: 672h > Remaining Estimate: 672h > > This Jira is target to add support numa aware feature which can help improve > performance by making core access local memory rather than remote one. > A patch is being developed, see https://github.com/apache/spark/pull/15524. > And the whole task includes 3 subtask and will be developed iteratively: > Numa aware support for Yarn based deployment mode > Numa aware support for Mesos based deployment mode > Numa aware support for Standalone based deployment mode -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17984) Add support for numa aware
[ https://issues.apache.org/jira/browse/SPARK-17984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17984: Assignee: (was: Apache Spark) > Add support for numa aware > -- > > Key: SPARK-17984 > URL: https://issues.apache.org/jira/browse/SPARK-17984 > Project: Spark > Issue Type: Task > Components: Deploy, Mesos, YARN >Affects Versions: 2.0.1 > Environment: Cluster Topo: 1 Master + 4 Slaves > CPU: Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz(72 Cores) > Memory: 128GB(2 NUMA Nodes) > SW Version: Hadoop-5.7.0 + Spark-2.0.0 >Reporter: quanfuwang > Fix For: 2.0.1 > > Original Estimate: 672h > Remaining Estimate: 672h > > This Jira is target to add support numa aware feature which can help improve > performance by making core access local memory rather than remote one. > A patch is being developed, see https://github.com/apache/spark/pull/15524. > And the whole task includes 3 subtask and will be developed iteratively: > Numa aware support for Yarn based deployment mode > Numa aware support for Mesos based deployment mode > Numa aware support for Standalone based deployment mode -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17984) Add support for numa aware
[ https://issues.apache.org/jira/browse/SPARK-17984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17984: Assignee: Apache Spark > Add support for numa aware > -- > > Key: SPARK-17984 > URL: https://issues.apache.org/jira/browse/SPARK-17984 > Project: Spark > Issue Type: Task > Components: Deploy, Mesos, YARN >Affects Versions: 2.0.1 > Environment: Cluster Topo: 1 Master + 4 Slaves > CPU: Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz(72 Cores) > Memory: 128GB(2 NUMA Nodes) > SW Version: Hadoop-5.7.0 + Spark-2.0.0 >Reporter: quanfuwang >Assignee: Apache Spark > Fix For: 2.0.1 > > Original Estimate: 672h > Remaining Estimate: 672h > > This Jira is target to add support numa aware feature which can help improve > performance by making core access local memory rather than remote one. > A patch is being developed, see https://github.com/apache/spark/pull/15524. > And the whole task includes 3 subtask and will be developed iteratively: > Numa aware support for Yarn based deployment mode > Numa aware support for Mesos based deployment mode > Numa aware support for Standalone based deployment mode -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12280) "--packages" command doesn't work in "spark-submit"
[ https://issues.apache.org/jira/browse/SPARK-12280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-12280. Resolution: Cannot Reproduce Please reopen with more info if you're still running into issues. Lots of people use this command line option and haven't run into problems. > "--packages" command doesn't work in "spark-submit" > --- > > Key: SPARK-12280 > URL: https://issues.apache.org/jira/browse/SPARK-12280 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Reporter: Anton Loss >Priority: Minor > > when running "spark-shell", then "--packages" option works as expected, but > with "spark-submit" it produces following stacktrace > 15/12/11 17:05:48 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 15/12/11 17:05:51 WARN Client: Resource > file:/home/anton/data-tools-1.0-SNAPSHOT-jar-with-dependencies.jar added > multiple times to distributed cache. > Exception in thread "main" java.io.FileNotFoundException: Requested file > maprfs:///home/mapr/.ivy2/jars/com.databricks_spark-csv_2.11-1.3.0.jar does > not exist. > at > com.mapr.fs.MapRFileSystem.getMapRFileStatus(MapRFileSystem.java:1332) > at com.mapr.fs.MapRFileSystem.getFileStatus(MapRFileSystem.java:942) > at com.mapr.fs.MFS.getFileStatus(MFS.java:151) > at > org.apache.hadoop.fs.AbstractFileSystem.resolvePath(AbstractFileSystem.java:467) > at org.apache.hadoop.fs.FileContext$25.next(FileContext.java:2193) > at org.apache.hadoop.fs.FileContext$25.next(FileContext.java:2189) > at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) > at org.apache.hadoop.fs.FileContext.resolve(FileContext.java:2189) > at org.apache.hadoop.fs.FileContext.resolvePath(FileContext.java:601) > at > org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:242) > at > org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$1.apply(Client.scala:366) > at > org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$1.apply(Client.scala:360) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at > org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6.apply(Client.scala:360) > at > org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6.apply(Client.scala:358) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:358) > at > org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:561) > at > org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:115) > at org.apache.spark.deploy.yarn.Client.run(Client.scala:842) > at org.apache.spark.deploy.yarn.Client$.main(Client.scala:881) > at org.apache.spark.deploy.yarn.Client.main(Client.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > it seems it's looking in the wrong place, as jar is clearly present here > file:///home/mapr/.ivy2/jars/com.databricks_spark-csv_2.11-1.3.0.jar -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17984) Add support for numa aware
[ https://issues.apache.org/jira/browse/SPARK-17984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] quanfuwang updated SPARK-17984: --- Description: This Jira is target to add support numa aware feature which can help improve performance by making core access local memory rather than remote one. A patch is being developed, see https://github.com/apache/spark/pull/15524. And the whole task includes 3 subtask and will be developed iteratively: Numa aware support for Yarn based deployment mode Numa aware support for Mesos based deployment mode Numa aware support for Standalone based deployment mode was: This Jira is target to add support numa aware feature which can help improve performance by making core access local memory rather the remote one. A patch is being developed, see https://github.com/apache/spark/pull/15524. And the whole task includes 3 subtask and will be developed iteratively: Numa aware support for Yarn based deployment mode Numa aware support for Mesos based deployment mode Numa aware support for Standalone based deployment mode > Add support for numa aware > -- > > Key: SPARK-17984 > URL: https://issues.apache.org/jira/browse/SPARK-17984 > Project: Spark > Issue Type: Task > Components: Deploy, Mesos, YARN >Affects Versions: 2.0.1 > Environment: Cluster Topo: 1 Master + 4 Slaves > CPU: Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz(72 Cores) > Memory: 128GB(2 NUMA Nodes) > SW Version: Hadoop-5.7.0 + Spark-2.0.0 >Reporter: quanfuwang > Fix For: 2.0.1 > > Original Estimate: 672h > Remaining Estimate: 672h > > This Jira is target to add support numa aware feature which can help improve > performance by making core access local memory rather than remote one. > A patch is being developed, see https://github.com/apache/spark/pull/15524. > And the whole task includes 3 subtask and will be developed iteratively: > Numa aware support for Yarn based deployment mode > Numa aware support for Mesos based deployment mode > Numa aware support for Standalone based deployment mode -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17984) Add support for numa aware
[ https://issues.apache.org/jira/browse/SPARK-17984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] quanfuwang updated SPARK-17984: --- Description: This Jira is target to add support numa aware feature which can help improve performance by making core access local memory rather the remote one. A patch is being developed, see https://github.com/apache/spark/pull/15524. And the whole task includes 3 subtask and will be developed iteratively: Numa aware support for Yarn based deployment mode Numa aware support for Mesos based deployment mode Numa aware support for Standalone based deployment mode was: This Jira is target to add support numa aware feature which make can help improve performance by making core access local memory rather the remote one. A patch is being developed, see https://github.com/apache/spark/pull/15524. And the whole task includes 3 subtask and will be developed iteratively: Numa aware support for Yarn based deployment mode Numa aware support for Mesos based deployment mode Numa aware support for Standalone based deployment mode > Add support for numa aware > -- > > Key: SPARK-17984 > URL: https://issues.apache.org/jira/browse/SPARK-17984 > Project: Spark > Issue Type: Task > Components: Deploy, Mesos, YARN >Affects Versions: 2.0.1 > Environment: Cluster Topo: 1 Master + 4 Slaves > CPU: Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz(72 Cores) > Memory: 128GB(2 NUMA Nodes) > SW Version: Hadoop-5.7.0 + Spark-2.0.0 >Reporter: quanfuwang > Fix For: 2.0.1 > > Original Estimate: 672h > Remaining Estimate: 672h > > This Jira is target to add support numa aware feature which can help improve > performance by making core access local memory rather the remote one. > A patch is being developed, see https://github.com/apache/spark/pull/15524. > And the whole task includes 3 subtask and will be developed iteratively: > Numa aware support for Yarn based deployment mode > Numa aware support for Mesos based deployment mode > Numa aware support for Standalone based deployment mode -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails when select statement contains limit clause
[ https://issues.apache.org/jira/browse/SPARK-17982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15584178#comment-15584178 ] Franck Tago commented on SPARK-17982: - == SQL == SELECT `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS gen_subquery_1 ^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582) at org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:189) ... 64 more > Spark 2.0.0 CREATE VIEW statement fails when select statement contains limit > clause > > > Key: SPARK-17982 > URL: https://issues.apache.org/jira/browse/SPARK-17982 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 > Environment: spark 2.0.0 >Reporter: Franck Tago > > The following statement fails in the spark shell . > scala> spark.sql("CREATE VIEW > DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS > SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") > scala> spark.sql("CREATE VIEW > DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS > SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") > java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT > `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT > `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT > `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS > `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS > gen_subquery_1 > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86) > at org.apache.spark.sql.Dataset.(Dataset.scala:186) > at org.apache.spark.sql.Dataset.(Dataset.scala:167) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65) > This appears to be a limitation of the create view statement . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17984) Add support for numa aware
quanfuwang created SPARK-17984: -- Summary: Add support for numa aware Key: SPARK-17984 URL: https://issues.apache.org/jira/browse/SPARK-17984 Project: Spark Issue Type: Task Components: Deploy, Mesos, YARN Affects Versions: 2.0.1 Environment: Cluster Topo: 1 Master + 4 Slaves CPU: Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz(72 Cores) Memory: 128GB(2 NUMA Nodes) SW Version: Hadoop-5.7.0 + Spark-2.0.0 Reporter: quanfuwang Fix For: 2.0.1 This Jira is target to add support numa aware feature which make can help improve performance by making core access local memory rather the remote one. A patch is being developed, see https://github.com/apache/spark/pull/15524. And the whole task includes 3 subtask and will be developed iteratively: Numa aware support for Yarn based deployment mode Numa aware support for Mesos based deployment mode Numa aware support for Standalone based deployment mode -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets
[ https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15584172#comment-15584172 ] Cody Koeninger commented on SPARK-17147: Well, are you using compacted topics? > Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets > > > Key: SPARK-17147 > URL: https://issues.apache.org/jira/browse/SPARK-17147 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 2.0.0 >Reporter: Robert Conrad > > When Kafka does log compaction offsets often end up with gaps, meaning the > next requested offset will be frequently not be offset+1. The logic in > KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset > will always be just an increment of 1 above the previous offset. > I have worked around this problem by changing CachedKafkaConsumer to use the > returned record's offset, from: > {{nextOffset = offset + 1}} > to: > {{nextOffset = record.offset + 1}} > and changed KafkaRDD from: > {{requestOffset += 1}} > to: > {{requestOffset = r.offset() + 1}} > (I also had to change some assert logic in CachedKafkaConsumer). > There's a strong possibility that I have misconstrued how to use the > streaming kafka consumer, and I'm happy to close this out if that's the case. > If, however, it is supposed to support non-consecutive offsets (e.g. due to > log compaction) I am also happy to contribute a PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets
[ https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15584169#comment-15584169 ] Justin Miller commented on SPARK-17147: --- Could this possibly be related to why I'm seeing the following? 16/10/18 02:11:02 WARN TaskSetManager: Lost task 6.0 in stage 2.0 (TID 5823, ip-172-20-222-162.int.protectwise.net): java.lang.IllegalStateException: This consumer has already been closed. at org.apache.kafka.clients.consumer.KafkaConsumer.ensureNotClosed(KafkaConsumer.java:1417) at org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1428) at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:929) at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.poll(CachedKafkaConsumer.scala:99) at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:73) at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227) at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193) > Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets > > > Key: SPARK-17147 > URL: https://issues.apache.org/jira/browse/SPARK-17147 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 2.0.0 >Reporter: Robert Conrad > > When Kafka does log compaction offsets often end up with gaps, meaning the > next requested offset will be frequently not be offset+1. The logic in > KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset > will always be just an increment of 1 above the previous offset. > I have worked around this problem by changing CachedKafkaConsumer to use the > returned record's offset, from: > {{nextOffset = offset + 1}} > to: > {{nextOffset = record.offset + 1}} > and changed KafkaRDD from: > {{requestOffset += 1}} > to: > {{requestOffset = r.offset() + 1}} > (I also had to change some assert logic in CachedKafkaConsumer). > There's a strong possibility that I have misconstrued how to use the > streaming kafka consumer, and I'm happy to close this out if that's the case. > If, however, it is supposed to support non-consecutive offsets (e.g. due to > log compaction) I am also happy to contribute a PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17504) Spark App Handle from SparkLauncher always returns UNKNOWN app state when used with Mesos in Client Mode
[ https://issues.apache.org/jira/browse/SPARK-17504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-17504. Resolution: Duplicate > Spark App Handle from SparkLauncher always returns UNKNOWN app state when > used with Mesos in Client Mode > - > > Key: SPARK-17504 > URL: https://issues.apache.org/jira/browse/SPARK-17504 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.0.0 >Reporter: Adam Jakubowski >Priority: Minor > > Spark App Handle returned from Spark Launcher when used with Mesos in Client > Mode always returnrs UNKNOWN app state. Even if I kill the process it won't > change to LOST state. > It works with YARN cluster and Spark Standalone. > Expected behaviour: > Spark App Handle .getState() should go through CONNECTED, SUBMITTED, RUNNING, > FINISHED states and not yield UKNOWN every time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14212) Add configuration element for --packages option
[ https://issues.apache.org/jira/browse/SPARK-14212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15584142#comment-15584142 ] Marcelo Vanzin commented on SPARK-14212: SPARK-15760 added the docs to 2.0 only, but I'm pretty sure the options were these in 1.6. > Add configuration element for --packages option > --- > > Key: SPARK-14212 > URL: https://issues.apache.org/jira/browse/SPARK-14212 > Project: Spark > Issue Type: New Feature > Components: Documentation, PySpark >Affects Versions: 1.6.1 >Reporter: Russell Jurney >Priority: Trivial > Labels: config, starter > > I use PySpark with the --packages option, for instance to load support for > CSV: > pyspark --packages com.databricks:spark-csv_2.10:1.4.0 > I would like to not have to set this every time at the command line, so a > corresponding element for --packages in the configuration file > spark-defaults.conf, would be good to have. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4160) Standalone cluster mode does not upload all needed jars to driver node
[ https://issues.apache.org/jira/browse/SPARK-4160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15584137#comment-15584137 ] Marcelo Vanzin commented on SPARK-4160: --- You don't need to ask for permission to work on things. > Standalone cluster mode does not upload all needed jars to driver node > -- > > Key: SPARK-4160 > URL: https://issues.apache.org/jira/browse/SPARK-4160 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Marcelo Vanzin > > If you look at the code in {{DriverRunner.scala}}, there is code to download > the main application jar from the launcher node. But that's the only jar > that's downloaded - if the driver depends on one of the jars or files > specified via {{spark-submit --jars --files }}, it won't be able > to run. > It should be possible to use the same mechanism to distribute the other files > to the driver node, even if that's not the most efficient way of doing it. > That way, at least, you don't need any external dependencies to be able to > distribute the files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17983) Can't filter over mixed case parquet columns of converted Hive tables
[ https://issues.apache.org/jira/browse/SPARK-17983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-17983: --- Description: We should probably revive https://github.com/apache/spark/pull/14750 in order to fix this issue and related classes of issues. The only other alternatives are (1) reconciling on-disk schemas with metastore schema at planning time, which seems pretty messy, and (2) fixing all the datasources to support case-insensitive matching, which also has issues. Reproduction: {code} private def setupPartitionedTable(tableName: String, dir: File): Unit = { spark.range(5).selectExpr("id as normalCol", "id as partCol1", "id as partCol2").write .partitionBy("partCol1", "partCol2") .mode("overwrite") .parquet(dir.getAbsolutePath) spark.sql(s""" |create external table $tableName (normalCol long) |partitioned by (partCol1 int, partCol2 int) |stored as parquet |location "${dir.getAbsolutePath}.stripMargin) spark.sql(s"msck repair table $tableName") } test("filter by mixed case col") { withTable("test") { withTempDir { dir => setupPartitionedTable("test", dir) val df = spark.sql("select * from test where normalCol = 3") assert(df.count() == 1) } } } {code} cc [~cloud_fan] was: We should probably revive https://github.com/apache/spark/pull/14750 in order to fix this issue and related classes of issues. The only other alternatives are (1) reconciling on-disk schemas with metastore schema at planning time, which seems pretty messy, and (2) fixing all the datasources to support case-insensitive matching, which also has issues. cc [~cloud_fan] > Can't filter over mixed case parquet columns of converted Hive tables > - > > Key: SPARK-17983 > URL: https://issues.apache.org/jira/browse/SPARK-17983 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Critical > > We should probably revive https://github.com/apache/spark/pull/14750 in order > to fix this issue and related classes of issues. > The only other alternatives are (1) reconciling on-disk schemas with > metastore schema at planning time, which seems pretty messy, and (2) fixing > all the datasources to support case-insensitive matching, which also has > issues. > Reproduction: > {code} > private def setupPartitionedTable(tableName: String, dir: File): Unit = { > spark.range(5).selectExpr("id as normalCol", "id as partCol1", "id as > partCol2").write > .partitionBy("partCol1", "partCol2") > .mode("overwrite") > .parquet(dir.getAbsolutePath) > spark.sql(s""" > |create external table $tableName (normalCol long) > |partitioned by (partCol1 int, partCol2 int) > |stored as parquet > |location "${dir.getAbsolutePath}.stripMargin) > spark.sql(s"msck repair table $tableName") > } > test("filter by mixed case col") { > withTable("test") { > withTempDir { dir => > setupPartitionedTable("test", dir) > val df = spark.sql("select * from test where normalCol = 3") > assert(df.count() == 1) > } > } > } > {code} > cc [~cloud_fan] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails when select statement contains limit clause
Franck Tago created SPARK-17982: --- Summary: Spark 2.0.0 CREATE VIEW statement fails when select statement contains limit clause Key: SPARK-17982 URL: https://issues.apache.org/jira/browse/SPARK-17982 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.1, 2.0.0 Environment: spark 2.0.0 Reporter: Franck Tago The following statement fails in the spark shell . scala> spark.sql("CREATE VIEW DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") scala> spark.sql("CREATE VIEW DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS gen_subquery_1 at org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192) at org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86) at org.apache.spark.sql.Dataset.(Dataset.scala:186) at org.apache.spark.sql.Dataset.(Dataset.scala:167) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65) This appears to be a limitation of the create view statement . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10872) Derby error (XSDB6) when creating new HiveContext after restarting SparkContext
[ https://issues.apache.org/jira/browse/SPARK-10872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15584052#comment-15584052 ] Angus Gerry commented on SPARK-10872: - Hi [~srowen], I'm chasing down something in our code base at the moment that might be tangentially related to this issue. In our tests, we start and stop a new {{TestHiveContext}} for each test suite. Our builds recently started failing with this stack trace, ultimately caused by an {{IOException}} because "Too many open files" {noformat} java.lang.IllegalStateException: failed to create a child event loop at io.netty.util.concurrent.MultithreadEventExecutorGroup.(MultithreadEventExecutorGroup.java:68) at io.netty.channel.MultithreadEventLoopGroup.(MultithreadEventLoopGroup.java:49) at io.netty.channel.nio.NioEventLoopGroup.(NioEventLoopGroup.java:61) at io.netty.channel.nio.NioEventLoopGroup.(NioEventLoopGroup.java:52) at org.apache.spark.network.util.NettyUtils.createEventLoop(NettyUtils.java:56) at org.apache.spark.network.client.TransportClientFactory.(TransportClientFactory.java:104) at org.apache.spark.network.TransportContext.createClientFactory(TransportContext.java:88) at org.apache.spark.network.netty.NettyBlockTransferService.init(NettyBlockTransferService.scala:63) at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:177) at org.apache.spark.SparkContext.(SparkContext.scala:536) ... Cause: io.netty.channel.ChannelException: failed to open a new selector at io.netty.channel.nio.NioEventLoop.openSelector(NioEventLoop.java:128) at io.netty.channel.nio.NioEventLoop.(NioEventLoop.java:120) at io.netty.channel.nio.NioEventLoopGroup.newChild(NioEventLoopGroup.java:87) at io.netty.util.concurrent.MultithreadEventExecutorGroup.(MultithreadEventExecutorGroup.java:64) at io.netty.channel.MultithreadEventLoopGroup.(MultithreadEventLoopGroup.java:49) at io.netty.channel.nio.NioEventLoopGroup.(NioEventLoopGroup.java:61) at io.netty.channel.nio.NioEventLoopGroup.(NioEventLoopGroup.java:52) at org.apache.spark.network.util.NettyUtils.createEventLoop(NettyUtils.java:56) at org.apache.spark.network.client.TransportClientFactory.(TransportClientFactory.java:104) at org.apache.spark.network.TransportContext.createClientFactory(TransportContext.java:88) ... Cause: java.io.IOException: Too many open files at sun.nio.ch.IOUtil.makePipe(Native Method) at sun.nio.ch.EPollSelectorImpl.(EPollSelectorImpl.java:65) at sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:36) at io.netty.channel.nio.NioEventLoop.openSelector(NioEventLoop.java:126) at io.netty.channel.nio.NioEventLoop.(NioEventLoop.java:120) at io.netty.channel.nio.NioEventLoopGroup.newChild(NioEventLoopGroup.java:87) at io.netty.util.concurrent.MultithreadEventExecutorGroup.(MultithreadEventExecutorGroup.java:64) at io.netty.channel.MultithreadEventLoopGroup.(MultithreadEventLoopGroup.java:49) at io.netty.channel.nio.NioEventLoopGroup.(NioEventLoopGroup.java:61) at io.netty.channel.nio.NioEventLoopGroup.(NioEventLoopGroup.java:52) {noformat} Running our test suite locally, and keeping an eye on the jvm process with lsof, I can see that the number of open file handles continues to grow larger and larger, and over 75% of the paths look something like this: {{/tmp/spark-a0ff08e6-ae94-42ad-8a9c-bc43dee0b283/metastore/seg0/c530.dat}} My initial tracing through the code indicates that even though we're stopping the context, it's not closing its connection to the {{executionHive}} object, which runs as a derby DB in a tmp directory as above. This is where my 'tangentially related' comes in - if the context were actually closing its derby DB connections, then we mightn't be hitting the issue at all. FWIW the [programming guide|http://spark.apache.org/docs/latest/programming-guide.html#initializing-spark] does state the following, which at the very least _implies_ that stopping and then subsequently starting a context within one JVM is supported. {quote} Only one SparkContext may be active per JVM. You must stop() the active SparkContext before creating a new one. {quote} Personally I don't much care about said support other than needing it for our tests. If [~belevtsoff] doesn't start working on a PR for this, I'll start trying to work on a fix for my problems shortly. > Derby error (XSDB6) when creating new HiveContext after restarting > SparkContext > --- > > Key: SPARK-10872 > URL: https://issues.apache.org/jira/browse/SPARK-10872 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.4.0, 1.4.1, 1.5.0 >Reporter: Dmytro Bielievtsov > > Starting from spark 1.4.0 (works well on 1.3.1), the following code fails > with "XSDB6: Another instance of Derby may have already booted the
[jira] [Comment Edited] (SPARK-17950) Match SparseVector behavior with DenseVector
[ https://issues.apache.org/jira/browse/SPARK-17950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583877#comment-15583877 ] AbderRahman Sobh edited comment on SPARK-17950 at 10/18/16 12:07 AM: - Yes, the full array needs to be expanded since the numpy functions potentially need to operate on every value in the array. There is room for another implementation that instead simply mimics the numpy functions (and their handles) and provides smarter implementations for solving means and such when using a SparseVector. If that is preferable, I can modify the code to do that instead. Note also that the unpacked array is automatically cleared out after the call. was (Author: itg-abby): Yes, the full array needs to be expanded since the numpy functions potentially need to operate on every value in the array. There is room for another implementation that instead simply mimics the numpy functions (and their handles) and provides smarter implementations for solving means and such when using a SparseVector. If that is preferable, I can modify the code to do that instead. > Match SparseVector behavior with DenseVector > > > Key: SPARK-17950 > URL: https://issues.apache.org/jira/browse/SPARK-17950 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Affects Versions: 2.0.1 >Reporter: AbderRahman Sobh >Priority: Minor > Original Estimate: 0h > Remaining Estimate: 0h > > Simply added the `__getattr__` to SparseVector that DenseVector has, but > calls self.toArray() instead of storing a vector all the time in self.array > This allows for use of numpy functions on the values of a SparseVector in the > same direct way that users interact with DenseVectors. > i.e. you can simply call SparseVector.mean() to average the values in the > entire vector. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17950) Match SparseVector behavior with DenseVector
[ https://issues.apache.org/jira/browse/SPARK-17950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583877#comment-15583877 ] AbderRahman Sobh edited comment on SPARK-17950 at 10/18/16 12:07 AM: - Yes, the full array needs to be expanded since the numpy functions potentially need to operate on every value in the array. There is room for another implementation that instead simply mimics the numpy functions (and their handles) and provides smarter implementations for solving means and such when using a SparseVector. If that is preferable, I can modify the code to do that instead. was (Author: itg-abby): Yes, the full array needs to be expanded since the numpy functions potentially need to operate on every value in the array. There is room for another implementation that instead simply mimics the numpy functions (and their handles) and provides smarter implementations for solving means and such when using a SparseVector. If that is preferable, I can modify the code to do that instead. I also just realized that I am not 100% sure if the garbage collection works as I am expecting. My assumption was that Python would automatically clean up after using the array, but since it is technically inside of the object's magic method I cannot tell if it might need another line to explicitly clear the array out. > Match SparseVector behavior with DenseVector > > > Key: SPARK-17950 > URL: https://issues.apache.org/jira/browse/SPARK-17950 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Affects Versions: 2.0.1 >Reporter: AbderRahman Sobh >Priority: Minor > Original Estimate: 0h > Remaining Estimate: 0h > > Simply added the `__getattr__` to SparseVector that DenseVector has, but > calls self.toArray() instead of storing a vector all the time in self.array > This allows for use of numpy functions on the values of a SparseVector in the > same direct way that users interact with DenseVectors. > i.e. you can simply call SparseVector.mean() to average the values in the > entire vector. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17950) Match SparseVector behavior with DenseVector
[ https://issues.apache.org/jira/browse/SPARK-17950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583877#comment-15583877 ] AbderRahman Sobh edited comment on SPARK-17950 at 10/18/16 12:05 AM: - Yes, the full array needs to be expanded since the numpy functions potentially need to operate on every value in the array. There is room for another implementation that instead simply mimics the numpy functions (and their handles) and provides smarter implementations for solving means and such when using a SparseVector. If that is preferable, I can modify the code to do that instead. I also just realized that I am not 100% sure if the garbage collection works as I am expecting. My assumption was that Python would automatically clean up after using the array, but since it is technically inside of the object's magic method I cannot tell if it might need another line to explicitly clear the array out. was (Author: itg-abby): Yes, the full array needs to be expanded since the numpy functions potentially need to operate on every value in the array. There is room for another implementation that instead simply mimics the numpy functions (and their handles) and provides smarter implementations for solving means and such when using a SparseVector. If that is preferable, I can modify the code to do that instead. I also just realized that I am not 100% sure if the garbage collection works as I am expecting. My assumption was that Python would automatically clean up after using the array, but since it is technically inside of the object it might need another line to explicitly clear the array out? > Match SparseVector behavior with DenseVector > > > Key: SPARK-17950 > URL: https://issues.apache.org/jira/browse/SPARK-17950 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Affects Versions: 2.0.1 >Reporter: AbderRahman Sobh >Priority: Minor > Original Estimate: 0h > Remaining Estimate: 0h > > Simply added the `__getattr__` to SparseVector that DenseVector has, but > calls self.toArray() instead of storing a vector all the time in self.array > This allows for use of numpy functions on the values of a SparseVector in the > same direct way that users interact with DenseVectors. > i.e. you can simply call SparseVector.mean() to average the values in the > entire vector. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17731) Metrics for Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-17731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-17731: -- Fix Version/s: 2.0.2 > Metrics for Structured Streaming > > > Key: SPARK-17731 > URL: https://issues.apache.org/jira/browse/SPARK-17731 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das > Fix For: 2.0.2, 2.1.0 > > > Metrics are needed for monitoring structured streaming apps. Here is the > design doc for implementing the necessary metrics. > https://docs.google.com/document/d/1NIdcGuR1B3WIe8t7VxLrt58TJB4DtipWEbj5I_mzJys/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17950) Match SparseVector behavior with DenseVector
[ https://issues.apache.org/jira/browse/SPARK-17950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583877#comment-15583877 ] AbderRahman Sobh commented on SPARK-17950: -- Yes, the full array needs to be expanded since the numpy functions potentially need to operate on every value in the array. There is room for another implementation that instead simply mimics the numpy functions (and their handles) and provides smarter implementations for solving means and such when using a SparseVector. If that is preferable, I can modify the code to do that instead. I also just realized that I am not 100% sure if the garbage collection works as I am expecting. My assumption was that Python would automatically clean up after using the array, but since it is technically inside of the object it might need another line to explicitly clear the array out? > Match SparseVector behavior with DenseVector > > > Key: SPARK-17950 > URL: https://issues.apache.org/jira/browse/SPARK-17950 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Affects Versions: 2.0.1 >Reporter: AbderRahman Sobh >Priority: Minor > Original Estimate: 0h > Remaining Estimate: 0h > > Simply added the `__getattr__` to SparseVector that DenseVector has, but > calls self.toArray() instead of storing a vector all the time in self.array > This allows for use of numpy functions on the values of a SparseVector in the > same direct way that users interact with DenseVectors. > i.e. you can simply call SparseVector.mean() to average the values in the > entire vector. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17981) Incorrectly Set Nullability to False in FilterExec
[ https://issues.apache.org/jira/browse/SPARK-17981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17981: Assignee: Apache Spark (was: Xiao Li) > Incorrectly Set Nullability to False in FilterExec > -- > > Key: SPARK-17981 > URL: https://issues.apache.org/jira/browse/SPARK-17981 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1, 2.1.0 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Critical > > When FilterExec contains isNotNull, which could be inferred and pushed down > or users specified, we convert the nullability of the involved columns if the > top-layer expression is null-intolerant. However, this is not true, if the > top-layer expression is not a leaf expression, it could still tolerate the > null when it has null-tolerant child expression. > For example, cast(coalesce(a#5, a#15) as double). Although cast is a > null-intolerant expression, but obviously coalesce is a null-tolerant. > When the nullability is wrong, we could generate incorrect results in > different cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17957) Calling outer join and na.fill(0) and then inner join will miss rows
[ https://issues.apache.org/jira/browse/SPARK-17957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17957: Assignee: Apache Spark (was: Xiao Li) > Calling outer join and na.fill(0) and then inner join will miss rows > > > Key: SPARK-17957 > URL: https://issues.apache.org/jira/browse/SPARK-17957 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: Spark 2.0.1, Mac, Local >Reporter: Linbo >Assignee: Apache Spark >Priority: Critical > Labels: correctness > > I reported a similar bug two months ago and it's fixed in Spark 2.0.1: > https://issues.apache.org/jira/browse/SPARK-17060 But I find a new bug: when > I insert a na.fill(0) call between outer join and inner join in the same > workflow in SPARK-17060 I get wrong result. > {code:title=spark-shell|borderStyle=solid} > scala> val a = Seq((1, 2), (2, 3)).toDF("a", "b") > a: org.apache.spark.sql.DataFrame = [a: int, b: int] > scala> val b = Seq((2, 5), (3, 4)).toDF("a", "c") > b: org.apache.spark.sql.DataFrame = [a: int, c: int] > scala> val ab = a.join(b, Seq("a"), "fullouter").na.fill(0) > ab: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field] > scala> ab.show > +---+---+---+ > | a| b| c| > +---+---+---+ > | 1| 2| 0| > | 3| 0| 4| > | 2| 3| 5| > +---+---+---+ > scala> val c = Seq((3, 1)).toDF("a", "d") > c: org.apache.spark.sql.DataFrame = [a: int, d: int] > scala> c.show > +---+---+ > | a| d| > +---+---+ > | 3| 1| > +---+---+ > scala> ab.join(c, "a").show > +---+---+---+---+ > | a| b| c| d| > +---+---+---+---+ > +---+---+---+---+ > {code} > And again if i use persist, the result is correct. I think the problem is > join optimizer similar to this pr: https://github.com/apache/spark/pull/14661 > {code:title=spark-shell|borderStyle=solid} > scala> val ab = a.join(b, Seq("a"), "outer").na.fill(0).persist > ab: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int > ... 1 more field] > scala> ab.show > +---+---+---+ > | a| b| c| > +---+---+---+ > | 1| 2| 0| > | 3| 0| 4| > | 2| 3| 5| > +---+---+---+ > scala> ab.join(c, "a").show > +---+---+---+---+ > | a| b| c| d| > +---+---+---+---+ > | 3| 0| 4| 1| > +---+---+---+---+ > {code} > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17957) Calling outer join and na.fill(0) and then inner join will miss rows
[ https://issues.apache.org/jira/browse/SPARK-17957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583874#comment-15583874 ] Apache Spark commented on SPARK-17957: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/15523 > Calling outer join and na.fill(0) and then inner join will miss rows > > > Key: SPARK-17957 > URL: https://issues.apache.org/jira/browse/SPARK-17957 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: Spark 2.0.1, Mac, Local >Reporter: Linbo >Assignee: Xiao Li >Priority: Critical > Labels: correctness > > I reported a similar bug two months ago and it's fixed in Spark 2.0.1: > https://issues.apache.org/jira/browse/SPARK-17060 But I find a new bug: when > I insert a na.fill(0) call between outer join and inner join in the same > workflow in SPARK-17060 I get wrong result. > {code:title=spark-shell|borderStyle=solid} > scala> val a = Seq((1, 2), (2, 3)).toDF("a", "b") > a: org.apache.spark.sql.DataFrame = [a: int, b: int] > scala> val b = Seq((2, 5), (3, 4)).toDF("a", "c") > b: org.apache.spark.sql.DataFrame = [a: int, c: int] > scala> val ab = a.join(b, Seq("a"), "fullouter").na.fill(0) > ab: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field] > scala> ab.show > +---+---+---+ > | a| b| c| > +---+---+---+ > | 1| 2| 0| > | 3| 0| 4| > | 2| 3| 5| > +---+---+---+ > scala> val c = Seq((3, 1)).toDF("a", "d") > c: org.apache.spark.sql.DataFrame = [a: int, d: int] > scala> c.show > +---+---+ > | a| d| > +---+---+ > | 3| 1| > +---+---+ > scala> ab.join(c, "a").show > +---+---+---+---+ > | a| b| c| d| > +---+---+---+---+ > +---+---+---+---+ > {code} > And again if i use persist, the result is correct. I think the problem is > join optimizer similar to this pr: https://github.com/apache/spark/pull/14661 > {code:title=spark-shell|borderStyle=solid} > scala> val ab = a.join(b, Seq("a"), "outer").na.fill(0).persist > ab: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int > ... 1 more field] > scala> ab.show > +---+---+---+ > | a| b| c| > +---+---+---+ > | 1| 2| 0| > | 3| 0| 4| > | 2| 3| 5| > +---+---+---+ > scala> ab.join(c, "a").show > +---+---+---+---+ > | a| b| c| d| > +---+---+---+---+ > | 3| 0| 4| 1| > +---+---+---+---+ > {code} > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17368) Scala value classes create encoder problems and break at runtime
[ https://issues.apache.org/jira/browse/SPARK-17368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583873#comment-15583873 ] Aris Vlasakakis commented on SPARK-17368: - That is great, thank you for the help with this. > Scala value classes create encoder problems and break at runtime > > > Key: SPARK-17368 > URL: https://issues.apache.org/jira/browse/SPARK-17368 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.6.2, 2.0.0 > Environment: JDK 8 on MacOS > Scala 2.11.8 > Spark 2.0.0 >Reporter: Aris Vlasakakis >Assignee: Jakob Odersky > Fix For: 2.1.0 > > > Using Scala value classes as the inner type for Datasets breaks in Spark 2.0 > and 1.6.X. > This simple Spark 2 application demonstrates that the code will compile, but > will break at runtime with the error. The value class is of course > *FeatureId*, as it extends AnyVal. > {noformat} > Exception in thread "main" java.lang.RuntimeException: Error while encoding: > java.lang.RuntimeException: Couldn't find v on int > assertnotnull(input[0, int, true], top level non-flat input object).v AS v#0 > +- assertnotnull(input[0, int, true], top level non-flat input object).v >+- assertnotnull(input[0, int, true], top level non-flat input object) > +- input[0, int, true]". > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:279) > at > org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421) > at > org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421) > {noformat} > Test code for Spark 2.0.0: > {noformat} > import org.apache.spark.sql.{Dataset, SparkSession} > object BreakSpark { > case class FeatureId(v: Int) extends AnyVal > def main(args: Array[String]): Unit = { > val seq = Seq(FeatureId(1), FeatureId(2), FeatureId(3)) > val spark = SparkSession.builder.getOrCreate() > import spark.implicits._ > spark.sparkContext.setLogLevel("warn") > val ds: Dataset[FeatureId] = spark.createDataset(seq) > println(s"BREAK HERE: ${ds.count}") > } > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17957) Calling outer join and na.fill(0) and then inner join will miss rows
[ https://issues.apache.org/jira/browse/SPARK-17957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17957: Assignee: Xiao Li (was: Apache Spark) > Calling outer join and na.fill(0) and then inner join will miss rows > > > Key: SPARK-17957 > URL: https://issues.apache.org/jira/browse/SPARK-17957 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: Spark 2.0.1, Mac, Local >Reporter: Linbo >Assignee: Xiao Li >Priority: Critical > Labels: correctness > > I reported a similar bug two months ago and it's fixed in Spark 2.0.1: > https://issues.apache.org/jira/browse/SPARK-17060 But I find a new bug: when > I insert a na.fill(0) call between outer join and inner join in the same > workflow in SPARK-17060 I get wrong result. > {code:title=spark-shell|borderStyle=solid} > scala> val a = Seq((1, 2), (2, 3)).toDF("a", "b") > a: org.apache.spark.sql.DataFrame = [a: int, b: int] > scala> val b = Seq((2, 5), (3, 4)).toDF("a", "c") > b: org.apache.spark.sql.DataFrame = [a: int, c: int] > scala> val ab = a.join(b, Seq("a"), "fullouter").na.fill(0) > ab: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field] > scala> ab.show > +---+---+---+ > | a| b| c| > +---+---+---+ > | 1| 2| 0| > | 3| 0| 4| > | 2| 3| 5| > +---+---+---+ > scala> val c = Seq((3, 1)).toDF("a", "d") > c: org.apache.spark.sql.DataFrame = [a: int, d: int] > scala> c.show > +---+---+ > | a| d| > +---+---+ > | 3| 1| > +---+---+ > scala> ab.join(c, "a").show > +---+---+---+---+ > | a| b| c| d| > +---+---+---+---+ > +---+---+---+---+ > {code} > And again if i use persist, the result is correct. I think the problem is > join optimizer similar to this pr: https://github.com/apache/spark/pull/14661 > {code:title=spark-shell|borderStyle=solid} > scala> val ab = a.join(b, Seq("a"), "outer").na.fill(0).persist > ab: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int > ... 1 more field] > scala> ab.show > +---+---+---+ > | a| b| c| > +---+---+---+ > | 1| 2| 0| > | 3| 0| 4| > | 2| 3| 5| > +---+---+---+ > scala> ab.join(c, "a").show > +---+---+---+---+ > | a| b| c| d| > +---+---+---+---+ > | 3| 0| 4| 1| > +---+---+---+---+ > {code} > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17981) Incorrectly Set Nullability to False in FilterExec
[ https://issues.apache.org/jira/browse/SPARK-17981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17981: Assignee: Xiao Li (was: Apache Spark) > Incorrectly Set Nullability to False in FilterExec > -- > > Key: SPARK-17981 > URL: https://issues.apache.org/jira/browse/SPARK-17981 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1, 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Critical > > When FilterExec contains isNotNull, which could be inferred and pushed down > or users specified, we convert the nullability of the involved columns if the > top-layer expression is null-intolerant. However, this is not true, if the > top-layer expression is not a leaf expression, it could still tolerate the > null when it has null-tolerant child expression. > For example, cast(coalesce(a#5, a#15) as double). Although cast is a > null-intolerant expression, but obviously coalesce is a null-tolerant. > When the nullability is wrong, we could generate incorrect results in > different cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17981) Incorrectly Set Nullability to False in FilterExec
[ https://issues.apache.org/jira/browse/SPARK-17981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583872#comment-15583872 ] Apache Spark commented on SPARK-17981: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/15523 > Incorrectly Set Nullability to False in FilterExec > -- > > Key: SPARK-17981 > URL: https://issues.apache.org/jira/browse/SPARK-17981 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1, 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Critical > > When FilterExec contains isNotNull, which could be inferred and pushed down > or users specified, we convert the nullability of the involved columns if the > top-layer expression is null-intolerant. However, this is not true, if the > top-layer expression is not a leaf expression, it could still tolerate the > null when it has null-tolerant child expression. > For example, cast(coalesce(a#5, a#15) as double). Although cast is a > null-intolerant expression, but obviously coalesce is a null-tolerant. > When the nullability is wrong, we could generate incorrect results in > different cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17981) Incorrectly Set Nullability to False in FilterExec
Xiao Li created SPARK-17981: --- Summary: Incorrectly Set Nullability to False in FilterExec Key: SPARK-17981 URL: https://issues.apache.org/jira/browse/SPARK-17981 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.1, 2.1.0 Reporter: Xiao Li Assignee: Xiao Li Priority: Critical When FilterExec contains isNotNull, which could be inferred and pushed down or users specified, we convert the nullability of the involved columns if the top-layer expression is null-intolerant. However, this is not true, if the top-layer expression is not a leaf expression, it could still tolerate the null when it has null-tolerant child expression. For example, cast(coalesce(a#5, a#15) as double). Although cast is a null-intolerant expression, but obviously coalesce is a null-tolerant. When the nullability is wrong, we could generate incorrect results in different cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15708) Tasks table in Detailed Stage page shows ip instead of hostname under Executor ID/Host
[ https://issues.apache.org/jira/browse/SPARK-15708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583751#comment-15583751 ] Alex Bozarth commented on SPARK-15708: -- I'm not sure closing this as cannot reproduce was correct, but I'm not sure how it could be fixed either. Due to the nature of those tables they get the host string from entirely different places in code. For the task table it's stored in {{TaskInfo}} but for the Agg. Metrics tables it's stored in {{BlockManagerId}}. The better question is when can these two end up with different host strings (IP vs hostname) and why. [~tgraves] is this something you would want fixed or was it just an behavioral oddity? > Tasks table in Detailed Stage page shows ip instead of hostname under > Executor ID/Host > -- > > Key: SPARK-15708 > URL: https://issues.apache.org/jira/browse/SPARK-15708 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0 >Reporter: Thomas Graves >Priority: Minor > > If you go to the detailed Stages page in Spark 2.0, the Tasks table under the > Executor ID/Host columns hosts the hostname as an ip address rather then a > fully qualified hostname. > The table above it (Aggregated Metrics by Executor) shows the "Address" as > the full hostname. > I'm running spark on yarn on latest branch-2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17979) Remove deprecated support for config SPARK_YARN_USER_ENV
[ https://issues.apache.org/jira/browse/SPARK-17979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-17979: -- Priority: Trivial (was: Major) Issue Type: Improvement (was: Bug) (Please set fields appropriately) There are a number of deprecated env variables that can be removed. Can you look through others and identify a logical set to remove together? it may not be all of them, but is probably more than this one. > Remove deprecated support for config SPARK_YARN_USER_ENV > - > > Key: SPARK-17979 > URL: https://issues.apache.org/jira/browse/SPARK-17979 > Project: Spark > Issue Type: Improvement >Reporter: Kishor Patil >Priority: Trivial > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17971) Unix timestamp handling in Spark SQL not allowing calculations on UTC times
[ https://issues.apache.org/jira/browse/SPARK-17971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583676#comment-15583676 ] Sean Owen commented on SPARK-17971: --- I'll say that I find the semantics of the Hive QL datetime + timezone functions odd, and Spark SQL is just mimicing them. For example, the behavior of from_utc_timestamp is already hard to understand because it operates on longs, essentially, and these can only really be thought of as absolute time since the epoch, not a quantity with a time zone inside that can vary. That is, what's the "non-UTC" timestamp that comes out? So from_utc_timestamp(x, "PST") will return a timestamp whose value is smaller by 8 * 3600 * 1000 because PST is GMT-8 (GMT vs UTC issue noted). But what does that even mean? it's still a "UTC" timestamp, just an 8-hour earlier one. It's the timestamp whose UTC-hour would equal the PST-hour of timestamp x. hour() et al will answer with respect the current system timezone, yes. If your system is in PST, and you want to know the UTC-hour of a timestamp x, then you need a time whose PST-hour matches the UTC-hour of x. That's the reverse. I believe you want: select hour(to_utc_timestamp(cast(1476354405 as timestamp), "PST")) That works for me. Of course you can programmatically insert TimeZone.getDefault.getID instead of "PST". I believe that then works as desired everywhere. It has some logic in that it reads as "the hour of a UTC timestamp ..." but it's not straightforward IMHO. But, there are tools for this and these are those tools Hive has the same, and so I think this would be considered working as intended. I looked at MySQL just now and it seems to have similar behaviors, with somewhat different methods, FWIW. > Unix timestamp handling in Spark SQL not allowing calculations on UTC times > --- > > Key: SPARK-17971 > URL: https://issues.apache.org/jira/browse/SPARK-17971 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.6.2 > Environment: MacOS X JDK 7 >Reporter: Gabriele Del Prete > > In our Spark data pipeline we store timed events using a bigint column called > 'timestamp', the values contained being Unix timestamp time points. > Our datacenter servers Java VMs are all set up to start with timezone set to > UTC, while developer's computers are all in the US Eastern timezone. > Given how Spark SQL datetime functions work, it's impossible to do > calculations (eg. extract and compare hours, year-month-date triplets) using > UTC values: > - from_unixtime takes a bigint unix timestamp and forces it to the computer's > local timezone; > - casting the bigint column to timestamp does the same (it converts it to the > local timezone) > - from_utc_timestamp works in the same way, the only difference being that it > gets a string as input instead of a bigint. > The result of all of this is that it's impossible to extract individual > fields of a UTC timestamp, since all timestamp always get converted to the > local timezone. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17980) Fix refreshByPath for converted Hive tables
[ https://issues.apache.org/jira/browse/SPARK-17980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17980: Assignee: Apache Spark > Fix refreshByPath for converted Hive tables > --- > > Key: SPARK-17980 > URL: https://issues.apache.org/jira/browse/SPARK-17980 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Assignee: Apache Spark >Priority: Minor > > There is a small bug introduced in https://github.com/apache/spark/pull/14690 > which broke refreshByPath with converted hive tables (though, it turns out it > was very difficult to refresh converted hive tables anyways, since you had to > specify the exact path of one of the partitions). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17980) Fix refreshByPath for converted Hive tables
[ https://issues.apache.org/jira/browse/SPARK-17980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583652#comment-15583652 ] Apache Spark commented on SPARK-17980: -- User 'ericl' has created a pull request for this issue: https://github.com/apache/spark/pull/15521 > Fix refreshByPath for converted Hive tables > --- > > Key: SPARK-17980 > URL: https://issues.apache.org/jira/browse/SPARK-17980 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Minor > > There is a small bug introduced in https://github.com/apache/spark/pull/14690 > which broke refreshByPath with converted hive tables (though, it turns out it > was very difficult to refresh converted hive tables anyways, since you had to > specify the exact path of one of the partitions). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17980) Fix refreshByPath for converted Hive tables
[ https://issues.apache.org/jira/browse/SPARK-17980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17980: Assignee: (was: Apache Spark) > Fix refreshByPath for converted Hive tables > --- > > Key: SPARK-17980 > URL: https://issues.apache.org/jira/browse/SPARK-17980 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Minor > > There is a small bug introduced in https://github.com/apache/spark/pull/14690 > which broke refreshByPath with converted hive tables (though, it turns out it > was very difficult to refresh converted hive tables anyways, since you had to > specify the exact path of one of the partitions). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17980) Fix refreshByPath for converted Hive tables
Eric Liang created SPARK-17980: -- Summary: Fix refreshByPath for converted Hive tables Key: SPARK-17980 URL: https://issues.apache.org/jira/browse/SPARK-17980 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Eric Liang Priority: Minor There is a small bug introduced in https://github.com/apache/spark/pull/14690 which broke refreshByPath with converted hive tables (though, it turns out it was very difficult to refresh converted hive tables anyways, since you had to specify the exact path of one of the partitions). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7721) Generate test coverage report from Python
[ https://issues.apache.org/jira/browse/SPARK-7721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583642#comment-15583642 ] Josh Rosen commented on SPARK-7721: --- IIRC when I looked into this I hit problems with the HTML Publisher Plugin not being able to properly publish / serve HTML reports which weren't present on the Jenkins master because the underlying files weren't being archived properly from the remote build workspaces. From a cursory Google search, it looks like other folks have hit similar problems with this: https://issues.jenkins-ci.org/browse/JENKINS-6780 https://issues.jenkins-ci.org/browse/JENKINS-15301 Ideally we could use the Codecov service to aggregate and publish these reports. Last month I opened a ticket with Apache Infra to ask about obtaining the token which would let us push results to that service, but they haven't responded back to my latest comment yet: https://issues.apache.org/jira/browse/INFRA-12640 Alternatively, we could write some one-off shell to archive the reports to a public S3 bucket and serve them as static files. > Generate test coverage report from Python > - > > Key: SPARK-7721 > URL: https://issues.apache.org/jira/browse/SPARK-7721 > Project: Spark > Issue Type: Test > Components: PySpark, Tests >Reporter: Reynold Xin > > Would be great to have test coverage report for Python. Compared with Scala, > it is tricker to understand the coverage without coverage reports in Python > because we employ both docstring tests and unit tests in test files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10915) Add support for UDAFs in Python
[ https://issues.apache.org/jira/browse/SPARK-10915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583638#comment-15583638 ] Davies Liu commented on SPARK-10915: Currently all the aggregate functions are implemented in Scala, which execute one row at a time. This will not work for Python UDAF, the overhead between JVM and Python process will make it super slow. > Add support for UDAFs in Python > --- > > Key: SPARK-10915 > URL: https://issues.apache.org/jira/browse/SPARK-10915 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Reporter: Justin Uang > > This should support python defined lambdas. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17979) Remove deprecated support for config SPARK_YARN_USER_ENV
Kishor Patil created SPARK-17979: Summary: Remove deprecated support for config SPARK_YARN_USER_ENV Key: SPARK-17979 URL: https://issues.apache.org/jira/browse/SPARK-17979 Project: Spark Issue Type: Bug Reporter: Kishor Patil -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3132) Avoid serialization for Array[Byte] in TorrentBroadcast
[ https://issues.apache.org/jira/browse/SPARK-3132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin closed SPARK-3132. -- Resolution: Not A Problem Marking it as not-a-problem for now given Josh's comment. > Avoid serialization for Array[Byte] in TorrentBroadcast > --- > > Key: SPARK-3132 > URL: https://issues.apache.org/jira/browse/SPARK-3132 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Reynold Xin > > If the input data is a byte array, we should allow TorrentBroadcast to skip > serializing and compressing the input. > To do this, we should add a new parameter (shortCircuitByteArray) to > TorrentBroadcast, and then avoid serialization in if the input is byte array > and shortCircuitByteArray is true. > We should then also do compression in task serialization itself instead of > doing it in TorrentBroadcast. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3132) Avoid serialization for Array[Byte] in TorrentBroadcast
[ https://issues.apache.org/jira/browse/SPARK-3132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583605#comment-15583605 ] Josh Rosen commented on SPARK-3132: --- I don't think that this is being actively worked on. I remember doing a POC prototype of using a custom {{Serializer}} for byte arrays and found that doing that by itself didn't seem to result in huge performance gains, but if we can manage to skip JVM-side compression of already-compressed Python arrays then I could see that being a reasonable small win. > Avoid serialization for Array[Byte] in TorrentBroadcast > --- > > Key: SPARK-3132 > URL: https://issues.apache.org/jira/browse/SPARK-3132 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Reynold Xin > > If the input data is a byte array, we should allow TorrentBroadcast to skip > serializing and compressing the input. > To do this, we should add a new parameter (shortCircuitByteArray) to > TorrentBroadcast, and then avoid serialization in if the input is byte array > and shortCircuitByteArray is true. > We should then also do compression in task serialization itself instead of > doing it in TorrentBroadcast. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3132) Avoid serialization for Array[Byte] in TorrentBroadcast
[ https://issues.apache.org/jira/browse/SPARK-3132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-3132: -- Assignee: (was: Davies Liu) > Avoid serialization for Array[Byte] in TorrentBroadcast > --- > > Key: SPARK-3132 > URL: https://issues.apache.org/jira/browse/SPARK-3132 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Reynold Xin > > If the input data is a byte array, we should allow TorrentBroadcast to skip > serializing and compressing the input. > To do this, we should add a new parameter (shortCircuitByteArray) to > TorrentBroadcast, and then avoid serialization in if the input is byte array > and shortCircuitByteArray is true. > We should then also do compression in task serialization itself instead of > doing it in TorrentBroadcast. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4160) Standalone cluster mode does not upload all needed jars to driver node
[ https://issues.apache.org/jira/browse/SPARK-4160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583531#comment-15583531 ] Amit Assudani commented on SPARK-4160: -- I can work on fixing this. Let me know. > Standalone cluster mode does not upload all needed jars to driver node > -- > > Key: SPARK-4160 > URL: https://issues.apache.org/jira/browse/SPARK-4160 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Marcelo Vanzin > > If you look at the code in {{DriverRunner.scala}}, there is code to download > the main application jar from the launcher node. But that's the only jar > that's downloaded - if the driver depends on one of the jars or files > specified via {{spark-submit --jars --files }}, it won't be able > to run. > It should be possible to use the same mechanism to distribute the other files > to the driver node, even if that's not the most efficient way of doing it. > That way, at least, you don't need any external dependencies to be able to > distribute the files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-17976) Global options to spark-submit should not be position-sensitive
[ https://issues.apache.org/jira/browse/SPARK-17976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas closed SPARK-17976. Resolution: Not A Problem Ah, makes perfect sense. Would have realized that myself if I had held off on reporting this for just a day or so. Apologies. > Global options to spark-submit should not be position-sensitive > --- > > Key: SPARK-17976 > URL: https://issues.apache.org/jira/browse/SPARK-17976 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 2.0.0, 2.0.1 >Reporter: Nicholas Chammas >Priority: Minor > > It is maddening that this does what you expect: > {code} > spark-submit --packages graphframes:graphframes:0.2.0-spark2.0-s_2.11 \ > file.py > {code} > whereas this doesn't because {{--packages}} is totally ignored: > {code} > spark-submit file.py \ > --packages graphframes:graphframes:0.2.0-spark2.0-s_2.11 > {code} > Ideally, global options should be valid no matter where they are specified. > If that's too much work, then I think at the very least {{spark-submit}} > should display a warning that some input is being ignored. (Ideally, it > should error out, but that's probably not possible for > backwards-compatibility reasons at this point.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17977) DataFrameReader and DataStreamReader should have an ancestor class
[ https://issues.apache.org/jira/browse/SPARK-17977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amit Assudani updated SPARK-17977: -- Affects Version/s: 2.0.0 > DataFrameReader and DataStreamReader should have an ancestor class > -- > > Key: SPARK-17977 > URL: https://issues.apache.org/jira/browse/SPARK-17977 > Project: Spark > Issue Type: Wish > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Amit Assudani >Priority: Critical > > There should be an ancestor class of DataFrameReader and DataStreamReader to > configure common options / format and use common methods. Most of the methods > are exact same having exact same arguments. This will help create utilities / > generic code being used for stream / batch applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17978) --jars option in spark-submit does not load jars for driver in spark - standalone mode
[ https://issues.apache.org/jira/browse/SPARK-17978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-17978. Resolution: Duplicate > --jars option in spark-submit does not load jars for driver in spark - > standalone mode > --- > > Key: SPARK-17978 > URL: https://issues.apache.org/jira/browse/SPARK-17978 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Submit >Affects Versions: 1.6.1, 1.6.2, 2.0.0, 2.0.1 >Reporter: Amit Assudani > > Additional jars ( jar location urls ) provided using --jars option in spark > submit is not retrieved and loaded in DriverWrapper making it not available > for application driver to find. This is handled for executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17978) --jars option in spark-submit does not load jars for driver in spark - standalone mode
[ https://issues.apache.org/jira/browse/SPARK-17978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583508#comment-15583508 ] Amit Assudani commented on SPARK-17978: --- I can fix this and send a PR. Let me know. > --jars option in spark-submit does not load jars for driver in spark - > standalone mode > --- > > Key: SPARK-17978 > URL: https://issues.apache.org/jira/browse/SPARK-17978 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Submit >Affects Versions: 1.6.1, 1.6.2, 2.0.0, 2.0.1 >Reporter: Amit Assudani > > Additional jars ( jar location urls ) provided using --jars option in spark > submit is not retrieved and loaded in DriverWrapper making it not available > for application driver to find. This is handled for executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17976) Global options to spark-submit should not be position-sensitive
[ https://issues.apache.org/jira/browse/SPARK-17976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583506#comment-15583506 ] Marcelo Vanzin commented on SPARK-17976: They are not being ignored. They are being passed as arguments to "file.py". A long time ago it was decided that the "resource" (i.e. the jar file or python file) would separate Spark options from application options. This was chosen for backwards compatibility; another option would be to use an explicit separator (e.g. "\-\-") but that would not be compatible with existing user scripts. So unless you have suggestion on how to differentiate Spark options from app options without the need for an explicit separator, this should probably be closed. > Global options to spark-submit should not be position-sensitive > --- > > Key: SPARK-17976 > URL: https://issues.apache.org/jira/browse/SPARK-17976 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 2.0.0, 2.0.1 >Reporter: Nicholas Chammas >Priority: Minor > > It is maddening that this does what you expect: > {code} > spark-submit --packages graphframes:graphframes:0.2.0-spark2.0-s_2.11 \ > file.py > {code} > whereas this doesn't because {{--packages}} is totally ignored: > {code} > spark-submit file.py \ > --packages graphframes:graphframes:0.2.0-spark2.0-s_2.11 > {code} > Ideally, global options should be valid no matter where they are specified. > If that's too much work, then I think at the very least {{spark-submit}} > should display a warning that some input is being ignored. (Ideally, it > should error out, but that's probably not possible for > backwards-compatibility reasons at this point.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17978) --jars option in spark-submit does not load jars for driver in spark - standalone mode
Amit Assudani created SPARK-17978: - Summary: --jars option in spark-submit does not load jars for driver in spark - standalone mode Key: SPARK-17978 URL: https://issues.apache.org/jira/browse/SPARK-17978 Project: Spark Issue Type: Bug Components: Spark Core, Spark Submit Affects Versions: 2.0.1, 2.0.0, 1.6.2, 1.6.1 Reporter: Amit Assudani Additional jars ( jar location urls ) provided using --jars option in spark submit is not retrieved and loaded in DriverWrapper making it not available for application driver to find. This is handled for executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17977) DataFrameReader and DataStreamReader should have an ancestor class
Amit Assudani created SPARK-17977: - Summary: DataFrameReader and DataStreamReader should have an ancestor class Key: SPARK-17977 URL: https://issues.apache.org/jira/browse/SPARK-17977 Project: Spark Issue Type: Wish Components: SQL Affects Versions: 2.0.1 Reporter: Amit Assudani Priority: Critical There should be an ancestor class of DataFrameReader and DataStreamReader to configure common options / format and use common methods. Most of the methods are exact same having exact same arguments. This will help create utilities / generic code being used for stream / batch applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17976) Global options to spark-submit should not be position-sensitive
Nicholas Chammas created SPARK-17976: Summary: Global options to spark-submit should not be position-sensitive Key: SPARK-17976 URL: https://issues.apache.org/jira/browse/SPARK-17976 Project: Spark Issue Type: Improvement Components: Deploy Affects Versions: 2.0.1, 2.0.0 Reporter: Nicholas Chammas Priority: Minor It is maddening that this does what you expect: {code} spark-submit --packages graphframes:graphframes:0.2.0-spark2.0-s_2.11 \ file.py {code} whereas this doesn't because {{--packages}} is totally ignored: {code} spark-submit file.py \ --packages graphframes:graphframes:0.2.0-spark2.0-s_2.11 {code} Ideally, global options should be valid no matter where they are specified. If that's too much work, then I think at the very least {{spark-submit}} should display a warning that some input is being ignored. (Ideally, it should error out, but that's probably not possible for backwards-compatibility reasons at this point.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool
[ https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583453#comment-15583453 ] Apache Spark commented on SPARK-13747: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/15520 > Concurrent execution in SQL doesn't work with Scala ForkJoinPool > > > Key: SPARK-13747 > URL: https://issues.apache.org/jira/browse/SPARK-13747 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.0.0 > > > Run the following codes may fail > {code} > (1 to 100).par.foreach { _ => > println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count()) > } > java.lang.IllegalArgumentException: spark.sql.execution.id is already set > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87) > > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) > at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) > {code} > This is because SparkContext.runJob can be suspended when using a > ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it > calls Await.ready (introduced by https://github.com/apache/spark/pull/9264). > So when SparkContext.runJob is suspended, ForkJoinPool will run another task > in the same thread, however, the local properties has been polluted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool
[ https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583454#comment-15583454 ] Shixiong Zhu commented on SPARK-13747: -- [~chinwei] Could you test https://github.com/apache/spark/pull/15520 and see if the error is gone? > Concurrent execution in SQL doesn't work with Scala ForkJoinPool > > > Key: SPARK-13747 > URL: https://issues.apache.org/jira/browse/SPARK-13747 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Shixiong Zhu >Assignee: Apache Spark > Fix For: 2.0.0 > > > Run the following codes may fail > {code} > (1 to 100).par.foreach { _ => > println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count()) > } > java.lang.IllegalArgumentException: spark.sql.execution.id is already set > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87) > > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) > at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) > {code} > This is because SparkContext.runJob can be suspended when using a > ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it > calls Await.ready (introduced by https://github.com/apache/spark/pull/9264). > So when SparkContext.runJob is suspended, ForkJoinPool will run another task > in the same thread, however, the local properties has been polluted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool
[ https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13747: Assignee: Apache Spark (was: Shixiong Zhu) > Concurrent execution in SQL doesn't work with Scala ForkJoinPool > > > Key: SPARK-13747 > URL: https://issues.apache.org/jira/browse/SPARK-13747 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Shixiong Zhu >Assignee: Apache Spark > Fix For: 2.0.0 > > > Run the following codes may fail > {code} > (1 to 100).par.foreach { _ => > println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count()) > } > java.lang.IllegalArgumentException: spark.sql.execution.id is already set > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87) > > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) > at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) > {code} > This is because SparkContext.runJob can be suspended when using a > ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it > calls Await.ready (introduced by https://github.com/apache/spark/pull/9264). > So when SparkContext.runJob is suspended, ForkJoinPool will run another task > in the same thread, however, the local properties has been polluted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool
[ https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13747: Assignee: Shixiong Zhu (was: Apache Spark) > Concurrent execution in SQL doesn't work with Scala ForkJoinPool > > > Key: SPARK-13747 > URL: https://issues.apache.org/jira/browse/SPARK-13747 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.0.0 > > > Run the following codes may fail > {code} > (1 to 100).par.foreach { _ => > println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count()) > } > java.lang.IllegalArgumentException: spark.sql.execution.id is already set > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87) > > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) > at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) > {code} > This is because SparkContext.runJob can be suspended when using a > ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it > calls Await.ready (introduced by https://github.com/apache/spark/pull/9264). > So when SparkContext.runJob is suspended, ForkJoinPool will run another task > in the same thread, however, the local properties has been polluted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool
[ https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu reopened SPARK-13747: -- Assignee: Shixiong Zhu (was: Andrew Or) There are other places need to be fixed. > Concurrent execution in SQL doesn't work with Scala ForkJoinPool > > > Key: SPARK-13747 > URL: https://issues.apache.org/jira/browse/SPARK-13747 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.0.0 > > > Run the following codes may fail > {code} > (1 to 100).par.foreach { _ => > println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count()) > } > java.lang.IllegalArgumentException: spark.sql.execution.id is already set > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87) > > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) > at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) > {code} > This is because SparkContext.runJob can be suspended when using a > ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it > calls Await.ready (introduced by https://github.com/apache/spark/pull/9264). > So when SparkContext.runJob is suspended, ForkJoinPool will run another task > in the same thread, however, the local properties has been polluted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17944) sbin/start-* scripts use of `hostname -f` fail with Solaris
[ https://issues.apache.org/jira/browse/SPARK-17944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik O'Shaughnessy updated SPARK-17944: --- Component/s: Deploy > sbin/start-* scripts use of `hostname -f` fail with Solaris > > > Key: SPARK-17944 > URL: https://issues.apache.org/jira/browse/SPARK-17944 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.0.1 > Environment: Solaris 10, Solaris 11 >Reporter: Erik O'Shaughnessy >Priority: Trivial > > {{$SPARK_HOME/sbin/start-master.sh}} fails: > {noformat} > $ ./start-master.sh > usage: hostname [[-t] system_name] >hostname [-D] > starting org.apache.spark.deploy.master.Master, logging to > /home/eoshaugh/local/spark/logs/spark-eoshaugh-org.apache.spark.deploy.master.Master-1-m7-16-002-ld1.out > failed to launch org.apache.spark.deploy.master.Master: > --properties-file FILE Path to a custom Spark properties file. >Default is conf/spark-defaults.conf. > full log in > /home/eoshaugh/local/spark/logs/spark-eoshaugh-org.apache.spark.deploy.master.Master-1-m7-16-002-ld1.out > {noformat} > I found SPARK-17546 which changed the invocation of hostname in > sbin/start-master.sh, sbin/start-slaves.sh and sbin/start-mesos-dispatcher.sh > to include the flag {{-f}}, which is not a valid command line option for the > Solaris hostname implementation. > As a workaround, Solaris users can substitute: > {noformat} > `/usr/sbin/check-hostname | awk '{print $NF}'` > {noformat} > Admittedly not an obvious fix, but it provides equivalent functionality. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15689) Data source API v2
[ https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-15689: Description: This ticket tracks progress in creating the v2 of data source API. This new API should focus on: 1. Have a small surface so it is easy to freeze and maintain compatibility for a long time. Ideally, this API should survive architectural rewrites and user-facing API revamps of Spark. 2. Have a well-defined column batch interface for high performance. Convenience methods should exist to convert row-oriented formats into column batches for data source developers. 3. Still support filter push down, similar to the existing API. 4. Nice-to-have: support additional common operators, including limit and sampling. Note that both 1 and 2 are problems that the current data source API (v1) suffers. The current data source API has a wide surface with dependency on DataFrame/SQLContext, making the data source API compatibility depending on the upper level API. The current data source API is also only row oriented and has to go through an expensive external data type conversion to internal data type. was: This ticket tracks progress in creating the v2 of data source API. This new API should focus on: 1. Have a small surface so it is easy to freeze and maintain compatibility for a long time. Ideally, this API should survive architectural rewrites and user-facing API revamps of Spark. 2. Have a well-defined column batch interface for high performance. Convenience methods should exist to convert row-oriented formats into column batches for data source developers. 3. Still support filter push down, similar to the existing API. 4. Support sampling. Note that both 1 and 2 are problems that the current data source API (v1) suffers. The current data source API has a wide surface with dependency on DataFrame/SQLContext, making the data source API compatibility depending on the upper level API. The current data source API is also only row oriented and has to go through an expensive external data type conversion to internal data type. > Data source API v2 > -- > > Key: SPARK-15689 > URL: https://issues.apache.org/jira/browse/SPARK-15689 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > This ticket tracks progress in creating the v2 of data source API. This new > API should focus on: > 1. Have a small surface so it is easy to freeze and maintain compatibility > for a long time. Ideally, this API should survive architectural rewrites and > user-facing API revamps of Spark. > 2. Have a well-defined column batch interface for high performance. > Convenience methods should exist to convert row-oriented formats into column > batches for data source developers. > 3. Still support filter push down, similar to the existing API. > 4. Nice-to-have: support additional common operators, including limit and > sampling. > Note that both 1 and 2 are problems that the current data source API (v1) > suffers. The current data source API has a wide surface with dependency on > DataFrame/SQLContext, making the data source API compatibility depending on > the upper level API. The current data source API is also only row oriented > and has to go through an expensive external data type conversion to internal > data type. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15689) Data source API v2
[ https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-15689: Description: This ticket tracks progress in creating the v2 of data source API. This new API should focus on: 1. Have a small surface so it is easy to freeze and maintain compatibility for a long time. Ideally, this API should survive architectural rewrites and user-facing API revamps of Spark. 2. Have a well-defined column batch interface for high performance. Convenience methods should exist to convert row-oriented formats into column batches for data source developers. 3. Still support filter push down, similar to the existing API. 4. Support sampling. Note that both 1 and 2 are problems that the current data source API (v1) suffers. The current data source API has a wide surface with dependency on DataFrame/SQLContext, making the data source API compatibility depending on the upper level API. The current data source API is also only row oriented and has to go through an expensive external data type conversion to internal data type. was: This ticket tracks progress in creating the v2 of data source API. This new API should focus on: 1. Have a small surface so it is easy to freeze and maintain compatibility for a long time. Ideally, this API should survive architectural rewrites and user-facing API revamps of Spark. 2. Have a well-defined column batch interface for high performance. Convenience methods should exist to convert row-oriented formats into column batches for data source developers. 3. Still support filter push down, similar to the existing API. Note that both 1 and 2 are problems that the current data source API (v1) suffers. The current data source API has a wide surface with dependency on DataFrame/SQLContext, making the data source API compatibility depending on the upper level API. The current data source API is also only row oriented and has to go through an expensive external data type conversion to internal data type. > Data source API v2 > -- > > Key: SPARK-15689 > URL: https://issues.apache.org/jira/browse/SPARK-15689 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > This ticket tracks progress in creating the v2 of data source API. This new > API should focus on: > 1. Have a small surface so it is easy to freeze and maintain compatibility > for a long time. Ideally, this API should survive architectural rewrites and > user-facing API revamps of Spark. > 2. Have a well-defined column batch interface for high performance. > Convenience methods should exist to convert row-oriented formats into column > batches for data source developers. > 3. Still support filter push down, similar to the existing API. > 4. Support sampling. > Note that both 1 and 2 are problems that the current data source API (v1) > suffers. The current data source API has a wide surface with dependency on > DataFrame/SQLContext, making the data source API compatibility depending on > the upper level API. The current data source API is also only row oriented > and has to go through an expensive external data type conversion to internal > data type. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17911) Scheduler does not need messageScheduler for ResubmitFailedStages
[ https://issues.apache.org/jira/browse/SPARK-17911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583334#comment-15583334 ] Mark Hamstra commented on SPARK-17911: -- I think we're pretty much on the same page when it comes to the net effects of just eliminating the RESUBMIT_TIMEOUT delay. I need to find some time to think about what something better than the current delayed-resubmit-event approach would look like. > Scheduler does not need messageScheduler for ResubmitFailedStages > - > > Key: SPARK-17911 > URL: https://issues.apache.org/jira/browse/SPARK-17911 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.0.0 >Reporter: Imran Rashid > > Its not totally clear what the purpose of the {{messageScheduler}} is in > {{DAGScheduler}}. It can perhaps be eliminated completely; or perhaps we > should just clearly document its purpose. > This comes from a long discussion w/ [~markhamstra] on an unrelated PR here: > https://github.com/apache/spark/pull/15335/files/c80ad22a242255cac91cce2c7c537f9b21100f70#diff-6a9ff7fb74fd490a50462d45db2d5e11 > But its tricky so breaking it out here for archiving the discussion. > Note: this issue requires a decision on what to do before a code change, so > lets just discuss it on jira first. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17975) EMLDAOptimizer fails with ClassCastException on YARN
[ https://issues.apache.org/jira/browse/SPARK-17975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Stein updated SPARK-17975: --- Description: I'm able to reproduce the error consistently with a 2000 record text file with each record having 1-5 terms and checkpointing enabled. It looks like the problem was introduced with the resolution for SPARK-13355. The EdgeRDD class seems to be lying about it's type in a way that causes RDD.mapPartitionsWithIndex method to be unusable when it's referenced as an RDD of Edge elements. {code} val spark = SparkSession.builder.appName("lda").getOrCreate() spark.sparkContext.setCheckpointDir("hdfs:///tmp/checkpoints") val data: RDD[(Long, Vector)] = // snip data.setName("data").cache() val lda = new LDA val optimizer = new EMLDAOptimizer lda.setOptimizer(optimizer) .setK(10) .setMaxIterations(400) .setAlpha(-1) .setBeta(-1) .setCheckpointInterval(7) val ldaModel = lda.run(data) {code} {noformat} 16/10/16 23:53:54 WARN TaskSetManager: Lost task 3.0 in stage 348.0 (TID 1225, server2.domain): java.lang.ClassCastException: scala.Tuple2 cannot be cast to org.apache.spark.graphx.Edge at org.apache.spark.graphx.EdgeRDD$$anonfun$1$$anonfun$apply$1.apply(EdgeRDD.scala:107) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107) at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:105) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:332) at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:330) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330) at org.apache.spark.rdd.RDD.iterator(RDD.scala:281) at org.apache.spark.graphx.EdgeRDD.compute(EdgeRDD.scala:50) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) {noformat} was: I'm able to reproduce the error consistently with a 2000 record text file with each record having 1-5 terms and checkpointing enabled. It looks like the problem was introduced with the resolution for SPARK-13355. The EdgeRDD class seems to be lying about it's type in a way that causes RDD.mapPartitionsWithIndex method to be unusable when it's referenced as an RDD of Edge elements. {code} val spark = SparkSession.builder.appName("lda").getOrCreate() spark.sparkContext.setCheckpointDir("hdfs:///tmp/checkpoints") val data: RDD[(Long, Vector)] = // snip data.setName("data").cache() val lda = new LDA val optimizer = new EMLDAOptimizer lda.setOptimizer(optimizer) .setK(10) .setMaxIterations(400) .setAlpha(-1) .setBeta(-1) .setCheckpointInterval(7) val ldaModel = lda.run(data) {code} > EMLDAOptimizer
[jira] [Commented] (SPARK-17975) EMLDAOptimizer fails with ClassCastException on YARN
[ https://issues.apache.org/jira/browse/SPARK-17975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583284#comment-15583284 ] Jeff Stein commented on SPARK-17975: Another issue that seems to be related to EdgeRDD partition problems. > EMLDAOptimizer fails with ClassCastException on YARN > > > Key: SPARK-17975 > URL: https://issues.apache.org/jira/browse/SPARK-17975 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.0.1 > Environment: Centos 6, CDH 5.7, Java 1.7u80 >Reporter: Jeff Stein > > I'm able to reproduce the error consistently with a 2000 record text file > with each record having 1-5 terms and checkpointing enabled. It looks like > the problem was introduced with the resolution for SPARK-13355. > The EdgeRDD class seems to be lying about it's type in a way that causes > RDD.mapPartitionsWithIndex method to be unusable when it's referenced as an > RDD of Edge elements. > {code} > val spark = SparkSession.builder.appName("lda").getOrCreate() > spark.sparkContext.setCheckpointDir("hdfs:///tmp/checkpoints") > val data: RDD[(Long, Vector)] = // snip > data.setName("data").cache() > val lda = new LDA > val optimizer = new EMLDAOptimizer > lda.setOptimizer(optimizer) > .setK(10) > .setMaxIterations(400) > .setAlpha(-1) > .setBeta(-1) > .setCheckpointInterval(7) > val ldaModel = lda.run(data) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17975) EMLDAOptimizer fails with ClassCastException on YARN
[ https://issues.apache.org/jira/browse/SPARK-17975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583284#comment-15583284 ] Jeff Stein edited comment on SPARK-17975 at 10/17/16 8:04 PM: -- Adding a link to another issue that seems to be related to EdgeRDD partition problems. was (Author: jvstein): Another issue that seems to be related to EdgeRDD partition problems. > EMLDAOptimizer fails with ClassCastException on YARN > > > Key: SPARK-17975 > URL: https://issues.apache.org/jira/browse/SPARK-17975 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.0.1 > Environment: Centos 6, CDH 5.7, Java 1.7u80 >Reporter: Jeff Stein > > I'm able to reproduce the error consistently with a 2000 record text file > with each record having 1-5 terms and checkpointing enabled. It looks like > the problem was introduced with the resolution for SPARK-13355. > The EdgeRDD class seems to be lying about it's type in a way that causes > RDD.mapPartitionsWithIndex method to be unusable when it's referenced as an > RDD of Edge elements. > {code} > val spark = SparkSession.builder.appName("lda").getOrCreate() > spark.sparkContext.setCheckpointDir("hdfs:///tmp/checkpoints") > val data: RDD[(Long, Vector)] = // snip > data.setName("data").cache() > val lda = new LDA > val optimizer = new EMLDAOptimizer > lda.setOptimizer(optimizer) > .setK(10) > .setMaxIterations(400) > .setAlpha(-1) > .setBeta(-1) > .setCheckpointInterval(7) > val ldaModel = lda.run(data) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17971) Unix timestamp handling in Spark SQL not allowing calculations on UTC times
[ https://issues.apache.org/jira/browse/SPARK-17971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583274#comment-15583274 ] Gabriele Del Prete commented on SPARK-17971: Already tried, and I could not make it to work. from_utc_timestamp can't accept a bigint column as input, only a timestamp column, and if I cast my bigint column to timestamp, the returned timestamp is shifted in the local node's timezone. uinx time 1476354405 is ~ 2016-10-13 at *10*:26 *select hour(from_utc_timestamp(cast(1476354405 as timestamp), "UTC"));* when run on our servers (set to UTC) returns *10*, when run on my personal dev machine (set to US/Eastern) returns *6*. > Unix timestamp handling in Spark SQL not allowing calculations on UTC times > --- > > Key: SPARK-17971 > URL: https://issues.apache.org/jira/browse/SPARK-17971 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.6.2 > Environment: MacOS X JDK 7 >Reporter: Gabriele Del Prete > > In our Spark data pipeline we store timed events using a bigint column called > 'timestamp', the values contained being Unix timestamp time points. > Our datacenter servers Java VMs are all set up to start with timezone set to > UTC, while developer's computers are all in the US Eastern timezone. > Given how Spark SQL datetime functions work, it's impossible to do > calculations (eg. extract and compare hours, year-month-date triplets) using > UTC values: > - from_unixtime takes a bigint unix timestamp and forces it to the computer's > local timezone; > - casting the bigint column to timestamp does the same (it converts it to the > local timezone) > - from_utc_timestamp works in the same way, the only difference being that it > gets a string as input instead of a bigint. > The result of all of this is that it's impossible to extract individual > fields of a UTC timestamp, since all timestamp always get converted to the > local timezone. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17975) EMLDAOptimizer fails with ClassCastException on YARN
Jeff Stein created SPARK-17975: -- Summary: EMLDAOptimizer fails with ClassCastException on YARN Key: SPARK-17975 URL: https://issues.apache.org/jira/browse/SPARK-17975 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 2.0.1 Environment: Centos 6, CDH 5.7, Java 1.7u80 Reporter: Jeff Stein I'm able to reproduce the error consistently with a 2000 record text file with each record having 1-5 terms and checkpointing enabled. It looks like the problem was introduced with the resolution for SPARK-13355. The EdgeRDD class seems to be lying about it's type in a way that causes RDD.mapPartitionsWithIndex method to be unusable when it's referenced as an RDD of Edge elements. {code} val spark = SparkSession.builder.appName("lda").getOrCreate() spark.sparkContext.setCheckpointDir("hdfs:///tmp/checkpoints") val data: RDD[(Long, Vector)] = // snip data.setName("data").cache() val lda = new LDA val optimizer = new EMLDAOptimizer lda.setOptimizer(optimizer) .setK(10) .setMaxIterations(400) .setAlpha(-1) .setBeta(-1) .setCheckpointInterval(7) val ldaModel = lda.run(data) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10915) Add support for UDAFs in Python
[ https://issues.apache.org/jira/browse/SPARK-10915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583180#comment-15583180 ] Tobi Bosede commented on SPARK-10915: - Thanks Davies. Someone also mentioned collect on the mailing list. I think I will use pandas' pivot for now rather than collect and create a UDF. (Hopefully I have enough memory). So how are the current (built in) aggregate functions being implemented? They are batch right? > Add support for UDAFs in Python > --- > > Key: SPARK-10915 > URL: https://issues.apache.org/jira/browse/SPARK-10915 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Reporter: Justin Uang > > This should support python defined lambdas. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17971) Unix timestamp handling in Spark SQL not allowing calculations on UTC times
[ https://issues.apache.org/jira/browse/SPARK-17971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583176#comment-15583176 ] Sean Owen commented on SPARK-17971: --- Oops I copied the wrong link. I mean : https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html#from_utc_timestamp(org.apache.spark.sql.Column,%20java.lang.String) A UNIX timestamp defines the same point in time and does not depend on a timezone to interpret it. I think we are clear on that and it isn't the point. You just need the methods that don't use system tz. > Unix timestamp handling in Spark SQL not allowing calculations on UTC times > --- > > Key: SPARK-17971 > URL: https://issues.apache.org/jira/browse/SPARK-17971 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.6.2 > Environment: MacOS X JDK 7 >Reporter: Gabriele Del Prete > > In our Spark data pipeline we store timed events using a bigint column called > 'timestamp', the values contained being Unix timestamp time points. > Our datacenter servers Java VMs are all set up to start with timezone set to > UTC, while developer's computers are all in the US Eastern timezone. > Given how Spark SQL datetime functions work, it's impossible to do > calculations (eg. extract and compare hours, year-month-date triplets) using > UTC values: > - from_unixtime takes a bigint unix timestamp and forces it to the computer's > local timezone; > - casting the bigint column to timestamp does the same (it converts it to the > local timezone) > - from_utc_timestamp works in the same way, the only difference being that it > gets a string as input instead of a bigint. > The result of all of this is that it's impossible to extract individual > fields of a UTC timestamp, since all timestamp always get converted to the > local timezone. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17974) Refactor FileCatalog classes to simplify the inheritance tree
[ https://issues.apache.org/jira/browse/SPARK-17974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583170#comment-15583170 ] Apache Spark commented on SPARK-17974: -- User 'ericl' has created a pull request for this issue: https://github.com/apache/spark/pull/15518 > Refactor FileCatalog classes to simplify the inheritance tree > - > > Key: SPARK-17974 > URL: https://issues.apache.org/jira/browse/SPARK-17974 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Minor > > This is a follow-up item for https://github.com/apache/spark/pull/14690 which > adds support for metastore partition pruning of converted hive tables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17974) Refactor FileCatalog classes to simplify the inheritance tree
[ https://issues.apache.org/jira/browse/SPARK-17974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17974: Assignee: Apache Spark > Refactor FileCatalog classes to simplify the inheritance tree > - > > Key: SPARK-17974 > URL: https://issues.apache.org/jira/browse/SPARK-17974 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Assignee: Apache Spark >Priority: Minor > > This is a follow-up item for https://github.com/apache/spark/pull/14690 which > adds support for metastore partition pruning of converted hive tables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org