[jira] [Resolved] (SPARK-16559) Got java.lang.ArithmeticException when Num of Buckets is Set to Zero
[ https://issues.apache.org/jira/browse/SPARK-16559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-16559. - Resolution: Fixed > Got java.lang.ArithmeticException when Num of Buckets is Set to Zero > > > Key: SPARK-16559 > URL: https://issues.apache.org/jira/browse/SPARK-16559 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Minor > > Got a run-time java.lang.ArithmeticException when num of buckets is set to > zero. > For example, > {noformat} > CREATE TABLE t USING PARQUET > OPTIONS (PATH '${path.toString}') > CLUSTERED BY (a) SORTED BY (b) INTO 0 BUCKETS > AS SELECT 1 AS a, 2 AS b > {noformat} > The exception we got is > {noformat} > ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 1.0 > (TID 2) > java.lang.ArithmeticException: / by zero > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18003) RDD zipWithIndex generate wrong result when one partition contains more than 2147483647 records.
[ https://issues.apache.org/jira/browse/SPARK-18003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-18003: --- Description: RDD zipWithIndex generate wrong result when one partition contains more than Int.MaxValue records. when RDD contains a partition with more than 2147483647 records, error occurs. for example, if partition-0 has more than 2147483647 records, the index became: 0,1, ..., 2147483647, -2147483648, -2147483647, -2147483646 when we do some operation such as repartition or coalesce, it is possible to generate big partition, so this bug should be fixed. was: RDD zipWithIndex generate wrong result when one partition contains more than Int.MaxValue records. when RDD contains a partition with more than 2147483647 records, error occurs. for example, if partition-0 has more than 2147483647 records, the index became: 0,1, ..., 2147483647, -2147483648, -2147483647, -2147483646 > RDD zipWithIndex generate wrong result when one partition contains more than > 2147483647 records. > > > Key: SPARK-18003 > URL: https://issues.apache.org/jira/browse/SPARK-18003 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Weichen Xu > Original Estimate: 24h > Remaining Estimate: 24h > > RDD zipWithIndex generate wrong result when one partition contains more than > Int.MaxValue records. > when RDD contains a partition with more than 2147483647 records, > error occurs. > for example, if partition-0 has more than 2147483647 records, the index > became: > 0,1, ..., 2147483647, -2147483648, -2147483647, -2147483646 > when we do some operation such as repartition or coalesce, it is possible to > generate big partition, so this bug should be fixed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16559) Got java.lang.ArithmeticException when Num of Buckets is Set to Zero
[ https://issues.apache.org/jira/browse/SPARK-16559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587713#comment-15587713 ] Xiao Li commented on SPARK-16559: - Yeah, resolved. > Got java.lang.ArithmeticException when Num of Buckets is Set to Zero > > > Key: SPARK-16559 > URL: https://issues.apache.org/jira/browse/SPARK-16559 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Priority: Minor > > Got a run-time java.lang.ArithmeticException when num of buckets is set to > zero. > For example, > {noformat} > CREATE TABLE t USING PARQUET > OPTIONS (PATH '${path.toString}') > CLUSTERED BY (a) SORTED BY (b) INTO 0 BUCKETS > AS SELECT 1 AS a, 2 AS b > {noformat} > The exception we got is > {noformat} > ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 1.0 > (TID 2) > java.lang.ArithmeticException: / by zero > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16559) Got java.lang.ArithmeticException when Num of Buckets is Set to Zero
[ https://issues.apache.org/jira/browse/SPARK-16559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-16559: --- Assignee: Xiao Li > Got java.lang.ArithmeticException when Num of Buckets is Set to Zero > > > Key: SPARK-16559 > URL: https://issues.apache.org/jira/browse/SPARK-16559 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Minor > > Got a run-time java.lang.ArithmeticException when num of buckets is set to > zero. > For example, > {noformat} > CREATE TABLE t USING PARQUET > OPTIONS (PATH '${path.toString}') > CLUSTERED BY (a) SORTED BY (b) INTO 0 BUCKETS > AS SELECT 1 AS a, 2 AS b > {noformat} > The exception we got is > {noformat} > ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 1.0 > (TID 2) > java.lang.ArithmeticException: / by zero > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16559) Got java.lang.ArithmeticException when Num of Buckets is Set to Zero
[ https://issues.apache.org/jira/browse/SPARK-16559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587711#comment-15587711 ] Dongjoon Hyun commented on SPARK-16559: --- Hi, [~smilegator]. It seems to be resolved in your PR, didn't it? > Got java.lang.ArithmeticException when Num of Buckets is Set to Zero > > > Key: SPARK-16559 > URL: https://issues.apache.org/jira/browse/SPARK-16559 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Priority: Minor > > Got a run-time java.lang.ArithmeticException when num of buckets is set to > zero. > For example, > {noformat} > CREATE TABLE t USING PARQUET > OPTIONS (PATH '${path.toString}') > CLUSTERED BY (a) SORTED BY (b) INTO 0 BUCKETS > AS SELECT 1 AS a, 2 AS b > {noformat} > The exception we got is > {noformat} > ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 1.0 > (TID 2) > java.lang.ArithmeticException: / by zero > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18003) RDD zipWithIndex generate wrong result when one partition contains more than 2147483647 records.
[ https://issues.apache.org/jira/browse/SPARK-18003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18003: Assignee: Apache Spark > RDD zipWithIndex generate wrong result when one partition contains more than > 2147483647 records. > > > Key: SPARK-18003 > URL: https://issues.apache.org/jira/browse/SPARK-18003 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Weichen Xu >Assignee: Apache Spark > Original Estimate: 24h > Remaining Estimate: 24h > > RDD zipWithIndex generate wrong result when one partition contains more than > Int.MaxValue records. > when RDD contains a partition with more than 2147483647 records, > error occurs. > for example, if partition-0 has more than 2147483647 records, the index > became: > 0,1, ..., 2147483647, -2147483648, -2147483647, -2147483646 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18003) RDD zipWithIndex generate wrong result when one partition contains more than 2147483647 records.
[ https://issues.apache.org/jira/browse/SPARK-18003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18003: Assignee: (was: Apache Spark) > RDD zipWithIndex generate wrong result when one partition contains more than > 2147483647 records. > > > Key: SPARK-18003 > URL: https://issues.apache.org/jira/browse/SPARK-18003 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Weichen Xu > Original Estimate: 24h > Remaining Estimate: 24h > > RDD zipWithIndex generate wrong result when one partition contains more than > Int.MaxValue records. > when RDD contains a partition with more than 2147483647 records, > error occurs. > for example, if partition-0 has more than 2147483647 records, the index > became: > 0,1, ..., 2147483647, -2147483648, -2147483647, -2147483646 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18003) RDD zipWithIndex generate wrong result when one partition contains more than 2147483647 records.
[ https://issues.apache.org/jira/browse/SPARK-18003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587706#comment-15587706 ] Apache Spark commented on SPARK-18003: -- User 'WeichenXu123' has created a pull request for this issue: https://github.com/apache/spark/pull/15550 > RDD zipWithIndex generate wrong result when one partition contains more than > 2147483647 records. > > > Key: SPARK-18003 > URL: https://issues.apache.org/jira/browse/SPARK-18003 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Weichen Xu > Original Estimate: 24h > Remaining Estimate: 24h > > RDD zipWithIndex generate wrong result when one partition contains more than > Int.MaxValue records. > when RDD contains a partition with more than 2147483647 records, > error occurs. > for example, if partition-0 has more than 2147483647 records, the index > became: > 0,1, ..., 2147483647, -2147483648, -2147483647, -2147483646 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18003) RDD zipWithIndex generate wrong result when one partition contains more than 2147483647 records.
[ https://issues.apache.org/jira/browse/SPARK-18003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-18003: --- Component/s: Spark Core > RDD zipWithIndex generate wrong result when one partition contains more than > 2147483647 records. > > > Key: SPARK-18003 > URL: https://issues.apache.org/jira/browse/SPARK-18003 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Weichen Xu > Original Estimate: 24h > Remaining Estimate: 24h > > RDD zipWithIndex generate wrong result when one partition contains more than > Int.MaxValue records. > when RDD contains a partition with more than 2147483647 records, > error occurs. > for example, if partition-0 has more than 2147483647 records, the index > became: > 0,1, ..., 2147483647, -2147483648, -2147483647, -2147483646 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18003) RDD zipWithIndex generate wrong result when one partition contains more than 2147483647 records.
Weichen Xu created SPARK-18003: -- Summary: RDD zipWithIndex generate wrong result when one partition contains more than 2147483647 records. Key: SPARK-18003 URL: https://issues.apache.org/jira/browse/SPARK-18003 Project: Spark Issue Type: Bug Reporter: Weichen Xu RDD zipWithIndex generate wrong result when one partition contains more than Int.MaxValue records. when RDD contains a partition with more than 2147483647 records, error occurs. for example, if partition-0 has more than 2147483647 records, the index became: 0,1, ..., 2147483647, -2147483648, -2147483647, -2147483646 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17630) jvm-exit-on-fatal-error handler for spark.rpc.netty like there is available for akka
[ https://issues.apache.org/jira/browse/SPARK-17630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587664#comment-15587664 ] Mario Briggs commented on SPARK-17630: -- [~zsxwing] thanks much. any pointers on how/where to add code or something existing in the code base to look. I can then try a PR > jvm-exit-on-fatal-error handler for spark.rpc.netty like there is available > for akka > > > Key: SPARK-17630 > URL: https://issues.apache.org/jira/browse/SPARK-17630 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Mario Briggs > Attachments: SecondCodePath.txt, firstCodepath.txt > > > Hi, > I have 2 code-paths from my app that result in a jvm OOM. > In the first code path, 'akka.jvm-exit-on-fatal-error' kicks in and shuts > down the JVM, so that the caller (py4J) get notified with proper stack trace. > Attached stack-trace file (firstCodepath.txt) > In the 2nd code path (rpc.netty), no such handler kicks in and shutdown the > JVM, so the caller does not get notified. > Attached stack-trace file (SecondCodepath.txt) > Is it possible to have an jvm exit handle for the rpc. netty path? > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: Failed to analyze the canonicalized SQL. It is possible there is a bug in Spark.
[ https://issues.apache.org/jira/browse/SPARK-17982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587659#comment-15587659 ] Jiang Xingbo commented on SPARK-17982: -- You are right, just made a mistake. Sorry for that! > Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: > Failed to analyze the canonicalized SQL. It is possible there is a bug in > Spark. > > > Key: SPARK-17982 > URL: https://issues.apache.org/jira/browse/SPARK-17982 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 > Environment: spark 2.0.0 >Reporter: Franck Tago > > The following statement fails in the spark shell . > {noformat} > scala> spark.sql("CREATE VIEW > DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS > SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") > scala> spark.sql("CREATE VIEW > DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS > SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") > java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT > `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT > `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT > `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS > `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS > gen_subquery_1 > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86) > at org.apache.spark.sql.Dataset.(Dataset.scala:186) > at org.apache.spark.sql.Dataset.(Dataset.scala:167) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65) > {noformat} > This appears to be a limitation of the create view statement . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587627#comment-15587627 ] Saikat Kanjilal edited comment on SPARK-9487 at 10/19/16 4:24 AM: -- [~holdenk] finally getting time to look at this, so I am starting small, I made the change inside ContextCleanerSuite and HeartbeatReceiverSuite from local[2] tp local[4], per the documentation here (http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version) I ran mvn -P hadoop2 -Dsuites= org.apache.spark.HeartbeatReceiverSuite test--- looks like everything worked I then ran mvn -P hadoop2 -Dsuites= org.apache.spark.ContextCleanerSuite test-- looks like everything worked as well See the attachments and let me know if this is the right process to run single unit tests, I'll start making changes to the other Suites , how would you like to see the output, should I just have attachments or just do a pull request from the new branch that I created? Thanks PS Another question, running single unit tests like this takes forever, are there flags I can set to speed up the builds, even on my 15 inch macbook pro with SSD the builds shouldnt take this long :(. Let me know next steps to get this into a PR. was (Author: kanjilal): [~holdenk] finally getting time to look at this, so I am starting small, I made the change inside ContextCleanerSuite and HeartbeatReceiverSuite from local[2] tp local[4], per the documentation here (http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version) I ran mvn -P hadoop2 -Dsuites= org.apache.spark.HeartbeatReceiverSuite test--- looks like everything worked I then ran mvn -P hadoop2 -Dsuites= org.apache.spark.ContextCleanerSuite test-- looks like everything worked as well See the attachments and let me know if this is this is the right process to run single unit tests, if not I'll start making changes to the other Suites , how would you like to see the output, should I just have attachments or just do a pull request from the new branch that I created? Thanks PS Another question, running single unit tests like this takes forever, are there flags I can set to speed up the builds, even on my 15 inch macbook pro with SSD the builds shouldnt take this long :(. Let me know next steps to get this into a PR. > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults > > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587627#comment-15587627 ] Saikat Kanjilal edited comment on SPARK-9487 at 10/19/16 4:24 AM: -- [~holdenk] finally getting time to look at this, so I am starting small, I made the change inside ContextCleanerSuite and HeartbeatReceiverSuite from local[2] tp local[4], per the documentation here (http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version) I ran mvn -P hadoop2 -Dsuites= org.apache.spark.HeartbeatReceiverSuite test--- looks like everything worked I then ran mvn -P hadoop2 -Dsuites= org.apache.spark.ContextCleanerSuite test-- looks like everything worked as well See the attachments and let me know if this is this is the right process to run single unit tests, if not I'll start making changes to the other Suites , how would you like to see the output, should I just have attachments or just do a pull request from the new branch that I created? Thanks PS Another question, running single unit tests like this takes forever, are there flags I can set to speed up the builds, even on my 15 inch macbook pro with SSD the builds shouldnt take this long :(. Let me know next steps to get this into a PR. was (Author: kanjilal): [~holdenk] finally getting time to look at this, so I am starting small, I made the change inside ContextCleanerSuite and HeartbeatReceiverSuite from local[2] tp local[4], per the documentation here (http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version) I ran mvn -Phadoop2 -Dsuites=org.apache.spark.HeartbeatReceiverSuite test--- looks like everything worked I then ran mvn -Phadoop2 -Dsuites=org.apache.spark.ContextCleanerSuite test-- looks like everything worked as well See the attachments and let me know if this is not the right process to run single unit tests, if not I'll start making changes to the other Suites , how would you like to see the output, should I just have attachments or just do a pull request from the new branch that I created? Thanks PS Another question, running single unit tests like this takes forever, are there flags I can set to speed up the builds, even on my 15 inch macbook pro with SSD the builds shouldnt take this long :(. Let me know next steps to get this into a PR. > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults > > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587627#comment-15587627 ] Saikat Kanjilal commented on SPARK-9487: [~holdenk] finally getting time to look at this, so I am starting small, I made the change inside ContextCleanerSuite and HeartbeatReceiverSuite from local[2] tp local[4], per the documentation here (http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version) I ran mvn -Phadoop2 -Dsuites=org.apache.spark.HeartbeatReceiverSuite test--- looks like everything worked I then ran mvn -Phadoop2 -Dsuites=org.apache.spark.ContextCleanerSuite test-- looks like everything worked as well See the attachments and let me know if this is not the right process to run single unit tests, if not I'll start making changes to the other Suites , how would you like to see the output, should I just have attachments or just do a pull request from the new branch that I created? Thanks PS Another question, running single unit tests like this takes forever, are there flags I can set to speed up the builds, even on my 15 inch macbook pro with SSD the builds shouldnt take this long :(. Let me know next steps to get this into a PR. > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults > > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17985) Bump commons-lang3 version to 3.5.
[ https://issues.apache.org/jira/browse/SPARK-17985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587625#comment-15587625 ] Apache Spark commented on SPARK-17985: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/15548 > Bump commons-lang3 version to 3.5. > -- > > Key: SPARK-17985 > URL: https://issues.apache.org/jira/browse/SPARK-17985 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin > Fix For: 2.1.0 > > > {{SerializationUtils.clone()}} of commons-lang3 (<3.5) has a bug that break > thread safety, which gets stack sometimes caused by race condition of > initializing hash map. > See https://issues.apache.org/jira/browse/LANG-1251. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18002) Prune unnecessary IsNotNull predicates from Filter
[ https://issues.apache.org/jira/browse/SPARK-18002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587624#comment-15587624 ] Apache Spark commented on SPARK-18002: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/15547 > Prune unnecessary IsNotNull predicates from Filter > -- > > Key: SPARK-18002 > URL: https://issues.apache.org/jira/browse/SPARK-18002 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh > > In PruneFilters rule, we can prune unnecessary IsNotNull predicates if the > predicate in IsNotNull is not nullable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saikat Kanjilal updated SPARK-9487: --- Attachment: ContextCleanerSuiteResults > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults > > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18002) Prune unnecessary IsNotNull predicates from Filter
[ https://issues.apache.org/jira/browse/SPARK-18002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18002: Assignee: Apache Spark > Prune unnecessary IsNotNull predicates from Filter > -- > > Key: SPARK-18002 > URL: https://issues.apache.org/jira/browse/SPARK-18002 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Apache Spark > > In PruneFilters rule, we can prune unnecessary IsNotNull predicates if the > predicate in IsNotNull is not nullable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18002) Prune unnecessary IsNotNull predicates from Filter
[ https://issues.apache.org/jira/browse/SPARK-18002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18002: Assignee: (was: Apache Spark) > Prune unnecessary IsNotNull predicates from Filter > -- > > Key: SPARK-18002 > URL: https://issues.apache.org/jira/browse/SPARK-18002 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh > > In PruneFilters rule, we can prune unnecessary IsNotNull predicates if the > predicate in IsNotNull is not nullable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saikat Kanjilal updated SPARK-9487: --- Attachment: HeartbeatReceiverSuiteResults > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: HeartbeatReceiverSuiteResults > > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18001) Broke link to R DataFrame In sql-programming-guide
[ https://issues.apache.org/jira/browse/SPARK-18001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-18001. - Resolution: Fixed Fix Version/s: 2.1.0 2.0.2 > Broke link to R DataFrame In sql-programming-guide > --- > > Key: SPARK-18001 > URL: https://issues.apache.org/jira/browse/SPARK-18001 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.0.1 >Reporter: Tommy Yu >Priority: Trivial > Fix For: 2.0.2, 2.1.0 > > > In http://spark.apache.org/docs/latest/sql-programming-guide.html, Section > "Untyped Dataset Operations (aka DataFrame Operations)" > Link to R doesn't work that return > The requested URL /docs/latest/api/R/DataFrame.html was not found on this > server. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18001) Broke link to R DataFrame In sql-programming-guide
[ https://issues.apache.org/jira/browse/SPARK-18001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18001: Assignee: Tommy Yu > Broke link to R DataFrame In sql-programming-guide > --- > > Key: SPARK-18001 > URL: https://issues.apache.org/jira/browse/SPARK-18001 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.0.1 >Reporter: Tommy Yu >Assignee: Tommy Yu >Priority: Trivial > Fix For: 2.0.2, 2.1.0 > > > In http://spark.apache.org/docs/latest/sql-programming-guide.html, Section > "Untyped Dataset Operations (aka DataFrame Operations)" > Link to R doesn't work that return > The requested URL /docs/latest/api/R/DataFrame.html was not found on this > server. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18002) Prune unnecessary IsNotNull predicates from Filter
Liang-Chi Hsieh created SPARK-18002: --- Summary: Prune unnecessary IsNotNull predicates from Filter Key: SPARK-18002 URL: https://issues.apache.org/jira/browse/SPARK-18002 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh In PruneFilters rule, we can prune unnecessary IsNotNull predicates if the predicate in IsNotNull is not nullable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: Failed to analyze the canonicalized SQL. It is possible there is a bug in Spark.
[ https://issues.apache.org/jira/browse/SPARK-17982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587583#comment-15587583 ] Dongjoon Hyun commented on SPARK-17982: --- Hi, [~tafra...@gmail.com]. I made a PR for you. It handles your case, too. {code} scala> sql("create table `where`(id int, name int)") res1: org.apache.spark.sql.DataFrame = [] scala> spark.sql("CREATE VIEW DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") res2: org.apache.spark.sql.DataFrame = [] {code} I just simplify your case because it contains multiple test cases. > Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: > Failed to analyze the canonicalized SQL. It is possible there is a bug in > Spark. > > > Key: SPARK-17982 > URL: https://issues.apache.org/jira/browse/SPARK-17982 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 > Environment: spark 2.0.0 >Reporter: Franck Tago > > The following statement fails in the spark shell . > {noformat} > scala> spark.sql("CREATE VIEW > DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS > SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") > scala> spark.sql("CREATE VIEW > DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS > SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") > java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT > `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT > `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT > `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS > `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS > gen_subquery_1 > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86) > at org.apache.spark.sql.Dataset.(Dataset.scala:186) > at org.apache.spark.sql.Dataset.(Dataset.scala:167) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65) > {noformat} > This appears to be a limitation of the create view statement . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: Failed to analyze the canonicalized SQL. It is possible there is a bug in Spark.
[ https://issues.apache.org/jira/browse/SPARK-17982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587579#comment-15587579 ] Dongjoon Hyun commented on SPARK-17982: --- Hi, [~jiangxb]. I think Spark supports that like MySQL http://dev.mysql.com/doc/refman/5.7/en/create-view.html . {code} scala> sql("SELECT id2 FROM v1") res0: org.apache.spark.sql.DataFrame = [id2: int] {code} > Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: > Failed to analyze the canonicalized SQL. It is possible there is a bug in > Spark. > > > Key: SPARK-17982 > URL: https://issues.apache.org/jira/browse/SPARK-17982 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 > Environment: spark 2.0.0 >Reporter: Franck Tago > > The following statement fails in the spark shell . > {noformat} > scala> spark.sql("CREATE VIEW > DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS > SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") > scala> spark.sql("CREATE VIEW > DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS > SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") > java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT > `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT > `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT > `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS > `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS > gen_subquery_1 > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86) > at org.apache.spark.sql.Dataset.(Dataset.scala:186) > at org.apache.spark.sql.Dataset.(Dataset.scala:167) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65) > {noformat} > This appears to be a limitation of the create view statement . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: Failed to analyze the canonicalized SQL. It is possible there is a bug in Spark.
[ https://issues.apache.org/jira/browse/SPARK-17982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17982: Assignee: (was: Apache Spark) > Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: > Failed to analyze the canonicalized SQL. It is possible there is a bug in > Spark. > > > Key: SPARK-17982 > URL: https://issues.apache.org/jira/browse/SPARK-17982 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 > Environment: spark 2.0.0 >Reporter: Franck Tago > > The following statement fails in the spark shell . > {noformat} > scala> spark.sql("CREATE VIEW > DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS > SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") > scala> spark.sql("CREATE VIEW > DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS > SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") > java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT > `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT > `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT > `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS > `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS > gen_subquery_1 > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86) > at org.apache.spark.sql.Dataset.(Dataset.scala:186) > at org.apache.spark.sql.Dataset.(Dataset.scala:167) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65) > {noformat} > This appears to be a limitation of the create view statement . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: Failed to analyze the canonicalized SQL. It is possible there is a bug in Spark.
[ https://issues.apache.org/jira/browse/SPARK-17982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587573#comment-15587573 ] Apache Spark commented on SPARK-17982: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/15546 > Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: > Failed to analyze the canonicalized SQL. It is possible there is a bug in > Spark. > > > Key: SPARK-17982 > URL: https://issues.apache.org/jira/browse/SPARK-17982 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 > Environment: spark 2.0.0 >Reporter: Franck Tago > > The following statement fails in the spark shell . > {noformat} > scala> spark.sql("CREATE VIEW > DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS > SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") > scala> spark.sql("CREATE VIEW > DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS > SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") > java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT > `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT > `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT > `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS > `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS > gen_subquery_1 > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86) > at org.apache.spark.sql.Dataset.(Dataset.scala:186) > at org.apache.spark.sql.Dataset.(Dataset.scala:167) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65) > {noformat} > This appears to be a limitation of the create view statement . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17892) Query in CTAS is Optimized Twice (branch-2.0)
[ https://issues.apache.org/jira/browse/SPARK-17892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587574#comment-15587574 ] Dongjoon Hyun commented on SPARK-17892: --- Sorry. It's my mistake. > Query in CTAS is Optimized Twice (branch-2.0) > - > > Key: SPARK-17892 > URL: https://issues.apache.org/jira/browse/SPARK-17892 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Yin Huai >Assignee: Xiao Li >Priority: Blocker > Fix For: 2.0.2 > > > This tracks the work that fixes the problem shown in > https://issues.apache.org/jira/browse/SPARK-17409 to branch 2.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: Failed to analyze the canonicalized SQL. It is possible there is a bug in Spark.
[ https://issues.apache.org/jira/browse/SPARK-17982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17982: Assignee: Apache Spark > Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: > Failed to analyze the canonicalized SQL. It is possible there is a bug in > Spark. > > > Key: SPARK-17982 > URL: https://issues.apache.org/jira/browse/SPARK-17982 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 > Environment: spark 2.0.0 >Reporter: Franck Tago >Assignee: Apache Spark > > The following statement fails in the spark shell . > {noformat} > scala> spark.sql("CREATE VIEW > DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS > SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") > scala> spark.sql("CREATE VIEW > DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS > SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") > java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT > `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT > `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT > `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS > `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS > gen_subquery_1 > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86) > at org.apache.spark.sql.Dataset.(Dataset.scala:186) > at org.apache.spark.sql.Dataset.(Dataset.scala:167) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65) > {noformat} > This appears to be a limitation of the create view statement . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17892) Query in CTAS is Optimized Twice (branch-2.0)
[ https://issues.apache.org/jira/browse/SPARK-17892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587567#comment-15587567 ] Apache Spark commented on SPARK-17892: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/15546 > Query in CTAS is Optimized Twice (branch-2.0) > - > > Key: SPARK-17892 > URL: https://issues.apache.org/jira/browse/SPARK-17892 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Yin Huai >Assignee: Xiao Li >Priority: Blocker > Fix For: 2.0.2 > > > This tracks the work that fixes the problem shown in > https://issues.apache.org/jira/browse/SPARK-17409 to branch 2.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17999) Add getPreferredLocations for KafkaSourceRDD
[ https://issues.apache.org/jira/browse/SPARK-17999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17999: Assignee: (was: Apache Spark) > Add getPreferredLocations for KafkaSourceRDD > > > Key: SPARK-17999 > URL: https://issues.apache.org/jira/browse/SPARK-17999 > Project: Spark > Issue Type: Improvement > Components: SQL, Streaming >Reporter: Saisai Shao >Priority: Minor > > The newly implemented Structured Streaming KafkaSource did calculate the > preferred locations for each topic partition, but didn't offer this > information through RDD's {{getPreferredLocations}} method. So here propose > to add this method in {{KafkaSourceRDD}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17999) Add getPreferredLocations for KafkaSourceRDD
[ https://issues.apache.org/jira/browse/SPARK-17999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17999: Assignee: Apache Spark > Add getPreferredLocations for KafkaSourceRDD > > > Key: SPARK-17999 > URL: https://issues.apache.org/jira/browse/SPARK-17999 > Project: Spark > Issue Type: Improvement > Components: SQL, Streaming >Reporter: Saisai Shao >Assignee: Apache Spark >Priority: Minor > > The newly implemented Structured Streaming KafkaSource did calculate the > preferred locations for each topic partition, but didn't offer this > information through RDD's {{getPreferredLocations}} method. So here propose > to add this method in {{KafkaSourceRDD}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17997) Aggregation function for counting distinct values for multiple intervals
[ https://issues.apache.org/jira/browse/SPARK-17997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17997: Assignee: (was: Apache Spark) > Aggregation function for counting distinct values for multiple intervals > > > Key: SPARK-17997 > URL: https://issues.apache.org/jira/browse/SPARK-17997 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.1.0 >Reporter: Zhenhua Wang > > This is for computing ndv's for bins in equi-height histograms. A bin > consists of two endpoints which form an interval of values and the ndv in > that interval. For computing histogram statistics, after getting the > endpoints, we need an agg function to count distinct values in each interval. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17997) Aggregation function for counting distinct values for multiple intervals
[ https://issues.apache.org/jira/browse/SPARK-17997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17997: Assignee: Apache Spark > Aggregation function for counting distinct values for multiple intervals > > > Key: SPARK-17997 > URL: https://issues.apache.org/jira/browse/SPARK-17997 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.1.0 >Reporter: Zhenhua Wang >Assignee: Apache Spark > > This is for computing ndv's for bins in equi-height histograms. A bin > consists of two endpoints which form an interval of values and the ndv in > that interval. For computing histogram statistics, after getting the > endpoints, we need an agg function to count distinct values in each interval. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17997) Aggregation function for counting distinct values for multiple intervals
[ https://issues.apache.org/jira/browse/SPARK-17997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587548#comment-15587548 ] Apache Spark commented on SPARK-17997: -- User 'wzhfy' has created a pull request for this issue: https://github.com/apache/spark/pull/15544 > Aggregation function for counting distinct values for multiple intervals > > > Key: SPARK-17997 > URL: https://issues.apache.org/jira/browse/SPARK-17997 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.1.0 >Reporter: Zhenhua Wang > > This is for computing ndv's for bins in equi-height histograms. A bin > consists of two endpoints which form an interval of values and the ndv in > that interval. For computing histogram statistics, after getting the > endpoints, we need an agg function to count distinct values in each interval. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18001) Broke link to R DataFrame In sql-programming-guide
[ https://issues.apache.org/jira/browse/SPARK-18001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18001: Assignee: Apache Spark > Broke link to R DataFrame In sql-programming-guide > --- > > Key: SPARK-18001 > URL: https://issues.apache.org/jira/browse/SPARK-18001 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.0.1 >Reporter: Tommy Yu >Assignee: Apache Spark >Priority: Trivial > > In http://spark.apache.org/docs/latest/sql-programming-guide.html, Section > "Untyped Dataset Operations (aka DataFrame Operations)" > Link to R doesn't work that return > The requested URL /docs/latest/api/R/DataFrame.html was not found on this > server. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18001) Broke link to R DataFrame In sql-programming-guide
[ https://issues.apache.org/jira/browse/SPARK-18001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18001: Assignee: (was: Apache Spark) > Broke link to R DataFrame In sql-programming-guide > --- > > Key: SPARK-18001 > URL: https://issues.apache.org/jira/browse/SPARK-18001 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.0.1 >Reporter: Tommy Yu >Priority: Trivial > > In http://spark.apache.org/docs/latest/sql-programming-guide.html, Section > "Untyped Dataset Operations (aka DataFrame Operations)" > Link to R doesn't work that return > The requested URL /docs/latest/api/R/DataFrame.html was not found on this > server. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18001) Broke link to R DataFrame In sql-programming-guide
[ https://issues.apache.org/jira/browse/SPARK-18001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587539#comment-15587539 ] Apache Spark commented on SPARK-18001: -- User 'Wenpei' has created a pull request for this issue: https://github.com/apache/spark/pull/15543 > Broke link to R DataFrame In sql-programming-guide > --- > > Key: SPARK-18001 > URL: https://issues.apache.org/jira/browse/SPARK-18001 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.0.1 >Reporter: Tommy Yu >Priority: Trivial > > In http://spark.apache.org/docs/latest/sql-programming-guide.html, Section > "Untyped Dataset Operations (aka DataFrame Operations)" > Link to R doesn't work that return > The requested URL /docs/latest/api/R/DataFrame.html was not found on this > server. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17873) ALTER TABLE ... RENAME TO ... should allow users to specify database in destination table name
[ https://issues.apache.org/jira/browse/SPARK-17873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-17873. -- Resolution: Fixed Fix Version/s: 2.1.0 This issue has been resolved by https://github.com/apache/spark/pull/15434. > ALTER TABLE ... RENAME TO ... should allow users to specify database in > destination table name > -- > > Key: SPARK-17873 > URL: https://issues.apache.org/jira/browse/SPARK-17873 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17987) ML Evaluator fails to handle null values in the dataset
[ https://issues.apache.org/jira/browse/SPARK-17987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587530#comment-15587530 ] bo song commented on SPARK-17987: - There is a common case in cross validation, suppose f1 is a categorical predictor, its categories are (0,1,2,3,4). As you all know, cross validation splits data into training and testing data sets randomly, suppose the training data contains only (0,1,2) for f1, when the testing data do forecasts for (3, 4), almost algorithms could produce null predictions for this case. I would like to introduce an option into Spark, its default behavior is still an exception thrown for missing null values, caller can change it to exclude missing values explicitly, he knows the changes/risks and wants an result instead of an exception. > ML Evaluator fails to handle null values in the dataset > --- > > Key: SPARK-17987 > URL: https://issues.apache.org/jira/browse/SPARK-17987 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.6.2, 2.0.1 >Reporter: bo song > > Take the RegressionEvaluator as an example, when the predictionCol is null in > a row, en exception "scala.MatchEror" will be thrown. The missing null > prediction is a common case, for example when an predictor is missing, or its > value is out of bound, almost machine learning models could not produce > correct predictions, then null predictions would be returned. Evaluators > should handle the null values instead of an exception thrown, the common way > to handle missing null values is to ignore them. Besides of the null value, > the NAN value need to be handled correctly too. > Those three evaluators RegressionEvaluator, BinaryClassificationEvaluator and > MulticlassClassificationEvaluator have the same problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17873) ALTER TABLE ... RENAME TO ... should allow users to specify database in destination table name
[ https://issues.apache.org/jira/browse/SPARK-17873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-17873: - Assignee: Wenchen Fan (was: Apache Spark) > ALTER TABLE ... RENAME TO ... should allow users to specify database in > destination table name > -- > > Key: SPARK-17873 > URL: https://issues.apache.org/jira/browse/SPARK-17873 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18001) Broke link to R DataFrame In sql-programming-guide
Tommy Yu created SPARK-18001: Summary: Broke link to R DataFrame In sql-programming-guide Key: SPARK-18001 URL: https://issues.apache.org/jira/browse/SPARK-18001 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 2.0.1 Reporter: Tommy Yu Priority: Trivial In http://spark.apache.org/docs/latest/sql-programming-guide.html, Section "Untyped Dataset Operations (aka DataFrame Operations)" Link to R doesn't work that return The requested URL /docs/latest/api/R/DataFrame.html was not found on this server. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17074) generate histogram information for column
[ https://issues.apache.org/jira/browse/SPARK-17074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhenhua Wang updated SPARK-17074: - Description: We support two kinds of histograms: - Equi-width histogram: We have a fixed width for each column interval in the histogram. The height of a histogram represents the frequency for those column values in a specific interval. For this kind of histogram, its height varies for different column intervals. We use the equi-width histogram when the number of distinct values is less than 254. - Equi-height histogram: For this histogram, the width of column interval varies. The heights of all column intervals are the same. The equi-height histogram is effective in handling skewed data distribution. We use the equi- height histogram when the number of distinct values is equal to or greater than 254. We first use [SPARK-18000] and [SPARK-17881] to compute equi-width histograms (for both numeric and string types) or endpoints of equi-height histograms (for numeric type only). Then, if we get endpoints of a equi-height histogram, we need to compute ndv's between those endpoints by [SPARK-17997] to form the equi-height histogram. This Jira incorporates three Jiras mentioned above to support needed aggregation functions. We need to resolve them before this one. was: We support two kinds of histograms: - Equi-width histogram: We have a fixed width for each column interval in the histogram. The height of a histogram represents the frequency for those column values in a specific interval. For this kind of histogram, its height varies for different column intervals. We use the equi-width histogram when the number of distinct values is less than 254. - Equi-height histogram: For this histogram, the width of column interval varies. The heights of all column intervals are the same. The equi-height histogram is effective in handling skewed data distribution. We use the equi- height histogram when the number of distinct values is equal to or greater than 254. > generate histogram information for column > - > > Key: SPARK-17074 > URL: https://issues.apache.org/jira/browse/SPARK-17074 > Project: Spark > Issue Type: Sub-task > Components: Optimizer >Affects Versions: 2.0.0 >Reporter: Ron Hu > > We support two kinds of histograms: > - Equi-width histogram: We have a fixed width for each column interval in > the histogram. The height of a histogram represents the frequency for those > column values in a specific interval. For this kind of histogram, its height > varies for different column intervals. We use the equi-width histogram when > the number of distinct values is less than 254. > - Equi-height histogram: For this histogram, the width of column interval > varies. The heights of all column intervals are the same. The equi-height > histogram is effective in handling skewed data distribution. We use the equi- > height histogram when the number of distinct values is equal to or greater > than 254. > We first use [SPARK-18000] and [SPARK-17881] to compute equi-width histograms > (for both numeric and string types) or endpoints of equi-height histograms > (for numeric type only). Then, if we get endpoints of a equi-height > histogram, we need to compute ndv's between those endpoints by [SPARK-17997] > to form the equi-height histogram. > This Jira incorporates three Jiras mentioned above to support needed > aggregation functions. We need to resolve them before this one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17984) Add support for numa aware feature
[ https://issues.apache.org/jira/browse/SPARK-17984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587472#comment-15587472 ] quanfuwang commented on SPARK-17984: Thanks. Yes, I'm considering not to bring the dependency on numactl, but to find a general way. > Add support for numa aware feature > -- > > Key: SPARK-17984 > URL: https://issues.apache.org/jira/browse/SPARK-17984 > Project: Spark > Issue Type: New Feature > Components: Deploy, Mesos, YARN >Affects Versions: 2.0.1 > Environment: Cluster Topo: 1 Master + 4 Slaves > CPU: Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz(72 Cores) > Memory: 128GB(2 NUMA Nodes) > SW Version: Hadoop-5.7.0 + Spark-2.0.0 >Reporter: quanfuwang > Original Estimate: 672h > Remaining Estimate: 672h > > This Jira is target to add support numa aware feature which can help improve > performance by making core access local memory rather than remote one. > A patch is being developed, see https://github.com/apache/spark/pull/15524. > And the whole task includes 3 subtasks and will be developed iteratively: > Numa aware support for Yarn based deployment mode > Numa aware support for Mesos based deployment mode > Numa aware support for Standalone based deployment mode -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18000) Aggregation function for computing endpoints for numeric histograms
Zhenhua Wang created SPARK-18000: Summary: Aggregation function for computing endpoints for numeric histograms Key: SPARK-18000 URL: https://issues.apache.org/jira/browse/SPARK-18000 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.1.0 Reporter: Zhenhua Wang For a column of numeric type (including date and timestamp), we will generate a equi-width or equi-height histogram, depending on if its ndv is large than the maximum number of bins allowed in one histogram (denoted as numBins). This agg function computes values and their frequencies using a small hashmap, whose size is less than or equal to "numBins", and returns an equi-width histogram. When the size of hashmap exceeds "numBins", it cleans the hashmap and utilizes ApproximatePercentile to return endpoints of equi-height histogram. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17999) Add getPreferredLocations for KafkaSourceRDD
Saisai Shao created SPARK-17999: --- Summary: Add getPreferredLocations for KafkaSourceRDD Key: SPARK-17999 URL: https://issues.apache.org/jira/browse/SPARK-17999 Project: Spark Issue Type: Improvement Components: SQL, Streaming Reporter: Saisai Shao Priority: Minor The newly implemented Structured Streaming KafkaSource did calculate the preferred locations for each topic partition, but didn't offer this information through RDD's {{getPreferredLocations}} method. So here propose to add this method in {{KafkaSourceRDD}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10872) Derby error (XSDB6) when creating new HiveContext after restarting SparkContext
[ https://issues.apache.org/jira/browse/SPARK-10872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587461#comment-15587461 ] Angus Gerry commented on SPARK-10872: - Will do. Thanks mate :). > Derby error (XSDB6) when creating new HiveContext after restarting > SparkContext > --- > > Key: SPARK-10872 > URL: https://issues.apache.org/jira/browse/SPARK-10872 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.4.0, 1.4.1, 1.5.0 >Reporter: Dmytro Bielievtsov > > Starting from spark 1.4.0 (works well on 1.3.1), the following code fails > with "XSDB6: Another instance of Derby may have already booted the database > ~/metastore_db": > {code:python} > from pyspark import SparkContext, HiveContext > sc = SparkContext("local[*]", "app1") > sql = HiveContext(sc) > sql.createDataFrame([[1]]).collect() > sc.stop() > sc = SparkContext("local[*]", "app2") > sql = HiveContext(sc) > sql.createDataFrame([[1]]).collect() # Py4J error > {code} > This is related to [#SPARK-9539], and I intend to restart spark context > several times for isolated jobs to prevent cache cluttering and GC errors. > Here's a larger part of the full error trace: > {noformat} > Failed to start database 'metastore_db' with class loader > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@13015ec0, see > the next exception for details. > org.datanucleus.exceptions.NucleusDataStoreException: Failed to start > database 'metastore_db' with class loader > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@13015ec0, see > the next exception for details. > at > org.datanucleus.store.rdbms.ConnectionFactoryImpl$ManagedConnectionImpl.getConnection(ConnectionFactoryImpl.java:516) > at > org.datanucleus.store.rdbms.RDBMSStoreManager.(RDBMSStoreManager.java:298) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at > org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631) > at > org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301) > at > org.datanucleus.NucleusContext.createStoreManagerForProperties(NucleusContext.java:1187) > at org.datanucleus.NucleusContext.initialise(NucleusContext.java:356) > at > org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:775) > at > org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:333) > at > org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:202) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at javax.jdo.JDOHelper$16.run(JDOHelper.java:1965) > at java.security.AccessController.doPrivileged(Native Method) > at javax.jdo.JDOHelper.invoke(JDOHelper.java:1960) > at > javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1166) > at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808) > at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701) > at > org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:365) > at > org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:394) > at > org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:291) > at > org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:258) > at > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73) > at > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) > at > org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:57) > at > org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:66) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:593) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:571) > at >
[jira] [Commented] (SPARK-17984) Add support for numa aware feature
[ https://issues.apache.org/jira/browse/SPARK-17984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587459#comment-15587459 ] Saisai Shao commented on SPARK-17984: - NUMA should be supported by most commodity servers as well as HPC. But {{numactl}} may not be installed by default in most OSes. Also other systems like Windows or Mac may not have equal tools, please be considered. > Add support for numa aware feature > -- > > Key: SPARK-17984 > URL: https://issues.apache.org/jira/browse/SPARK-17984 > Project: Spark > Issue Type: New Feature > Components: Deploy, Mesos, YARN >Affects Versions: 2.0.1 > Environment: Cluster Topo: 1 Master + 4 Slaves > CPU: Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz(72 Cores) > Memory: 128GB(2 NUMA Nodes) > SW Version: Hadoop-5.7.0 + Spark-2.0.0 >Reporter: quanfuwang > Original Estimate: 672h > Remaining Estimate: 672h > > This Jira is target to add support numa aware feature which can help improve > performance by making core access local memory rather than remote one. > A patch is being developed, see https://github.com/apache/spark/pull/15524. > And the whole task includes 3 subtasks and will be developed iteratively: > Numa aware support for Yarn based deployment mode > Numa aware support for Mesos based deployment mode > Numa aware support for Standalone based deployment mode -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: Failed to analyze the canonicalized SQL. It is possible there is a bug in Spark.
[ https://issues.apache.org/jira/browse/SPARK-17982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell updated SPARK-17982: -- Description: The following statement fails in the spark shell . {noformat} scala> spark.sql("CREATE VIEW DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") scala> spark.sql("CREATE VIEW DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS gen_subquery_1 at org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192) at org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86) at org.apache.spark.sql.Dataset.(Dataset.scala:186) at org.apache.spark.sql.Dataset.(Dataset.scala:167) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65) {noformat} This appears to be a limitation of the create view statement . was: The following statement fails in the spark shell . scala> spark.sql("CREATE VIEW DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") scala> spark.sql("CREATE VIEW DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS gen_subquery_1 at org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192) at org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86) at org.apache.spark.sql.Dataset.(Dataset.scala:186) at org.apache.spark.sql.Dataset.(Dataset.scala:167) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65) This appears to be a limitation of the create view statement . > Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: > Failed to analyze the canonicalized SQL. It is possible there is a bug in > Spark. > > > Key: SPARK-17982 >
[jira] [Created] (SPARK-17998) Reading Parquet files coalesces parts into too few in-memory partitions
Shea Parkes created SPARK-17998: --- Summary: Reading Parquet files coalesces parts into too few in-memory partitions Key: SPARK-17998 URL: https://issues.apache.org/jira/browse/SPARK-17998 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 2.0.1, 2.0.0 Environment: Spark Standalone Cluster (not "local mode") Windows 10 and Windows 7 Python 3.x Reporter: Shea Parkes Reading a parquet ~file into a DataFrame is resulting in far too few in-memory partitions. In prior versions of Spark, the resulting DataFrame would have a number of partitions often equal to the number of parts in the parquet folder. Here's a minimal reproducible sample: {quote} df_first = session.range(start=1, end=1, numPartitions=13) assert df_first.rdd.getNumPartitions() == 13 assert session._sc.defaultParallelism == 6 path_scrap = r"c:\scratch\scrap.parquet" df_first.write.parquet(path_scrap) df_second = session.read.parquet(path_scrap) print(df_second.rdd.getNumPartitions()) {quote} The above shows only 7 partitions in the DataFrame that was created by reading the Parquet back into memory for me. Why is it no longer just the number of part files in the Parquet folder? (Which is 13 in the example above.) I'm filing this as a bug because it has gotten so bad that we can't work with the underlying RDD without first repartitioning the DataFrame, which is costly and wasteful. I really doubt this was the intended effect of moving to Spark 2.0. I've tried to research where the number of in-memory partitions is determined, but my Scala skills have proven in-adequate. I'd be happy to dig further if someone could point me in the right direction... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: Failed to analyze the canonicalized SQL. It is possible there is a bug in Spark.
[ https://issues.apache.org/jira/browse/SPARK-17982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587433#comment-15587433 ] Jiang Xingbo commented on SPARK-17982: -- Would you please give a example that it works? Thanks! > Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: > Failed to analyze the canonicalized SQL. It is possible there is a bug in > Spark. > > > Key: SPARK-17982 > URL: https://issues.apache.org/jira/browse/SPARK-17982 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 > Environment: spark 2.0.0 >Reporter: Franck Tago > > The following statement fails in the spark shell . > scala> spark.sql("CREATE VIEW > DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS > SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") > scala> spark.sql("CREATE VIEW > DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS > SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") > java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT > `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT > `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT > `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS > `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS > gen_subquery_1 > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86) > at org.apache.spark.sql.Dataset.(Dataset.scala:186) > at org.apache.spark.sql.Dataset.(Dataset.scala:167) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65) > This appears to be a limitation of the create view statement . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17997) Aggregation function for counting distinct values for multiple intervals
Zhenhua Wang created SPARK-17997: Summary: Aggregation function for counting distinct values for multiple intervals Key: SPARK-17997 URL: https://issues.apache.org/jira/browse/SPARK-17997 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.1.0 Reporter: Zhenhua Wang This is for computing ndv's for bins in equi-height histograms. A bin consists of two endpoints which form an interval of values and the ndv in that interval. For computing histogram statistics, after getting the endpoints, we need an agg function to count distinct values in each interval. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17996) catalog.getFunction(name) returns wrong result for a permanent function
[ https://issues.apache.org/jira/browse/SPARK-17996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17996: Assignee: Herman van Hovell (was: Apache Spark) > catalog.getFunction(name) returns wrong result for a permanent function > --- > > Key: SPARK-17996 > URL: https://issues.apache.org/jira/browse/SPARK-17996 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Herman van Hovell >Assignee: Herman van Hovell >Priority: Minor > > The catalog returns a wrong result, if we lookup a permanent function without > specifying the database. For example: > {noformat} > scala> sql("create function fn1 as > 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFAbs'") > res0: org.apache.spark.sql.DataFrame = [] > scala> spark.catalog.getFunction("fn1") > res1: org.apache.spark.sql.catalog.Function = Function[name='fn1', > className='org.apache.hadoop.hive.ql.udf.generic.GenericUDFAbs', > isTemporary='true'] > {noformat} > It should not return that this function is temporary and define a database. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17996) catalog.getFunction(name) returns wrong result for a permanent function
[ https://issues.apache.org/jira/browse/SPARK-17996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587425#comment-15587425 ] Apache Spark commented on SPARK-17996: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/15542 > catalog.getFunction(name) returns wrong result for a permanent function > --- > > Key: SPARK-17996 > URL: https://issues.apache.org/jira/browse/SPARK-17996 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Herman van Hovell >Assignee: Herman van Hovell >Priority: Minor > > The catalog returns a wrong result, if we lookup a permanent function without > specifying the database. For example: > {noformat} > scala> sql("create function fn1 as > 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFAbs'") > res0: org.apache.spark.sql.DataFrame = [] > scala> spark.catalog.getFunction("fn1") > res1: org.apache.spark.sql.catalog.Function = Function[name='fn1', > className='org.apache.hadoop.hive.ql.udf.generic.GenericUDFAbs', > isTemporary='true'] > {noformat} > It should not return that this function is temporary and define a database. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17996) catalog.getFunction(name) returns wrong result for a permanent function
[ https://issues.apache.org/jira/browse/SPARK-17996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17996: Assignee: Apache Spark (was: Herman van Hovell) > catalog.getFunction(name) returns wrong result for a permanent function > --- > > Key: SPARK-17996 > URL: https://issues.apache.org/jira/browse/SPARK-17996 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Herman van Hovell >Assignee: Apache Spark >Priority: Minor > > The catalog returns a wrong result, if we lookup a permanent function without > specifying the database. For example: > {noformat} > scala> sql("create function fn1 as > 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFAbs'") > res0: org.apache.spark.sql.DataFrame = [] > scala> spark.catalog.getFunction("fn1") > res1: org.apache.spark.sql.catalog.Function = Function[name='fn1', > className='org.apache.hadoop.hive.ql.udf.generic.GenericUDFAbs', > isTemporary='true'] > {noformat} > It should not return that this function is temporary and define a database. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: Failed to analyze the canonicalized SQL. It is possible there is a bug in Spark.
[ https://issues.apache.org/jira/browse/SPARK-17982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587420#comment-15587420 ] Jiang Xingbo edited comment on SPARK-17982 at 10/19/16 2:22 AM: [~dongjoon] In your examples there is a misleading part: {code} scala> sql("CREATE VIEW v1(id2) AS SELECT id FROM tbl") res2: org.apache.spark.sql.DataFrame = [] {code} The above "(id2)" in "v1(id2)" is in fact an identifierCommentList instead of colTypeList, so it is not actually creating columns accord. Perhaps we should listen to [~hvanhovell] whether we should support specify columns in CreateView? was (Author: jiangxb1987): [~dongjoon] In your examples there is a misleading part: {code} scala> sql("CREATE VIEW v1(id2) AS SELECT id FROM tbl") res2: org.apache.spark.sql.DataFrame = [] {code} The above "(id2)" in "v1(id2)" is infact an identifierCommentList instead of colTypeList, so it is not actually creating columns accord. Perhaps we should listen to [~hvanhovell] whether we should support specify columns in CreateView? > Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: > Failed to analyze the canonicalized SQL. It is possible there is a bug in > Spark. > > > Key: SPARK-17982 > URL: https://issues.apache.org/jira/browse/SPARK-17982 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 > Environment: spark 2.0.0 >Reporter: Franck Tago > > The following statement fails in the spark shell . > scala> spark.sql("CREATE VIEW > DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS > SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") > scala> spark.sql("CREATE VIEW > DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS > SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") > java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT > `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT > `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT > `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS > `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS > gen_subquery_1 > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86) > at org.apache.spark.sql.Dataset.(Dataset.scala:186) > at org.apache.spark.sql.Dataset.(Dataset.scala:167) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65) > This appears to be a limitation of the create view statement . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17980) Fix refreshByPath for converted Hive tables
[ https://issues.apache.org/jira/browse/SPARK-17980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-17980: Assignee: Eric Liang > Fix refreshByPath for converted Hive tables > --- > > Key: SPARK-17980 > URL: https://issues.apache.org/jira/browse/SPARK-17980 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Assignee: Eric Liang >Priority: Minor > Fix For: 2.1.0 > > > There is a small bug introduced in https://github.com/apache/spark/pull/14690 > which broke refreshByPath with converted hive tables (though, it turns out it > was very difficult to refresh converted hive tables anyways, since you had to > specify the exact path of one of the partitions). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17980) Fix refreshByPath for converted Hive tables
[ https://issues.apache.org/jira/browse/SPARK-17980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-17980. - Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 15521 [https://github.com/apache/spark/pull/15521] > Fix refreshByPath for converted Hive tables > --- > > Key: SPARK-17980 > URL: https://issues.apache.org/jira/browse/SPARK-17980 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Assignee: Eric Liang >Priority: Minor > Fix For: 2.1.0 > > > There is a small bug introduced in https://github.com/apache/spark/pull/14690 > which broke refreshByPath with converted hive tables (though, it turns out it > was very difficult to refresh converted hive tables anyways, since you had to > specify the exact path of one of the partitions). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: Failed to analyze the canonicalized SQL. It is possible there is a bug in Spark.
[ https://issues.apache.org/jira/browse/SPARK-17982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587420#comment-15587420 ] Jiang Xingbo commented on SPARK-17982: -- [~dongjoon] In your examples there is a misleading part: {code} scala> sql("CREATE VIEW v1(id2) AS SELECT id FROM tbl") res2: org.apache.spark.sql.DataFrame = [] {code} The above "(id2)" in "v1(id2)" is infact an identifierCommentList instead of colTypeList, so it is not actually creating columns accord. Perhaps we should listen to [~hvanhovell] whether we should support specify columns in CreateView? > Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: > Failed to analyze the canonicalized SQL. It is possible there is a bug in > Spark. > > > Key: SPARK-17982 > URL: https://issues.apache.org/jira/browse/SPARK-17982 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 > Environment: spark 2.0.0 >Reporter: Franck Tago > > The following statement fails in the spark shell . > scala> spark.sql("CREATE VIEW > DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS > SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") > scala> spark.sql("CREATE VIEW > DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS > SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") > java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT > `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT > `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT > `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS > `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS > gen_subquery_1 > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86) > at org.apache.spark.sql.Dataset.(Dataset.scala:186) > at org.apache.spark.sql.Dataset.(Dataset.scala:167) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65) > This appears to be a limitation of the create view statement . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17984) Add support for numa aware feature
[ https://issues.apache.org/jira/browse/SPARK-17984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587374#comment-15587374 ] quanfuwang commented on SPARK-17984: Normal servers support NUMA, but I'm considering to abstract it away which does not bring dependency on numactl. Yes, one can't always find numactl. > Add support for numa aware feature > -- > > Key: SPARK-17984 > URL: https://issues.apache.org/jira/browse/SPARK-17984 > Project: Spark > Issue Type: New Feature > Components: Deploy, Mesos, YARN >Affects Versions: 2.0.1 > Environment: Cluster Topo: 1 Master + 4 Slaves > CPU: Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz(72 Cores) > Memory: 128GB(2 NUMA Nodes) > SW Version: Hadoop-5.7.0 + Spark-2.0.0 >Reporter: quanfuwang > Original Estimate: 672h > Remaining Estimate: 672h > > This Jira is target to add support numa aware feature which can help improve > performance by making core access local memory rather than remote one. > A patch is being developed, see https://github.com/apache/spark/pull/15524. > And the whole task includes 3 subtasks and will be developed iteratively: > Numa aware support for Yarn based deployment mode > Numa aware support for Mesos based deployment mode > Numa aware support for Standalone based deployment mode -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17996) catalog.getFunction(name) returns wrong result for a permanent function
Herman van Hovell created SPARK-17996: - Summary: catalog.getFunction(name) returns wrong result for a permanent function Key: SPARK-17996 URL: https://issues.apache.org/jira/browse/SPARK-17996 Project: Spark Issue Type: Bug Components: SQL Reporter: Herman van Hovell Assignee: Herman van Hovell Priority: Minor The catalog returns a wrong result, if we lookup a permanent function without specifying the database. For example: {noformat} scala> sql("create function fn1 as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFAbs'") res0: org.apache.spark.sql.DataFrame = [] scala> spark.catalog.getFunction("fn1") res1: org.apache.spark.sql.catalog.Function = Function[name='fn1', className='org.apache.hadoop.hive.ql.udf.generic.GenericUDFAbs', isTemporary='true'] {noformat} It should not return that this function is temporary and define a database. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17995) Use new attributes for columns from outer joins
[ https://issues.apache.org/jira/browse/SPARK-17995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587348#comment-15587348 ] Wenchen Fan edited comment on SPARK-17995 at 10/19/16 1:46 AM: --- Can we just add a `newOuterJoinAttrs: Seq[Attribute]` parameter in the `Join` class? e.g. {code} case class Join(..., newOuterJoinAttrs: Seq[Attribute]) { def output = joinType match { case LeftOuterJoin => left.output ++ newOuterJoinAttrs } } object Join { def apply(...) = { val newOuterJoinAttrs = joinType match { case LeftOuterJoin => right.output.map(_.newInstance) } Join(..., newOuterJoinAttrs) } } {code} was (Author: cloud_fan): Can we just add a `newOuterJoinAttrs: Seq[Attribute]` parameter in the `Join` class? e.g. {code} case class Join(..., newOuterJoinAttrs: Seq[Attribute]) object Join { def apply(...) = { val newOuterJoinAttrs = joinType match { case LeftOuterJoin => right.output.map(_.newInstance) } } } {code} > Use new attributes for columns from outer joins > --- > > Key: SPARK-17995 > URL: https://issues.apache.org/jira/browse/SPARK-17995 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.2, 2.0.0, 2.1.0 >Reporter: Ryan Blue > > Plans involving outer joins use the same attribute reference (by exprId) to > reference columns above the join and below the join. This is a false > equivalence that leads to bugs like SPARK-16181, in which an attributes were > incorrectly replaced by the optimizer. The column has a different schema > above the outer join because its values may be null. The fix for that issue, > [PR #13884](https://github.com/apache/spark/pull/13884) has a TODO comment > from [~cloud_fan] to fix this by using different attributes instead of > needing to special-case outer joins in rules and this issue is to track that > improvement. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17995) Use new attributes for columns from outer joins
[ https://issues.apache.org/jira/browse/SPARK-17995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587348#comment-15587348 ] Wenchen Fan commented on SPARK-17995: - Can we just add a `newOuterJoinAttrs: Seq[Attribute]` parameter in the `Join` class? e.g. {code} case class Join(..., newOuterJoinAttrs: Seq[Attribute]) object Join { def apply(...) = { val newOuterJoinAttrs = joinType match { case LeftOuterJoin => right.output.map(_.newInstance) } } } {code} > Use new attributes for columns from outer joins > --- > > Key: SPARK-17995 > URL: https://issues.apache.org/jira/browse/SPARK-17995 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.2, 2.0.0, 2.1.0 >Reporter: Ryan Blue > > Plans involving outer joins use the same attribute reference (by exprId) to > reference columns above the join and below the join. This is a false > equivalence that leads to bugs like SPARK-16181, in which an attributes were > incorrectly replaced by the optimizer. The column has a different schema > above the outer join because its values may be null. The fix for that issue, > [PR #13884](https://github.com/apache/spark/pull/13884) has a TODO comment > from [~cloud_fan] to fix this by using different attributes instead of > needing to special-case outer joins in rules and this issue is to track that > improvement. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17637) Packed scheduling for Spark tasks across executors
[ https://issues.apache.org/jira/browse/SPARK-17637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587307#comment-15587307 ] Apache Spark commented on SPARK-17637: -- User 'zhzhan' has created a pull request for this issue: https://github.com/apache/spark/pull/15541 > Packed scheduling for Spark tasks across executors > -- > > Key: SPARK-17637 > URL: https://issues.apache.org/jira/browse/SPARK-17637 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Zhan Zhang >Assignee: Zhan Zhang >Priority: Minor > > Currently Spark scheduler implements round robin scheduling for tasks to > executors. Which is great as it distributes the load evenly across the > cluster, but this leads to significant resource waste in some cases, > especially when dynamic allocation is enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17985) Bump commons-lang3 version to 3.5.
[ https://issues.apache.org/jira/browse/SPARK-17985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17985: Assignee: Takuya Ueshin (was: Apache Spark) > Bump commons-lang3 version to 3.5. > -- > > Key: SPARK-17985 > URL: https://issues.apache.org/jira/browse/SPARK-17985 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin > Fix For: 2.1.0 > > > {{SerializationUtils.clone()}} of commons-lang3 (<3.5) has a bug that break > thread safety, which gets stack sometimes caused by race condition of > initializing hash map. > See https://issues.apache.org/jira/browse/LANG-1251. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17985) Bump commons-lang3 version to 3.5.
[ https://issues.apache.org/jira/browse/SPARK-17985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17985: Assignee: Apache Spark (was: Takuya Ueshin) > Bump commons-lang3 version to 3.5. > -- > > Key: SPARK-17985 > URL: https://issues.apache.org/jira/browse/SPARK-17985 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Takuya Ueshin >Assignee: Apache Spark > Fix For: 2.1.0 > > > {{SerializationUtils.clone()}} of commons-lang3 (<3.5) has a bug that break > thread safety, which gets stack sometimes caused by race condition of > initializing hash map. > See https://issues.apache.org/jira/browse/LANG-1251. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: Failed to analyze the canonicalized SQL. It is possible there is a bug in Spark.
[ https://issues.apache.org/jira/browse/SPARK-17982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587239#comment-15587239 ] Dongjoon Hyun commented on SPARK-17982: --- I'll make a PR for this today. > Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: > Failed to analyze the canonicalized SQL. It is possible there is a bug in > Spark. > > > Key: SPARK-17982 > URL: https://issues.apache.org/jira/browse/SPARK-17982 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 > Environment: spark 2.0.0 >Reporter: Franck Tago > > The following statement fails in the spark shell . > scala> spark.sql("CREATE VIEW > DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS > SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") > scala> spark.sql("CREATE VIEW > DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS > SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") > java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT > `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT > `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT > `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS > `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS > gen_subquery_1 > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86) > at org.apache.spark.sql.Dataset.(Dataset.scala:186) > at org.apache.spark.sql.Dataset.(Dataset.scala:167) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65) > This appears to be a limitation of the create view statement . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: Failed to analyze the canonicalized SQL. It is possible there is a bug in Spark.
[ https://issues.apache.org/jira/browse/SPARK-17982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587222#comment-15587222 ] Dongjoon Hyun edited comment on SPARK-17982 at 10/19/16 12:55 AM: -- Sorry, [~tafra...@gmail.com]. Now, I understand what you meant by `limit`. The following is the simplified version of your case, isn't it? {code} scala> spark.version res0: String = 2.1.0-SNAPSHOT scala> sql("CREATE TABLE tbl(id INT)") res1: org.apache.spark.sql.DataFrame = [] scala> sql("CREATE VIEW v1(id2) AS SELECT id FROM tbl") res2: org.apache.spark.sql.DataFrame = [] scala> sql("CREATE VIEW v2 AS SELECT id FROM tbl limit 2") res3: org.apache.spark.sql.DataFrame = [] scala> sql("CREATE VIEW v3(id2) AS SELECT id FROM tbl limit 2") java.lang.RuntimeException: Failed to analyze the canonicalized SQL: ... {code} was (Author: dongjoon): Sorry, [~tafra...@gmail.com]. Now, I understand what you meant by `limit`. The following is the simplified version of your case, isn't it? {code} scala> spark.version res0: String = 2.1.0-SNAPSHOT scala> sql("CREATE TABLE tbl(id INT)") res1: org.apache.spark.sql.DataFrame = [] scala> sql("CREATE VIEW v1(id2) AS SELECT id FROM tbl") res2: org.apache.spark.sql.DataFrame = [] scala> sql("CREATE VIEW v2 AS SELECT id FROM tbl limit 2") res3: org.apache.spark.sql.DataFrame = [] scala> sql("CREATE VIEW v2(id2) AS SELECT id FROM tbl limit 2") java.lang.RuntimeException: Failed to analyze the canonicalized SQL: ... {code} > Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: > Failed to analyze the canonicalized SQL. It is possible there is a bug in > Spark. > > > Key: SPARK-17982 > URL: https://issues.apache.org/jira/browse/SPARK-17982 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 > Environment: spark 2.0.0 >Reporter: Franck Tago > > The following statement fails in the spark shell . > scala> spark.sql("CREATE VIEW > DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS > SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") > scala> spark.sql("CREATE VIEW > DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS > SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") > java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT > `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT > `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT > `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS > `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS > gen_subquery_1 > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86) > at org.apache.spark.sql.Dataset.(Dataset.scala:186) > at org.apache.spark.sql.Dataset.(Dataset.scala:167) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65) > This appears to be a limitation of the create view statement . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: Failed to analyze the canonicalized SQL. It is possible there is a bug in Spark.
[ https://issues.apache.org/jira/browse/SPARK-17982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587222#comment-15587222 ] Dongjoon Hyun commented on SPARK-17982: --- Sorry, [~tafra...@gmail.com]. Now, I understand what you meant by `limit`. The following is the simplified version of your case, isn't it? {code} scala> spark.version res0: String = 2.1.0-SNAPSHOT scala> sql("CREATE TABLE tbl(id INT)") res1: org.apache.spark.sql.DataFrame = [] scala> sql("CREATE VIEW v1(id2) AS SELECT id FROM tbl") res2: org.apache.spark.sql.DataFrame = [] scala> sql("CREATE VIEW v2 AS SELECT id FROM tbl limit 2") res3: org.apache.spark.sql.DataFrame = [] scala> sql("CREATE VIEW v2(id2) AS SELECT id FROM tbl limit 2") java.lang.RuntimeException: Failed to analyze the canonicalized SQL: ... {code} > Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: > Failed to analyze the canonicalized SQL. It is possible there is a bug in > Spark. > > > Key: SPARK-17982 > URL: https://issues.apache.org/jira/browse/SPARK-17982 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 > Environment: spark 2.0.0 >Reporter: Franck Tago > > The following statement fails in the spark shell . > scala> spark.sql("CREATE VIEW > DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS > SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") > scala> spark.sql("CREATE VIEW > DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS > SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") > java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT > `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT > `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT > `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS > `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS > gen_subquery_1 > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86) > at org.apache.spark.sql.Dataset.(Dataset.scala:186) > at org.apache.spark.sql.Dataset.(Dataset.scala:167) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65) > This appears to be a limitation of the create view statement . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14393) monotonicallyIncreasingId not monotonically increasing with downstream coalesce
[ https://issues.apache.org/jira/browse/SPARK-14393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-14393: - Affects Version/s: 2.0.0 2.0.1 > monotonicallyIncreasingId not monotonically increasing with downstream > coalesce > --- > > Key: SPARK-14393 > URL: https://issues.apache.org/jira/browse/SPARK-14393 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 2.0.0, 2.0.1 >Reporter: Jason Piper > Labels: correctness > > When utilising monotonicallyIncreasingId with a coalesce, it appears that > every partition uses the same offset (0) leading to non-monotonically > increasing IDs. > See examples below > {code} > >>> sqlContext.range(10).select(monotonicallyIncreasingId()).show() > +---+ > |monotonicallyincreasingid()| > +---+ > |25769803776| > |51539607552| > |77309411328| > | 103079215104| > | 128849018880| > | 163208757248| > | 188978561024| > | 214748364800| > | 240518168576| > | 266287972352| > +---+ > >>> sqlContext.range(10).select(monotonicallyIncreasingId()).coalesce(1).show() > +---+ > |monotonicallyincreasingid()| > +---+ > | 0| > | 0| > | 0| > | 0| > | 0| > | 0| > | 0| > | 0| > | 0| > | 0| > +---+ > >>> sqlContext.range(10).repartition(5).select(monotonicallyIncreasingId()).coalesce(1).show() > +---+ > |monotonicallyincreasingid()| > +---+ > | 0| > | 1| > | 0| > | 0| > | 1| > | 2| > | 3| > | 0| > | 1| > | 2| > +---+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14393) monotonicallyIncreasingId not monotonically increasing with downstream coalesce
[ https://issues.apache.org/jira/browse/SPARK-14393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-14393: -- Labels: correctness (was: ) > monotonicallyIncreasingId not monotonically increasing with downstream > coalesce > --- > > Key: SPARK-14393 > URL: https://issues.apache.org/jira/browse/SPARK-14393 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Jason Piper > Labels: correctness > > When utilising monotonicallyIncreasingId with a coalesce, it appears that > every partition uses the same offset (0) leading to non-monotonically > increasing IDs. > See examples below > {code} > >>> sqlContext.range(10).select(monotonicallyIncreasingId()).show() > +---+ > |monotonicallyincreasingid()| > +---+ > |25769803776| > |51539607552| > |77309411328| > | 103079215104| > | 128849018880| > | 163208757248| > | 188978561024| > | 214748364800| > | 240518168576| > | 266287972352| > +---+ > >>> sqlContext.range(10).select(monotonicallyIncreasingId()).coalesce(1).show() > +---+ > |monotonicallyincreasingid()| > +---+ > | 0| > | 0| > | 0| > | 0| > | 0| > | 0| > | 0| > | 0| > | 0| > | 0| > +---+ > >>> sqlContext.range(10).repartition(5).select(monotonicallyIncreasingId()).coalesce(1).show() > +---+ > |monotonicallyincreasingid()| > +---+ > | 0| > | 1| > | 0| > | 0| > | 1| > | 2| > | 3| > | 0| > | 1| > | 2| > +---+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: Failed to analyze the canonicalized SQL. It is possible there is a bug in Spark.
[ https://issues.apache.org/jira/browse/SPARK-17982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587124#comment-15587124 ] Dongjoon Hyun commented on SPARK-17982: --- I'll investigate it for you, [~tafra...@gmail.com]. > Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: > Failed to analyze the canonicalized SQL. It is possible there is a bug in > Spark. > > > Key: SPARK-17982 > URL: https://issues.apache.org/jira/browse/SPARK-17982 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 > Environment: spark 2.0.0 >Reporter: Franck Tago > > The following statement fails in the spark shell . > scala> spark.sql("CREATE VIEW > DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS > SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") > scala> spark.sql("CREATE VIEW > DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS > SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") > java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT > `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT > `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT > `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS > `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS > gen_subquery_1 > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86) > at org.apache.spark.sql.Dataset.(Dataset.scala:186) > at org.apache.spark.sql.Dataset.(Dataset.scala:167) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65) > This appears to be a limitation of the create view statement . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17994) Add back a file status cache for catalog tables
[ https://issues.apache.org/jira/browse/SPARK-17994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587106#comment-15587106 ] Apache Spark commented on SPARK-17994: -- User 'ericl' has created a pull request for this issue: https://github.com/apache/spark/pull/15539 > Add back a file status cache for catalog tables > --- > > Key: SPARK-17994 > URL: https://issues.apache.org/jira/browse/SPARK-17994 > Project: Spark > Issue Type: Sub-task >Reporter: Eric Liang > > In SPARK-16980, we removed the full in-memory cache of table partitions in > favor of loading only needed partitions from the metastore. This greatly > improves the initial latency of queries that only read a small fraction of > table partitions. > However, since the metastore does not store file statistics, we need to > discover those from remote storage. With the loss of the in-memory file > status cache this has to happen on each query, increasing the latency of > repeated queries over the same partitions. > The proposal is to add back a per-table cache of partition contents, i.e. > Map[Path, Array[FileStatus]]. This cache would be retained per-table, and can > be invalidated through refreshTable() and refreshByPath(). Unlike the prior > cache, it can be incrementally updated as new partitions are read. > cc [~michael] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17994) Add back a file status cache for catalog tables
[ https://issues.apache.org/jira/browse/SPARK-17994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17994: Assignee: (was: Apache Spark) > Add back a file status cache for catalog tables > --- > > Key: SPARK-17994 > URL: https://issues.apache.org/jira/browse/SPARK-17994 > Project: Spark > Issue Type: Sub-task >Reporter: Eric Liang > > In SPARK-16980, we removed the full in-memory cache of table partitions in > favor of loading only needed partitions from the metastore. This greatly > improves the initial latency of queries that only read a small fraction of > table partitions. > However, since the metastore does not store file statistics, we need to > discover those from remote storage. With the loss of the in-memory file > status cache this has to happen on each query, increasing the latency of > repeated queries over the same partitions. > The proposal is to add back a per-table cache of partition contents, i.e. > Map[Path, Array[FileStatus]]. This cache would be retained per-table, and can > be invalidated through refreshTable() and refreshByPath(). Unlike the prior > cache, it can be incrementally updated as new partitions are read. > cc [~michael] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17994) Add back a file status cache for catalog tables
[ https://issues.apache.org/jira/browse/SPARK-17994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17994: Assignee: Apache Spark > Add back a file status cache for catalog tables > --- > > Key: SPARK-17994 > URL: https://issues.apache.org/jira/browse/SPARK-17994 > Project: Spark > Issue Type: Sub-task >Reporter: Eric Liang >Assignee: Apache Spark > > In SPARK-16980, we removed the full in-memory cache of table partitions in > favor of loading only needed partitions from the metastore. This greatly > improves the initial latency of queries that only read a small fraction of > table partitions. > However, since the metastore does not store file statistics, we need to > discover those from remote storage. With the loss of the in-memory file > status cache this has to happen on each query, increasing the latency of > repeated queries over the same partitions. > The proposal is to add back a per-table cache of partition contents, i.e. > Map[Path, Array[FileStatus]]. This cache would be retained per-table, and can > be invalidated through refreshTable() and refreshByPath(). Unlike the prior > cache, it can be incrementally updated as new partitions are read. > cc [~michael] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17993) Spark spews a slew of harmless but annoying warning messages from Parquet when reading parquet files written by older versions of Parquet-mr
[ https://issues.apache.org/jira/browse/SPARK-17993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17993: Assignee: Apache Spark > Spark spews a slew of harmless but annoying warning messages from Parquet > when reading parquet files written by older versions of Parquet-mr > > > Key: SPARK-17993 > URL: https://issues.apache.org/jira/browse/SPARK-17993 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Allman >Assignee: Apache Spark > > It looks like https://github.com/apache/spark/pull/14690 broke parquet log > output redirection. After that patch, when querying parquet files written by > Parquet-mr 1.6.0 Spark prints a torrent of (harmless) warning messages from > the Parquet reader: > {code} > Oct 18, 2016 7:42:18 PM WARNING: org.apache.parquet.CorruptStatistics: > Ignoring statistics because created_by could not be parsed (see PARQUET-251): > parquet-mr version 1.6.0 > org.apache.parquet.VersionParser$VersionParseException: Could not parse > created_by: parquet-mr version 1.6.0 using format: (.+) version ((.*) > )?\(build ?(.*)\) > at org.apache.parquet.VersionParser.parse(VersionParser.java:112) > at > org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60) > at > org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263) > at > org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:583) > at > org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:513) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:270) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:225) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:162) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:372) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > This only happens during execution, not planning, and it doesn't matter what > log level the {{SparkContext}} is set to. > This is a regression I noted as something we needed to fix as a follow up to > PR 14690. I feel responsible, so I'm going to expedite a fix for it. I > suspect that PR broke Spark's Parquet log output redirection. That's the > premise I'm going by. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail:
[jira] [Assigned] (SPARK-17993) Spark spews a slew of harmless but annoying warning messages from Parquet when reading parquet files written by older versions of Parquet-mr
[ https://issues.apache.org/jira/browse/SPARK-17993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17993: Assignee: (was: Apache Spark) > Spark spews a slew of harmless but annoying warning messages from Parquet > when reading parquet files written by older versions of Parquet-mr > > > Key: SPARK-17993 > URL: https://issues.apache.org/jira/browse/SPARK-17993 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Allman > > It looks like https://github.com/apache/spark/pull/14690 broke parquet log > output redirection. After that patch, when querying parquet files written by > Parquet-mr 1.6.0 Spark prints a torrent of (harmless) warning messages from > the Parquet reader: > {code} > Oct 18, 2016 7:42:18 PM WARNING: org.apache.parquet.CorruptStatistics: > Ignoring statistics because created_by could not be parsed (see PARQUET-251): > parquet-mr version 1.6.0 > org.apache.parquet.VersionParser$VersionParseException: Could not parse > created_by: parquet-mr version 1.6.0 using format: (.+) version ((.*) > )?\(build ?(.*)\) > at org.apache.parquet.VersionParser.parse(VersionParser.java:112) > at > org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60) > at > org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263) > at > org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:583) > at > org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:513) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:270) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:225) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:162) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:372) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > This only happens during execution, not planning, and it doesn't matter what > log level the {{SparkContext}} is set to. > This is a regression I noted as something we needed to fix as a follow up to > PR 14690. I feel responsible, so I'm going to expedite a fix for it. I > suspect that PR broke Spark's Parquet log output redirection. That's the > premise I'm going by. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17993) Spark spews a slew of harmless but annoying warning messages from Parquet when reading parquet files written by older versions of Parquet-mr
[ https://issues.apache.org/jira/browse/SPARK-17993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586987#comment-15586987 ] Apache Spark commented on SPARK-17993: -- User 'mallman' has created a pull request for this issue: https://github.com/apache/spark/pull/15538 > Spark spews a slew of harmless but annoying warning messages from Parquet > when reading parquet files written by older versions of Parquet-mr > > > Key: SPARK-17993 > URL: https://issues.apache.org/jira/browse/SPARK-17993 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Allman > > It looks like https://github.com/apache/spark/pull/14690 broke parquet log > output redirection. After that patch, when querying parquet files written by > Parquet-mr 1.6.0 Spark prints a torrent of (harmless) warning messages from > the Parquet reader: > {code} > Oct 18, 2016 7:42:18 PM WARNING: org.apache.parquet.CorruptStatistics: > Ignoring statistics because created_by could not be parsed (see PARQUET-251): > parquet-mr version 1.6.0 > org.apache.parquet.VersionParser$VersionParseException: Could not parse > created_by: parquet-mr version 1.6.0 using format: (.+) version ((.*) > )?\(build ?(.*)\) > at org.apache.parquet.VersionParser.parse(VersionParser.java:112) > at > org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60) > at > org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263) > at > org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:583) > at > org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:513) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:270) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:225) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:162) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:372) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > This only happens during execution, not planning, and it doesn't matter what > log level the {{SparkContext}} is set to. > This is a regression I noted as something we needed to fix as a follow up to > PR 14690. I feel responsible, so I'm going to expedite a fix for it. I > suspect that PR broke Spark's Parquet log output redirection. That's the > premise I'm going by. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail:
[jira] [Commented] (SPARK-17630) jvm-exit-on-fatal-error handler for spark.rpc.netty like there is available for akka
[ https://issues.apache.org/jira/browse/SPARK-17630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586979#comment-15586979 ] Shixiong Zhu commented on SPARK-17630: -- Yeah, I think we can set up SparkUncaughtExceptionHandler for R and Python users. > jvm-exit-on-fatal-error handler for spark.rpc.netty like there is available > for akka > > > Key: SPARK-17630 > URL: https://issues.apache.org/jira/browse/SPARK-17630 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Mario Briggs > Attachments: SecondCodePath.txt, firstCodepath.txt > > > Hi, > I have 2 code-paths from my app that result in a jvm OOM. > In the first code path, 'akka.jvm-exit-on-fatal-error' kicks in and shuts > down the JVM, so that the caller (py4J) get notified with proper stack trace. > Attached stack-trace file (firstCodepath.txt) > In the 2nd code path (rpc.netty), no such handler kicks in and shutdown the > JVM, so the caller does not get notified. > Attached stack-trace file (SecondCodepath.txt) > Is it possible to have an jvm exit handle for the rpc. netty path? > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17995) Use new attributes for columns from outer joins
[ https://issues.apache.org/jira/browse/SPARK-17995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586972#comment-15586972 ] Ryan Blue commented on SPARK-17995: --- [~cloud_fan] and [~yhuai], I'd like to help fix this, but I'm not sure the best way. I started to write an analyzer rule that uses transformUp on the initial logical plan, before unresovled aliases are resolved. That rule would find outer joins and generate a map of attributes to replace to the new attribute above the join, with the schema updated to be nullable and with a new exprId. The attributes to replace come from the output of the outer join. Where I ran into trouble was in replacing the attributes in the logical plan above the a join. I don't think it is a good idea to have cases in the rule for every possible plan, so I think we need a method to substitute attributes that is implemented by nodes in the plan. That sounds like a larger patch than I originally thought, so I wanted to make sure I'm going down the right path before I put up a PR for it. What do you think? > Use new attributes for columns from outer joins > --- > > Key: SPARK-17995 > URL: https://issues.apache.org/jira/browse/SPARK-17995 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.2, 2.0.0, 2.1.0 >Reporter: Ryan Blue > > Plans involving outer joins use the same attribute reference (by exprId) to > reference columns above the join and below the join. This is a false > equivalence that leads to bugs like SPARK-16181, in which an attributes were > incorrectly replaced by the optimizer. The column has a different schema > above the outer join because its values may be null. The fix for that issue, > [PR #13884](https://github.com/apache/spark/pull/13884) has a TODO comment > from [~cloud_fan] to fix this by using different attributes instead of > needing to special-case outer joins in rules and this issue is to track that > improvement. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17995) Use new attributes for columns from outer joins
Ryan Blue created SPARK-17995: - Summary: Use new attributes for columns from outer joins Key: SPARK-17995 URL: https://issues.apache.org/jira/browse/SPARK-17995 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.0, 1.6.2, 2.1.0 Reporter: Ryan Blue Plans involving outer joins use the same attribute reference (by exprId) to reference columns above the join and below the join. This is a false equivalence that leads to bugs like SPARK-16181, in which an attributes were incorrectly replaced by the optimizer. The column has a different schema above the outer join because its values may be null. The fix for that issue, [PR #13884](https://github.com/apache/spark/pull/13884) has a TODO comment from [~cloud_fan] to fix this by using different attributes instead of needing to special-case outer joins in rules and this issue is to track that improvement. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: Failed to analyze the canonicalized SQL. It is possible there is a bug in Spark.
[ https://issues.apache.org/jira/browse/SPARK-17982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586918#comment-15586918 ] Franck Tago commented on SPARK-17982: - Updated the Title of the Jira . I looked at the views.scala file and I want to know if setting the flag spark.sql.nativeView.canonical to false is an acceptable workaround. I tested it and it works but the question is that an acceptable workaround. > Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: > Failed to analyze the canonicalized SQL. It is possible there is a bug in > Spark. > > > Key: SPARK-17982 > URL: https://issues.apache.org/jira/browse/SPARK-17982 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 > Environment: spark 2.0.0 >Reporter: Franck Tago > > The following statement fails in the spark shell . > scala> spark.sql("CREATE VIEW > DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS > SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") > scala> spark.sql("CREATE VIEW > DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS > SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") > java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT > `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT > `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT > `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS > `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS > gen_subquery_1 > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86) > at org.apache.spark.sql.Dataset.(Dataset.scala:186) > at org.apache.spark.sql.Dataset.(Dataset.scala:167) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65) > This appears to be a limitation of the create view statement . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: Failed to analyze the canonicalized SQL. It is possible there is a bug in Spark.
[ https://issues.apache.org/jira/browse/SPARK-17982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Franck Tago updated SPARK-17982: Summary: Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: Failed to analyze the canonicalized SQL. It is possible there is a bug in Spark. (was: Spark 2.0.0 CREATE VIEW statement fails when select statement contains limit clause) > Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: > Failed to analyze the canonicalized SQL. It is possible there is a bug in > Spark. > > > Key: SPARK-17982 > URL: https://issues.apache.org/jira/browse/SPARK-17982 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 > Environment: spark 2.0.0 >Reporter: Franck Tago > > The following statement fails in the spark shell . > scala> spark.sql("CREATE VIEW > DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS > SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") > scala> spark.sql("CREATE VIEW > DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS > SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") > java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT > `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT > `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT > `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS > `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS > gen_subquery_1 > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86) > at org.apache.spark.sql.Dataset.(Dataset.scala:186) > at org.apache.spark.sql.Dataset.(Dataset.scala:167) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65) > This appears to be a limitation of the create view statement . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17993) Spark spews a slew of harmless but annoying warning messages from Parquet when reading parquet files written by older versions of Parquet-mr
[ https://issues.apache.org/jira/browse/SPARK-17993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586897#comment-15586897 ] Michael Allman commented on SPARK-17993: cc [~ekhliang] I think I have a fix for this. I'm going to submit a PR shortly. > Spark spews a slew of harmless but annoying warning messages from Parquet > when reading parquet files written by older versions of Parquet-mr > > > Key: SPARK-17993 > URL: https://issues.apache.org/jira/browse/SPARK-17993 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Allman > > It looks like https://github.com/apache/spark/pull/14690 broke parquet log > output redirection. After that patch, when querying parquet files written by > Parquet-mr 1.6.0 Spark prints a torrent of (harmless) warning messages from > the Parquet reader: > {code} > Oct 18, 2016 7:42:18 PM WARNING: org.apache.parquet.CorruptStatistics: > Ignoring statistics because created_by could not be parsed (see PARQUET-251): > parquet-mr version 1.6.0 > org.apache.parquet.VersionParser$VersionParseException: Could not parse > created_by: parquet-mr version 1.6.0 using format: (.+) version ((.*) > )?\(build ?(.*)\) > at org.apache.parquet.VersionParser.parse(VersionParser.java:112) > at > org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60) > at > org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263) > at > org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:583) > at > org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:513) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:270) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:225) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:162) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:372) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > This only happens during execution, not planning, and it doesn't matter what > log level the {{SparkContext}} is set to. > This is a regression I noted as something we needed to fix as a follow up to > PR 14690. I feel responsible, so I'm going to expedite a fix for it. I > suspect that PR broke Spark's Parquet log output redirection. That's the > premise I'm going by. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail:
[jira] [Commented] (SPARK-17992) HiveClient.getPartitionsByFilter throws an exception for some unsupported filters when hive.metastore.try.direct.sql=false
[ https://issues.apache.org/jira/browse/SPARK-17992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586892#comment-15586892 ] Michael Allman commented on SPARK-17992: cc [~ekhliang] [~cloud_fan] > HiveClient.getPartitionsByFilter throws an exception for some unsupported > filters when hive.metastore.try.direct.sql=false > -- > > Key: SPARK-17992 > URL: https://issues.apache.org/jira/browse/SPARK-17992 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Allman > > We recently added (and enabled by default) table partition pruning for > partitioned Hive tables converted to using {{TableFileCatalog}}. When the > Hive configuration option {{hive.metastore.try.direct.sql}} is set to > {{false}}, Hive will throw an exception for unsupported filter expressions. > For example, attempting to filter on an integer partition column will throw a > {{org.apache.hadoop.hive.metastore.api.MetaException}}. > I discovered this behavior because VideoAmp uses the CDH version of Hive with > a Postgresql metastore DB. In this configuration, CDH sets > {{hive.metastore.try.direct.sql}} to {{false}} by default, and queries that > filter on a non-string partition column will fail. That would be a rather > rude surprise for these Spark 2.1 users... > I'm not sure exactly what behavior we should expect, but I suggest that > {{HiveClientImpl.getPartitionsByFilter}} catch this metastore exception and > return all partitions instead. This is what Spark does for Hive 0.12 users, > which does not support this feature at all. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17990) ALTER TABLE ... ADD PARTITION does not play nice with mixed-case partition column names
[ https://issues.apache.org/jira/browse/SPARK-17990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586888#comment-15586888 ] Michael Allman commented on SPARK-17990: CC [~ekhliang] [~cloud_fan] > ALTER TABLE ... ADD PARTITION does not play nice with mixed-case partition > column names > --- > > Key: SPARK-17990 > URL: https://issues.apache.org/jira/browse/SPARK-17990 > Project: Spark > Issue Type: Bug > Components: SQL > Environment: Linux > Mac OS with a case-sensitive filesystem >Reporter: Michael Allman > > Writing partition data to an external table's file location and then adding > those as table partition metadata is a common use case. However, for tables > with partition column names with upper case letters, the SQL command {{ALTER > TABLE ... ADD PARTITION}} does not work, as illustrated in the following > example: > {code} > scala> sql("create external table mixed_case_partitioning (a bigint) > PARTITIONED BY (partCol bigint) STORED AS parquet LOCATION > '/tmp/mixed_case_partitioning'") > res0: org.apache.spark.sql.DataFrame = [] > scala> spark.sqlContext.range(10).selectExpr("id as a", "id as > partCol").write.partitionBy("partCol").mode("overwrite").parquet("/tmp/mixed_case_partitioning") > {code} > At this point, doing a {{hadoop fs -ls /tmp/mixed_case_partitioning}} > produces the following: > {code} > [msa@jupyter ~]$ hadoop fs -ls /tmp/mixed_case_partitioning > Found 11 items > -rw-r--r-- 3 msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/_SUCCESS > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=0 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=1 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=2 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=3 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=4 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=5 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=6 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=7 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=8 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=9 > {code} > Returning to the Spark shell, we execute the following to add the partition > metadata: > {code} > scala> (0 to 9).foreach { p => sql(s"alter table mixed_case_partitioning add > partition(partCol=$p)") } > {code} > Examining the HDFS file listing again, we see: > {code} > [msa@jupyter ~]$ hadoop fs -ls /tmp/mixed_case_partitioning > Found 21 items > -rw-r--r-- 3 msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/_SUCCESS > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=0 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=1 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=2 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=3 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=4 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=5 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=6 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=7 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=8 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=9 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:53 > /tmp/mixed_case_partitioning/partcol=0 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:53 > /tmp/mixed_case_partitioning/partcol=1 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:53 > /tmp/mixed_case_partitioning/partcol=2 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:53 > /tmp/mixed_case_partitioning/partcol=3 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:53 > /tmp/mixed_case_partitioning/partcol=4 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:53 > /tmp/mixed_case_partitioning/partcol=5 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:53 > /tmp/mixed_case_partitioning/partcol=6 > drwxr-xr-x
[jira] [Commented] (SPARK-17711) Compress rolled executor logs
[ https://issues.apache.org/jira/browse/SPARK-17711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586842#comment-15586842 ] Apache Spark commented on SPARK-17711: -- User 'loneknightpy' has created a pull request for this issue: https://github.com/apache/spark/pull/15537 > Compress rolled executor logs > - > > Key: SPARK-17711 > URL: https://issues.apache.org/jira/browse/SPARK-17711 > Project: Spark > Issue Type: New Feature >Reporter: Yu Peng > Fix For: 2.0.2, 2.2.0 > > > Currently, rolled executor logs are not compressed. If the executor produces > a lot of logs, it may consume all executor disk space and fail the task. With > this feature, executor will compress the rolled stderr/stdout like log4j to > reduce disk usage. > [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails when select statement contains limit clause
[ https://issues.apache.org/jira/browse/SPARK-17982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586829#comment-15586829 ] Dongjoon Hyun commented on SPARK-17982: --- Hi, [~tafra...@gmail.com]. Could you update your JIRA title? As you see in [~jiangxb]'s example, `limit` is not a problem. > Spark 2.0.0 CREATE VIEW statement fails when select statement contains limit > clause > > > Key: SPARK-17982 > URL: https://issues.apache.org/jira/browse/SPARK-17982 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 > Environment: spark 2.0.0 >Reporter: Franck Tago > > The following statement fails in the spark shell . > scala> spark.sql("CREATE VIEW > DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS > SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") > scala> spark.sql("CREATE VIEW > DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS > SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") > java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT > `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT > `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT > `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS > `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS > gen_subquery_1 > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86) > at org.apache.spark.sql.Dataset.(Dataset.scala:186) > at org.apache.spark.sql.Dataset.(Dataset.scala:167) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65) > This appears to be a limitation of the create view statement . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12149) Executor UI improvement suggestions - Color UI
[ https://issues.apache.org/jira/browse/SPARK-12149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586725#comment-15586725 ] Reynold Xin edited comment on SPARK-12149 at 10/18/16 9:31 PM: --- So I looked at the UI today and I have to say the color code was *extremely confusing*. On a normal cluster I saw a bunch of reds, which is usually reserved for errors. (I'm going to leave a comment on the JIRA ticket too) was (Author: rxin): So I looked at the UI today and I have to say the color code is extremely confusing. On a normal cluster I saw a bunch of reds, which is usually reserved for errors. (I'm going to leave a comment on the JIRA ticket too) > Executor UI improvement suggestions - Color UI > -- > > Key: SPARK-12149 > URL: https://issues.apache.org/jira/browse/SPARK-12149 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Reporter: Alex Bozarth >Assignee: Alex Bozarth > Fix For: 2.0.0 > > > Splitting off the Color UI portion of the parent UI improvements task, > description copied below: > Fill some of the cells with color in order to make it easier to absorb the > info, e.g. > RED if Failed Tasks greater than 0 (maybe the more failed, the more intense > the red) > GREEN if Active Tasks greater than 0 (maybe more intense the larger the > number) > Possibly color code COMPLETE TASKS using various shades of blue (e.g., based > on the log(# completed) > if dark blue then write the value in white (same for the RED and GREEN above > Merging another idea from SPARK-2132: > Color GC time red when over a percentage of task time -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12149) Executor UI improvement suggestions - Color UI
[ https://issues.apache.org/jira/browse/SPARK-12149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586725#comment-15586725 ] Reynold Xin commented on SPARK-12149: - So I looked at the UI today and I have to say the color code is extremely confusing. On a normal cluster I saw a bunch of reds, which is usually reserved for errors. (I'm going to leave a comment on the JIRA ticket too) > Executor UI improvement suggestions - Color UI > -- > > Key: SPARK-12149 > URL: https://issues.apache.org/jira/browse/SPARK-12149 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Reporter: Alex Bozarth >Assignee: Alex Bozarth > Fix For: 2.0.0 > > > Splitting off the Color UI portion of the parent UI improvements task, > description copied below: > Fill some of the cells with color in order to make it easier to absorb the > info, e.g. > RED if Failed Tasks greater than 0 (maybe the more failed, the more intense > the red) > GREEN if Active Tasks greater than 0 (maybe more intense the larger the > number) > Possibly color code COMPLETE TASKS using various shades of blue (e.g., based > on the log(# completed) > if dark blue then write the value in white (same for the RED and GREEN above > Merging another idea from SPARK-2132: > Color GC time red when over a percentage of task time -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15708) Tasks table in Detailed Stage page shows ip instead of hostname under Executor ID/Host
[ https://issues.apache.org/jira/browse/SPARK-15708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586703#comment-15586703 ] Alex Bozarth commented on SPARK-15708: -- I have not seen this recently so it was possibly incidentally fixed. I had this on my back burner and did the above research before seeing it had been closed so I wanted to share it for future reference. > Tasks table in Detailed Stage page shows ip instead of hostname under > Executor ID/Host > -- > > Key: SPARK-15708 > URL: https://issues.apache.org/jira/browse/SPARK-15708 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0 >Reporter: Thomas Graves >Priority: Minor > > If you go to the detailed Stages page in Spark 2.0, the Tasks table under the > Executor ID/Host columns hosts the hostname as an ip address rather then a > fully qualified hostname. > The table above it (Aggregated Metrics by Executor) shows the "Address" as > the full hostname. > I'm running spark on yarn on latest branch-2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13275) With dynamic allocation, executors appear to be added before job starts
[ https://issues.apache.org/jira/browse/SPARK-13275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13275: Assignee: (was: Apache Spark) > With dynamic allocation, executors appear to be added before job starts > --- > > Key: SPARK-13275 > URL: https://issues.apache.org/jira/browse/SPARK-13275 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.5.0 >Reporter: Stephanie Bodoff >Priority: Minor > Attachments: webui.png > > > When I look at the timeline in the Spark Web UI I see the job starting and > then executors being added. The blue lines and dots hitting the timeline show > that the executors were added after the job started. But the way the Executor > box is rendered it looks like the executors started before the job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13275) With dynamic allocation, executors appear to be added before job starts
[ https://issues.apache.org/jira/browse/SPARK-13275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586680#comment-15586680 ] Apache Spark commented on SPARK-13275: -- User 'ajbozarth' has created a pull request for this issue: https://github.com/apache/spark/pull/15536 > With dynamic allocation, executors appear to be added before job starts > --- > > Key: SPARK-13275 > URL: https://issues.apache.org/jira/browse/SPARK-13275 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.5.0 >Reporter: Stephanie Bodoff >Priority: Minor > Attachments: webui.png > > > When I look at the timeline in the Spark Web UI I see the job starting and > then executors being added. The blue lines and dots hitting the timeline show > that the executors were added after the job started. But the way the Executor > box is rendered it looks like the executors started before the job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13275) With dynamic allocation, executors appear to be added before job starts
[ https://issues.apache.org/jira/browse/SPARK-13275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13275: Assignee: Apache Spark > With dynamic allocation, executors appear to be added before job starts > --- > > Key: SPARK-13275 > URL: https://issues.apache.org/jira/browse/SPARK-13275 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.5.0 >Reporter: Stephanie Bodoff >Assignee: Apache Spark >Priority: Minor > Attachments: webui.png > > > When I look at the timeline in the Spark Web UI I see the job starting and > then executors being added. The blue lines and dots hitting the timeline show > that the executors were added after the job started. But the way the Executor > box is rendered it looks like the executors started before the job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15769) Add Encoder for input type to Aggregator
[ https://issues.apache.org/jira/browse/SPARK-15769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586654#comment-15586654 ] Koert Kuipers commented on SPARK-15769: --- what i really want is to be able to use Aggregator in DataFrame, without having to resort to taking apart Row. this way we can develop all general aggregation algorithms in Aggregator and use it across the board in spark-sql. to be specific: in RelationalGroupedDataset i want to be able to apply an Aggregator to input specified by the column names, and the input gets converted to the input type of the Aggregator, just like we do in UDFs and such. an example (taken from my pullreq) where ComplexResultAgg is an Aggregator[(String, Long), (Long, Long), (Long, Long)]: val df3 = Seq(("a", "x", 1), ("a", "y", 3), ("b", "x", 3)).toDF("i", "j", "k ") df3.groupBy("i").agg(ComplexResultAgg("i", "k")) this applies the Aggregator to columns "i" and "k" i found creating a inputDeserializer from the encoder to be the easiest way to make this all work, plus my pullreq removes a lof of adhoc stuff (all the withInputType stuff) indicating to me its cleaner. i also like how this catches mistakes earlier (because you need an implicit encoder) versus storing TypeTags etc. and creating deserializer/converter expressions at runtime. but yeah maybe i am misunderstanding encoders and input data type is all we need. > Add Encoder for input type to Aggregator > > > Key: SPARK-15769 > URL: https://issues.apache.org/jira/browse/SPARK-15769 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: koert kuipers >Priority: Minor > > Currently org.apache.spark.sql.expressions.Aggregator has Encoders for its > buffer and output type, but not for its input type. The thought is that the > input type is known from the Dataset it operates on and hence can be inserted > later. > However i think there are compelling reasons to have Aggregator carry an > Encoder for its input type: > * Generally transformations on Dataset only require the Encoder for the > result type since the input type is exactly known and it's Encoder is already > available within the Dataset. However this is not the case for an Aggregator: > an Aggregator is defined independently of a Dataset, and i think it should be > generally desirable that an Aggregator work on any type that can safely be > cast to the Aggregator's input type (for example an Aggregator that has Long > as input should work on a Dataset of Ints). > * Aggregators should also work on DataFrames, because its a much nicer API to > use than UserDefinedAggregateFunction. And when operating on DataFrames you > should not have to use Row objects, which means your input type is not equal > to the type of the Dataset you operate on (so the Encoder of the Dataset that > is operated on should not be used as input Encoder for the Aggregator). > * Adding an input Encoder is not a big burden, since it can typically be > created implicitly > * It removes TypedColumn.withInputType and its usage in Dataset, > KeyValueGroupedDataset and RelationalGroupedDataset, which always felt > somewhat ad-hoc to me > * Once an Aggregator has an Encoder for it's input type it is a small change > to make the Aggregator also work on a subset of columns in a DataFrame, which > facilitates Aggregator re-use since you don't have to write a custom > Aggregator to extract the columns from a specific DataFrame. This also > enables a usage that is more typical within a DataFrame context, very similar > to how a UserDefinedAggregateFunction is used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17841) Kafka 0.10 commitQueue needs to be drained
[ https://issues.apache.org/jira/browse/SPARK-17841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-17841. - Resolution: Fixed Assignee: Cody Koeninger Fix Version/s: 2.1.0 2.0.2 > Kafka 0.10 commitQueue needs to be drained > -- > > Key: SPARK-17841 > URL: https://issues.apache.org/jira/browse/SPARK-17841 > Project: Spark > Issue Type: Bug >Reporter: Cody Koeninger >Assignee: Cody Koeninger > Fix For: 2.0.2, 2.1.0 > > > Current implementation is just iterating, not polling and removing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17985) Bump commons-lang3 version to 3.5.
[ https://issues.apache.org/jira/browse/SPARK-17985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586644#comment-15586644 ] Reynold Xin edited comment on SPARK-17985 at 10/18/16 8:59 PM: --- The patch was reverted due to build failures in Hadoop 2.2: Actually I'm seeing the following exceptions locally as well as on Jenkins for Hadoop 2.2: {noformat} [error] /scratch/rxin/spark/core/src/main/scala/org/apache/spark/util/Utils.scala:1489: value read is not a member of object org.apache.commons.io.IOUtils [error] var numBytes = IOUtils.read(gzInputStream, buf) [error] ^ [error] /scratch/rxin/spark/core/src/main/scala/org/apache/spark/util/Utils.scala:1492: value read is not a member of object org.apache.commons.io.IOUtils [error] numBytes = IOUtils.read(gzInputStream, buf) [error]^ {noformat} I'm going to revert the patch for now. was (Author: rxin): The patch was reverted due to build failures in Hadoop 2.2: Actually I'm seeing the following exceptions locally as well as on Jenkins for Hadoop 2.2: ``` [error] /scratch/rxin/spark/core/src/main/scala/org/apache/spark/util/Utils.scala:1489: value read is not a member of object org.apache.commons.io.IOUtils [error] var numBytes = IOUtils.read(gzInputStream, buf) [error] ^ [error] /scratch/rxin/spark/core/src/main/scala/org/apache/spark/util/Utils.scala:1492: value read is not a member of object org.apache.commons.io.IOUtils [error] numBytes = IOUtils.read(gzInputStream, buf) [error]^ ``` I'm going to revert the patch for now. > Bump commons-lang3 version to 3.5. > -- > > Key: SPARK-17985 > URL: https://issues.apache.org/jira/browse/SPARK-17985 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin > Fix For: 2.1.0 > > > {{SerializationUtils.clone()}} of commons-lang3 (<3.5) has a bug that break > thread safety, which gets stack sometimes caused by race condition of > initializing hash map. > See https://issues.apache.org/jira/browse/LANG-1251. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-17985) Bump commons-lang3 version to 3.5.
[ https://issues.apache.org/jira/browse/SPARK-17985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin reopened SPARK-17985: - The patch was reverted due to build failures in Hadoop 2.2: Actually I'm seeing the following exceptions locally as well as on Jenkins for Hadoop 2.2: ``` [error] /scratch/rxin/spark/core/src/main/scala/org/apache/spark/util/Utils.scala:1489: value read is not a member of object org.apache.commons.io.IOUtils [error] var numBytes = IOUtils.read(gzInputStream, buf) [error] ^ [error] /scratch/rxin/spark/core/src/main/scala/org/apache/spark/util/Utils.scala:1492: value read is not a member of object org.apache.commons.io.IOUtils [error] numBytes = IOUtils.read(gzInputStream, buf) [error]^ ``` I'm going to revert the patch for now. > Bump commons-lang3 version to 3.5. > -- > > Key: SPARK-17985 > URL: https://issues.apache.org/jira/browse/SPARK-17985 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin > Fix For: 2.1.0 > > > {{SerializationUtils.clone()}} of commons-lang3 (<3.5) has a bug that break > thread safety, which gets stack sometimes caused by race condition of > initializing hash map. > See https://issues.apache.org/jira/browse/LANG-1251. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17731) Metrics for Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-17731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586621#comment-15586621 ] Apache Spark commented on SPARK-17731: -- User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/15535 > Metrics for Structured Streaming > > > Key: SPARK-17731 > URL: https://issues.apache.org/jira/browse/SPARK-17731 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das > Fix For: 2.0.2, 2.1.0 > > > Metrics are needed for monitoring structured streaming apps. Here is the > design doc for implementing the necessary metrics. > https://docs.google.com/document/d/1NIdcGuR1B3WIe8t7VxLrt58TJB4DtipWEbj5I_mzJys/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org