[jira] [Resolved] (SPARK-22418) Add test cases for NULL Handling
[ https://issues.apache.org/jira/browse/SPARK-22418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-22418. - Resolution: Fixed Fix Version/s: 2.3.0 > Add test cases for NULL Handling > > > Key: SPARK-22418 > URL: https://issues.apache.org/jira/browse/SPARK-22418 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Marco Gaido >Priority: Normal > Fix For: 2.3.0 > > > Add test cases mentioned in the link https://sqlite.org/nulls.html -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22418) Add test cases for NULL Handling
[ https://issues.apache.org/jira/browse/SPARK-22418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-22418: --- Assignee: Marco Gaido > Add test cases for NULL Handling > > > Key: SPARK-22418 > URL: https://issues.apache.org/jira/browse/SPARK-22418 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Marco Gaido >Priority: Normal > Fix For: 2.3.0 > > > Add test cases mentioned in the link https://sqlite.org/nulls.html -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22436) New function strip() to remove all whitespace from string
[ https://issues.apache.org/jira/browse/SPARK-22436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238712#comment-16238712 ] Eric Maynard commented on SPARK-22436: -- [~asmaier] Wouldn't the right way to implement this be to use the existing function `regexp_replace`? I'm not sure why this needs to be a built-in function. Furthermore, if this was implemented, imo it should exist in all APIs, not just pyspark. > New function strip() to remove all whitespace from string > - > > Key: SPARK-22436 > URL: https://issues.apache.org/jira/browse/SPARK-22436 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 2.2.0 >Reporter: Andreas Maier >Priority: Minor > Labels: features > > Since ticket SPARK-17299 the [trim() > function|https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.trim] > will not remove any whitespace characters from beginning and end of a string > but only spaces. This is correct in regard to the SQL standard, but it opens > a gap in functionality. > My suggestion is to add to the Spark API in analogy to pythons standard > library the functions l/r/strip(), which should remove all whitespace > characters from a string from beginning and/or end of a string respectively. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22446) Optimizer causing StringIndexerModel's indexer UDF to throw "Unseen label" exception incorrectly for filtered data.
[ https://issues.apache.org/jira/browse/SPARK-22446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Bellchambers updated SPARK-22446: -- Summary: Optimizer causing StringIndexerModel's indexer UDF to throw "Unseen label" exception incorrectly for filtered data. (was: Optimizer causing StringIndexer's indexer UDF to throw "Unseen label" exception incorrectly for filtered data.) > Optimizer causing StringIndexerModel's indexer UDF to throw "Unseen label" > exception incorrectly for filtered data. > --- > > Key: SPARK-22446 > URL: https://issues.apache.org/jira/browse/SPARK-22446 > Project: Spark > Issue Type: Bug > Components: ML, Optimizer >Affects Versions: 2.0.0, 2.2.0 > Environment: spark-shell, local mode, macOS Sierra 10.12.6 >Reporter: Greg Bellchambers >Priority: Normal > > In the following, the `indexer` UDF defined inside the > `org.apache.spark.ml.feature.StringIndexerModel.transform` method throws an > "Unseen label" error, despite the label not being present in the transformed > DataFrame. > Here is the definition of the indexer UDF in the transform method: > {code:java} > val indexer = udf { label: String => > if (labelToIndex.contains(label)) { > labelToIndex(label) > } else { > throw new SparkException(s"Unseen label: $label.") > } > } > {code} > We can demonstrate the error with a very simple example DataFrame. > {code:java} > scala> import org.apache.spark.ml.feature.StringIndexer > import org.apache.spark.ml.feature.StringIndexer > scala> // first we create a DataFrame with three cities > scala> val df = List( > | ("A", "London", "StrA"), > | ("B", "Bristol", null), > | ("C", "New York", "StrC") > | ).toDF("ID", "CITY", "CONTENT") > df: org.apache.spark.sql.DataFrame = [ID: string, CITY: string ... 1 more > field] > scala> df.show > +---++---+ > | ID|CITY|CONTENT| > +---++---+ > | A| London| StrA| > | B| Bristol| null| > | C|New York| StrC| > +---++---+ > scala> // then we remove the row with null in CONTENT column, which removes > Bristol > scala> val dfNoBristol = finalStatic.filter($"CONTENT".isNotNull) > dfNoBristol: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [ID: > string, CITY: string ... 1 more field] > scala> dfNoBristol.show > +---++---+ > | ID|CITY|CONTENT| > +---++---+ > | A| London| StrA| > | C|New York| StrC| > +---++---+ > scala> // now create a StringIndexer for the CITY column and fit to > dfNoBristol > scala> val model = { > | new StringIndexer() > | .setInputCol("CITY") > | .setOutputCol("CITYIndexed") > | .fit(dfNoBristol) > | } > model: org.apache.spark.ml.feature.StringIndexerModel = strIdx_f5afa2fb > scala> // the StringIndexerModel has only two labels: "London" and "New York" > scala> str.labels foreach println > London > New York > scala> // transform our DataFrame to add an index column > scala> val dfWithIndex = model.transform(dfNoBristol) > dfWithIndex: org.apache.spark.sql.DataFrame = [ID: string, CITY: string ... 2 > more fields] > scala> dfWithIndex.show > +---++---+---+ > | ID|CITY|CONTENT|CITYIndexed| > +---++---+---+ > | A| London| StrA|0.0| > | C|New York| StrC|1.0| > +---++---+---+ > {code} > The unexpected behaviour comes when we filter `dfWithIndex` for `CITYIndexed` > equal to 1.0 and perform an action. The `indexer` UDF in `transform` method > throws an exception reporting unseen label "Bristol". This is irrational > behaviour as far as the user of the API is concerned, because there is no > such value as "Bristol" when do show all rows of `dfWithIndex`: > {code:java} > scala> dfWithIndex.filter($"CITYIndexed" === 1.0).count > 17/11/04 00:33:41 ERROR Executor: Exception in task 1.0 in stage 20.0 (TID 40) > org.apache.spark.SparkException: Failed to execute user defined > function($anonfun$5: (string) => double) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at >
[jira] [Updated] (SPARK-22446) Optimizer causing StringIndexer's indexer UDF to throw "Unseen label" exception incorrectly for filtered data.
[ https://issues.apache.org/jira/browse/SPARK-22446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Bellchambers updated SPARK-22446: -- Description: In the following, the `indexer` UDF defined inside the `org.apache.spark.ml.feature.StringIndexerModel.transform` method throws an "Unseen label" error, despite the label not being present in the transformed DataFrame. Here is the definition of the indexer UDF in the transform method: {code:java} val indexer = udf { label: String => if (labelToIndex.contains(label)) { labelToIndex(label) } else { throw new SparkException(s"Unseen label: $label.") } } {code} We can demonstrate the error with a very simple example DataFrame. {code:java} scala> import org.apache.spark.ml.feature.StringIndexer import org.apache.spark.ml.feature.StringIndexer scala> // first we create a DataFrame with three cities scala> val df = List( | ("A", "London", "StrA"), | ("B", "Bristol", null), | ("C", "New York", "StrC") | ).toDF("ID", "CITY", "CONTENT") df: org.apache.spark.sql.DataFrame = [ID: string, CITY: string ... 1 more field] scala> df.show +---++---+ | ID|CITY|CONTENT| +---++---+ | A| London| StrA| | B| Bristol| null| | C|New York| StrC| +---++---+ scala> // then we remove the row with null in CONTENT column, which removes Bristol scala> val dfNoBristol = finalStatic.filter($"CONTENT".isNotNull) dfNoBristol: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [ID: string, CITY: string ... 1 more field] scala> dfNoBristol.show +---++---+ | ID|CITY|CONTENT| +---++---+ | A| London| StrA| | C|New York| StrC| +---++---+ scala> // now create a StringIndexer for the CITY column and fit to dfNoBristol scala> val model = { | new StringIndexer() | .setInputCol("CITY") | .setOutputCol("CITYIndexed") | .fit(dfNoBristol) | } model: org.apache.spark.ml.feature.StringIndexerModel = strIdx_f5afa2fb scala> // the StringIndexerModel has only two labels: "London" and "New York" scala> str.labels foreach println London New York scala> // transform our DataFrame to add an index column scala> val dfWithIndex = model.transform(dfNoBristol) dfWithIndex: org.apache.spark.sql.DataFrame = [ID: string, CITY: string ... 2 more fields] scala> dfWithIndex.show +---++---+---+ | ID|CITY|CONTENT|CITYIndexed| +---++---+---+ | A| London| StrA|0.0| | C|New York| StrC|1.0| +---++---+---+ {code} The unexpected behaviour comes when we filter `dfWithIndex` for `CITYIndexed` equal to 1.0 and perform an action. The `indexer` UDF in `transform` method throws an exception reporting unseen label "Bristol". This is irrational behaviour as far as the user of the API is concerned, because there is no such value as "Bristol" when do show all rows of `dfWithIndex`: {code:java} scala> dfWithIndex.filter($"CITYIndexed" === 1.0).count 17/11/04 00:33:41 ERROR Executor: Exception in task 1.0 in stage 20.0 (TID 40) org.apache.spark.SparkException: Failed to execute user defined function($anonfun$5: (string) => double) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.spark.SparkException: Unseen label: Bristol. To handle unseen labels, set Param handleInvalid to keep. at org.apache.spark.ml.feature.StringIndexerModel$$anonfun$5.apply(StringIndexer.scala:222) at org.apache.spark.ml.feature.StringIndexerModel$$anonfun$5.apply(StringIndexer.scala:208) ... 13 more {code} To understand what is happening here, note that an action is triggered when we call
[jira] [Created] (SPARK-22446) Optimizer causing StringIndexer's indexer UDF to throw "Unseen label" exception incorrectly for filtered data.
Greg Bellchambers created SPARK-22446: - Summary: Optimizer causing StringIndexer's indexer UDF to throw "Unseen label" exception incorrectly for filtered data. Key: SPARK-22446 URL: https://issues.apache.org/jira/browse/SPARK-22446 Project: Spark Issue Type: Bug Components: ML, Optimizer Affects Versions: 2.2.0, 2.0.0 Environment: spark-shell, local mode, macOS Sierra 10.12.6 Reporter: Greg Bellchambers Priority: Normal In the following, the `indexer` UDF defined inside the `org.apache.spark.ml.feature.StringIndexerModel.transform` method throws an "Unseen label" error, despite the label not being present in the transformed DataFrame. Here is the definition of the indexer UDF in the transform method: {code:scala} val indexer = udf { label: String => if (labelToIndex.contains(label)) { labelToIndex(label) } else { throw new SparkException(s"Unseen label: $label.") } } {code} We can demonstrate the error with a very simple example DataFrame. {code:scala} scala> import org.apache.spark.ml.feature.StringIndexer import org.apache.spark.ml.feature.StringIndexer scala> // first we create a DataFrame with three cities scala> val df = List( | ("A", "London", "StrA"), | ("B", "Bristol", null), | ("C", "New York", "StrC") | ).toDF("ID", "CITY", "CONTENT") df: org.apache.spark.sql.DataFrame = [ID: string, CITY: string ... 1 more field] scala> df.show +---++---+ | ID|CITY|CONTENT| +---++---+ | A| London| StrA| | B| Bristol| null| | C|New York| StrC| +---++---+ scala> // then we remove the row with null in CONTENT column, which removes Bristol scala> val dfNoBristol = finalStatic.filter($"CONTENT".isNotNull) dfNoBristol: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [ID: string, CITY: string ... 1 more field] scala> dfNoBristol.show +---++---+ | ID|CITY|CONTENT| +---++---+ | A| London| StrA| | C|New York| StrC| +---++---+ scala> // now create a StringIndexer for the CITY column and fit to dfNoBristol scala> val model = { | new StringIndexer() | .setInputCol("CITY") | .setOutputCol("CITYIndexed") | .fit(dfNoBristol) | } model: org.apache.spark.ml.feature.StringIndexerModel = strIdx_f5afa2fb scala> // the StringIndexerModel has only two labels: "London" and "New York" scala> str.labels foreach println London New York scala> // transform our DataFrame to add an index column scala> val dfWithIndex = model.transform(dfNoBristol) dfWithIndex: org.apache.spark.sql.DataFrame = [ID: string, CITY: string ... 2 more fields] scala> dfWithIndex.show +---++---+---+ | ID|CITY|CONTENT|CITYIndexed| +---++---+---+ | A| London| StrA|0.0| | C|New York| StrC|1.0| +---++---+---+ {code} The unexpected behaviour comes when we filter `dfWithIndex` for `CITYIndexed` equal to 1.0 and perform an action. The `indexer` UDF in `transform` method throws an exception reporting unseen label "Bristol". This is irrational behaviour as far as the user of the API is concerned, because there is no such value as "Bristol" when do show all rows of `dfWithIndex`: {code:scala} scala> dfWithIndex.filter($"CITYIndexed" === 1.0).count 17/11/04 00:33:41 ERROR Executor: Exception in task 1.0 in stage 20.0 (TID 40) org.apache.spark.SparkException: Failed to execute user defined function($anonfun$5: (string) => double) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.spark.SparkException: Unseen label: Bristol. To handle unseen
[jira] [Assigned] (SPARK-22445) move CodegenContext.copyResult to CodegenSupport
[ https://issues.apache.org/jira/browse/SPARK-22445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22445: Assignee: Wenchen Fan (was: Apache Spark) > move CodegenContext.copyResult to CodegenSupport > > > Key: SPARK-22445 > URL: https://issues.apache.org/jira/browse/SPARK-22445 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22445) move CodegenContext.copyResult to CodegenSupport
[ https://issues.apache.org/jira/browse/SPARK-22445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22445: Assignee: Apache Spark (was: Wenchen Fan) > move CodegenContext.copyResult to CodegenSupport > > > Key: SPARK-22445 > URL: https://issues.apache.org/jira/browse/SPARK-22445 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22441) JDBC REAL type is mapped to Double instead of Float
[ https://issues.apache.org/jira/browse/SPARK-22441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238628#comment-16238628 ] Tor Myklebust edited comment on SPARK-22441 at 11/4/17 12:28 AM: - I wrote this code a while ago, but I don't think REAL -> double was an accident. The JDBC spec wasn't the reference for this implementation since a number of things it says about datatypes contradict the behaviour of actual databases. Which is to say that some database used REAL for 64-bit floats. Postgres and H2 document that a REAL is a 32-bit float. MySQL docs say that "MySQL also treats REAL as a synonym for DOUBLE PRECISION." I haven't got a quick way to check, but I'd speculate that MySQL's JDBC connector advertises a MySQL REAL column as being of JDBC REAL type. Re SQL FLOAT and DOUBLE types: - H2 uses FLOAT and DOUBLE interchangeably according to its documentation. - "PostgreSQL also supports the SQL-standard notations float and float(p) for specifying inexact numeric types. Here, p specifies the minimum acceptable precision in binary digits. PostgreSQL accepts float(1) to float(24) as selecting the real type, while float(25) to float(53) select double precision. Values of p outside the allowed range draw an error. float with no precision specified is taken to mean double precision." - According to its docs, MySQL does something similar to Postgres for FLOAT. Depending on how this is exposed via JDBC, perhaps JDBC FLOAT should map to double as well. was (Author: tmyklebu): I wrote this code a while ago, but I don't think REAL -> double was an accident. The JDBC spec wasn't the reference for this implementation since a number of things it says about datatypes contradict the behaviour of actual databases. Which is to say that some database used REAL for 64-bit floats. Postgres and H2 document that a REAL is a 32-bit float. MySQL docs say that "MySQL also treats REAL as a synonym for DOUBLE PRECISION." I haven't got a quick way to check, but I'd speculate that MySQL's JDBC connector advertises a MySQL REAL column as being of JDBC REAL type. > JDBC REAL type is mapped to Double instead of Float > --- > > Key: SPARK-22441 > URL: https://issues.apache.org/jira/browse/SPARK-22441 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Hongbo >Priority: Minor > > In [JDBC > Specification|http://download.oracle.com/otn-pub/jcp/jdbc-4_1-mrel-eval-spec/jdbc4.1-fr-spec.pdf], > REAL should be mapped to Float. > But now, it's mapped to Double: > [https://github.com/apache/spark/blob/bc7ca9786e162e33f29d57c4aacb830761b97221/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L220] > Should it be changed according to the specification? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22445) move CodegenContext.copyResult to CodegenSupport
[ https://issues.apache.org/jira/browse/SPARK-22445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238635#comment-16238635 ] Apache Spark commented on SPARK-22445: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/19656 > move CodegenContext.copyResult to CodegenSupport > > > Key: SPARK-22445 > URL: https://issues.apache.org/jira/browse/SPARK-22445 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22441) JDBC REAL type is mapped to Double instead of Float
[ https://issues.apache.org/jira/browse/SPARK-22441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238628#comment-16238628 ] Tor Myklebust commented on SPARK-22441: --- I wrote this code a while ago, but I don't think REAL -> double was an accident. The JDBC spec wasn't the reference for this implementation since a number of things it says about datatypes contradict the behaviour of actual databases. Which is to say that some database used REAL for 64-bit floats. Postgres and H2 document that a REAL is a 32-bit float. MySQL docs say that "MySQL also treats REAL as a synonym for DOUBLE PRECISION." I haven't got a quick way to check, but I'd speculate that MySQL's JDBC connector advertises a MySQL REAL column as being of JDBC REAL type. > JDBC REAL type is mapped to Double instead of Float > --- > > Key: SPARK-22441 > URL: https://issues.apache.org/jira/browse/SPARK-22441 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Hongbo >Priority: Minor > > In [JDBC > Specification|http://download.oracle.com/otn-pub/jcp/jdbc-4_1-mrel-eval-spec/jdbc4.1-fr-spec.pdf], > REAL should be mapped to Float. > But now, it's mapped to Double: > [https://github.com/apache/spark/blob/bc7ca9786e162e33f29d57c4aacb830761b97221/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L220] > Should it be changed according to the specification? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22445) move CodegenContext.copyResult to CodegenSupport
Wenchen Fan created SPARK-22445: --- Summary: move CodegenContext.copyResult to CodegenSupport Key: SPARK-22445 URL: https://issues.apache.org/jira/browse/SPARK-22445 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Wenchen Fan Assignee: Wenchen Fan Priority: Major -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22444) Spark History Server missing /environment endpoint/api
[ https://issues.apache.org/jira/browse/SPARK-22444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238615#comment-16238615 ] Apache Spark commented on SPARK-22444: -- User 'ambud' has created a pull request for this issue: https://github.com/apache/spark/pull/19655 > Spark History Server missing /environment endpoint/api > -- > > Key: SPARK-22444 > URL: https://issues.apache.org/jira/browse/SPARK-22444 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 1.6.3, 1.6.4 >Reporter: Ambud Sharma > > Spark History Server REST API is missing the /environment endpoint. This > endpoint is otherwise available in Spark 2.x however it's missing from 1.6.x. > Since the environment endpoint provides programmatic access to critical > launch parameters it would be great to have this back ported. > This feature was contributed under: > https://issues.apache.org/jira/browse/SPARK-16122 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22444) Spark History Server missing /environment endpoint/api
[ https://issues.apache.org/jira/browse/SPARK-22444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-22444. Resolution: Won't Fix Fix Version/s: (was: 1.6.4) Features like this are not generally added to older releases. > Spark History Server missing /environment endpoint/api > -- > > Key: SPARK-22444 > URL: https://issues.apache.org/jira/browse/SPARK-22444 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 1.6.3, 1.6.4 >Reporter: Ambud Sharma > > Spark History Server REST API is missing the /environment endpoint. This > endpoint is otherwise available in Spark 2.x however it's missing from 1.6.x. > Since the environment endpoint provides programmatic access to critical > launch parameters it would be great to have this back ported. > This feature was contributed under: > https://issues.apache.org/jira/browse/SPARK-16122 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22444) Spark History Server missing /environment endpoint/api
Ambud Sharma created SPARK-22444: Summary: Spark History Server missing /environment endpoint/api Key: SPARK-22444 URL: https://issues.apache.org/jira/browse/SPARK-22444 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.6.3, 1.6.4 Reporter: Ambud Sharma Fix For: 1.6.4 Spark History Server REST API is missing the /environment endpoint. This endpoint is otherwise available in Spark 2.x however it's missing from 1.6.x. Since the environment endpoint provides programmatic access to critical launch parameters it would be great to have this back ported. This feature was contributed under: https://issues.apache.org/jira/browse/SPARK-16122 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22443) AggregatedDialect doesn't override quoteIdentifier and other methods in JdbcDialects
[ https://issues.apache.org/jira/browse/SPARK-22443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238482#comment-16238482 ] Sean Owen commented on SPARK-22443: --- Good catch. I suppose that this and getTableExistsQuery and getSchemaQuery need to return the value from the first dialect? > AggregatedDialect doesn't override quoteIdentifier and other methods in > JdbcDialects > > > Key: SPARK-22443 > URL: https://issues.apache.org/jira/browse/SPARK-22443 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Hongbo >Priority: Normal > > The AggregatedDialect only implements canHandle, getCatalystType, > getJDBCType. It doesn't implement other methods in JdbcDialect. > So if multiple Dialects are registered with the same driver, the > implementation of these methods will not be taken and the default > implementation in JdbcDialect will be used. > Example: > {code:java} > package example > import java.util.Properties > import org.apache.spark.sql.SparkSession > import org.apache.spark.sql.jdbc.{JdbcDialect, JdbcDialects} > import org.apache.spark.sql.types.{DataType, MetadataBuilder} > object AnotherMySQLDialect extends JdbcDialect { > override def canHandle(url : String): Boolean = url.startsWith("jdbc:mysql") > override def getCatalystType( > sqlType: Int, typeName: String, size: Int, > md: MetadataBuilder): Option[DataType] = { > None > } > override def quoteIdentifier(colName: String): String = { > s"`$colName`" > } > } > object App { > def main(args: Array[String]) { > val spark = SparkSession.builder.master("local").appName("Simple > Application").getOrCreate() > JdbcDialects.registerDialect(AnotherMySQLDialect) > val jdbcUrl = s"jdbc:mysql://host:port/db?user=user=password" > spark.read.jdbc(jdbcUrl, "badge", new Properties()).show() > } > } > {code} > will throw an exception. > {code:none} > 17/11/03 17:08:39 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > java.sql.SQLDataException: Cannot determine value type from string 'id' > at > com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:530) > at > com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:513) > at > com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:505) > at > com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:479) > at > com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:489) > at > com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:89) > at > com.mysql.cj.jdbc.result.ResultSetImpl.getLong(ResultSetImpl.java:853) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:409) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:408) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:330) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:312) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at
[jira] [Updated] (SPARK-22443) AggregatedDialect doesn't override quoteIdentifier and other methods in JdbcDialects
[ https://issues.apache.org/jira/browse/SPARK-22443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongbo updated SPARK-22443: --- Summary: AggregatedDialect doesn't override quoteIdentifier and other methods in JdbcDialects (was: AggregatedDialect doesn't work for quoteIdentifier) > AggregatedDialect doesn't override quoteIdentifier and other methods in > JdbcDialects > > > Key: SPARK-22443 > URL: https://issues.apache.org/jira/browse/SPARK-22443 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Hongbo >Priority: Normal > > The AggregatedDialect only implements canHandle, getCatalystType, > getJDBCType. It doesn't implement other methods in JdbcDialect. > So if multiple Dialects are registered with the same driver, the > implementation of these methods will not be taken and the default > implementation in JdbcDialect will be used. > Example: > {code:java} > package example > import java.util.Properties > import org.apache.spark.sql.SparkSession > import org.apache.spark.sql.jdbc.{JdbcDialect, JdbcDialects} > import org.apache.spark.sql.types.{DataType, MetadataBuilder} > object AnotherMySQLDialect extends JdbcDialect { > override def canHandle(url : String): Boolean = url.startsWith("jdbc:mysql") > override def getCatalystType( > sqlType: Int, typeName: String, size: Int, > md: MetadataBuilder): Option[DataType] = { > None > } > override def quoteIdentifier(colName: String): String = { > s"`$colName`" > } > } > object App { > def main(args: Array[String]) { > val spark = SparkSession.builder.master("local").appName("Simple > Application").getOrCreate() > JdbcDialects.registerDialect(AnotherMySQLDialect) > val jdbcUrl = s"jdbc:mysql://host:port/db?user=user=password" > spark.read.jdbc(jdbcUrl, "badge", new Properties()).show() > } > } > {code} > will throw an exception. > {code:none} > 17/11/03 17:08:39 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > java.sql.SQLDataException: Cannot determine value type from string 'id' > at > com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:530) > at > com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:513) > at > com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:505) > at > com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:479) > at > com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:489) > at > com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:89) > at > com.mysql.cj.jdbc.result.ResultSetImpl.getLong(ResultSetImpl.java:853) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:409) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:408) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:330) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:312) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at
[jira] [Created] (SPARK-22443) AggregatedDialect doesn't work for quoteIdentifier
Hongbo created SPARK-22443: -- Summary: AggregatedDialect doesn't work for quoteIdentifier Key: SPARK-22443 URL: https://issues.apache.org/jira/browse/SPARK-22443 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Hongbo Priority: Normal The AggregatedDialect only implements canHandle, getCatalystType, getJDBCType. It doesn't implement other methods in JdbcDialect. So if multiple Dialects are registered with the same driver, the implementation of these methods will not be taken and the default implementation in JdbcDialect will be used. Example: {code:java} package example import java.util.Properties import org.apache.spark.sql.SparkSession import org.apache.spark.sql.jdbc.{JdbcDialect, JdbcDialects} import org.apache.spark.sql.types.{DataType, MetadataBuilder} object AnotherMySQLDialect extends JdbcDialect { override def canHandle(url : String): Boolean = url.startsWith("jdbc:mysql") override def getCatalystType( sqlType: Int, typeName: String, size: Int, md: MetadataBuilder): Option[DataType] = { None } override def quoteIdentifier(colName: String): String = { s"`$colName`" } } object App { def main(args: Array[String]) { val spark = SparkSession.builder.master("local").appName("Simple Application").getOrCreate() JdbcDialects.registerDialect(AnotherMySQLDialect) val jdbcUrl = s"jdbc:mysql://host:port/db?user=user=password" spark.read.jdbc(jdbcUrl, "badge", new Properties()).show() } } {code} will throw an exception. {code:none} 17/11/03 17:08:39 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.sql.SQLDataException: Cannot determine value type from string 'id' at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:530) at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:513) at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:505) at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:479) at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:489) at com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:89) at com.mysql.cj.jdbc.result.ResultSetImpl.getLong(ResultSetImpl.java:853) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:409) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:408) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:330) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:312) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: com.mysql.cj.core.exceptions.DataConversionException: Cannot determine value type from string 'id' at com.mysql.cj.core.io.StringConverter.createFromBytes(StringConverter.java:121) at
[jira] [Commented] (SPARK-14703) Spark uses SLF4J, but actually relies quite heavily on Log4J
[ https://issues.apache.org/jira/browse/SPARK-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238409#comment-16238409 ] Harry Weppner commented on SPARK-14703: --- [~happy15sheng] I've discovered that {{log4j-over-slf4j}} can not be used as a drop-in replacement in all cases. The hive server for example calls {{log4j}} components that are not part of the bridge. {code} Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hive.service.cli.operation.LogDivertAppender.setWriter(Ljava/io/Writer;)V at org.apache.hive.service.cli.operation.LogDivertAppender.(LogDivertAppender.java:166) at org.apache.hive.service.cli.operation.OperationManager.initOperationLogCapture(OperationManager.java:85) at org.apache.hive.service.cli.operation.OperationManager.init(OperationManager.java:63) at org.apache.spark.sql.hive.thriftserver.ReflectedCompositeService$$anonfun$initCompositeService$1.apply(SparkSQLCLIService.scala:79) at org.apache.spark.sql.hive.thriftserver.ReflectedCompositeService$$anonfun$initCompositeService$1.apply(SparkSQLCLIService.scala:79) .. {code} In my case, I've worked around it by setting {{hive.server2.logging.operation.enabled}} to {{false}}, which avoids these code paths altogether. At the expense of having no logs for the hive server, of course. > Spark uses SLF4J, but actually relies quite heavily on Log4J > > > Key: SPARK-14703 > URL: https://issues.apache.org/jira/browse/SPARK-14703 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 1.6.0 > Environment: 1.6.0-cdh5.7.0, logback 1.1.3, yarn >Reporter: Matthew Byng-Maddick >Priority: Minor > Labels: log4j, logback, logging, slf4j > Attachments: spark-logback.patch > > > We've built a version of Hadoop CDH-5.7.0 in house with logback as the SLF4J > provider, in order to send hadoop logs straight to logstash (to handle with > logstash/elasticsearch), on top of our existing use of the logback backend. > In trying to start spark-shell I discovered several points where the fact > that we weren't quite using a real L4J caused the sc not to be created or the > YARN module not to exist. There are many more places where we should probably > be wrapping the logging more sensibly, but I have a basic patch that fixes > some of the worst offenders (at least the ones that stop the sparkContext > being created properly). > I'm prepared to accept that this is not a good solution and there probably > needs to be some sort of better wrapper, perhaps in the Logging.scala class > which handles this properly. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22442) Schema generated by Product Encoder doesn't match case class field name when using non-standard characters
[ https://issues.apache.org/jira/browse/SPARK-22442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mikel San Vicente updated SPARK-22442: -- Description: Product encoder encodes special characters wrongly when field name contains certain nonstandard characters. For example for: {quote} case class MyType(`field.1`: String, `field 2`: String) {quote} we will get the following schema {quote} root |-- field$u002E1: string (nullable = true) |-- field$u00202: string (nullable = true) {quote} was: Product encoder encodes special characters wrongly when field name contains certain nonstandard characters. For example for: {{ case class MyType(`field.1`: String, `field 2`: String) }} we will get the following schema {{ root |-- field$u002E1: string (nullable = true) |-- field$u00202: string (nullable = true) }} > Schema generated by Product Encoder doesn't match case class field name when > using non-standard characters > -- > > Key: SPARK-22442 > URL: https://issues.apache.org/jira/browse/SPARK-22442 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.2, 2.2.0 >Reporter: Mikel San Vicente >Priority: Normal > > Product encoder encodes special characters wrongly when field name contains > certain nonstandard characters. > For example for: > {quote} > case class MyType(`field.1`: String, `field 2`: String) > {quote} > we will get the following schema > {quote} > root > |-- field$u002E1: string (nullable = true) > |-- field$u00202: string (nullable = true) > {quote} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22442) Schema generated by Product Encoder doesn't match case class field name when using non-standard characters
[ https://issues.apache.org/jira/browse/SPARK-22442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mikel San Vicente updated SPARK-22442: -- Description: Product encoder encodes special characters wrongly when field name contains certain nonstandard characters. For example for: {{case class MyType(`field.1`: String, `field 2`: String) }} we will get the following schema {{root |-- field$u002E1: string (nullable = true) |-- field$u00202: string (nullable = true)}} was: Product encoder encodes special characters wrongly when field name contains certain nonstandard characters. For example for: ``` case class MyType(`field.1`: String, `field 2`: String) ``` we will get the following schema ``` root |-- field$u002E1: string (nullable = true) |-- field$u00202: string (nullable = true) ``` > Schema generated by Product Encoder doesn't match case class field name when > using non-standard characters > -- > > Key: SPARK-22442 > URL: https://issues.apache.org/jira/browse/SPARK-22442 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.2, 2.2.0 >Reporter: Mikel San Vicente >Priority: Normal > > Product encoder encodes special characters wrongly when field name contains > certain nonstandard characters. > For example for: > {{case class MyType(`field.1`: String, `field 2`: String) > }} > we will get the following schema > {{root > |-- field$u002E1: string (nullable = true) > |-- field$u00202: string (nullable = true)}} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22442) Schema generated by Product Encoder doesn't match case class field name when using non-standard characters
[ https://issues.apache.org/jira/browse/SPARK-22442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mikel San Vicente updated SPARK-22442: -- Description: Product encoder encodes special characters wrongly when field name contains certain nonstandard characters. For example for: {{ case class MyType(`field.1`: String, `field 2`: String) }} we will get the following schema {{ root |-- field$u002E1: string (nullable = true) |-- field$u00202: string (nullable = true) }} was: Product encoder encodes special characters wrongly when field name contains certain nonstandard characters. For example for: {{case class MyType(`field.1`: String, `field 2`: String) }} we will get the following schema {{root |-- field$u002E1: string (nullable = true) |-- field$u00202: string (nullable = true)}} > Schema generated by Product Encoder doesn't match case class field name when > using non-standard characters > -- > > Key: SPARK-22442 > URL: https://issues.apache.org/jira/browse/SPARK-22442 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.2, 2.2.0 >Reporter: Mikel San Vicente >Priority: Normal > > Product encoder encodes special characters wrongly when field name contains > certain nonstandard characters. > For example for: > {{ > case class MyType(`field.1`: String, `field 2`: String) > }} > we will get the following schema > {{ > root > |-- field$u002E1: string (nullable = true) > |-- field$u00202: string (nullable = true) > }} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22442) Schema generated by Product Encoder doesn't match case class field name when using non-standard characters
Mikel San Vicente created SPARK-22442: - Summary: Schema generated by Product Encoder doesn't match case class field name when using non-standard characters Key: SPARK-22442 URL: https://issues.apache.org/jira/browse/SPARK-22442 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0, 2.1.2, 2.0.2 Reporter: Mikel San Vicente Priority: Normal Product encoder encodes special characters wrongly when field name contains certain nonstandard characters. For example for: ``` case class MyType(`field.1`: String, `field 2`: String) ``` we will get the following schema ``` root |-- field$u002E1: string (nullable = true) |-- field$u00202: string (nullable = true) ``` -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22441) JDBC REAL type is mapped to Double instead of Float
[ https://issues.apache.org/jira/browse/SPARK-22441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238313#comment-16238313 ] Sean Owen commented on SPARK-22441: --- You have a point. In the same file, the reverse mapping maps float / FloatType to REAL (but, also java.sql.Types.FLOAT)? The reference you give also says FLOAT and DOUBLE both map to double, not to float/double respectively. I'm concerned about changing behavior, but that does seem internally inconsistent at least. Does anyone have a view on whether this should change in 2.3? CC [~tmyklebu] who I believe wrote the code in question. > JDBC REAL type is mapped to Double instead of Float > --- > > Key: SPARK-22441 > URL: https://issues.apache.org/jira/browse/SPARK-22441 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Hongbo >Priority: Minor > > In [JDBC > Specification|http://download.oracle.com/otn-pub/jcp/jdbc-4_1-mrel-eval-spec/jdbc4.1-fr-spec.pdf], > REAL should be mapped to Float. > But now, it's mapped to Double: > [https://github.com/apache/spark/blob/bc7ca9786e162e33f29d57c4aacb830761b97221/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L220] > Should it be changed according to the specification? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22441) JDBC REAL type is mapped to Double instead of Float
Hongbo created SPARK-22441: -- Summary: JDBC REAL type is mapped to Double instead of Float Key: SPARK-22441 URL: https://issues.apache.org/jira/browse/SPARK-22441 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Hongbo Priority: Minor In [JDBC Specification|http://download.oracle.com/otn-pub/jcp/jdbc-4_1-mrel-eval-spec/jdbc4.1-fr-spec.pdf], REAL should be mapped to Float. But now, it's mapped to Double: [https://github.com/apache/spark/blob/bc7ca9786e162e33f29d57c4aacb830761b97221/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L220] Should it be changed according to the specification? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22211) LimitPushDown optimization for FullOuterJoin generates wrong results
[ https://issues.apache.org/jira/browse/SPARK-22211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238281#comment-16238281 ] Henry Robinson commented on SPARK-22211: Sounds good, thanks both. > LimitPushDown optimization for FullOuterJoin generates wrong results > > > Key: SPARK-22211 > URL: https://issues.apache.org/jira/browse/SPARK-22211 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: on community.cloude.databrick.com > Runtime Version 3.2 (includes Apache Spark 2.2.0, Scala 2.11) >Reporter: Benyi Wang >Priority: Major > > LimitPushDown pushes LocalLimit to one side for FullOuterJoin, but this may > generate a wrong result: > Assume we use limit(1) and LocalLimit will be pushed to left side, and id=999 > is selected, but at right side we have 100K rows including 999, the result > will be > - one row is (999, 999) > - the rest rows are (null, xxx) > Once you call show(), the row (999,999) has only 1/10th chance to be > selected by CollectLimit. > The actual optimization might be, > - push down limit > - but convert the join to Broadcast LeftOuterJoin or RightOuterJoin. > Here is my notebook: > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/349451637617406/2750346983121008/656075277290/latest.html > {code:java} > import scala.util.Random._ > val dl = shuffle(1 to 10).toDF("id") > val dr = shuffle(1 to 10).toDF("id") > println("data frame dl:") > dl.explain > println("data frame dr:") > dr.explain > val j = dl.join(dr, dl("id") === dr("id"), "outer").limit(1) > j.explain > j.show(false) > {code} > {code} > data frame dl: > == Physical Plan == > LocalTableScan [id#10] > data frame dr: > == Physical Plan == > LocalTableScan [id#16] > == Physical Plan == > CollectLimit 1 > +- SortMergeJoin [id#10], [id#16], FullOuter >:- *Sort [id#10 ASC NULLS FIRST], false, 0 >: +- Exchange hashpartitioning(id#10, 200) >: +- *LocalLimit 1 >:+- LocalTableScan [id#10] >+- *Sort [id#16 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(id#16, 200) > +- LocalTableScan [id#16] > import scala.util.Random._ > dl: org.apache.spark.sql.DataFrame = [id: int] > dr: org.apache.spark.sql.DataFrame = [id: int] > j: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, id: int] > ++---+ > |id |id | > ++---+ > |null|148| > ++---+ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22440) Add Calinski-Harabasz index to ClusteringEvaluator
[ https://issues.apache.org/jira/browse/SPARK-22440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238241#comment-16238241 ] Marco Gaido commented on SPARK-22440: - Honestly I don't know what people are using for clustering evaluation and I don't know either where to retrive such a statistic. My goal here was to make easier for people to migrate their existing workloads to Spark. Since sklearn is surely one of the most widespread libraries for machine learning, the existing workloads can evaluate an unsupervised clustering through Silhouette or Calinski-Harabasz. If we support both, I think the adoption of Spark would be easier for them. > Add Calinski-Harabasz index to ClusteringEvaluator > -- > > Key: SPARK-22440 > URL: https://issues.apache.org/jira/browse/SPARK-22440 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.3.0 >Reporter: Marco Gaido >Priority: Minor > > In SPARK-14516 we introduced ClusteringEvaluator with an implementation of > Silhouette. > sklearn contains also another metric for the evaluation of unsupervised > clustering results. The metric is Calinski-Harabasz. This JIRA is to add it > to Spark. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22424) Task not finished for a long time in monitor UI, but I found it finished in logs
[ https://issues.apache.org/jira/browse/SPARK-22424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238202#comment-16238202 ] Sean Owen commented on SPARK-22424: --- Yes, but how is that related to the tasks you are highlighting? the tasks you also highlight are both successful. A task may succeed but that doesn't mean all have succeeded in the batch. > Task not finished for a long time in monitor UI, but I found it finished in > logs > > > Key: SPARK-22424 > URL: https://issues.apache.org/jira/browse/SPARK-22424 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: chengning >Priority: Blocking > Attachments: 1.jpg, 1.png, C33oL.jpg > > > Task not finished for a long time in monitor UI, but I found it finished in > logs > Thanks a lot. > !https://i.stack.imgur.com/C33oL.jpg! > !C33oL.jpg|thumbnail! > executor log: > 17/09/29 17:32:28 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 213492 > 17/09/29 17:32:28 INFO executor.Executor: Running task 52.0 in stage 2468.0 > (TID 213492) > 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Getting 30 > non-empty blocks out of 30 blocks > 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Started 29 remote > fetches in 1 ms > 17:32:28.447: tcPartition=7 ms > 17/09/29 17:32:28 INFO executor.Executor: Finished task 52.0 in stage 2468.0 > (TID 213492). 2755 bytes result sent to driver > driver log:: > 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Starting task 52.0 in stage > 2468.0 (TID 213492, HMGQXD2, executor 1, partition 52, PROCESS_LOCAL, 6386 > bytes) > 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Finished task 52.0 in stage > 2468.0 (TID 213492) in 24 ms on HMGQXD2 (executor 1) (53/200) > 17/09/29 17:32:28 INFO cluster.YarnScheduler: Removed TaskSet 2468.0, whose > tasks have all completed, from pool > 17/09/29 17:32:28 INFO scheduler.DAGScheduler: ResultStage 2468 > (foreachPartition at Counter2.java:152) finished in 0.255 s > 17/09/29 17:32:28 INFO scheduler.DAGScheduler: Job 1647 finished: > foreachPartition at Counter2.java:152, took 0.415256 s -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22440) Add Calinski-Harabasz index to ClusteringEvaluator
[ https://issues.apache.org/jira/browse/SPARK-22440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238179#comment-16238179 ] Sean Owen commented on SPARK-22440: --- I had honestly never heard of this. Is this widely used at all? I don't think it's a goal to add every known metric here. > Add Calinski-Harabasz index to ClusteringEvaluator > -- > > Key: SPARK-22440 > URL: https://issues.apache.org/jira/browse/SPARK-22440 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.3.0 >Reporter: Marco Gaido >Priority: Minor > > In SPARK-14516 we introduced ClusteringEvaluator with an implementation of > Silhouette. > sklearn contains also another metric for the evaluation of unsupervised > clustering results. The metric is Calinski-Harabasz. This JIRA is to add it > to Spark. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22433) Linear regression R^2 train/test terminology related
[ https://issues.apache.org/jira/browse/SPARK-22433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238150#comment-16238150 ] Sean Owen commented on SPARK-22433: --- Yeah, R^2 _could_ be used as a metric but it belongs a bit more as metadata from the model fitting process. You're right, it's put in too general a place; R^2 really is highly specific to linear models. And yeah, probably can't go prohibiting that now (though if there's a nice place for a warning, maybe worthwhile.) I think it's OK to change the test, but am neutral on it. I would have just written a test case using RMSE. > Linear regression R^2 train/test terminology related > - > > Key: SPARK-22433 > URL: https://issues.apache.org/jira/browse/SPARK-22433 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Teng Peng >Priority: Minor > > Traditional statistics is traditional statistics. Their goal, framework, and > terminologies are not the same as ML. However, in linear regression related > components, this distinction is not clear, which is reflected: > 1. regressionMetric + regressionEvaluator : > * R2 shouldn't be there. > * A better name "regressionPredictionMetric". > 2. LinearRegressionSuite: > * Shouldn't test R2 and residuals on test data. > * There is no train set and test set in this setting. > 3. Terminology: there is no "linear regression with L1 regularization". > Linear regression is linear. Adding a penalty term, then it is no longer > linear. Just call it "LASSO", "ElasticNet". > There are more. I am working on correcting them. > They are not breaking anything, but it does not make one feel good to see the > basic distinction is blurred. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22433) Linear regression R^2 train/test terminology related
[ https://issues.apache.org/jira/browse/SPARK-22433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238137#comment-16238137 ] Seth Hendrickson commented on SPARK-22433: -- The main problem I see is that we put "r2" in the `RegressionEvaluator` class, which can be used for all types of regression - e.g. DecisionTreeRegressor, which is non-sensical. Removing it would break compatibility and is probably not worth it since the end user is responsible for using the tools appropriately anyway. I'm not sure there is much to do here. AFAIK using r2 on regularized models is a fuzzy area, but I don't think it's doing much harm to leave it and I don't think we'd be concerned about our test cases. Certainly unit tests don't imply an endorsement of the methodology anyway. > Linear regression R^2 train/test terminology related > - > > Key: SPARK-22433 > URL: https://issues.apache.org/jira/browse/SPARK-22433 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Teng Peng >Priority: Minor > > Traditional statistics is traditional statistics. Their goal, framework, and > terminologies are not the same as ML. However, in linear regression related > components, this distinction is not clear, which is reflected: > 1. regressionMetric + regressionEvaluator : > * R2 shouldn't be there. > * A better name "regressionPredictionMetric". > 2. LinearRegressionSuite: > * Shouldn't test R2 and residuals on test data. > * There is no train set and test set in this setting. > 3. Terminology: there is no "linear regression with L1 regularization". > Linear regression is linear. Adding a penalty term, then it is no longer > linear. Just call it "LASSO", "ElasticNet". > There are more. I am working on correcting them. > They are not breaking anything, but it does not make one feel good to see the > basic distinction is blurred. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22440) Add Calinski-Harabasz index to ClusteringEvaluator
[ https://issues.apache.org/jira/browse/SPARK-22440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238095#comment-16238095 ] Marco Gaido commented on SPARK-22440: - I am preparing an implementation for this. It will stil take some time. At soon as it will be ready, I'll submit a PR. > Add Calinski-Harabasz index to ClusteringEvaluator > -- > > Key: SPARK-22440 > URL: https://issues.apache.org/jira/browse/SPARK-22440 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.3.0 >Reporter: Marco Gaido >Priority: Minor > > In SPARK-14516 we introduced ClusteringEvaluator with an implementation of > Silhouette. > sklearn contains also another metric for the evaluation of unsupervised > clustering results. The metric is Calinski-Harabasz. This JIRA is to add it > to Spark. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22440) Add Calinski-Harabasz index to ClusteringEvaluator
Marco Gaido created SPARK-22440: --- Summary: Add Calinski-Harabasz index to ClusteringEvaluator Key: SPARK-22440 URL: https://issues.apache.org/jira/browse/SPARK-22440 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 2.3.0 Reporter: Marco Gaido Priority: Minor In SPARK-14516 we introduced ClusteringEvaluator with an implementation of Silhouette. sklearn contains also another metric for the evaluation of unsupervised clustering results. The metric is Calinski-Harabasz. This JIRA is to add it to Spark. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22433) Linear regression R^2 train/test terminology related
[ https://issues.apache.org/jira/browse/SPARK-22433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237884#comment-16237884 ] Teng Peng commented on SPARK-22433: --- Thanks for the quick response, Sean. I am glad this issue is discussed in Spark community. I understand how important coherent is, and it's the users' decision to do what they believe is appropriate. I just want to propose a one-line change: change eval.setMetricName("r2") to "mse" in test("cross validation with linear regression"). Then we would not leave the impression that "Wait what? Spark officially cross validate on R2?" > Linear regression R^2 train/test terminology related > - > > Key: SPARK-22433 > URL: https://issues.apache.org/jira/browse/SPARK-22433 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Teng Peng >Priority: Minor > > Traditional statistics is traditional statistics. Their goal, framework, and > terminologies are not the same as ML. However, in linear regression related > components, this distinction is not clear, which is reflected: > 1. regressionMetric + regressionEvaluator : > * R2 shouldn't be there. > * A better name "regressionPredictionMetric". > 2. LinearRegressionSuite: > * Shouldn't test R2 and residuals on test data. > * There is no train set and test set in this setting. > 3. Terminology: there is no "linear regression with L1 regularization". > Linear regression is linear. Adding a penalty term, then it is no longer > linear. Just call it "LASSO", "ElasticNet". > There are more. I am working on correcting them. > They are not breaking anything, but it does not make one feel good to see the > basic distinction is blurred. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22433) Linear regression R^2 train/test terminology related
[ https://issues.apache.org/jira/browse/SPARK-22433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237830#comment-16237830 ] Sean Owen commented on SPARK-22433: --- I think one misunderstanding here is that you need not apply an Evaluator to a held-out test set. You could apply it to the training set, and indeed, normally would. I'd further assert that while R^2 isn't that useful to measure generalization error on a held-out test set, it's not incoherent. So it's not even some practice we need to prohibit. But yeah, that's not what it's for, even in MLlib. > Linear regression R^2 train/test terminology related > - > > Key: SPARK-22433 > URL: https://issues.apache.org/jira/browse/SPARK-22433 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Teng Peng >Priority: Minor > > Traditional statistics is traditional statistics. Their goal, framework, and > terminologies are not the same as ML. However, in linear regression related > components, this distinction is not clear, which is reflected: > 1. regressionMetric + regressionEvaluator : > * R2 shouldn't be there. > * A better name "regressionPredictionMetric". > 2. LinearRegressionSuite: > * Shouldn't test R2 and residuals on test data. > * There is no train set and test set in this setting. > 3. Terminology: there is no "linear regression with L1 regularization". > Linear regression is linear. Adding a penalty term, then it is no longer > linear. Just call it "LASSO", "ElasticNet". > There are more. I am working on correcting them. > They are not breaking anything, but it does not make one feel good to see the > basic distinction is blurred. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22211) LimitPushDown optimization for FullOuterJoin generates wrong results
[ https://issues.apache.org/jira/browse/SPARK-22211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237825#comment-16237825 ] Xiao Li commented on SPARK-22211: - We should merge it to the master and the previous releases at first. > LimitPushDown optimization for FullOuterJoin generates wrong results > > > Key: SPARK-22211 > URL: https://issues.apache.org/jira/browse/SPARK-22211 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: on community.cloude.databrick.com > Runtime Version 3.2 (includes Apache Spark 2.2.0, Scala 2.11) >Reporter: Benyi Wang >Priority: Major > > LimitPushDown pushes LocalLimit to one side for FullOuterJoin, but this may > generate a wrong result: > Assume we use limit(1) and LocalLimit will be pushed to left side, and id=999 > is selected, but at right side we have 100K rows including 999, the result > will be > - one row is (999, 999) > - the rest rows are (null, xxx) > Once you call show(), the row (999,999) has only 1/10th chance to be > selected by CollectLimit. > The actual optimization might be, > - push down limit > - but convert the join to Broadcast LeftOuterJoin or RightOuterJoin. > Here is my notebook: > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/349451637617406/2750346983121008/656075277290/latest.html > {code:java} > import scala.util.Random._ > val dl = shuffle(1 to 10).toDF("id") > val dr = shuffle(1 to 10).toDF("id") > println("data frame dl:") > dl.explain > println("data frame dr:") > dr.explain > val j = dl.join(dr, dl("id") === dr("id"), "outer").limit(1) > j.explain > j.show(false) > {code} > {code} > data frame dl: > == Physical Plan == > LocalTableScan [id#10] > data frame dr: > == Physical Plan == > LocalTableScan [id#16] > == Physical Plan == > CollectLimit 1 > +- SortMergeJoin [id#10], [id#16], FullOuter >:- *Sort [id#10 ASC NULLS FIRST], false, 0 >: +- Exchange hashpartitioning(id#10, 200) >: +- *LocalLimit 1 >:+- LocalTableScan [id#10] >+- *Sort [id#16 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(id#16, 200) > +- LocalTableScan [id#16] > import scala.util.Random._ > dl: org.apache.spark.sql.DataFrame = [id: int] > dr: org.apache.spark.sql.DataFrame = [id: int] > j: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, id: int] > ++---+ > |id |id | > ++---+ > |null|148| > ++---+ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22211) LimitPushDown optimization for FullOuterJoin generates wrong results
[ https://issues.apache.org/jira/browse/SPARK-22211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237822#comment-16237822 ] Sean Owen commented on SPARK-22211: --- In the name of correctness, still worth disabling for now, and then fixing later right? > LimitPushDown optimization for FullOuterJoin generates wrong results > > > Key: SPARK-22211 > URL: https://issues.apache.org/jira/browse/SPARK-22211 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: on community.cloude.databrick.com > Runtime Version 3.2 (includes Apache Spark 2.2.0, Scala 2.11) >Reporter: Benyi Wang >Priority: Major > > LimitPushDown pushes LocalLimit to one side for FullOuterJoin, but this may > generate a wrong result: > Assume we use limit(1) and LocalLimit will be pushed to left side, and id=999 > is selected, but at right side we have 100K rows including 999, the result > will be > - one row is (999, 999) > - the rest rows are (null, xxx) > Once you call show(), the row (999,999) has only 1/10th chance to be > selected by CollectLimit. > The actual optimization might be, > - push down limit > - but convert the join to Broadcast LeftOuterJoin or RightOuterJoin. > Here is my notebook: > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/349451637617406/2750346983121008/656075277290/latest.html > {code:java} > import scala.util.Random._ > val dl = shuffle(1 to 10).toDF("id") > val dr = shuffle(1 to 10).toDF("id") > println("data frame dl:") > dl.explain > println("data frame dr:") > dr.explain > val j = dl.join(dr, dl("id") === dr("id"), "outer").limit(1) > j.explain > j.show(false) > {code} > {code} > data frame dl: > == Physical Plan == > LocalTableScan [id#10] > data frame dr: > == Physical Plan == > LocalTableScan [id#16] > == Physical Plan == > CollectLimit 1 > +- SortMergeJoin [id#10], [id#16], FullOuter >:- *Sort [id#10 ASC NULLS FIRST], false, 0 >: +- Exchange hashpartitioning(id#10, 200) >: +- *LocalLimit 1 >:+- LocalTableScan [id#10] >+- *Sort [id#16 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(id#16, 200) > +- LocalTableScan [id#16] > import scala.util.Random._ > dl: org.apache.spark.sql.DataFrame = [id: int] > dr: org.apache.spark.sql.DataFrame = [id: int] > j: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, id: int] > ++---+ > |id |id | > ++---+ > |null|148| > ++---+ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22211) LimitPushDown optimization for FullOuterJoin generates wrong results
[ https://issues.apache.org/jira/browse/SPARK-22211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237797#comment-16237797 ] Xiao Li commented on SPARK-22211: - The Join operator should be limit aware. Anyway, we can do it later. > LimitPushDown optimization for FullOuterJoin generates wrong results > > > Key: SPARK-22211 > URL: https://issues.apache.org/jira/browse/SPARK-22211 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: on community.cloude.databrick.com > Runtime Version 3.2 (includes Apache Spark 2.2.0, Scala 2.11) >Reporter: Benyi Wang >Priority: Major > > LimitPushDown pushes LocalLimit to one side for FullOuterJoin, but this may > generate a wrong result: > Assume we use limit(1) and LocalLimit will be pushed to left side, and id=999 > is selected, but at right side we have 100K rows including 999, the result > will be > - one row is (999, 999) > - the rest rows are (null, xxx) > Once you call show(), the row (999,999) has only 1/10th chance to be > selected by CollectLimit. > The actual optimization might be, > - push down limit > - but convert the join to Broadcast LeftOuterJoin or RightOuterJoin. > Here is my notebook: > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/349451637617406/2750346983121008/656075277290/latest.html > {code:java} > import scala.util.Random._ > val dl = shuffle(1 to 10).toDF("id") > val dr = shuffle(1 to 10).toDF("id") > println("data frame dl:") > dl.explain > println("data frame dr:") > dr.explain > val j = dl.join(dr, dl("id") === dr("id"), "outer").limit(1) > j.explain > j.show(false) > {code} > {code} > data frame dl: > == Physical Plan == > LocalTableScan [id#10] > data frame dr: > == Physical Plan == > LocalTableScan [id#16] > == Physical Plan == > CollectLimit 1 > +- SortMergeJoin [id#10], [id#16], FullOuter >:- *Sort [id#10 ASC NULLS FIRST], false, 0 >: +- Exchange hashpartitioning(id#10, 200) >: +- *LocalLimit 1 >:+- LocalTableScan [id#10] >+- *Sort [id#16 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(id#16, 200) > +- LocalTableScan [id#16] > import scala.util.Random._ > dl: org.apache.spark.sql.DataFrame = [id: int] > dr: org.apache.spark.sql.DataFrame = [id: int] > j: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, id: int] > ++---+ > |id |id | > ++---+ > |null|148| > ++---+ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22211) LimitPushDown optimization for FullOuterJoin generates wrong results
[ https://issues.apache.org/jira/browse/SPARK-22211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237778#comment-16237778 ] Henry Robinson commented on SPARK-22211: [~smilegator] - sounds good! What will your approach be? I wasn't able to see a safe way to push the limit through the join without either a more invasive rewrite or restricting the set of join operators for FOJ. > LimitPushDown optimization for FullOuterJoin generates wrong results > > > Key: SPARK-22211 > URL: https://issues.apache.org/jira/browse/SPARK-22211 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: on community.cloude.databrick.com > Runtime Version 3.2 (includes Apache Spark 2.2.0, Scala 2.11) >Reporter: Benyi Wang >Priority: Major > > LimitPushDown pushes LocalLimit to one side for FullOuterJoin, but this may > generate a wrong result: > Assume we use limit(1) and LocalLimit will be pushed to left side, and id=999 > is selected, but at right side we have 100K rows including 999, the result > will be > - one row is (999, 999) > - the rest rows are (null, xxx) > Once you call show(), the row (999,999) has only 1/10th chance to be > selected by CollectLimit. > The actual optimization might be, > - push down limit > - but convert the join to Broadcast LeftOuterJoin or RightOuterJoin. > Here is my notebook: > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/349451637617406/2750346983121008/656075277290/latest.html > {code:java} > import scala.util.Random._ > val dl = shuffle(1 to 10).toDF("id") > val dr = shuffle(1 to 10).toDF("id") > println("data frame dl:") > dl.explain > println("data frame dr:") > dr.explain > val j = dl.join(dr, dl("id") === dr("id"), "outer").limit(1) > j.explain > j.show(false) > {code} > {code} > data frame dl: > == Physical Plan == > LocalTableScan [id#10] > data frame dr: > == Physical Plan == > LocalTableScan [id#16] > == Physical Plan == > CollectLimit 1 > +- SortMergeJoin [id#10], [id#16], FullOuter >:- *Sort [id#10 ASC NULLS FIRST], false, 0 >: +- Exchange hashpartitioning(id#10, 200) >: +- *LocalLimit 1 >:+- LocalTableScan [id#10] >+- *Sort [id#16 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(id#16, 200) > +- LocalTableScan [id#16] > import scala.util.Random._ > dl: org.apache.spark.sql.DataFrame = [id: int] > dr: org.apache.spark.sql.DataFrame = [id: int] > j: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, id: int] > ++---+ > |id |id | > ++---+ > |null|148| > ++---+ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22417) createDataFrame from a pandas.DataFrame reads datetime64 values as longs
[ https://issues.apache.org/jira/browse/SPARK-22417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22417: Assignee: (was: Apache Spark) > createDataFrame from a pandas.DataFrame reads datetime64 values as longs > > > Key: SPARK-22417 > URL: https://issues.apache.org/jira/browse/SPARK-22417 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Bryan Cutler >Priority: Normal > > When trying to create a Spark DataFrame from an existing Pandas DataFrame > using {{createDataFrame}}, columns with datetime64 values are converted as > long values. This is only when the schema is not specified. > {code} > In [2]: import pandas as pd >...: from datetime import datetime >...: > In [3]: pdf = pd.DataFrame({"ts": [datetime(2017, 10, 31, 1, 1, 1)]}) > In [4]: df = spark.createDataFrame(pdf) > In [5]: df.show() > +---+ > | ts| > +---+ > |15094116610| > +---+ > In [6]: df.schema > Out[6]: StructType(List(StructField(ts,LongType,true))) > {code} > Spark should interpret a datetime64\[D\] value to DateType and other > datetime64 values to TImestampType. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22433) Linear regression R^2 train/test terminology related
[ https://issues.apache.org/jira/browse/SPARK-22433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237774#comment-16237774 ] Teng Peng commented on SPARK-22433: --- What I agree with you: be coherent, and we prefer ML-oreinted standard. What I want to add: be coherent, and we prefer ML-oreinted standard only if we are talking about ML. If we are talking about traditional statistics, we should stick to the established standard of traditional statistics. What I want to explain: 1. ML world: there is training set and test set. We have this to evaluate if we our models have good prediction performance. If we don't have them, then unavoidably overfitting. Traditional statistics world: there is no training set and test, because our goal is interpretation of models, not prediction performance. R^2 is in the framework of traditional statistics, and it has nothing to do with prediction related goals. If we are using R^2, we are in the domain of traditional statistics. If our goal is interpretation, then we look at R^2. 2. The regressionMetric and regressionEvaluator is designed for ML related goals using linear regression approach(which might be useful for a benchmark). So this two are actually in the domain of ML world, not traditional statistics. However, R^2 is mixed into it. This mixture appear everywhere. Looking at test("cross validation with linear regression") . R^2 is evaluated by cross validation, and the larger the better. This is a misunderstanding of what R2 is. The bottom line: there is a clear distinction between traditional statistics and ML. If something belongs to traditional statistics, then we should not mix them with ML. > Linear regression R^2 train/test terminology related > - > > Key: SPARK-22433 > URL: https://issues.apache.org/jira/browse/SPARK-22433 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Teng Peng >Priority: Minor > > Traditional statistics is traditional statistics. Their goal, framework, and > terminologies are not the same as ML. However, in linear regression related > components, this distinction is not clear, which is reflected: > 1. regressionMetric + regressionEvaluator : > * R2 shouldn't be there. > * A better name "regressionPredictionMetric". > 2. LinearRegressionSuite: > * Shouldn't test R2 and residuals on test data. > * There is no train set and test set in this setting. > 3. Terminology: there is no "linear regression with L1 regularization". > Linear regression is linear. Adding a penalty term, then it is no longer > linear. Just call it "LASSO", "ElasticNet". > There are more. I am working on correcting them. > They are not breaking anything, but it does not make one feel good to see the > basic distinction is blurred. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22417) createDataFrame from a pandas.DataFrame reads datetime64 values as longs
[ https://issues.apache.org/jira/browse/SPARK-22417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22417: Assignee: Apache Spark > createDataFrame from a pandas.DataFrame reads datetime64 values as longs > > > Key: SPARK-22417 > URL: https://issues.apache.org/jira/browse/SPARK-22417 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Bryan Cutler >Assignee: Apache Spark >Priority: Normal > > When trying to create a Spark DataFrame from an existing Pandas DataFrame > using {{createDataFrame}}, columns with datetime64 values are converted as > long values. This is only when the schema is not specified. > {code} > In [2]: import pandas as pd >...: from datetime import datetime >...: > In [3]: pdf = pd.DataFrame({"ts": [datetime(2017, 10, 31, 1, 1, 1)]}) > In [4]: df = spark.createDataFrame(pdf) > In [5]: df.show() > +---+ > | ts| > +---+ > |15094116610| > +---+ > In [6]: df.schema > Out[6]: StructType(List(StructField(ts,LongType,true))) > {code} > Spark should interpret a datetime64\[D\] value to DateType and other > datetime64 values to TImestampType. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22147) BlockId.hashCode allocates a StringBuilder/String on each call
[ https://issues.apache.org/jira/browse/SPARK-22147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237775#comment-16237775 ] Bryan Cutler commented on SPARK-22147: -- Sorry, I linked the above PR to this JIRA accidentally > BlockId.hashCode allocates a StringBuilder/String on each call > -- > > Key: SPARK-22147 > URL: https://issues.apache.org/jira/browse/SPARK-22147 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 2.2.0 >Reporter: Sergei Lebedev >Assignee: Sergei Lebedev >Priority: Minor > Fix For: 2.3.0 > > > The base class {{BlockId}} > [defines|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockId.scala#L44] > {{hashCode}} and {{equals}} for all its subclasses in terms of {{name}}. > This makes the definitions of different ID types [very > concise|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockId.scala#L52]. > The downside, however, is redundant allocations. While I don't think this > could be the major issue, it is still a bit disappointing to increase GC > pressure on the driver for nothing. For our machine learning workloads, we've > seen as much as 10% of all allocations on the driver coming from > {{BlockId.hashCode}} calls done for > [BlockManagerMasterEndpoint.blockLocations|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala#L54]. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22417) createDataFrame from a pandas.DataFrame reads datetime64 values as longs
[ https://issues.apache.org/jira/browse/SPARK-22417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237772#comment-16237772 ] Apache Spark commented on SPARK-22417: -- User 'BryanCutler' has created a pull request for this issue: https://github.com/apache/spark/pull/19646 > createDataFrame from a pandas.DataFrame reads datetime64 values as longs > > > Key: SPARK-22417 > URL: https://issues.apache.org/jira/browse/SPARK-22417 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Bryan Cutler >Priority: Normal > > When trying to create a Spark DataFrame from an existing Pandas DataFrame > using {{createDataFrame}}, columns with datetime64 values are converted as > long values. This is only when the schema is not specified. > {code} > In [2]: import pandas as pd >...: from datetime import datetime >...: > In [3]: pdf = pd.DataFrame({"ts": [datetime(2017, 10, 31, 1, 1, 1)]}) > In [4]: df = spark.createDataFrame(pdf) > In [5]: df.show() > +---+ > | ts| > +---+ > |15094116610| > +---+ > In [6]: df.schema > Out[6]: StructType(List(StructField(ts,LongType,true))) > {code} > Spark should interpret a datetime64\[D\] value to DateType and other > datetime64 values to TImestampType. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22426) Spark AM launching containers on node where External spark shuffle service failed to initialize
[ https://issues.apache.org/jira/browse/SPARK-22426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237769#comment-16237769 ] Prabhu Joseph commented on SPARK-22426: --- Thanks [~jerryshao], we can close this as a duplicate. > Spark AM launching containers on node where External spark shuffle service > failed to initialize > --- > > Key: SPARK-22426 > URL: https://issues.apache.org/jira/browse/SPARK-22426 > Project: Spark > Issue Type: Bug > Components: Shuffle, YARN >Affects Versions: 1.6.3 >Reporter: Prabhu Joseph >Priority: Major > > When Spark External Shuffle Service on a NodeManager fails, the remote > executors will fail while fetching the data from the executors launched on > this Node. Spark ApplicationMaster should not launch containers on this Node > or not use external shuffle service. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22437) jdbc write fails to set default mode
[ https://issues.apache.org/jira/browse/SPARK-22437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237733#comment-16237733 ] Adrian Bridgett commented on SPARK-22437: - good grief that was fast! > jdbc write fails to set default mode > > > Key: SPARK-22437 > URL: https://issues.apache.org/jira/browse/SPARK-22437 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.2.0 > Environment: python3 >Reporter: Adrian Bridgett > > With this bit of code: > {code} > df.write.jdbc(jdbc_url, table) > {code} > We see this error: > {code} > 09:54:22 2017-11-03 09:54:19,985 INFO File > "/opt/spark220/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 820, > in jdbc > 09:54:22 2017-11-03 09:54:19,985 INFO File > "/opt/spark220/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line > 1133, in __call__ > 09:54:22 2017-11-03 09:54:19,986 INFO File > "/opt/spark220/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco > 09:54:22 2017-11-03 09:54:19,986 INFO File > "/opt/spark220/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in > get_return_value > 09:54:22 2017-11-03 09:54:19,987 INFO py4j.protocol.Py4JJavaError: An > error occurred while calling o106.jdbc. > 09:54:22 2017-11-03 09:54:19,987 INFO : scala.MatchError: null > 09:54:22 2017-11-03 09:54:19,987 INFO at > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:62) > 09:54:22 2017-11-03 09:54:19,987 INFO at > org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:472) > 09:54:22 2017-11-03 09:54:19,987 INFO at > org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48) > ... > {code} > This seems to be that "mode" isn't correctly picking up the "error" default > as listed in > https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame > Note that if it was erroring because of existing data it'd say "SaveMode: > ErrorIfExists." as seen in > ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala). > Changing our code to this worked: > {code} > df.write.jdbc(jdbc_url, table, mode='append') > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark
[ https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237731#comment-16237731 ] Timothy Hunter commented on SPARK-21866: Adding {{spark.read.image}} is going to create a (soft) dependency between the core and mllib, which hosts the implementation of the current reader methods. This is fine and can dealt with using reflection, but since this would involve adding a core API to Spark, I suggest we do it as a follow-up task. > SPIP: Image support in Spark > > > Key: SPARK-21866 > URL: https://issues.apache.org/jira/browse/SPARK-21866 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Timothy Hunter >Priority: Major > Labels: SPIP > Attachments: SPIP - Image support for Apache Spark V1.1.pdf > > > h2. Background and motivation > As Apache Spark is being used more and more in the industry, some new use > cases are emerging for different data formats beyond the traditional SQL > types or the numerical types (vectors and matrices). Deep Learning > applications commonly deal with image processing. A number of projects add > some Deep Learning capabilities to Spark (see list below), but they struggle > to communicate with each other or with MLlib pipelines because there is no > standard way to represent an image in Spark DataFrames. We propose to > federate efforts for representing images in Spark by defining a > representation that caters to the most common needs of users and library > developers. > This SPIP proposes a specification to represent images in Spark DataFrames > and Datasets (based on existing industrial standards), and an interface for > loading sources of images. It is not meant to be a full-fledged image > processing library, but rather the core description that other libraries and > users can rely on. Several packages already offer various processing > facilities for transforming images or doing more complex operations, and each > has various design tradeoffs that make them better as standalone solutions. > This project is a joint collaboration between Microsoft and Databricks, which > have been testing this design in two open source packages: MMLSpark and Deep > Learning Pipelines. > The proposed image format is an in-memory, decompressed representation that > targets low-level applications. It is significantly more liberal in memory > usage than compressed image representations such as JPEG, PNG, etc., but it > allows easy communication with popular image processing libraries and has no > decoding overhead. > h2. Targets users and personas: > Data scientists, data engineers, library developers. > The following libraries define primitives for loading and representing > images, and will gain from a common interchange format (in alphabetical > order): > * BigDL > * DeepLearning4J > * Deep Learning Pipelines > * MMLSpark > * TensorFlow (Spark connector) > * TensorFlowOnSpark > * TensorFrames > * Thunder > h2. Goals: > * Simple representation of images in Spark DataFrames, based on pre-existing > industrial standards (OpenCV) > * This format should eventually allow the development of high-performance > integration points with image processing libraries such as libOpenCV, Google > TensorFlow, CNTK, and other C libraries. > * The reader should be able to read popular formats of images from > distributed sources. > h2. Non-Goals: > Images are a versatile medium and encompass a very wide range of formats and > representations. This SPIP explicitly aims at the most common use case in the > industry currently: multi-channel matrices of binary, int32, int64, float or > double data that can fit comfortably in the heap of the JVM: > * the total size of an image should be restricted to less than 2GB (roughly) > * the meaning of color channels is application-specific and is not mandated > by the standard (in line with the OpenCV standard) > * specialized formats used in meteorology, the medical field, etc. are not > supported > * this format is specialized to images and does not attempt to solve the more > general problem of representing n-dimensional tensors in Spark > h2. Proposed API changes > We propose to add a new package in the package structure, under the MLlib > project: > {{org.apache.spark.image}} > h3. Data format > We propose to add the following structure: > imageSchema = StructType([ > * StructField("mode", StringType(), False), > ** The exact representation of the data. > ** The values are described in the following OpenCV convention. Basically, > the type has both "depth" and "number of channels" info: in particular, type > "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 > (value 32 in the table) with the channel order specified
[jira] [Commented] (SPARK-22430) Unknown tag warnings when building R docs with Roxygen 6.0.1
[ https://issues.apache.org/jira/browse/SPARK-22430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237719#comment-16237719 ] Felix Cheung commented on SPARK-22430: -- I am seeing it too. I think we can just remove the tag but Jenkins is running an older version of roxygen2 and our earlier attempt to update it didn't go well (we have a JIRA on this). Will need to be very careful with this. > Unknown tag warnings when building R docs with Roxygen 6.0.1 > > > Key: SPARK-22430 > URL: https://issues.apache.org/jira/browse/SPARK-22430 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.3.0 > Environment: Roxygen 6.0.1 >Reporter: Joel Croteau > > When building R docs using create-rd.sh with Roxygen 6.0.1, a large number of > unknown tag warnings are generated: > {noformat} > Warning: @export [schema.R#33]: unknown tag > Warning: @export [schema.R#53]: unknown tag > Warning: @export [schema.R#63]: unknown tag > Warning: @export [schema.R#80]: unknown tag > Warning: @export [schema.R#123]: unknown tag > Warning: @export [schema.R#141]: unknown tag > Warning: @export [schema.R#216]: unknown tag > Warning: @export [generics.R#388]: unknown tag > Warning: @export [generics.R#403]: unknown tag > Warning: @export [generics.R#407]: unknown tag > Warning: @export [generics.R#414]: unknown tag > Warning: @export [generics.R#418]: unknown tag > Warning: @export [generics.R#422]: unknown tag > Warning: @export [generics.R#428]: unknown tag > Warning: @export [generics.R#432]: unknown tag > Warning: @export [generics.R#438]: unknown tag > Warning: @export [generics.R#442]: unknown tag > Warning: @export [generics.R#446]: unknown tag > Warning: @export [generics.R#450]: unknown tag > Warning: @export [generics.R#454]: unknown tag > Warning: @export [generics.R#459]: unknown tag > Warning: @export [generics.R#467]: unknown tag > Warning: @export [generics.R#475]: unknown tag > Warning: @export [generics.R#479]: unknown tag > Warning: @export [generics.R#483]: unknown tag > Warning: @export [generics.R#487]: unknown tag > Warning: @export [generics.R#498]: unknown tag > Warning: @export [generics.R#502]: unknown tag > Warning: @export [generics.R#506]: unknown tag > Warning: @export [generics.R#512]: unknown tag > Warning: @export [generics.R#518]: unknown tag > Warning: @export [generics.R#526]: unknown tag > Warning: @export [generics.R#530]: unknown tag > Warning: @export [generics.R#534]: unknown tag > Warning: @export [generics.R#538]: unknown tag > Warning: @export [generics.R#542]: unknown tag > Warning: @export [generics.R#549]: unknown tag > Warning: @export [generics.R#556]: unknown tag > Warning: @export [generics.R#560]: unknown tag > Warning: @export [generics.R#567]: unknown tag > Warning: @export [generics.R#571]: unknown tag > Warning: @export [generics.R#575]: unknown tag > Warning: @export [generics.R#579]: unknown tag > Warning: @export [generics.R#583]: unknown tag > Warning: @export [generics.R#587]: unknown tag > Warning: @export [generics.R#591]: unknown tag > Warning: @export [generics.R#595]: unknown tag > Warning: @export [generics.R#599]: unknown tag > Warning: @export [generics.R#603]: unknown tag > Warning: @export [generics.R#607]: unknown tag > Warning: @export [generics.R#611]: unknown tag > Warning: @export [generics.R#615]: unknown tag > Warning: @export [generics.R#619]: unknown tag > Warning: @export [generics.R#623]: unknown tag > Warning: @export [generics.R#627]: unknown tag > Warning: @export [generics.R#631]: unknown tag > Warning: @export [generics.R#635]: unknown tag > Warning: @export [generics.R#639]: unknown tag > Warning: @export [generics.R#643]: unknown tag > Warning: @export [generics.R#647]: unknown tag > Warning: @export [generics.R#654]: unknown tag > Warning: @export [generics.R#658]: unknown tag > Warning: @export [generics.R#663]: unknown tag > Warning: @export [generics.R#667]: unknown tag > Warning: @export [generics.R#672]: unknown tag > Warning: @export [generics.R#676]: unknown tag > Warning: @export [generics.R#680]: unknown tag > Warning: @export [generics.R#684]: unknown tag > Warning: @export [generics.R#690]: unknown tag > Warning: @export [generics.R#696]: unknown tag > Warning: @export [generics.R#702]: unknown tag > Warning: @export [generics.R#706]: unknown tag > Warning: @export [generics.R#710]: unknown tag > Warning: @export [generics.R#716]: unknown tag > Warning: @export [generics.R#720]: unknown tag > Warning: @export [generics.R#726]: unknown tag > Warning: @export [generics.R#730]: unknown tag > Warning: @export [generics.R#734]: unknown tag > Warning: @export [generics.R#738]: unknown tag > Warning: @export [generics.R#742]: unknown tag > Warning: @export [generics.R#750]: unknown tag > Warning:
[jira] [Assigned] (SPARK-22437) jdbc write fails to set default mode
[ https://issues.apache.org/jira/browse/SPARK-22437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22437: Assignee: (was: Apache Spark) > jdbc write fails to set default mode > > > Key: SPARK-22437 > URL: https://issues.apache.org/jira/browse/SPARK-22437 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.2.0 > Environment: python3 >Reporter: Adrian Bridgett > > With this bit of code: > {code} > df.write.jdbc(jdbc_url, table) > {code} > We see this error: > {code} > 09:54:22 2017-11-03 09:54:19,985 INFO File > "/opt/spark220/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 820, > in jdbc > 09:54:22 2017-11-03 09:54:19,985 INFO File > "/opt/spark220/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line > 1133, in __call__ > 09:54:22 2017-11-03 09:54:19,986 INFO File > "/opt/spark220/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco > 09:54:22 2017-11-03 09:54:19,986 INFO File > "/opt/spark220/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in > get_return_value > 09:54:22 2017-11-03 09:54:19,987 INFO py4j.protocol.Py4JJavaError: An > error occurred while calling o106.jdbc. > 09:54:22 2017-11-03 09:54:19,987 INFO : scala.MatchError: null > 09:54:22 2017-11-03 09:54:19,987 INFO at > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:62) > 09:54:22 2017-11-03 09:54:19,987 INFO at > org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:472) > 09:54:22 2017-11-03 09:54:19,987 INFO at > org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48) > ... > {code} > This seems to be that "mode" isn't correctly picking up the "error" default > as listed in > https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame > Note that if it was erroring because of existing data it'd say "SaveMode: > ErrorIfExists." as seen in > ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala). > Changing our code to this worked: > {code} > df.write.jdbc(jdbc_url, table, mode='append') > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22437) jdbc write fails to set default mode
[ https://issues.apache.org/jira/browse/SPARK-22437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237716#comment-16237716 ] Apache Spark commented on SPARK-22437: -- User 'mgaido91' has created a pull request for this issue: https://github.com/apache/spark/pull/19654 > jdbc write fails to set default mode > > > Key: SPARK-22437 > URL: https://issues.apache.org/jira/browse/SPARK-22437 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.2.0 > Environment: python3 >Reporter: Adrian Bridgett > > With this bit of code: > {code} > df.write.jdbc(jdbc_url, table) > {code} > We see this error: > {code} > 09:54:22 2017-11-03 09:54:19,985 INFO File > "/opt/spark220/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 820, > in jdbc > 09:54:22 2017-11-03 09:54:19,985 INFO File > "/opt/spark220/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line > 1133, in __call__ > 09:54:22 2017-11-03 09:54:19,986 INFO File > "/opt/spark220/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco > 09:54:22 2017-11-03 09:54:19,986 INFO File > "/opt/spark220/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in > get_return_value > 09:54:22 2017-11-03 09:54:19,987 INFO py4j.protocol.Py4JJavaError: An > error occurred while calling o106.jdbc. > 09:54:22 2017-11-03 09:54:19,987 INFO : scala.MatchError: null > 09:54:22 2017-11-03 09:54:19,987 INFO at > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:62) > 09:54:22 2017-11-03 09:54:19,987 INFO at > org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:472) > 09:54:22 2017-11-03 09:54:19,987 INFO at > org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48) > ... > {code} > This seems to be that "mode" isn't correctly picking up the "error" default > as listed in > https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame > Note that if it was erroring because of existing data it'd say "SaveMode: > ErrorIfExists." as seen in > ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala). > Changing our code to this worked: > {code} > df.write.jdbc(jdbc_url, table, mode='append') > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22437) jdbc write fails to set default mode
[ https://issues.apache.org/jira/browse/SPARK-22437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22437: Assignee: Apache Spark > jdbc write fails to set default mode > > > Key: SPARK-22437 > URL: https://issues.apache.org/jira/browse/SPARK-22437 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.2.0 > Environment: python3 >Reporter: Adrian Bridgett >Assignee: Apache Spark > > With this bit of code: > {code} > df.write.jdbc(jdbc_url, table) > {code} > We see this error: > {code} > 09:54:22 2017-11-03 09:54:19,985 INFO File > "/opt/spark220/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 820, > in jdbc > 09:54:22 2017-11-03 09:54:19,985 INFO File > "/opt/spark220/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line > 1133, in __call__ > 09:54:22 2017-11-03 09:54:19,986 INFO File > "/opt/spark220/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco > 09:54:22 2017-11-03 09:54:19,986 INFO File > "/opt/spark220/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in > get_return_value > 09:54:22 2017-11-03 09:54:19,987 INFO py4j.protocol.Py4JJavaError: An > error occurred while calling o106.jdbc. > 09:54:22 2017-11-03 09:54:19,987 INFO : scala.MatchError: null > 09:54:22 2017-11-03 09:54:19,987 INFO at > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:62) > 09:54:22 2017-11-03 09:54:19,987 INFO at > org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:472) > 09:54:22 2017-11-03 09:54:19,987 INFO at > org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48) > ... > {code} > This seems to be that "mode" isn't correctly picking up the "error" default > as listed in > https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame > Note that if it was erroring because of existing data it'd say "SaveMode: > ErrorIfExists." as seen in > ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala). > Changing our code to this worked: > {code} > df.write.jdbc(jdbc_url, table, mode='append') > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22433) Linear regression R^2 train/test terminology related
[ https://issues.apache.org/jira/browse/SPARK-22433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237676#comment-16237676 ] Sean Owen commented on SPARK-22433: --- Likewise, the goal here is not to adopt statistics terminology. As the name implies, MLlib is coming more from ML terminology and practices. We should stick to standard language and approaches where there is a clear standard, completely agree; where there is a ML-oriented standard, we should probably prefer that one for consistency. I'm not sure I agree with these yet. What is a regression prediction metric? and why can't R^2 be an evaluation metric? I am not sure I'd use it that way, but it's coherent. There's no issue with computing R^2 on a test set vs train set it was trained on -- could be negative, sure. I understand your distinction, but "linear regression with L1 reg" still produces a linear model. The cost function is not the same as in linear regression, but the name is also different. LASSO is, I think, less well known for people that would use this than L1. While I wouldn't mind LASSO, I don't see a clear motive to change this. > Linear regression R^2 train/test terminology related > - > > Key: SPARK-22433 > URL: https://issues.apache.org/jira/browse/SPARK-22433 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Teng Peng >Priority: Minor > > Traditional statistics is traditional statistics. Their goal, framework, and > terminologies are not the same as ML. However, in linear regression related > components, this distinction is not clear, which is reflected: > 1. regressionMetric + regressionEvaluator : > * R2 shouldn't be there. > * A better name "regressionPredictionMetric". > 2. LinearRegressionSuite: > * Shouldn't test R2 and residuals on test data. > * There is no train set and test set in this setting. > 3. Terminology: there is no "linear regression with L1 regularization". > Linear regression is linear. Adding a penalty term, then it is no longer > linear. Just call it "LASSO", "ElasticNet". > There are more. I am working on correcting them. > They are not breaking anything, but it does not make one feel good to see the > basic distinction is blurred. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22418) Add test cases for NULL Handling
[ https://issues.apache.org/jira/browse/SPARK-22418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22418: Assignee: (was: Apache Spark) > Add test cases for NULL Handling > > > Key: SPARK-22418 > URL: https://issues.apache.org/jira/browse/SPARK-22418 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Normal > > Add test cases mentioned in the link https://sqlite.org/nulls.html -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22418) Add test cases for NULL Handling
[ https://issues.apache.org/jira/browse/SPARK-22418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22418: Assignee: Apache Spark > Add test cases for NULL Handling > > > Key: SPARK-22418 > URL: https://issues.apache.org/jira/browse/SPARK-22418 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Normal > > Add test cases mentioned in the link https://sqlite.org/nulls.html -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22418) Add test cases for NULL Handling
[ https://issues.apache.org/jira/browse/SPARK-22418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237647#comment-16237647 ] Apache Spark commented on SPARK-22418: -- User 'mgaido91' has created a pull request for this issue: https://github.com/apache/spark/pull/19653 > Add test cases for NULL Handling > > > Key: SPARK-22418 > URL: https://issues.apache.org/jira/browse/SPARK-22418 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Normal > > Add test cases mentioned in the link https://sqlite.org/nulls.html -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22439) Not able to get numeric columns for the file having decimal values
[ https://issues.apache.org/jira/browse/SPARK-22439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237626#comment-16237626 ] Sean Owen commented on SPARK-22439: --- Your code doesn't work as is, but I adapted it to Scala to try it. I realize you're calling a private method, which happens to be visible in the Java bytecode but isn't meant to be called. I don't think you're intended to call this directly. > Not able to get numeric columns for the file having decimal values > -- > > Key: SPARK-22439 > URL: https://issues.apache.org/jira/browse/SPARK-22439 > Project: Spark > Issue Type: Bug > Components: Java API, SQL >Affects Versions: 2.2.0 >Reporter: Navya Krishnappa >Priority: Major > > When reading the below-mentioned decimal value by specifying header as true. > SourceFile: > 8.95977565356765764E+20 > 8.95977565356765764E+20 > 8.95977565356765764E+20 > Source code1: > Dataset dataset = getSqlContext().read() > .option(PARSER_LIB, "commons") > .option(INFER_SCHEMA, "true") > .option(HEADER, "true") > .option(DELIMITER, ",") > .option(QUOTE, "\"") > .option(ESCAPE, " > ") > .option(MODE, Mode.PERMISSIVE) > .csv(sourceFile); > dataset.numericColumns() > Result: > Caused by: java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:347) > at scala.None$.get(Option.scala:345) > at > org.apache.spark.sql.Dataset$$anonfun$numericColumns$2.apply(Dataset.scala:223) > at > org.apache.spark.sql.Dataset$$anonfun$numericColumns$2.apply(Dataset.scala:222) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22438) OutOfMemoryError on very small data sets
[ https://issues.apache.org/jira/browse/SPARK-22438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237553#comment-16237553 ] Morten Hornbech commented on SPARK-22438: - I honestly can't see whether those are duplicates. I have seen these OOM errors occur in a lot of very different situations, and I am not an expert on Spark Core. Since the linked issue is resolved you could run my 10 lines of code on your 2.3.0 development build to know for sure. > OutOfMemoryError on very small data sets > > > Key: SPARK-22438 > URL: https://issues.apache.org/jira/browse/SPARK-22438 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Morten Hornbech >Priority: Critical > > We have a customer that uses Spark as an engine for running SQL on a > collection of small datasets, typically no greater than a few thousand rows. > Recently we started observing out-of-memory errors on some new workloads. > Even though the datasets were only a few kilobytes, the job would almost > immediately spike to > 10GB of memory usage, producing an out-of-memory error > on the modest hardware (2 CPUs, 16 RAM) that is used. Using larger hardware > and allocating more memory to Spark (4 CPUs, 32 RAM) made the job complete, > but still with an unreasonable high memory usage. > The query involved was a left join on two datasets. In some, but not all, > cases we were able to remove or reduce the problem by rewriting the query to > use an exists sub-select instead. After a lot of debugging we were able to > reproduce the problem locally with the following test: > {code:java} > case class Data(value: String) > val session = SparkSession.builder.master("local[1]").getOrCreate() > import session.implicits._ > val foo = session.createDataset((1 to 500).map(i => Data(i.toString))) > val bar = session.createDataset((1 to 1).map(i => Data(i.toString))) > foo.persist(StorageLevel.MEMORY_ONLY) > foo.createTempView("foo") > bar.createTempView("bar") > val result = session.sql("select * from bar left join foo on bar.value = > foo.value") > result.coalesce(2).collect() > {code} > Running this produces the error below: > {code:java} > java.lang.OutOfMemoryError: Unable to acquire 28 bytes of memory, got 0 >at > org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:127) >at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:372) >at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:396) >at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109) >at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown > Source) >at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) >at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) >at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) >at > org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83) >at > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedBufferedToRowWithNullFreeJoinKey(SortMergeJoinExec.scala:774) >at > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.(SortMergeJoinExec.scala:649) >at > org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$doExecute$1.apply(SortMergeJoinExec.scala:198) >at > org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$doExecute$1.apply(SortMergeJoinExec.scala:136) >at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89) >at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) >at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) >at > org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:100) >at > org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:99) >at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) >at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) >at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234) >at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228) >at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) >at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) >at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) >at
[jira] [Assigned] (SPARK-22407) Add rdd id column on storage page to speed up navigating
[ https://issues.apache.org/jira/browse/SPARK-22407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-22407: - Assignee: zhoukang > Add rdd id column on storage page to speed up navigating > > > Key: SPARK-22407 > URL: https://issues.apache.org/jira/browse/SPARK-22407 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.1.0 >Reporter: zhoukang >Assignee: zhoukang > Fix For: 2.3.0 > > Attachments: add-rddid.png, rdd-cache.png > > > We can add rdd id column on storage page to speed up nagigating when many > rdds are cached. > Example after adding as below: > !add-rddid.png! > !rdd-cache.png! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22407) Add rdd id column on storage page to speed up navigating
[ https://issues.apache.org/jira/browse/SPARK-22407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-22407. --- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19625 [https://github.com/apache/spark/pull/19625] > Add rdd id column on storage page to speed up navigating > > > Key: SPARK-22407 > URL: https://issues.apache.org/jira/browse/SPARK-22407 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.1.0 >Reporter: zhoukang > Fix For: 2.3.0 > > Attachments: add-rddid.png, rdd-cache.png > > > We can add rdd id column on storage page to speed up nagigating when many > rdds are cached. > Example after adding as below: > !add-rddid.png! > !rdd-cache.png! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22436) New function strip() to remove all whitespace from string
[ https://issues.apache.org/jira/browse/SPARK-22436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237530#comment-16237530 ] Sean Owen commented on SPARK-22436: --- Yes, but that's true of any Python UDF, and not every one can be built into Spark. These functions you cite are there for SQL support. I don't think the idea is to expand into all variants. > New function strip() to remove all whitespace from string > - > > Key: SPARK-22436 > URL: https://issues.apache.org/jira/browse/SPARK-22436 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 2.2.0 >Reporter: Andreas Maier >Priority: Minor > Labels: features > > Since ticket SPARK-17299 the [trim() > function|https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.trim] > will not remove any whitespace characters from beginning and end of a string > but only spaces. This is correct in regard to the SQL standard, but it opens > a gap in functionality. > My suggestion is to add to the Spark API in analogy to pythons standard > library the functions l/r/strip(), which should remove all whitespace > characters from a string from beginning and/or end of a string respectively. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22438) OutOfMemoryError on very small data sets
[ https://issues.apache.org/jira/browse/SPARK-22438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-22438. --- Resolution: Duplicate Have a look through JIRA first. This looks like https://issues.apache.org/jira/browse/SPARK-21033 > OutOfMemoryError on very small data sets > > > Key: SPARK-22438 > URL: https://issues.apache.org/jira/browse/SPARK-22438 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Morten Hornbech >Priority: Critical > > We have a customer that uses Spark as an engine for running SQL on a > collection of small datasets, typically no greater than a few thousand rows. > Recently we started observing out-of-memory errors on some new workloads. > Even though the datasets were only a few kilobytes, the job would almost > immediately spike to > 10GB of memory usage, producing an out-of-memory error > on the modest hardware (2 CPUs, 16 RAM) that is used. Using larger hardware > and allocating more memory to Spark (4 CPUs, 32 RAM) made the job complete, > but still with an unreasonable high memory usage. > The query involved was a left join on two datasets. In some, but not all, > cases we were able to remove or reduce the problem by rewriting the query to > use an exists sub-select instead. After a lot of debugging we were able to > reproduce the problem locally with the following test: > {code:java} > case class Data(value: String) > val session = SparkSession.builder.master("local[1]").getOrCreate() > import session.implicits._ > val foo = session.createDataset((1 to 500).map(i => Data(i.toString))) > val bar = session.createDataset((1 to 1).map(i => Data(i.toString))) > foo.persist(StorageLevel.MEMORY_ONLY) > foo.createTempView("foo") > bar.createTempView("bar") > val result = session.sql("select * from bar left join foo on bar.value = > foo.value") > result.coalesce(2).collect() > {code} > Running this produces the error below: > {code:java} > java.lang.OutOfMemoryError: Unable to acquire 28 bytes of memory, got 0 >at > org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:127) >at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:372) >at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:396) >at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109) >at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown > Source) >at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) >at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) >at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) >at > org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83) >at > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedBufferedToRowWithNullFreeJoinKey(SortMergeJoinExec.scala:774) >at > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.(SortMergeJoinExec.scala:649) >at > org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$doExecute$1.apply(SortMergeJoinExec.scala:198) >at > org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$doExecute$1.apply(SortMergeJoinExec.scala:136) >at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89) >at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) >at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) >at > org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:100) >at > org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:99) >at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) >at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) >at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234) >at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228) >at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) >at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) >at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) >at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) >at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) >at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) >at
[jira] [Updated] (SPARK-22439) Not able to get numeric columns for the file having decimal values
[ https://issues.apache.org/jira/browse/SPARK-22439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Navya Krishnappa updated SPARK-22439: - Summary: Not able to get numeric columns for the file having decimal values (was: Not able to get numeric columns for the attached file) > Not able to get numeric columns for the file having decimal values > -- > > Key: SPARK-22439 > URL: https://issues.apache.org/jira/browse/SPARK-22439 > Project: Spark > Issue Type: Bug > Components: Java API, SQL >Affects Versions: 2.2.0 >Reporter: Navya Krishnappa >Priority: Major > > When reading the below-mentioned decimal value by specifying header as true. > SourceFile: > 8.95977565356765764E+20 > 8.95977565356765764E+20 > 8.95977565356765764E+20 > Source code1: > Dataset dataset = getSqlContext().read() > .option(PARSER_LIB, "commons") > .option(INFER_SCHEMA, "true") > .option(HEADER, "true") > .option(DELIMITER, ",") > .option(QUOTE, "\"") > .option(ESCAPE, " > ") > .option(MODE, Mode.PERMISSIVE) > .csv(sourceFile); > dataset.numericColumns() > Result: > Caused by: java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:347) > at scala.None$.get(Option.scala:345) > at > org.apache.spark.sql.Dataset$$anonfun$numericColumns$2.apply(Dataset.scala:223) > at > org.apache.spark.sql.Dataset$$anonfun$numericColumns$2.apply(Dataset.scala:222) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22439) Not able to get numeric columns for the attached file
[ https://issues.apache.org/jira/browse/SPARK-22439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Navya Krishnappa updated SPARK-22439: - Summary: Not able to get numeric columns for the attached file (was: Not able to get numeric column for the attached file) > Not able to get numeric columns for the attached file > - > > Key: SPARK-22439 > URL: https://issues.apache.org/jira/browse/SPARK-22439 > Project: Spark > Issue Type: Bug > Components: Java API, SQL >Affects Versions: 2.2.0 >Reporter: Navya Krishnappa >Priority: Major > > When reading the below-mentioned decimal value by specifying header as true. > SourceFile: > 8.95977565356765764E+20 > 8.95977565356765764E+20 > 8.95977565356765764E+20 > Source code1: > Dataset dataset = getSqlContext().read() > .option(PARSER_LIB, "commons") > .option(INFER_SCHEMA, "true") > .option(HEADER, "true") > .option(DELIMITER, ",") > .option(QUOTE, "\"") > .option(ESCAPE, " > ") > .option(MODE, Mode.PERMISSIVE) > .csv(sourceFile); > dataset.numericColumns() > Result: > Caused by: java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:347) > at scala.None$.get(Option.scala:345) > at > org.apache.spark.sql.Dataset$$anonfun$numericColumns$2.apply(Dataset.scala:223) > at > org.apache.spark.sql.Dataset$$anonfun$numericColumns$2.apply(Dataset.scala:222) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22439) Not able to get numeric column for the attached file
Navya Krishnappa created SPARK-22439: Summary: Not able to get numeric column for the attached file Key: SPARK-22439 URL: https://issues.apache.org/jira/browse/SPARK-22439 Project: Spark Issue Type: Bug Components: Java API, SQL Affects Versions: 2.2.0 Reporter: Navya Krishnappa Priority: Major When reading the below-mentioned decimal value by specifying header as true. SourceFile: 8.95977565356765764E+20 8.95977565356765764E+20 8.95977565356765764E+20 Source code1: Dataset dataset = getSqlContext().read() .option(PARSER_LIB, "commons") .option(INFER_SCHEMA, "true") .option(HEADER, "true") .option(DELIMITER, ",") .option(QUOTE, "\"") .option(ESCAPE, " ") .option(MODE, Mode.PERMISSIVE) .csv(sourceFile); dataset.numericColumns() Result: Caused by: java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:347) at scala.None$.get(Option.scala:345) at org.apache.spark.sql.Dataset$$anonfun$numericColumns$2.apply(Dataset.scala:223) at org.apache.spark.sql.Dataset$$anonfun$numericColumns$2.apply(Dataset.scala:222) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22438) OutOfMemoryError on very small data sets
[ https://issues.apache.org/jira/browse/SPARK-22438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Morten Hornbech updated SPARK-22438: Description: We have a customer that uses Spark as an engine for running SQL on a collection of small datasets, typically no greater than a few thousand rows. Recently we started observing out-of-memory errors on some new workloads. Even though the datasets were only a few kilobytes, the job would almost immediately spike to > 10GB of memory usage, producing an out-of-memory error on the modest hardware (2 CPUs, 16 RAM) that is used. Using larger hardware and allocating more memory to Spark (4 CPUs, 32 RAM) made the job complete, but still with an unreasonable high memory usage. The query involved was a left join on two datasets. In some, but not all, cases we were able to remove or reduce the problem by rewriting the query to use an exists sub-select instead. After a lot of debugging we were able to reproduce the problem locally with the following test: {code:java} case class Data(value: String) val session = SparkSession.builder.master("local[1]").getOrCreate() import session.implicits._ val foo = session.createDataset((1 to 500).map(i => Data(i.toString))) val bar = session.createDataset((1 to 1).map(i => Data(i.toString))) foo.persist(StorageLevel.MEMORY_ONLY) foo.createTempView("foo") bar.createTempView("bar") val result = session.sql("select * from bar left join foo on bar.value = foo.value") result.coalesce(2).collect() {code} Running this produces the error below: {code:java} java.lang.OutOfMemoryError: Unable to acquire 28 bytes of memory, got 0 at org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:127) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:372) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:396) at org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83) at org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedBufferedToRowWithNullFreeJoinKey(SortMergeJoinExec.scala:774) at org.apache.spark.sql.execution.joins.SortMergeJoinScanner.(SortMergeJoinExec.scala:649) at org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$doExecute$1.apply(SortMergeJoinExec.scala:198) at org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$doExecute$1.apply(SortMergeJoinExec.scala:136) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:100) at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:99) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} The exact failure point varies with the number of threads given to spark, the "coalesce" value and the number of rows in "foo". Using an inner join, removing the call to persist, removing the call to coalease (or using repartition) will all independently make the error go away. The reason persist and
[jira] [Created] (SPARK-22438) OutOfMemoryError on very small data sets
Morten Hornbech created SPARK-22438: --- Summary: OutOfMemoryError on very small data sets Key: SPARK-22438 URL: https://issues.apache.org/jira/browse/SPARK-22438 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.0 Reporter: Morten Hornbech Priority: Critical We have a customer that uses Spark as an engine for running SQL on a collection of small datasets, typically no greater than a few thousand rows. Recently we started observing out-of-memory errors on some new workloads. Even though the datasets were only a few kilobytes, the job would almost immediately spike to > 10GB of memory usage, producing an out-of-memory error on the modest hardware (2 CPUs, 16 RAM) that is used. Using larger hardware and allocating more memory to Spark (4 CPUs, 32 RAM) made the job complete, but still with an unreasonable high memory usage. The query involved was a left join on two datasets. In some, but not all, cases we were able to remove or reduce the problem by rewriting the query to use an exists sub-select instead. After a lot of debugging we were able to reproduce the problem locally with the following test: {{ case class Data(value: String) val session = SparkSession.builder.master("local[1]").getOrCreate() import session.implicits._ val foo = session.createDataset((1 to 500).map(i => Data(i.toString))) val bar = session.createDataset((1 to 1).map(i => Data(i.toString))) foo.persist(StorageLevel.MEMORY_ONLY) foo.createTempView("foo") bar.createTempView("bar") val result = session.sql("select * from bar left join foo on bar.value = foo.value") result.coalesce(2).collect() }} Running this produces the error below: {{ java.lang.OutOfMemoryError: Unable to acquire 28 bytes of memory, got 0 at org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:127) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:372) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:396) at org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83) at org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedBufferedToRowWithNullFreeJoinKey(SortMergeJoinExec.scala:774) at org.apache.spark.sql.execution.joins.SortMergeJoinScanner.(SortMergeJoinExec.scala:649) at org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$doExecute$1.apply(SortMergeJoinExec.scala:198) at org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$doExecute$1.apply(SortMergeJoinExec.scala:136) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:100) at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:99) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) }} The exact failure point varies with the number of threads given to spark, the "coalesce" value and the number of rows
[jira] [Created] (SPARK-22437) jdbc write fails to set default mode
Adrian Bridgett created SPARK-22437: --- Summary: jdbc write fails to set default mode Key: SPARK-22437 URL: https://issues.apache.org/jira/browse/SPARK-22437 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 2.2.0 Environment: python3 Reporter: Adrian Bridgett With this bit of code: {code} df.write.jdbc(jdbc_url, table) {code} We see this error: {code} 09:54:22 2017-11-03 09:54:19,985 INFO File "/opt/spark220/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 820, in jdbc 09:54:22 2017-11-03 09:54:19,985 INFO File "/opt/spark220/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ 09:54:22 2017-11-03 09:54:19,986 INFO File "/opt/spark220/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco 09:54:22 2017-11-03 09:54:19,986 INFO File "/opt/spark220/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value 09:54:22 2017-11-03 09:54:19,987 INFO py4j.protocol.Py4JJavaError: An error occurred while calling o106.jdbc. 09:54:22 2017-11-03 09:54:19,987 INFO : scala.MatchError: null 09:54:22 2017-11-03 09:54:19,987 INFO at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:62) 09:54:22 2017-11-03 09:54:19,987 INFO at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:472) 09:54:22 2017-11-03 09:54:19,987 INFO at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48) ... {code} This seems to be that "mode" isn't correctly picking up the "error" default as listed in https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame Note that if it was erroring because of existing data it'd say "SaveMode: ErrorIfExists." as seen in ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala). Changing our code to this worked: {code} df.write.jdbc(jdbc_url, table, mode='append') {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22436) New function strip() to remove all whitespace from string
[ https://issues.apache.org/jira/browse/SPARK-22436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237446#comment-16237446 ] Andreas Maier commented on SPARK-22436: --- Python UDFs are very slow, aren't they? I believe a Spark native function would be much faster. And in fact it was already available with trim() before SPARK-17299 . > New function strip() to remove all whitespace from string > - > > Key: SPARK-22436 > URL: https://issues.apache.org/jira/browse/SPARK-22436 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 2.2.0 >Reporter: Andreas Maier >Priority: Minor > Labels: features > > Since ticket SPARK-17299 the [trim() > function|https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.trim] > will not remove any whitespace characters from beginning and end of a string > but only spaces. This is correct in regard to the SQL standard, but it opens > a gap in functionality. > My suggestion is to add to the Spark API in analogy to pythons standard > library the functions l/r/strip(), which should remove all whitespace > characters from a string from beginning and/or end of a string respectively. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22418) Add test cases for NULL Handling
[ https://issues.apache.org/jira/browse/SPARK-22418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237442#comment-16237442 ] Marco Gaido commented on SPARK-22418: - can I work on this? > Add test cases for NULL Handling > > > Key: SPARK-22418 > URL: https://issues.apache.org/jira/browse/SPARK-22418 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Normal > > Add test cases mentioned in the link https://sqlite.org/nulls.html -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22436) New function strip() to remove all whitespace from string
[ https://issues.apache.org/jira/browse/SPARK-22436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237420#comment-16237420 ] Sean Owen commented on SPARK-22436: --- Why does it need to be in Spark as opposed to a simple UDF? > New function strip() to remove all whitespace from string > - > > Key: SPARK-22436 > URL: https://issues.apache.org/jira/browse/SPARK-22436 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 2.2.0 >Reporter: Andreas Maier >Priority: Minor > Labels: features > > Since ticket SPARK-17299 the [trim() > function|https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.trim] > will not remove any whitespace characters from beginning and end of a string > but only spaces. This is correct in regard to the SQL standard, but it opens > a gap in functionality. > My suggestion is to add to the Spark API in analogy to pythons standard > library the functions l/r/strip(), which should remove all whitespace > characters from a string from beginning and/or end of a string respectively. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22435) Support processing array and map type using script
[ https://issues.apache.org/jira/browse/SPARK-22435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jin xing updated SPARK-22435: - Priority: Major (was: Critical) > Support processing array and map type using script > -- > > Key: SPARK-22435 > URL: https://issues.apache.org/jira/browse/SPARK-22435 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: jin xing >Priority: Major > > Currently, It is not supported to use script(e.g. python) to process array > type or map type, it will complain with below message: > {{org.apache.spark.sql.catalyst.expressions.UnsafeArrayData cannot be cast to > [Ljava.lang.Object}} > {{org.apache.spark.sql.catalyst.expressions.UnsafeMapData cannot be cast to > java.util.Map}} > Could we support it? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22436) New function strip() to remove all whitespace from string
Andreas Maier created SPARK-22436: - Summary: New function strip() to remove all whitespace from string Key: SPARK-22436 URL: https://issues.apache.org/jira/browse/SPARK-22436 Project: Spark Issue Type: Improvement Components: PySpark, Spark Core Affects Versions: 2.2.0 Reporter: Andreas Maier Priority: Minor Since ticket SPARK-17299 the [trim() function|https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.trim] will not remove any whitespace characters from beginning and end of a string but only spaces. This is correct in regard to the SQL standard, but it opens a gap in functionality. My suggestion is to add to the Spark API in analogy to pythons standard library the functions l/r/strip(), which should remove all whitespace characters from a string from beginning and/or end of a string respectively. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22435) Support processing array and map type using script
[ https://issues.apache.org/jira/browse/SPARK-22435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22435: Assignee: Apache Spark > Support processing array and map type using script > -- > > Key: SPARK-22435 > URL: https://issues.apache.org/jira/browse/SPARK-22435 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: jin xing >Assignee: Apache Spark >Priority: Critical > > Currently, It is not supported to use script(e.g. python) to process array > type or map type, it will complain with below message: > {{org.apache.spark.sql.catalyst.expressions.UnsafeArrayData cannot be cast to > [Ljava.lang.Object}} > {{org.apache.spark.sql.catalyst.expressions.UnsafeMapData cannot be cast to > java.util.Map}} > Could we support it? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22435) Support processing array and map type using script
[ https://issues.apache.org/jira/browse/SPARK-22435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237407#comment-16237407 ] Apache Spark commented on SPARK-22435: -- User 'jinxing64' has created a pull request for this issue: https://github.com/apache/spark/pull/19652 > Support processing array and map type using script > -- > > Key: SPARK-22435 > URL: https://issues.apache.org/jira/browse/SPARK-22435 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: jin xing >Priority: Critical > > Currently, It is not supported to use script(e.g. python) to process array > type or map type, it will complain with below message: > {{org.apache.spark.sql.catalyst.expressions.UnsafeArrayData cannot be cast to > [Ljava.lang.Object}} > {{org.apache.spark.sql.catalyst.expressions.UnsafeMapData cannot be cast to > java.util.Map}} > Could we support it? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22435) Support processing array and map type using script
[ https://issues.apache.org/jira/browse/SPARK-22435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22435: Assignee: (was: Apache Spark) > Support processing array and map type using script > -- > > Key: SPARK-22435 > URL: https://issues.apache.org/jira/browse/SPARK-22435 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: jin xing >Priority: Critical > > Currently, It is not supported to use script(e.g. python) to process array > type or map type, it will complain with below message: > {{org.apache.spark.sql.catalyst.expressions.UnsafeArrayData cannot be cast to > [Ljava.lang.Object}} > {{org.apache.spark.sql.catalyst.expressions.UnsafeMapData cannot be cast to > java.util.Map}} > Could we support it? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22435) Support processing array and map type using script
jin xing created SPARK-22435: Summary: Support processing array and map type using script Key: SPARK-22435 URL: https://issues.apache.org/jira/browse/SPARK-22435 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: jin xing Priority: Critical Currently, It is not supported to use script(e.g. python) to process array type or map type, it will complain with below message: {{org.apache.spark.sql.catalyst.expressions.UnsafeArrayData cannot be cast to [Ljava.lang.Object}} {{org.apache.spark.sql.catalyst.expressions.UnsafeMapData cannot be cast to java.util.Map}} Could we support it? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22434) Spark structured streaming with HBase
[ https://issues.apache.org/jira/browse/SPARK-22434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-22434. --- Resolution: Invalid Fix Version/s: (was: 2.0.3) Questions are for the mailing list > Spark structured streaming with HBase > - > > Key: SPARK-22434 > URL: https://issues.apache.org/jira/browse/SPARK-22434 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 2.1.2 >Reporter: Harendra Singh >Priority: Blocker > Original Estimate: 2m > Remaining Estimate: 2m > > Hi Team, > We are doing streaming on kafka data which being collected from MySQL. > Now once all the analytics has been done i want to save my data directly to > Hbase. > I have through the spark structured streaming document but couldn't find any > sink with Hbase. > > Code Snip which i used to read the data from Kafka is below. > == > val records = spark.readStream.format("kafka").option("subscribe", > "kaapociot").option("kafka.bootstrap.servers", > "XX.XX.XX.XX:6667").option("startingOffsets", "earliest").load > val jsonschema = StructType(Seq(StructField("header", StringType, > true),StructField("event", StringType, true))) > val uschema = StructType(Seq( > StructField("MeterNumber", StringType, true), > StructField("Utility", StringType, true), > StructField("VendorServiceNumber", StringType, true), > StructField("VendorName", StringType, true), > StructField("SiteNumber", StringType, true), > StructField("SiteName", StringType, true), > StructField("Location", StringType, true), > StructField("timestamp", LongType, true), > StructField("power", DoubleType, true) > )) > val DF_Hbase = records.selectExpr("cast (value as string) as > json").select(from_json($"json",schema=jsonschema).as("data")).select("data.event").select(from_json($"event", > uschema).as("mykafkadata")).select("mykafkadata.*") > === > Now finally, i want to save DF_Hbase dataframe in hbase. > Please help me out > Thanks > Harendra Singh -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22434) Spark structured streaming with HBase
Harendra Singh created SPARK-22434: -- Summary: Spark structured streaming with HBase Key: SPARK-22434 URL: https://issues.apache.org/jira/browse/SPARK-22434 Project: Spark Issue Type: Task Components: Structured Streaming Affects Versions: 2.1.2 Reporter: Harendra Singh Priority: Blocker Fix For: 2.0.3 Hi Team, We are doing streaming on kafka data which being collected from MySQL. Now once all the analytics has been done i want to save my data directly to Hbase. I have through the spark structured streaming document but couldn't find any sink with Hbase. Code Snip which i used to read the data from Kafka is below. == val records = spark.readStream.format("kafka").option("subscribe", "kaapociot").option("kafka.bootstrap.servers", "XX.XX.XX.XX:6667").option("startingOffsets", "earliest").load val jsonschema = StructType(Seq(StructField("header", StringType, true),StructField("event", StringType, true))) val uschema = StructType(Seq( StructField("MeterNumber", StringType, true), StructField("Utility", StringType, true), StructField("VendorServiceNumber", StringType, true), StructField("VendorName", StringType, true), StructField("SiteNumber", StringType, true), StructField("SiteName", StringType, true), StructField("Location", StringType, true), StructField("timestamp", LongType, true), StructField("power", DoubleType, true) )) val DF_Hbase = records.selectExpr("cast (value as string) as json").select(from_json($"json",schema=jsonschema).as("data")).select("data.event").select(from_json($"event", uschema).as("mykafkadata")).select("mykafkadata.*") === Now finally, i want to save DF_Hbase dataframe in hbase. Please help me out Thanks Harendra Singh -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22420) Spark SQL return invalid json string for struct with date/datetime field
[ https://issues.apache.org/jira/browse/SPARK-22420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237368#comment-16237368 ] Marco Gaido commented on SPARK-22420: - I think this is related and will be resolved by SPARK-20202 > Spark SQL return invalid json string for struct with date/datetime field > > > Key: SPARK-22420 > URL: https://issues.apache.org/jira/browse/SPARK-22420 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: pin_zhang >Priority: Normal > > Run SQL with JDBC client in spark hiveserver2 > select named_struct ( 'b',current_timestamp) from test; > +---+--+ > | named_struct(b, current_timestamp()) | > +---+--+ > | {"b":2017-11-01 23:18:40.988} | > The json string is is invalid, date time value should be quoted. > If run sql in Apache hiveserver2, get expected json string > select named_struct ( 'b',current_timestamp) from dummy_table ; > +--+--+ > | _c0| > +--+--+ > | {"b":"2017-11-01 23:21:24.168"} | > +--+--+ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21668) Ability to run driver programs within a container
[ https://issues.apache.org/jira/browse/SPARK-21668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237329#comment-16237329 ] Apache Spark commented on SPARK-21668: -- User 'tashoyan' has created a pull request for this issue: https://github.com/apache/spark/pull/18885 > Ability to run driver programs within a container > - > > Key: SPARK-21668 > URL: https://issues.apache.org/jira/browse/SPARK-21668 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.1, 2.2.0 >Reporter: Arseniy Tashoyan >Priority: Minor > Labels: containers, docker, driver, spark-submit, standalone > Original Estimate: 96h > Remaining Estimate: 96h > > When a driver program in Client mode runs in a Docker container, it binds to > the IP address of the container, not the host machine. This container IP > address is accessible only within the host machine, it is inaccessible for > master and worker nodes. > For example, the host machine has IP address 192.168.216.10. When Docker > machine starts a container, it places it to a special bridged network and > assigns it an IP address like 172.17.0.2. All Spark nodes belonging to the > 192.168.216.0 network cannot access the bridged network with the container. > Therefore, the driver program is not able to communicate with the Spark > cluster. > Spark already provides SPARK_PUBLIC_DNS environment variable for this > purpose. However, in this scenario setting SPARK_PUBLIC_DNS to the host > machine IP address does not work. > Topic on StackOverflow: > [https://stackoverflow.com/questions/45489248/running-spark-driver-program-in-docker-container-no-connection-back-from-execu] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22429) Streaming checkpointing code does not retry after failure due to NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-22429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237282#comment-16237282 ] Sean Owen commented on SPARK-22429: --- [~tmgstev] it will have to be vs. master. Master and all branches seem fine though: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/ What do you see? > Streaming checkpointing code does not retry after failure due to > NullPointerException > - > > Key: SPARK-22429 > URL: https://issues.apache.org/jira/browse/SPARK-22429 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 1.6.3, 2.2.0 >Reporter: Tristan Stevens > > CheckpointWriteHandler has a built in retry mechanism. However > SPARK-14930/SPARK-13693 put in a fix to de-allocate the fs object, yet > initialises it in the wrong place for the while loop, and therefore on > attempt 2 it fails with a NPE. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22430) Unknown tag warnings when building R docs with Roxygen 6.0.1
[ https://issues.apache.org/jira/browse/SPARK-22430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237268#comment-16237268 ] Sean Owen commented on SPARK-22430: --- I see it too, because I have Roxygen 6.0.1 locally. [~felixcheung] have you seen this? I wonder if we could all use 6.0.1, but that's separate. Do you know of a fix? > Unknown tag warnings when building R docs with Roxygen 6.0.1 > > > Key: SPARK-22430 > URL: https://issues.apache.org/jira/browse/SPARK-22430 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.3.0 > Environment: Roxygen 6.0.1 >Reporter: Joel Croteau > > When building R docs using create-rd.sh with Roxygen 6.0.1, a large number of > unknown tag warnings are generated: > {noformat} > Warning: @export [schema.R#33]: unknown tag > Warning: @export [schema.R#53]: unknown tag > Warning: @export [schema.R#63]: unknown tag > Warning: @export [schema.R#80]: unknown tag > Warning: @export [schema.R#123]: unknown tag > Warning: @export [schema.R#141]: unknown tag > Warning: @export [schema.R#216]: unknown tag > Warning: @export [generics.R#388]: unknown tag > Warning: @export [generics.R#403]: unknown tag > Warning: @export [generics.R#407]: unknown tag > Warning: @export [generics.R#414]: unknown tag > Warning: @export [generics.R#418]: unknown tag > Warning: @export [generics.R#422]: unknown tag > Warning: @export [generics.R#428]: unknown tag > Warning: @export [generics.R#432]: unknown tag > Warning: @export [generics.R#438]: unknown tag > Warning: @export [generics.R#442]: unknown tag > Warning: @export [generics.R#446]: unknown tag > Warning: @export [generics.R#450]: unknown tag > Warning: @export [generics.R#454]: unknown tag > Warning: @export [generics.R#459]: unknown tag > Warning: @export [generics.R#467]: unknown tag > Warning: @export [generics.R#475]: unknown tag > Warning: @export [generics.R#479]: unknown tag > Warning: @export [generics.R#483]: unknown tag > Warning: @export [generics.R#487]: unknown tag > Warning: @export [generics.R#498]: unknown tag > Warning: @export [generics.R#502]: unknown tag > Warning: @export [generics.R#506]: unknown tag > Warning: @export [generics.R#512]: unknown tag > Warning: @export [generics.R#518]: unknown tag > Warning: @export [generics.R#526]: unknown tag > Warning: @export [generics.R#530]: unknown tag > Warning: @export [generics.R#534]: unknown tag > Warning: @export [generics.R#538]: unknown tag > Warning: @export [generics.R#542]: unknown tag > Warning: @export [generics.R#549]: unknown tag > Warning: @export [generics.R#556]: unknown tag > Warning: @export [generics.R#560]: unknown tag > Warning: @export [generics.R#567]: unknown tag > Warning: @export [generics.R#571]: unknown tag > Warning: @export [generics.R#575]: unknown tag > Warning: @export [generics.R#579]: unknown tag > Warning: @export [generics.R#583]: unknown tag > Warning: @export [generics.R#587]: unknown tag > Warning: @export [generics.R#591]: unknown tag > Warning: @export [generics.R#595]: unknown tag > Warning: @export [generics.R#599]: unknown tag > Warning: @export [generics.R#603]: unknown tag > Warning: @export [generics.R#607]: unknown tag > Warning: @export [generics.R#611]: unknown tag > Warning: @export [generics.R#615]: unknown tag > Warning: @export [generics.R#619]: unknown tag > Warning: @export [generics.R#623]: unknown tag > Warning: @export [generics.R#627]: unknown tag > Warning: @export [generics.R#631]: unknown tag > Warning: @export [generics.R#635]: unknown tag > Warning: @export [generics.R#639]: unknown tag > Warning: @export [generics.R#643]: unknown tag > Warning: @export [generics.R#647]: unknown tag > Warning: @export [generics.R#654]: unknown tag > Warning: @export [generics.R#658]: unknown tag > Warning: @export [generics.R#663]: unknown tag > Warning: @export [generics.R#667]: unknown tag > Warning: @export [generics.R#672]: unknown tag > Warning: @export [generics.R#676]: unknown tag > Warning: @export [generics.R#680]: unknown tag > Warning: @export [generics.R#684]: unknown tag > Warning: @export [generics.R#690]: unknown tag > Warning: @export [generics.R#696]: unknown tag > Warning: @export [generics.R#702]: unknown tag > Warning: @export [generics.R#706]: unknown tag > Warning: @export [generics.R#710]: unknown tag > Warning: @export [generics.R#716]: unknown tag > Warning: @export [generics.R#720]: unknown tag > Warning: @export [generics.R#726]: unknown tag > Warning: @export [generics.R#730]: unknown tag > Warning: @export [generics.R#734]: unknown tag > Warning: @export [generics.R#738]: unknown tag > Warning: @export [generics.R#742]: unknown tag > Warning: @export [generics.R#750]: unknown tag > Warning: @export [generics.R#754]: unknown tag > Warning: @export
[jira] [Updated] (SPARK-22427) StackOverFlowError when using FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-22427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-22427: -- Description: code part: val path = jobConfig.getString("hdfspath") val vectordata = sc.sparkContext.textFile(path) val finaldata = sc.createDataset(vectordata.map(obj => { obj.split(" ") }).filter(arr => arr.length > 0)).toDF("items") val fpg = new FPGrowth() fpg.setMinSupport(minSupport).setItemsCol("items").setMinConfidence(minConfidence) val train = fpg.fit(finaldata) print(train.freqItemsets.count()) print(train.associationRules.count()) train.save("/tmp/FPGModel") And encountered following exception: Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087) at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at org.apache.spark.rdd.RDD.collect(RDD.scala:935) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:278) at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2430) at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2429) at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2837) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2836) at org.apache.spark.sql.Dataset.count(Dataset.scala:2429) at DataMining.FPGrowth$.runJob(FPGrowth.scala:116) at DataMining.testFPG$.main(FPGrowth.scala:36) at DataMining.testFPG.main(FPGrowth.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.StackOverflowError at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:616) at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:36) at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:33) at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) at com.twitter.chill.TraversableSerializer$$anonfun$write$1.apply(Traversable.scala:29) at com.twitter.chill.TraversableSerializer$$anonfun$write$1.apply(Traversable.scala:27) at
[jira] [Commented] (SPARK-22423) Scala test source files like TestHiveSingleton.scala should be in scala source root
[ https://issues.apache.org/jira/browse/SPARK-22423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237176#comment-16237176 ] xubo245 commented on SPARK-22423: - OK, I will fix it. > Scala test source files like TestHiveSingleton.scala should be in scala > source root > --- > > Key: SPARK-22423 > URL: https://issues.apache.org/jira/browse/SPARK-22423 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.2.0 >Reporter: xubo245 >Priority: Minor > > The TestHiveSingleton.scala file should be in scala directory, not in java > directory -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22427) StackOverFlowError when using FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-22427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237174#comment-16237174 ] yuhao yang commented on SPARK-22427: Could you please try to increase the stack size, E.g. with -Xss10m ? > StackOverFlowError when using FPGrowth > -- > > Key: SPARK-22427 > URL: https://issues.apache.org/jira/browse/SPARK-22427 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.2.0 > Environment: Centos Linux 3.10.0-327.el7.x86_64 > java 1.8.0.111 > spark 2.2.0 >Reporter: lyt >Priority: Normal > > code part: > val path = jobConfig.getString("hdfspath") > val vectordata = sc.sparkContext.textFile(path) > val finaldata = sc.createDataset(vectordata.map(obj => { > obj.split(" ") > }).filter(arr => arr.length > 0)).toDF("items") > val fpg = new FPGrowth() > > fpg.setMinSupport(minSupport).setItemsCol("items").setMinConfidence(minConfidence) > val train = fpg.fit(finaldata) > print(train.freqItemsets.count()) > print(train.associationRules.count()) > train.save("/tmp/FPGModel") > And encountered following exception: > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087) > at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) > at org.apache.spark.rdd.RDD.collect(RDD.scala:935) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:278) > at > org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2430) > at > org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2429) > at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2837) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2836) > at org.apache.spark.sql.Dataset.count(Dataset.scala:2429) > at DataMining.FPGrowth$.runJob(FPGrowth.scala:116) > at DataMining.testFPG$.main(FPGrowth.scala:36) > at DataMining.testFPG.main(FPGrowth.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119) > at
[jira] [Updated] (SPARK-22211) LimitPushDown optimization for FullOuterJoin generates wrong results
[ https://issues.apache.org/jira/browse/SPARK-22211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-22211: Target Version/s: 2.2.1, 2.3.0 > LimitPushDown optimization for FullOuterJoin generates wrong results > > > Key: SPARK-22211 > URL: https://issues.apache.org/jira/browse/SPARK-22211 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: on community.cloude.databrick.com > Runtime Version 3.2 (includes Apache Spark 2.2.0, Scala 2.11) >Reporter: Benyi Wang >Priority: Major > > LimitPushDown pushes LocalLimit to one side for FullOuterJoin, but this may > generate a wrong result: > Assume we use limit(1) and LocalLimit will be pushed to left side, and id=999 > is selected, but at right side we have 100K rows including 999, the result > will be > - one row is (999, 999) > - the rest rows are (null, xxx) > Once you call show(), the row (999,999) has only 1/10th chance to be > selected by CollectLimit. > The actual optimization might be, > - push down limit > - but convert the join to Broadcast LeftOuterJoin or RightOuterJoin. > Here is my notebook: > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/349451637617406/2750346983121008/656075277290/latest.html > {code:java} > import scala.util.Random._ > val dl = shuffle(1 to 10).toDF("id") > val dr = shuffle(1 to 10).toDF("id") > println("data frame dl:") > dl.explain > println("data frame dr:") > dr.explain > val j = dl.join(dr, dl("id") === dr("id"), "outer").limit(1) > j.explain > j.show(false) > {code} > {code} > data frame dl: > == Physical Plan == > LocalTableScan [id#10] > data frame dr: > == Physical Plan == > LocalTableScan [id#16] > == Physical Plan == > CollectLimit 1 > +- SortMergeJoin [id#10], [id#16], FullOuter >:- *Sort [id#10 ASC NULLS FIRST], false, 0 >: +- Exchange hashpartitioning(id#10, 200) >: +- *LocalLimit 1 >:+- LocalTableScan [id#10] >+- *Sort [id#16 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(id#16, 200) > +- LocalTableScan [id#16] > import scala.util.Random._ > dl: org.apache.spark.sql.DataFrame = [id: int] > dr: org.apache.spark.sql.DataFrame = [id: int] > j: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, id: int] > ++---+ > |id |id | > ++---+ > |null|148| > ++---+ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22211) LimitPushDown optimization for FullOuterJoin generates wrong results
[ https://issues.apache.org/jira/browse/SPARK-22211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237133#comment-16237133 ] Xiao Li commented on SPARK-22211: - Will submit a PR based on my previous PR https://github.com/apache/spark/pull/10454 > LimitPushDown optimization for FullOuterJoin generates wrong results > > > Key: SPARK-22211 > URL: https://issues.apache.org/jira/browse/SPARK-22211 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: on community.cloude.databrick.com > Runtime Version 3.2 (includes Apache Spark 2.2.0, Scala 2.11) >Reporter: Benyi Wang >Priority: Major > > LimitPushDown pushes LocalLimit to one side for FullOuterJoin, but this may > generate a wrong result: > Assume we use limit(1) and LocalLimit will be pushed to left side, and id=999 > is selected, but at right side we have 100K rows including 999, the result > will be > - one row is (999, 999) > - the rest rows are (null, xxx) > Once you call show(), the row (999,999) has only 1/10th chance to be > selected by CollectLimit. > The actual optimization might be, > - push down limit > - but convert the join to Broadcast LeftOuterJoin or RightOuterJoin. > Here is my notebook: > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/349451637617406/2750346983121008/656075277290/latest.html > {code:java} > import scala.util.Random._ > val dl = shuffle(1 to 10).toDF("id") > val dr = shuffle(1 to 10).toDF("id") > println("data frame dl:") > dl.explain > println("data frame dr:") > dr.explain > val j = dl.join(dr, dl("id") === dr("id"), "outer").limit(1) > j.explain > j.show(false) > {code} > {code} > data frame dl: > == Physical Plan == > LocalTableScan [id#10] > data frame dr: > == Physical Plan == > LocalTableScan [id#16] > == Physical Plan == > CollectLimit 1 > +- SortMergeJoin [id#10], [id#16], FullOuter >:- *Sort [id#10 ASC NULLS FIRST], false, 0 >: +- Exchange hashpartitioning(id#10, 200) >: +- *LocalLimit 1 >:+- LocalTableScan [id#10] >+- *Sort [id#16 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(id#16, 200) > +- LocalTableScan [id#16] > import scala.util.Random._ > dl: org.apache.spark.sql.DataFrame = [id: int] > dr: org.apache.spark.sql.DataFrame = [id: int] > j: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, id: int] > ++---+ > |id |id | > ++---+ > |null|148| > ++---+ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org