[jira] [Resolved] (SPARK-22418) Add test cases for NULL Handling

2017-11-03 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-22418.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> Add test cases for NULL Handling
> 
>
> Key: SPARK-22418
> URL: https://issues.apache.org/jira/browse/SPARK-22418
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Marco Gaido
>Priority: Normal
> Fix For: 2.3.0
>
>
> Add test cases mentioned in the link https://sqlite.org/nulls.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22418) Add test cases for NULL Handling

2017-11-03 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-22418:
---

Assignee: Marco Gaido

> Add test cases for NULL Handling
> 
>
> Key: SPARK-22418
> URL: https://issues.apache.org/jira/browse/SPARK-22418
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Marco Gaido
>Priority: Normal
> Fix For: 2.3.0
>
>
> Add test cases mentioned in the link https://sqlite.org/nulls.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22436) New function strip() to remove all whitespace from string

2017-11-03 Thread Eric Maynard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238712#comment-16238712
 ] 

Eric Maynard commented on SPARK-22436:
--

[~asmaier] Wouldn't the right way to implement this be to use the existing 
function `regexp_replace`?
I'm not sure why this needs to be a built-in function.
Furthermore, if this was implemented, imo it should exist in all APIs, not just 
pyspark.

> New function strip() to remove all whitespace from string
> -
>
> Key: SPARK-22436
> URL: https://issues.apache.org/jira/browse/SPARK-22436
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.2.0
>Reporter: Andreas Maier
>Priority: Minor
>  Labels: features
>
> Since ticket SPARK-17299 the [trim() 
> function|https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.trim]
>  will not remove any whitespace characters from beginning and end of a string 
> but only spaces. This is correct in regard to the SQL standard, but it opens 
> a gap in functionality. 
> My suggestion is to add to the Spark API in analogy to pythons standard 
> library the functions l/r/strip(), which should remove all whitespace 
> characters from a string from beginning and/or end of a string respectively. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22446) Optimizer causing StringIndexerModel's indexer UDF to throw "Unseen label" exception incorrectly for filtered data.

2017-11-03 Thread Greg Bellchambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Bellchambers updated SPARK-22446:
--
Summary: Optimizer causing StringIndexerModel's indexer UDF to throw 
"Unseen label" exception incorrectly for filtered data.  (was: Optimizer 
causing StringIndexer's indexer UDF to throw "Unseen label" exception 
incorrectly for filtered data.)

> Optimizer causing StringIndexerModel's indexer UDF to throw "Unseen label" 
> exception incorrectly for filtered data.
> ---
>
> Key: SPARK-22446
> URL: https://issues.apache.org/jira/browse/SPARK-22446
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Optimizer
>Affects Versions: 2.0.0, 2.2.0
> Environment: spark-shell, local mode, macOS Sierra 10.12.6
>Reporter: Greg Bellchambers
>Priority: Normal
>
> In the following, the `indexer` UDF defined inside the 
> `org.apache.spark.ml.feature.StringIndexerModel.transform` method throws an 
> "Unseen label" error, despite the label not being present in the transformed 
> DataFrame.
> Here is the definition of the indexer UDF in the transform method:
> {code:java}
> val indexer = udf { label: String =>
>   if (labelToIndex.contains(label)) {
> labelToIndex(label)
>   } else {
> throw new SparkException(s"Unseen label: $label.")
>   }
> }
> {code}
> We can demonstrate the error with a very simple example DataFrame.
> {code:java}
> scala> import org.apache.spark.ml.feature.StringIndexer
> import org.apache.spark.ml.feature.StringIndexer
> scala> // first we create a DataFrame with three cities
> scala> val df = List(
>  | ("A", "London", "StrA"),
>  | ("B", "Bristol", null),
>  | ("C", "New York", "StrC")
>  | ).toDF("ID", "CITY", "CONTENT")
> df: org.apache.spark.sql.DataFrame = [ID: string, CITY: string ... 1 more 
> field]
> scala> df.show
> +---++---+
> | ID|CITY|CONTENT|
> +---++---+
> |  A|  London|   StrA|
> |  B| Bristol|   null|
> |  C|New York|   StrC|
> +---++---+
> scala> // then we remove the row with null in CONTENT column, which removes 
> Bristol
> scala> val dfNoBristol = finalStatic.filter($"CONTENT".isNotNull)
> dfNoBristol: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [ID: 
> string, CITY: string ... 1 more field]
> scala> dfNoBristol.show
> +---++---+
> | ID|CITY|CONTENT|
> +---++---+
> |  A|  London|   StrA|
> |  C|New York|   StrC|
> +---++---+
> scala> // now create a StringIndexer for the CITY column and fit to 
> dfNoBristol
> scala> val model = {
>  | new StringIndexer()
>  | .setInputCol("CITY")
>  | .setOutputCol("CITYIndexed")
>  | .fit(dfNoBristol)
>  | }
> model: org.apache.spark.ml.feature.StringIndexerModel = strIdx_f5afa2fb
> scala> // the StringIndexerModel has only two labels: "London" and "New York"
> scala> str.labels foreach println
> London
> New York
> scala> // transform our DataFrame to add an index column
> scala> val dfWithIndex = model.transform(dfNoBristol)
> dfWithIndex: org.apache.spark.sql.DataFrame = [ID: string, CITY: string ... 2 
> more fields]
> scala> dfWithIndex.show
> +---++---+---+
> | ID|CITY|CONTENT|CITYIndexed|
> +---++---+---+
> |  A|  London|   StrA|0.0|
> |  C|New York|   StrC|1.0|
> +---++---+---+
> {code}
> The unexpected behaviour comes when we filter `dfWithIndex` for `CITYIndexed` 
> equal to 1.0 and perform an action. The `indexer` UDF in `transform` method 
> throws an exception reporting unseen label "Bristol". This is irrational 
> behaviour as far as the user of the API is concerned, because there is no 
> such value as "Bristol" when do show all rows of `dfWithIndex`:
> {code:java}
> scala> dfWithIndex.filter($"CITYIndexed" === 1.0).count
> 17/11/04 00:33:41 ERROR Executor: Exception in task 1.0 in stage 20.0 (TID 40)
> org.apache.spark.SparkException: Failed to execute user defined 
> function($anonfun$5: (string) => double)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> 

[jira] [Updated] (SPARK-22446) Optimizer causing StringIndexer's indexer UDF to throw "Unseen label" exception incorrectly for filtered data.

2017-11-03 Thread Greg Bellchambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Bellchambers updated SPARK-22446:
--
Description: 
In the following, the `indexer` UDF defined inside the 
`org.apache.spark.ml.feature.StringIndexerModel.transform` method throws an 
"Unseen label" error, despite the label not being present in the transformed 
DataFrame.

Here is the definition of the indexer UDF in the transform method:

{code:java}
val indexer = udf { label: String =>
  if (labelToIndex.contains(label)) {
labelToIndex(label)
  } else {
throw new SparkException(s"Unseen label: $label.")
  }
}
{code}

We can demonstrate the error with a very simple example DataFrame.

{code:java}
scala> import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.ml.feature.StringIndexer

scala> // first we create a DataFrame with three cities

scala> val df = List(
 | ("A", "London", "StrA"),
 | ("B", "Bristol", null),
 | ("C", "New York", "StrC")
 | ).toDF("ID", "CITY", "CONTENT")
df: org.apache.spark.sql.DataFrame = [ID: string, CITY: string ... 1 more field]

scala> df.show
+---++---+
| ID|CITY|CONTENT|
+---++---+
|  A|  London|   StrA|
|  B| Bristol|   null|
|  C|New York|   StrC|
+---++---+


scala> // then we remove the row with null in CONTENT column, which removes 
Bristol

scala> val dfNoBristol = finalStatic.filter($"CONTENT".isNotNull)
dfNoBristol: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [ID: 
string, CITY: string ... 1 more field]

scala> dfNoBristol.show
+---++---+
| ID|CITY|CONTENT|
+---++---+
|  A|  London|   StrA|
|  C|New York|   StrC|
+---++---+


scala> // now create a StringIndexer for the CITY column and fit to dfNoBristol

scala> val model = {
 | new StringIndexer()
 | .setInputCol("CITY")
 | .setOutputCol("CITYIndexed")
 | .fit(dfNoBristol)
 | }
model: org.apache.spark.ml.feature.StringIndexerModel = strIdx_f5afa2fb

scala> // the StringIndexerModel has only two labels: "London" and "New York"

scala> str.labels foreach println
London
New York

scala> // transform our DataFrame to add an index column

scala> val dfWithIndex = model.transform(dfNoBristol)
dfWithIndex: org.apache.spark.sql.DataFrame = [ID: string, CITY: string ... 2 
more fields]

scala> dfWithIndex.show
+---++---+---+
| ID|CITY|CONTENT|CITYIndexed|
+---++---+---+
|  A|  London|   StrA|0.0|
|  C|New York|   StrC|1.0|
+---++---+---+
{code}

The unexpected behaviour comes when we filter `dfWithIndex` for `CITYIndexed` 
equal to 1.0 and perform an action. The `indexer` UDF in `transform` method 
throws an exception reporting unseen label "Bristol". This is irrational 
behaviour as far as the user of the API is concerned, because there is no such 
value as "Bristol" when do show all rows of `dfWithIndex`:

{code:java}
scala> dfWithIndex.filter($"CITYIndexed" === 1.0).count
17/11/04 00:33:41 ERROR Executor: Exception in task 1.0 in stage 20.0 (TID 40)
org.apache.spark.SparkException: Failed to execute user defined 
function($anonfun$5: (string) => double)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Unseen label: Bristol.  To handle 
unseen labels, set Param handleInvalid to keep.
at 
org.apache.spark.ml.feature.StringIndexerModel$$anonfun$5.apply(StringIndexer.scala:222)
at 
org.apache.spark.ml.feature.StringIndexerModel$$anonfun$5.apply(StringIndexer.scala:208)
... 13 more
{code}

To understand what is happening here, note that an action is triggered when we 
call 

[jira] [Created] (SPARK-22446) Optimizer causing StringIndexer's indexer UDF to throw "Unseen label" exception incorrectly for filtered data.

2017-11-03 Thread Greg Bellchambers (JIRA)
Greg Bellchambers created SPARK-22446:
-

 Summary: Optimizer causing StringIndexer's indexer UDF to throw 
"Unseen label" exception incorrectly for filtered data.
 Key: SPARK-22446
 URL: https://issues.apache.org/jira/browse/SPARK-22446
 Project: Spark
  Issue Type: Bug
  Components: ML, Optimizer
Affects Versions: 2.2.0, 2.0.0
 Environment: spark-shell, local mode, macOS Sierra 10.12.6
Reporter: Greg Bellchambers
Priority: Normal


In the following, the `indexer` UDF defined inside the 
`org.apache.spark.ml.feature.StringIndexerModel.transform` method throws an 
"Unseen label" error, despite the label not being present in the transformed 
DataFrame.

Here is the definition of the indexer UDF in the transform method:

{code:scala}
val indexer = udf { label: String =>
  if (labelToIndex.contains(label)) {
labelToIndex(label)
  } else {
throw new SparkException(s"Unseen label: $label.")
  }
}
{code}

We can demonstrate the error with a very simple example DataFrame.

{code:scala}
scala> import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.ml.feature.StringIndexer

scala> // first we create a DataFrame with three cities

scala> val df = List(
 | ("A", "London", "StrA"),
 | ("B", "Bristol", null),
 | ("C", "New York", "StrC")
 | ).toDF("ID", "CITY", "CONTENT")
df: org.apache.spark.sql.DataFrame = [ID: string, CITY: string ... 1 more field]

scala> df.show
+---++---+
| ID|CITY|CONTENT|
+---++---+
|  A|  London|   StrA|
|  B| Bristol|   null|
|  C|New York|   StrC|
+---++---+


scala> // then we remove the row with null in CONTENT column, which removes 
Bristol

scala> val dfNoBristol = finalStatic.filter($"CONTENT".isNotNull)
dfNoBristol: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [ID: 
string, CITY: string ... 1 more field]

scala> dfNoBristol.show
+---++---+
| ID|CITY|CONTENT|
+---++---+
|  A|  London|   StrA|
|  C|New York|   StrC|
+---++---+


scala> // now create a StringIndexer for the CITY column and fit to dfNoBristol

scala> val model = {
 | new StringIndexer()
 | .setInputCol("CITY")
 | .setOutputCol("CITYIndexed")
 | .fit(dfNoBristol)
 | }
model: org.apache.spark.ml.feature.StringIndexerModel = strIdx_f5afa2fb

scala> // the StringIndexerModel has only two labels: "London" and "New York"

scala> str.labels foreach println
London
New York

scala> // transform our DataFrame to add an index column

scala> val dfWithIndex = model.transform(dfNoBristol)
dfWithIndex: org.apache.spark.sql.DataFrame = [ID: string, CITY: string ... 2 
more fields]

scala> dfWithIndex.show
+---++---+---+
| ID|CITY|CONTENT|CITYIndexed|
+---++---+---+
|  A|  London|   StrA|0.0|
|  C|New York|   StrC|1.0|
+---++---+---+
{code}

The unexpected behaviour comes when we filter `dfWithIndex` for `CITYIndexed` 
equal to 1.0 and perform an action. The `indexer` UDF in `transform` method 
throws an exception reporting unseen label "Bristol". This is irrational 
behaviour as far as the user of the API is concerned, because there is no such 
value as "Bristol" when do show all rows of `dfWithIndex`:

{code:scala}
scala> dfWithIndex.filter($"CITYIndexed" === 1.0).count
17/11/04 00:33:41 ERROR Executor: Exception in task 1.0 in stage 20.0 (TID 40)
org.apache.spark.SparkException: Failed to execute user defined 
function($anonfun$5: (string) => double)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Unseen label: Bristol.  To handle 
unseen 

[jira] [Assigned] (SPARK-22445) move CodegenContext.copyResult to CodegenSupport

2017-11-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22445:


Assignee: Wenchen Fan  (was: Apache Spark)

> move CodegenContext.copyResult to CodegenSupport
> 
>
> Key: SPARK-22445
> URL: https://issues.apache.org/jira/browse/SPARK-22445
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22445) move CodegenContext.copyResult to CodegenSupport

2017-11-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22445:


Assignee: Apache Spark  (was: Wenchen Fan)

> move CodegenContext.copyResult to CodegenSupport
> 
>
> Key: SPARK-22445
> URL: https://issues.apache.org/jira/browse/SPARK-22445
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22441) JDBC REAL type is mapped to Double instead of Float

2017-11-03 Thread Tor Myklebust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238628#comment-16238628
 ] 

Tor Myklebust edited comment on SPARK-22441 at 11/4/17 12:28 AM:
-

I wrote this code a while ago, but I don't think REAL -> double was an accident.

The JDBC spec wasn't the reference for this implementation since a number of 
things it says about datatypes contradict the behaviour of actual databases.  
Which is to say that some database used REAL for 64-bit floats.

Postgres and H2 document that a REAL is a 32-bit float.  MySQL docs say that 
"MySQL also treats REAL as a synonym for DOUBLE PRECISION."  I haven't got a 
quick way to check, but I'd speculate that MySQL's JDBC connector advertises a 
MySQL REAL column as being of JDBC REAL type.

Re SQL FLOAT and DOUBLE types:

- H2 uses FLOAT and DOUBLE interchangeably according to its documentation.
- "PostgreSQL also supports the SQL-standard notations float and float(p) for 
specifying inexact numeric types. Here, p specifies the minimum acceptable 
precision in binary digits. PostgreSQL accepts float(1) to float(24) as 
selecting the real type, while float(25) to float(53) select double precision. 
Values of p outside the allowed range draw an error. float with no precision 
specified is taken to mean double precision."
- According to its docs, MySQL does something similar to Postgres for FLOAT.

Depending on how this is exposed via JDBC, perhaps JDBC FLOAT should map to 
double as well.


was (Author: tmyklebu):
I wrote this code a while ago, but I don't think REAL -> double was an accident.

The JDBC spec wasn't the reference for this implementation since a number of 
things it says about datatypes contradict the behaviour of actual databases.  
Which is to say that some database used REAL for 64-bit floats.

Postgres and H2 document that a REAL is a 32-bit float.  MySQL docs say that 
"MySQL also treats REAL as a synonym for DOUBLE PRECISION."  I haven't got a 
quick way to check, but I'd speculate that MySQL's JDBC connector advertises a 
MySQL REAL column as being of JDBC REAL type.

> JDBC REAL type is mapped to Double instead of Float
> ---
>
> Key: SPARK-22441
> URL: https://issues.apache.org/jira/browse/SPARK-22441
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hongbo
>Priority: Minor
>
> In [JDBC 
> Specification|http://download.oracle.com/otn-pub/jcp/jdbc-4_1-mrel-eval-spec/jdbc4.1-fr-spec.pdf],
>  REAL should be mapped to Float. 
> But now, it's mapped to Double:
> [https://github.com/apache/spark/blob/bc7ca9786e162e33f29d57c4aacb830761b97221/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L220]
> Should it be changed according to the specification?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22445) move CodegenContext.copyResult to CodegenSupport

2017-11-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238635#comment-16238635
 ] 

Apache Spark commented on SPARK-22445:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/19656

> move CodegenContext.copyResult to CodegenSupport
> 
>
> Key: SPARK-22445
> URL: https://issues.apache.org/jira/browse/SPARK-22445
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22441) JDBC REAL type is mapped to Double instead of Float

2017-11-03 Thread Tor Myklebust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238628#comment-16238628
 ] 

Tor Myklebust commented on SPARK-22441:
---

I wrote this code a while ago, but I don't think REAL -> double was an accident.

The JDBC spec wasn't the reference for this implementation since a number of 
things it says about datatypes contradict the behaviour of actual databases.  
Which is to say that some database used REAL for 64-bit floats.

Postgres and H2 document that a REAL is a 32-bit float.  MySQL docs say that 
"MySQL also treats REAL as a synonym for DOUBLE PRECISION."  I haven't got a 
quick way to check, but I'd speculate that MySQL's JDBC connector advertises a 
MySQL REAL column as being of JDBC REAL type.

> JDBC REAL type is mapped to Double instead of Float
> ---
>
> Key: SPARK-22441
> URL: https://issues.apache.org/jira/browse/SPARK-22441
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hongbo
>Priority: Minor
>
> In [JDBC 
> Specification|http://download.oracle.com/otn-pub/jcp/jdbc-4_1-mrel-eval-spec/jdbc4.1-fr-spec.pdf],
>  REAL should be mapped to Float. 
> But now, it's mapped to Double:
> [https://github.com/apache/spark/blob/bc7ca9786e162e33f29d57c4aacb830761b97221/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L220]
> Should it be changed according to the specification?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22445) move CodegenContext.copyResult to CodegenSupport

2017-11-03 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-22445:
---

 Summary: move CodegenContext.copyResult to CodegenSupport
 Key: SPARK-22445
 URL: https://issues.apache.org/jira/browse/SPARK-22445
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan
Priority: Major






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22444) Spark History Server missing /environment endpoint/api

2017-11-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238615#comment-16238615
 ] 

Apache Spark commented on SPARK-22444:
--

User 'ambud' has created a pull request for this issue:
https://github.com/apache/spark/pull/19655

> Spark History Server missing /environment endpoint/api
> --
>
> Key: SPARK-22444
> URL: https://issues.apache.org/jira/browse/SPARK-22444
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.6.3, 1.6.4
>Reporter: Ambud Sharma
>
> Spark History Server REST API is missing the /environment endpoint. This 
> endpoint is otherwise available in Spark 2.x however it's missing from 1.6.x.
> Since the environment endpoint provides programmatic access to critical 
> launch parameters it would be great to have this back ported.
> This feature was contributed under: 
> https://issues.apache.org/jira/browse/SPARK-16122



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22444) Spark History Server missing /environment endpoint/api

2017-11-03 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-22444.

   Resolution: Won't Fix
Fix Version/s: (was: 1.6.4)

Features like this are not generally added to older releases.

> Spark History Server missing /environment endpoint/api
> --
>
> Key: SPARK-22444
> URL: https://issues.apache.org/jira/browse/SPARK-22444
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.6.3, 1.6.4
>Reporter: Ambud Sharma
>
> Spark History Server REST API is missing the /environment endpoint. This 
> endpoint is otherwise available in Spark 2.x however it's missing from 1.6.x.
> Since the environment endpoint provides programmatic access to critical 
> launch parameters it would be great to have this back ported.
> This feature was contributed under: 
> https://issues.apache.org/jira/browse/SPARK-16122



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22444) Spark History Server missing /environment endpoint/api

2017-11-03 Thread Ambud Sharma (JIRA)
Ambud Sharma created SPARK-22444:


 Summary: Spark History Server missing /environment endpoint/api
 Key: SPARK-22444
 URL: https://issues.apache.org/jira/browse/SPARK-22444
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.6.3, 1.6.4
Reporter: Ambud Sharma
 Fix For: 1.6.4


Spark History Server REST API is missing the /environment endpoint. This 
endpoint is otherwise available in Spark 2.x however it's missing from 1.6.x.

Since the environment endpoint provides programmatic access to critical launch 
parameters it would be great to have this back ported.

This feature was contributed under: 
https://issues.apache.org/jira/browse/SPARK-16122



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22443) AggregatedDialect doesn't override quoteIdentifier and other methods in JdbcDialects

2017-11-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238482#comment-16238482
 ] 

Sean Owen commented on SPARK-22443:
---

Good catch. I suppose that this and getTableExistsQuery and getSchemaQuery need 
to return the value from the first dialect?

> AggregatedDialect doesn't override quoteIdentifier and other methods in 
> JdbcDialects
> 
>
> Key: SPARK-22443
> URL: https://issues.apache.org/jira/browse/SPARK-22443
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hongbo
>Priority: Normal
>
> The AggregatedDialect only implements canHandle, getCatalystType, 
> getJDBCType. It doesn't implement other methods in JdbcDialect. 
> So if multiple Dialects are registered with the same driver, the 
> implementation of these methods will not be taken and the default 
> implementation in JdbcDialect will be used.
> Example:
> {code:java}
> package example
> import java.util.Properties
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.jdbc.{JdbcDialect, JdbcDialects}
> import org.apache.spark.sql.types.{DataType, MetadataBuilder}
> object AnotherMySQLDialect extends JdbcDialect {
>   override def canHandle(url : String): Boolean = url.startsWith("jdbc:mysql")
>   override def getCatalystType(
> sqlType: Int, typeName: String, size: Int, 
> md: MetadataBuilder): Option[DataType] = {
> None
>   }
>   override def quoteIdentifier(colName: String): String = {
> s"`$colName`"
>   }
> }
> object App {
>   def main(args: Array[String]) {
> val spark = SparkSession.builder.master("local").appName("Simple 
> Application").getOrCreate()
> JdbcDialects.registerDialect(AnotherMySQLDialect)
> val jdbcUrl = s"jdbc:mysql://host:port/db?user=user=password"
> spark.read.jdbc(jdbcUrl, "badge", new Properties()).show()
>   }
> }
> {code}
> will throw an exception. 
> {code:none}
> 17/11/03 17:08:39 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.sql.SQLDataException: Cannot determine value type from string 'id'
>   at 
> com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:530)
>   at 
> com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:513)
>   at 
> com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:505)
>   at 
> com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:479)
>   at 
> com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:489)
>   at 
> com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:89)
>   at 
> com.mysql.cj.jdbc.result.ResultSetImpl.getLong(ResultSetImpl.java:853)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:409)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:408)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:330)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:312)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at 

[jira] [Updated] (SPARK-22443) AggregatedDialect doesn't override quoteIdentifier and other methods in JdbcDialects

2017-11-03 Thread Hongbo (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbo updated SPARK-22443:
---
Summary: AggregatedDialect doesn't override quoteIdentifier and other 
methods in JdbcDialects  (was: AggregatedDialect doesn't work for 
quoteIdentifier)

> AggregatedDialect doesn't override quoteIdentifier and other methods in 
> JdbcDialects
> 
>
> Key: SPARK-22443
> URL: https://issues.apache.org/jira/browse/SPARK-22443
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hongbo
>Priority: Normal
>
> The AggregatedDialect only implements canHandle, getCatalystType, 
> getJDBCType. It doesn't implement other methods in JdbcDialect. 
> So if multiple Dialects are registered with the same driver, the 
> implementation of these methods will not be taken and the default 
> implementation in JdbcDialect will be used.
> Example:
> {code:java}
> package example
> import java.util.Properties
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.jdbc.{JdbcDialect, JdbcDialects}
> import org.apache.spark.sql.types.{DataType, MetadataBuilder}
> object AnotherMySQLDialect extends JdbcDialect {
>   override def canHandle(url : String): Boolean = url.startsWith("jdbc:mysql")
>   override def getCatalystType(
> sqlType: Int, typeName: String, size: Int, 
> md: MetadataBuilder): Option[DataType] = {
> None
>   }
>   override def quoteIdentifier(colName: String): String = {
> s"`$colName`"
>   }
> }
> object App {
>   def main(args: Array[String]) {
> val spark = SparkSession.builder.master("local").appName("Simple 
> Application").getOrCreate()
> JdbcDialects.registerDialect(AnotherMySQLDialect)
> val jdbcUrl = s"jdbc:mysql://host:port/db?user=user=password"
> spark.read.jdbc(jdbcUrl, "badge", new Properties()).show()
>   }
> }
> {code}
> will throw an exception. 
> {code:none}
> 17/11/03 17:08:39 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.sql.SQLDataException: Cannot determine value type from string 'id'
>   at 
> com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:530)
>   at 
> com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:513)
>   at 
> com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:505)
>   at 
> com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:479)
>   at 
> com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:489)
>   at 
> com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:89)
>   at 
> com.mysql.cj.jdbc.result.ResultSetImpl.getLong(ResultSetImpl.java:853)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:409)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:408)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:330)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:312)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at 

[jira] [Created] (SPARK-22443) AggregatedDialect doesn't work for quoteIdentifier

2017-11-03 Thread Hongbo (JIRA)
Hongbo created SPARK-22443:
--

 Summary: AggregatedDialect doesn't work for quoteIdentifier
 Key: SPARK-22443
 URL: https://issues.apache.org/jira/browse/SPARK-22443
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Hongbo
Priority: Normal


The AggregatedDialect only implements canHandle, getCatalystType, getJDBCType. 
It doesn't implement other methods in JdbcDialect. 
So if multiple Dialects are registered with the same driver, the implementation 
of these methods will not be taken and the default implementation in 
JdbcDialect will be used.

Example:

{code:java}
package example

import java.util.Properties

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.jdbc.{JdbcDialect, JdbcDialects}
import org.apache.spark.sql.types.{DataType, MetadataBuilder}

object AnotherMySQLDialect extends JdbcDialect {
  override def canHandle(url : String): Boolean = url.startsWith("jdbc:mysql")

  override def getCatalystType(
sqlType: Int, typeName: String, size: Int, md: 
MetadataBuilder): Option[DataType] = {
None
  }

  override def quoteIdentifier(colName: String): String = {
s"`$colName`"
  }
}

object App {
  def main(args: Array[String]) {
val spark = SparkSession.builder.master("local").appName("Simple 
Application").getOrCreate()
JdbcDialects.registerDialect(AnotherMySQLDialect)
val jdbcUrl = s"jdbc:mysql://host:port/db?user=user=password"
spark.read.jdbc(jdbcUrl, "badge", new Properties()).show()
  }
}
{code}

will throw an exception. 

{code:none}
17/11/03 17:08:39 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.sql.SQLDataException: Cannot determine value type from string 'id'
at 
com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:530)
at 
com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:513)
at 
com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:505)
at 
com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:479)
at 
com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:489)
at 
com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:89)
at 
com.mysql.cj.jdbc.result.ResultSetImpl.getLong(ResultSetImpl.java:853)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:409)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:408)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:330)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:312)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: com.mysql.cj.core.exceptions.DataConversionException: Cannot 
determine value type from string 'id'
at 
com.mysql.cj.core.io.StringConverter.createFromBytes(StringConverter.java:121)
at 

[jira] [Commented] (SPARK-14703) Spark uses SLF4J, but actually relies quite heavily on Log4J

2017-11-03 Thread Harry Weppner (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238409#comment-16238409
 ] 

Harry Weppner commented on SPARK-14703:
---

[~happy15sheng] I've discovered that {{log4j-over-slf4j}} can not be used as a 
drop-in replacement in all cases. The hive server for example calls {{log4j}} 
components that are not part of the bridge.


{code}
Exception in thread "main" java.lang.NoSuchMethodError: 
org.apache.hive.service.cli.operation.LogDivertAppender.setWriter(Ljava/io/Writer;)V
 
at 
org.apache.hive.service.cli.operation.LogDivertAppender.(LogDivertAppender.java:166)
 
at 
org.apache.hive.service.cli.operation.OperationManager.initOperationLogCapture(OperationManager.java:85)
at 
org.apache.hive.service.cli.operation.OperationManager.init(OperationManager.java:63)
 
at 
org.apache.spark.sql.hive.thriftserver.ReflectedCompositeService$$anonfun$initCompositeService$1.apply(SparkSQLCLIService.scala:79)
at 
org.apache.spark.sql.hive.thriftserver.ReflectedCompositeService$$anonfun$initCompositeService$1.apply(SparkSQLCLIService.scala:79)
..
{code}

In my case, I've worked around it by setting 
{{hive.server2.logging.operation.enabled}} to {{false}}, which avoids these 
code paths altogether. At the expense of having no logs for the hive server, of 
course.

> Spark uses SLF4J, but actually relies quite heavily on Log4J
> 
>
> Key: SPARK-14703
> URL: https://issues.apache.org/jira/browse/SPARK-14703
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 1.6.0
> Environment: 1.6.0-cdh5.7.0, logback 1.1.3, yarn
>Reporter: Matthew Byng-Maddick
>Priority: Minor
>  Labels: log4j, logback, logging, slf4j
> Attachments: spark-logback.patch
>
>
> We've built a version of Hadoop CDH-5.7.0 in house with logback as the SLF4J 
> provider, in order to send hadoop logs straight to logstash (to handle with 
> logstash/elasticsearch), on top of our existing use of the logback backend.
> In trying to start spark-shell I discovered several points where the fact 
> that we weren't quite using a real L4J caused the sc not to be created or the 
> YARN module not to exist. There are many more places where we should probably 
> be wrapping the logging more sensibly, but I have a basic patch that fixes 
> some of the worst offenders (at least the ones that stop the sparkContext 
> being created properly).
> I'm prepared to accept that this is not a good solution and there probably 
> needs to be some sort of better wrapper, perhaps in the Logging.scala class 
> which handles this properly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22442) Schema generated by Product Encoder doesn't match case class field name when using non-standard characters

2017-11-03 Thread Mikel San Vicente (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikel San Vicente updated SPARK-22442:
--
Description: 
Product encoder encodes special characters wrongly when field name contains 
certain nonstandard characters.

For example for:
{quote}
case class MyType(`field.1`: String, `field 2`: String)
{quote}
we will get the following schema

{quote}
root
 |-- field$u002E1: string (nullable = true)
 |-- field$u00202: string (nullable = true)
{quote}



  was:
Product encoder encodes special characters wrongly when field name contains 
certain nonstandard characters.

For example for:


{{
case class MyType(`field.1`: String, `field 2`: String)
}}
we will get the following schema

{{
root
 |-- field$u002E1: string (nullable = true)
 |-- field$u00202: string (nullable = true)
}}




> Schema generated by Product Encoder doesn't match case class field name when 
> using non-standard characters
> --
>
> Key: SPARK-22442
> URL: https://issues.apache.org/jira/browse/SPARK-22442
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.2, 2.2.0
>Reporter: Mikel San Vicente
>Priority: Normal
>
> Product encoder encodes special characters wrongly when field name contains 
> certain nonstandard characters.
> For example for:
> {quote}
> case class MyType(`field.1`: String, `field 2`: String)
> {quote}
> we will get the following schema
> {quote}
> root
>  |-- field$u002E1: string (nullable = true)
>  |-- field$u00202: string (nullable = true)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22442) Schema generated by Product Encoder doesn't match case class field name when using non-standard characters

2017-11-03 Thread Mikel San Vicente (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikel San Vicente updated SPARK-22442:
--
Description: 
Product encoder encodes special characters wrongly when field name contains 
certain nonstandard characters.

For example for:


{{case class MyType(`field.1`: String, `field 2`: String)
}}
we will get the following schema

{{root
 |-- field$u002E1: string (nullable = true)
 |-- field$u00202: string (nullable = true)}}



  was:
Product encoder encodes special characters wrongly when field name contains 
certain nonstandard characters.

For example for:

```
case class MyType(`field.1`: String, `field 2`: String)
```
we will get the following schema
```
root
 |-- field$u002E1: string (nullable = true)
 |-- field$u00202: string (nullable = true)
```



> Schema generated by Product Encoder doesn't match case class field name when 
> using non-standard characters
> --
>
> Key: SPARK-22442
> URL: https://issues.apache.org/jira/browse/SPARK-22442
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.2, 2.2.0
>Reporter: Mikel San Vicente
>Priority: Normal
>
> Product encoder encodes special characters wrongly when field name contains 
> certain nonstandard characters.
> For example for:
> {{case class MyType(`field.1`: String, `field 2`: String)
> }}
> we will get the following schema
> {{root
>  |-- field$u002E1: string (nullable = true)
>  |-- field$u00202: string (nullable = true)}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22442) Schema generated by Product Encoder doesn't match case class field name when using non-standard characters

2017-11-03 Thread Mikel San Vicente (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikel San Vicente updated SPARK-22442:
--
Description: 
Product encoder encodes special characters wrongly when field name contains 
certain nonstandard characters.

For example for:


{{
case class MyType(`field.1`: String, `field 2`: String)
}}
we will get the following schema

{{
root
 |-- field$u002E1: string (nullable = true)
 |-- field$u00202: string (nullable = true)
}}



  was:
Product encoder encodes special characters wrongly when field name contains 
certain nonstandard characters.

For example for:


{{case class MyType(`field.1`: String, `field 2`: String)
}}
we will get the following schema

{{root
 |-- field$u002E1: string (nullable = true)
 |-- field$u00202: string (nullable = true)}}




> Schema generated by Product Encoder doesn't match case class field name when 
> using non-standard characters
> --
>
> Key: SPARK-22442
> URL: https://issues.apache.org/jira/browse/SPARK-22442
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.2, 2.2.0
>Reporter: Mikel San Vicente
>Priority: Normal
>
> Product encoder encodes special characters wrongly when field name contains 
> certain nonstandard characters.
> For example for:
> {{
> case class MyType(`field.1`: String, `field 2`: String)
> }}
> we will get the following schema
> {{
> root
>  |-- field$u002E1: string (nullable = true)
>  |-- field$u00202: string (nullable = true)
> }}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22442) Schema generated by Product Encoder doesn't match case class field name when using non-standard characters

2017-11-03 Thread Mikel San Vicente (JIRA)
Mikel San Vicente created SPARK-22442:
-

 Summary: Schema generated by Product Encoder doesn't match case 
class field name when using non-standard characters
 Key: SPARK-22442
 URL: https://issues.apache.org/jira/browse/SPARK-22442
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0, 2.1.2, 2.0.2
Reporter: Mikel San Vicente
Priority: Normal


Product encoder encodes special characters wrongly when field name contains 
certain nonstandard characters.

For example for:

```
case class MyType(`field.1`: String, `field 2`: String)
```
we will get the following schema
```
root
 |-- field$u002E1: string (nullable = true)
 |-- field$u00202: string (nullable = true)
```




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22441) JDBC REAL type is mapped to Double instead of Float

2017-11-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238313#comment-16238313
 ] 

Sean Owen commented on SPARK-22441:
---

You have a point. In the same file, the reverse mapping maps float / FloatType 
to REAL (but, also java.sql.Types.FLOAT)?
The reference you give also says FLOAT and DOUBLE both map to double, not to 
float/double respectively.

I'm concerned about changing behavior, but that does seem internally 
inconsistent at least.

Does anyone have a view on whether this should change in 2.3?

CC [~tmyklebu] who I believe wrote the code in question.

> JDBC REAL type is mapped to Double instead of Float
> ---
>
> Key: SPARK-22441
> URL: https://issues.apache.org/jira/browse/SPARK-22441
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hongbo
>Priority: Minor
>
> In [JDBC 
> Specification|http://download.oracle.com/otn-pub/jcp/jdbc-4_1-mrel-eval-spec/jdbc4.1-fr-spec.pdf],
>  REAL should be mapped to Float. 
> But now, it's mapped to Double:
> [https://github.com/apache/spark/blob/bc7ca9786e162e33f29d57c4aacb830761b97221/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L220]
> Should it be changed according to the specification?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22441) JDBC REAL type is mapped to Double instead of Float

2017-11-03 Thread Hongbo (JIRA)
Hongbo created SPARK-22441:
--

 Summary: JDBC REAL type is mapped to Double instead of Float
 Key: SPARK-22441
 URL: https://issues.apache.org/jira/browse/SPARK-22441
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Hongbo
Priority: Minor


In [JDBC 
Specification|http://download.oracle.com/otn-pub/jcp/jdbc-4_1-mrel-eval-spec/jdbc4.1-fr-spec.pdf],
 REAL should be mapped to Float. 
But now, it's mapped to Double:
[https://github.com/apache/spark/blob/bc7ca9786e162e33f29d57c4aacb830761b97221/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L220]

Should it be changed according to the specification?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22211) LimitPushDown optimization for FullOuterJoin generates wrong results

2017-11-03 Thread Henry Robinson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238281#comment-16238281
 ] 

Henry Robinson commented on SPARK-22211:


Sounds good, thanks both.

> LimitPushDown optimization for FullOuterJoin generates wrong results
> 
>
> Key: SPARK-22211
> URL: https://issues.apache.org/jira/browse/SPARK-22211
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: on community.cloude.databrick.com 
> Runtime Version 3.2 (includes Apache Spark 2.2.0, Scala 2.11)
>Reporter: Benyi Wang
>Priority: Major
>
> LimitPushDown pushes LocalLimit to one side for FullOuterJoin, but this may 
> generate a wrong result:
> Assume we use limit(1) and LocalLimit will be pushed to left side, and id=999 
> is selected, but at right side we have 100K rows including 999, the result 
> will be
> - one row is (999, 999)
> - the rest rows are (null, xxx)
> Once you call show(), the row (999,999) has only 1/10th chance to be 
> selected by CollectLimit.
> The actual optimization might be, 
> - push down limit
> - but convert the join to Broadcast LeftOuterJoin or RightOuterJoin.
> Here is my notebook:
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/349451637617406/2750346983121008/656075277290/latest.html
> {code:java}
> import scala.util.Random._
> val dl = shuffle(1 to 10).toDF("id")
> val dr = shuffle(1 to 10).toDF("id")
> println("data frame dl:")
> dl.explain
> println("data frame dr:")
> dr.explain
> val j = dl.join(dr, dl("id") === dr("id"), "outer").limit(1)
> j.explain
> j.show(false)
> {code}
> {code}
> data frame dl:
> == Physical Plan ==
> LocalTableScan [id#10]
> data frame dr:
> == Physical Plan ==
> LocalTableScan [id#16]
> == Physical Plan ==
> CollectLimit 1
> +- SortMergeJoin [id#10], [id#16], FullOuter
>:- *Sort [id#10 ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(id#10, 200)
>: +- *LocalLimit 1
>:+- LocalTableScan [id#10]
>+- *Sort [id#16 ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(id#16, 200)
>  +- LocalTableScan [id#16]
> import scala.util.Random._
> dl: org.apache.spark.sql.DataFrame = [id: int]
> dr: org.apache.spark.sql.DataFrame = [id: int]
> j: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, id: int]
> ++---+
> |id  |id |
> ++---+
> |null|148|
> ++---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22440) Add Calinski-Harabasz index to ClusteringEvaluator

2017-11-03 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238241#comment-16238241
 ] 

Marco Gaido commented on SPARK-22440:
-

Honestly I don't know what people are using for clustering evaluation and I 
don't know either where to retrive such a statistic. My goal here was to make 
easier for people to migrate their existing workloads to Spark. Since sklearn 
is surely one of the most widespread libraries for machine learning, the 
existing workloads can evaluate an unsupervised clustering through Silhouette 
or Calinski-Harabasz. If we support both, I think the adoption of Spark would 
be easier for them.

> Add Calinski-Harabasz index to ClusteringEvaluator
> --
>
> Key: SPARK-22440
> URL: https://issues.apache.org/jira/browse/SPARK-22440
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Priority: Minor
>
> In SPARK-14516 we introduced ClusteringEvaluator with an implementation of 
> Silhouette.
> sklearn contains also another metric for the evaluation of unsupervised 
> clustering results. The metric is Calinski-Harabasz. This JIRA is to add it 
> to Spark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22424) Task not finished for a long time in monitor UI, but I found it finished in logs

2017-11-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238202#comment-16238202
 ] 

Sean Owen commented on SPARK-22424:
---

Yes, but how is that related to the tasks you are highlighting? the tasks you 
also highlight are both successful. A task may succeed but that doesn't mean 
all have succeeded in the batch.

> Task not finished for a long time in monitor UI, but I found it finished in 
> logs
> 
>
> Key: SPARK-22424
> URL: https://issues.apache.org/jira/browse/SPARK-22424
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: chengning
>Priority: Blocking
> Attachments: 1.jpg, 1.png, C33oL.jpg
>
>
> Task not finished for a long time in monitor UI, but I found it finished in 
> logs
> Thanks a lot.
> !https://i.stack.imgur.com/C33oL.jpg!
> !C33oL.jpg|thumbnail!
> executor log:
> 17/09/29 17:32:28 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 213492
> 17/09/29 17:32:28 INFO executor.Executor: Running task 52.0 in stage 2468.0 
> (TID 213492)
> 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Getting 30 
> non-empty blocks out of 30 blocks
> 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Started 29 remote 
> fetches in 1 ms
> 17:32:28.447: tcPartition=7 ms
> 17/09/29 17:32:28 INFO executor.Executor: Finished task 52.0 in stage 2468.0 
> (TID 213492). 2755 bytes result sent to driver
> driver log::
> 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Starting task 52.0 in stage 
> 2468.0 (TID 213492, HMGQXD2, executor 1, partition 52, PROCESS_LOCAL, 6386 
> bytes)
> 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Finished task 52.0 in stage 
> 2468.0 (TID 213492) in 24 ms on HMGQXD2 (executor 1) (53/200)
> 17/09/29 17:32:28 INFO cluster.YarnScheduler: Removed TaskSet 2468.0, whose 
> tasks have all completed, from pool 
> 17/09/29 17:32:28 INFO scheduler.DAGScheduler: ResultStage 2468 
> (foreachPartition at Counter2.java:152) finished in 0.255 s
> 17/09/29 17:32:28 INFO scheduler.DAGScheduler: Job 1647 finished: 
> foreachPartition at Counter2.java:152, took 0.415256 s



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22440) Add Calinski-Harabasz index to ClusteringEvaluator

2017-11-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238179#comment-16238179
 ] 

Sean Owen commented on SPARK-22440:
---

I had honestly never heard of this. Is this widely used at all? I don't think 
it's a goal to add every known metric here.

> Add Calinski-Harabasz index to ClusteringEvaluator
> --
>
> Key: SPARK-22440
> URL: https://issues.apache.org/jira/browse/SPARK-22440
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Priority: Minor
>
> In SPARK-14516 we introduced ClusteringEvaluator with an implementation of 
> Silhouette.
> sklearn contains also another metric for the evaluation of unsupervised 
> clustering results. The metric is Calinski-Harabasz. This JIRA is to add it 
> to Spark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22433) Linear regression R^2 train/test terminology related

2017-11-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238150#comment-16238150
 ] 

Sean Owen commented on SPARK-22433:
---

Yeah, R^2 _could_ be used as a metric but it belongs a bit more as metadata 
from the model fitting process. You're right, it's put in too general a place; 
R^2 really is highly specific to linear models. And yeah, probably can't go 
prohibiting that now (though if there's a nice place for a warning, maybe 
worthwhile.)

I think it's OK to change the test, but am neutral on it. I would have just 
written a test case using RMSE.

> Linear regression R^2 train/test terminology related 
> -
>
> Key: SPARK-22433
> URL: https://issues.apache.org/jira/browse/SPARK-22433
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Teng Peng
>Priority: Minor
>
> Traditional statistics is traditional statistics. Their goal, framework, and 
> terminologies are not the same as ML. However, in linear regression related 
> components, this distinction is not clear, which is reflected:
> 1. regressionMetric + regressionEvaluator : 
> * R2 shouldn't be there. 
> * A better name "regressionPredictionMetric".
> 2. LinearRegressionSuite: 
> * Shouldn't test R2 and residuals on test data. 
> * There is no train set and test set in this setting.
> 3. Terminology: there is no "linear regression with L1 regularization". 
> Linear regression is linear. Adding a penalty term, then it is no longer 
> linear. Just call it "LASSO", "ElasticNet".
> There are more. I am working on correcting them.
> They are not breaking anything, but it does not make one feel good to see the 
> basic distinction is blurred.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22433) Linear regression R^2 train/test terminology related

2017-11-03 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238137#comment-16238137
 ] 

Seth Hendrickson commented on SPARK-22433:
--

The main problem I see is that we put "r2" in the `RegressionEvaluator` class, 
which can be used for all types of regression - e.g. DecisionTreeRegressor, 
which is non-sensical. Removing it would break compatibility and is probably 
not worth it since the end user is responsible for using the tools 
appropriately anyway. I'm not sure there is much to do here.

AFAIK using r2 on regularized models is a fuzzy area, but I don't think it's 
doing much harm to leave it and I don't think we'd be concerned about our test 
cases. Certainly unit tests don't imply an endorsement of the methodology 
anyway.

> Linear regression R^2 train/test terminology related 
> -
>
> Key: SPARK-22433
> URL: https://issues.apache.org/jira/browse/SPARK-22433
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Teng Peng
>Priority: Minor
>
> Traditional statistics is traditional statistics. Their goal, framework, and 
> terminologies are not the same as ML. However, in linear regression related 
> components, this distinction is not clear, which is reflected:
> 1. regressionMetric + regressionEvaluator : 
> * R2 shouldn't be there. 
> * A better name "regressionPredictionMetric".
> 2. LinearRegressionSuite: 
> * Shouldn't test R2 and residuals on test data. 
> * There is no train set and test set in this setting.
> 3. Terminology: there is no "linear regression with L1 regularization". 
> Linear regression is linear. Adding a penalty term, then it is no longer 
> linear. Just call it "LASSO", "ElasticNet".
> There are more. I am working on correcting them.
> They are not breaking anything, but it does not make one feel good to see the 
> basic distinction is blurred.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22440) Add Calinski-Harabasz index to ClusteringEvaluator

2017-11-03 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238095#comment-16238095
 ] 

Marco Gaido commented on SPARK-22440:
-

I am preparing an implementation for this. It will stil take some time. At soon 
as it will be ready, I'll submit a PR.

> Add Calinski-Harabasz index to ClusteringEvaluator
> --
>
> Key: SPARK-22440
> URL: https://issues.apache.org/jira/browse/SPARK-22440
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Priority: Minor
>
> In SPARK-14516 we introduced ClusteringEvaluator with an implementation of 
> Silhouette.
> sklearn contains also another metric for the evaluation of unsupervised 
> clustering results. The metric is Calinski-Harabasz. This JIRA is to add it 
> to Spark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22440) Add Calinski-Harabasz index to ClusteringEvaluator

2017-11-03 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-22440:
---

 Summary: Add Calinski-Harabasz index to ClusteringEvaluator
 Key: SPARK-22440
 URL: https://issues.apache.org/jira/browse/SPARK-22440
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 2.3.0
Reporter: Marco Gaido
Priority: Minor


In SPARK-14516 we introduced ClusteringEvaluator with an implementation of 
Silhouette.
sklearn contains also another metric for the evaluation of unsupervised 
clustering results. The metric is Calinski-Harabasz. This JIRA is to add it to 
Spark.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22433) Linear regression R^2 train/test terminology related

2017-11-03 Thread Teng Peng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237884#comment-16237884
 ] 

Teng Peng commented on SPARK-22433:
---

Thanks for the quick response, Sean. I am glad this issue is discussed in Spark 
community.

I understand how important coherent is, and it's the users' decision to do what 
they believe is appropriate. 

I just want to propose a one-line change: change eval.setMetricName("r2") to 
"mse" in test("cross validation with linear regression"). Then we would not 
leave the impression that "Wait what? Spark officially cross validate on R2?" 

> Linear regression R^2 train/test terminology related 
> -
>
> Key: SPARK-22433
> URL: https://issues.apache.org/jira/browse/SPARK-22433
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Teng Peng
>Priority: Minor
>
> Traditional statistics is traditional statistics. Their goal, framework, and 
> terminologies are not the same as ML. However, in linear regression related 
> components, this distinction is not clear, which is reflected:
> 1. regressionMetric + regressionEvaluator : 
> * R2 shouldn't be there. 
> * A better name "regressionPredictionMetric".
> 2. LinearRegressionSuite: 
> * Shouldn't test R2 and residuals on test data. 
> * There is no train set and test set in this setting.
> 3. Terminology: there is no "linear regression with L1 regularization". 
> Linear regression is linear. Adding a penalty term, then it is no longer 
> linear. Just call it "LASSO", "ElasticNet".
> There are more. I am working on correcting them.
> They are not breaking anything, but it does not make one feel good to see the 
> basic distinction is blurred.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22433) Linear regression R^2 train/test terminology related

2017-11-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237830#comment-16237830
 ] 

Sean Owen commented on SPARK-22433:
---

I think one misunderstanding here is that you need not apply an Evaluator to a 
held-out test set. You could apply it to the training set, and indeed, normally 
would. I'd further assert that while R^2 isn't that useful to measure 
generalization error on a held-out test set, it's not incoherent. So it's not 
even some practice we need to prohibit. But yeah, that's not what it's for, 
even in MLlib.

> Linear regression R^2 train/test terminology related 
> -
>
> Key: SPARK-22433
> URL: https://issues.apache.org/jira/browse/SPARK-22433
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Teng Peng
>Priority: Minor
>
> Traditional statistics is traditional statistics. Their goal, framework, and 
> terminologies are not the same as ML. However, in linear regression related 
> components, this distinction is not clear, which is reflected:
> 1. regressionMetric + regressionEvaluator : 
> * R2 shouldn't be there. 
> * A better name "regressionPredictionMetric".
> 2. LinearRegressionSuite: 
> * Shouldn't test R2 and residuals on test data. 
> * There is no train set and test set in this setting.
> 3. Terminology: there is no "linear regression with L1 regularization". 
> Linear regression is linear. Adding a penalty term, then it is no longer 
> linear. Just call it "LASSO", "ElasticNet".
> There are more. I am working on correcting them.
> They are not breaking anything, but it does not make one feel good to see the 
> basic distinction is blurred.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22211) LimitPushDown optimization for FullOuterJoin generates wrong results

2017-11-03 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237825#comment-16237825
 ] 

Xiao Li commented on SPARK-22211:
-

We should merge it to the master and the previous releases at first.

> LimitPushDown optimization for FullOuterJoin generates wrong results
> 
>
> Key: SPARK-22211
> URL: https://issues.apache.org/jira/browse/SPARK-22211
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: on community.cloude.databrick.com 
> Runtime Version 3.2 (includes Apache Spark 2.2.0, Scala 2.11)
>Reporter: Benyi Wang
>Priority: Major
>
> LimitPushDown pushes LocalLimit to one side for FullOuterJoin, but this may 
> generate a wrong result:
> Assume we use limit(1) and LocalLimit will be pushed to left side, and id=999 
> is selected, but at right side we have 100K rows including 999, the result 
> will be
> - one row is (999, 999)
> - the rest rows are (null, xxx)
> Once you call show(), the row (999,999) has only 1/10th chance to be 
> selected by CollectLimit.
> The actual optimization might be, 
> - push down limit
> - but convert the join to Broadcast LeftOuterJoin or RightOuterJoin.
> Here is my notebook:
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/349451637617406/2750346983121008/656075277290/latest.html
> {code:java}
> import scala.util.Random._
> val dl = shuffle(1 to 10).toDF("id")
> val dr = shuffle(1 to 10).toDF("id")
> println("data frame dl:")
> dl.explain
> println("data frame dr:")
> dr.explain
> val j = dl.join(dr, dl("id") === dr("id"), "outer").limit(1)
> j.explain
> j.show(false)
> {code}
> {code}
> data frame dl:
> == Physical Plan ==
> LocalTableScan [id#10]
> data frame dr:
> == Physical Plan ==
> LocalTableScan [id#16]
> == Physical Plan ==
> CollectLimit 1
> +- SortMergeJoin [id#10], [id#16], FullOuter
>:- *Sort [id#10 ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(id#10, 200)
>: +- *LocalLimit 1
>:+- LocalTableScan [id#10]
>+- *Sort [id#16 ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(id#16, 200)
>  +- LocalTableScan [id#16]
> import scala.util.Random._
> dl: org.apache.spark.sql.DataFrame = [id: int]
> dr: org.apache.spark.sql.DataFrame = [id: int]
> j: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, id: int]
> ++---+
> |id  |id |
> ++---+
> |null|148|
> ++---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22211) LimitPushDown optimization for FullOuterJoin generates wrong results

2017-11-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237822#comment-16237822
 ] 

Sean Owen commented on SPARK-22211:
---

In the name of correctness, still worth disabling for now, and then fixing 
later right?

> LimitPushDown optimization for FullOuterJoin generates wrong results
> 
>
> Key: SPARK-22211
> URL: https://issues.apache.org/jira/browse/SPARK-22211
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: on community.cloude.databrick.com 
> Runtime Version 3.2 (includes Apache Spark 2.2.0, Scala 2.11)
>Reporter: Benyi Wang
>Priority: Major
>
> LimitPushDown pushes LocalLimit to one side for FullOuterJoin, but this may 
> generate a wrong result:
> Assume we use limit(1) and LocalLimit will be pushed to left side, and id=999 
> is selected, but at right side we have 100K rows including 999, the result 
> will be
> - one row is (999, 999)
> - the rest rows are (null, xxx)
> Once you call show(), the row (999,999) has only 1/10th chance to be 
> selected by CollectLimit.
> The actual optimization might be, 
> - push down limit
> - but convert the join to Broadcast LeftOuterJoin or RightOuterJoin.
> Here is my notebook:
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/349451637617406/2750346983121008/656075277290/latest.html
> {code:java}
> import scala.util.Random._
> val dl = shuffle(1 to 10).toDF("id")
> val dr = shuffle(1 to 10).toDF("id")
> println("data frame dl:")
> dl.explain
> println("data frame dr:")
> dr.explain
> val j = dl.join(dr, dl("id") === dr("id"), "outer").limit(1)
> j.explain
> j.show(false)
> {code}
> {code}
> data frame dl:
> == Physical Plan ==
> LocalTableScan [id#10]
> data frame dr:
> == Physical Plan ==
> LocalTableScan [id#16]
> == Physical Plan ==
> CollectLimit 1
> +- SortMergeJoin [id#10], [id#16], FullOuter
>:- *Sort [id#10 ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(id#10, 200)
>: +- *LocalLimit 1
>:+- LocalTableScan [id#10]
>+- *Sort [id#16 ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(id#16, 200)
>  +- LocalTableScan [id#16]
> import scala.util.Random._
> dl: org.apache.spark.sql.DataFrame = [id: int]
> dr: org.apache.spark.sql.DataFrame = [id: int]
> j: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, id: int]
> ++---+
> |id  |id |
> ++---+
> |null|148|
> ++---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22211) LimitPushDown optimization for FullOuterJoin generates wrong results

2017-11-03 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237797#comment-16237797
 ] 

Xiao Li commented on SPARK-22211:
-

The Join operator should be limit aware. Anyway, we can do it later.

> LimitPushDown optimization for FullOuterJoin generates wrong results
> 
>
> Key: SPARK-22211
> URL: https://issues.apache.org/jira/browse/SPARK-22211
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: on community.cloude.databrick.com 
> Runtime Version 3.2 (includes Apache Spark 2.2.0, Scala 2.11)
>Reporter: Benyi Wang
>Priority: Major
>
> LimitPushDown pushes LocalLimit to one side for FullOuterJoin, but this may 
> generate a wrong result:
> Assume we use limit(1) and LocalLimit will be pushed to left side, and id=999 
> is selected, but at right side we have 100K rows including 999, the result 
> will be
> - one row is (999, 999)
> - the rest rows are (null, xxx)
> Once you call show(), the row (999,999) has only 1/10th chance to be 
> selected by CollectLimit.
> The actual optimization might be, 
> - push down limit
> - but convert the join to Broadcast LeftOuterJoin or RightOuterJoin.
> Here is my notebook:
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/349451637617406/2750346983121008/656075277290/latest.html
> {code:java}
> import scala.util.Random._
> val dl = shuffle(1 to 10).toDF("id")
> val dr = shuffle(1 to 10).toDF("id")
> println("data frame dl:")
> dl.explain
> println("data frame dr:")
> dr.explain
> val j = dl.join(dr, dl("id") === dr("id"), "outer").limit(1)
> j.explain
> j.show(false)
> {code}
> {code}
> data frame dl:
> == Physical Plan ==
> LocalTableScan [id#10]
> data frame dr:
> == Physical Plan ==
> LocalTableScan [id#16]
> == Physical Plan ==
> CollectLimit 1
> +- SortMergeJoin [id#10], [id#16], FullOuter
>:- *Sort [id#10 ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(id#10, 200)
>: +- *LocalLimit 1
>:+- LocalTableScan [id#10]
>+- *Sort [id#16 ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(id#16, 200)
>  +- LocalTableScan [id#16]
> import scala.util.Random._
> dl: org.apache.spark.sql.DataFrame = [id: int]
> dr: org.apache.spark.sql.DataFrame = [id: int]
> j: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, id: int]
> ++---+
> |id  |id |
> ++---+
> |null|148|
> ++---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22211) LimitPushDown optimization for FullOuterJoin generates wrong results

2017-11-03 Thread Henry Robinson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237778#comment-16237778
 ] 

Henry Robinson commented on SPARK-22211:


[~smilegator] - sounds good! What will your approach be? I wasn't able to see a 
safe way to push the limit through the join without either a more invasive 
rewrite or restricting the set of join operators for FOJ. 

> LimitPushDown optimization for FullOuterJoin generates wrong results
> 
>
> Key: SPARK-22211
> URL: https://issues.apache.org/jira/browse/SPARK-22211
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: on community.cloude.databrick.com 
> Runtime Version 3.2 (includes Apache Spark 2.2.0, Scala 2.11)
>Reporter: Benyi Wang
>Priority: Major
>
> LimitPushDown pushes LocalLimit to one side for FullOuterJoin, but this may 
> generate a wrong result:
> Assume we use limit(1) and LocalLimit will be pushed to left side, and id=999 
> is selected, but at right side we have 100K rows including 999, the result 
> will be
> - one row is (999, 999)
> - the rest rows are (null, xxx)
> Once you call show(), the row (999,999) has only 1/10th chance to be 
> selected by CollectLimit.
> The actual optimization might be, 
> - push down limit
> - but convert the join to Broadcast LeftOuterJoin or RightOuterJoin.
> Here is my notebook:
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/349451637617406/2750346983121008/656075277290/latest.html
> {code:java}
> import scala.util.Random._
> val dl = shuffle(1 to 10).toDF("id")
> val dr = shuffle(1 to 10).toDF("id")
> println("data frame dl:")
> dl.explain
> println("data frame dr:")
> dr.explain
> val j = dl.join(dr, dl("id") === dr("id"), "outer").limit(1)
> j.explain
> j.show(false)
> {code}
> {code}
> data frame dl:
> == Physical Plan ==
> LocalTableScan [id#10]
> data frame dr:
> == Physical Plan ==
> LocalTableScan [id#16]
> == Physical Plan ==
> CollectLimit 1
> +- SortMergeJoin [id#10], [id#16], FullOuter
>:- *Sort [id#10 ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(id#10, 200)
>: +- *LocalLimit 1
>:+- LocalTableScan [id#10]
>+- *Sort [id#16 ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(id#16, 200)
>  +- LocalTableScan [id#16]
> import scala.util.Random._
> dl: org.apache.spark.sql.DataFrame = [id: int]
> dr: org.apache.spark.sql.DataFrame = [id: int]
> j: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, id: int]
> ++---+
> |id  |id |
> ++---+
> |null|148|
> ++---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22417) createDataFrame from a pandas.DataFrame reads datetime64 values as longs

2017-11-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22417:


Assignee: (was: Apache Spark)

> createDataFrame from a pandas.DataFrame reads datetime64 values as longs
> 
>
> Key: SPARK-22417
> URL: https://issues.apache.org/jira/browse/SPARK-22417
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Bryan Cutler
>Priority: Normal
>
> When trying to create a Spark DataFrame from an existing Pandas DataFrame 
> using {{createDataFrame}}, columns with datetime64 values are converted as 
> long values.  This is only when the schema is not specified.  
> {code}
> In [2]: import pandas as pd
>...: from datetime import datetime
>...: 
> In [3]: pdf = pd.DataFrame({"ts": [datetime(2017, 10, 31, 1, 1, 1)]})
> In [4]: df = spark.createDataFrame(pdf)
> In [5]: df.show()
> +---+
> | ts|
> +---+
> |15094116610|
> +---+
> In [6]: df.schema
> Out[6]: StructType(List(StructField(ts,LongType,true)))
> {code}
> Spark should interpret a datetime64\[D\] value to DateType and other 
> datetime64 values to TImestampType.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22433) Linear regression R^2 train/test terminology related

2017-11-03 Thread Teng Peng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237774#comment-16237774
 ] 

Teng Peng commented on SPARK-22433:
---

What I agree with you: be coherent, and we prefer ML-oreinted standard.

What I want to add: be coherent, and we prefer ML-oreinted standard only if we 
are talking about ML. If we are talking about traditional statistics, we should 
stick to the established standard of traditional statistics.

What I want to explain:
1.
ML world: there is training set and test set. We have this to evaluate if we 
our models have good prediction performance. If we don't have them, then 
unavoidably overfitting.
Traditional statistics world: there is no training set and test, because our 
goal is interpretation of models, not prediction performance. 
R^2 is in the framework of traditional statistics, and it has nothing to do 
with prediction related goals. If we are using R^2, we are in the domain of 
traditional statistics. If our goal is interpretation, then we look at R^2. 

2.
The regressionMetric and regressionEvaluator is designed for ML related goals 
using linear regression approach(which might be useful for a benchmark). So 
this two are actually in the domain of ML world, not traditional statistics. 
However, R^2 is mixed into it. This mixture appear everywhere. Looking at 
test("cross validation with linear regression") . R^2 is evaluated by cross 
validation, and the larger the better. This is a misunderstanding of what R2 is.

The bottom line: there is a clear distinction between traditional statistics 
and ML. If something belongs to traditional statistics, then we should not mix 
them with ML.

> Linear regression R^2 train/test terminology related 
> -
>
> Key: SPARK-22433
> URL: https://issues.apache.org/jira/browse/SPARK-22433
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Teng Peng
>Priority: Minor
>
> Traditional statistics is traditional statistics. Their goal, framework, and 
> terminologies are not the same as ML. However, in linear regression related 
> components, this distinction is not clear, which is reflected:
> 1. regressionMetric + regressionEvaluator : 
> * R2 shouldn't be there. 
> * A better name "regressionPredictionMetric".
> 2. LinearRegressionSuite: 
> * Shouldn't test R2 and residuals on test data. 
> * There is no train set and test set in this setting.
> 3. Terminology: there is no "linear regression with L1 regularization". 
> Linear regression is linear. Adding a penalty term, then it is no longer 
> linear. Just call it "LASSO", "ElasticNet".
> There are more. I am working on correcting them.
> They are not breaking anything, but it does not make one feel good to see the 
> basic distinction is blurred.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22417) createDataFrame from a pandas.DataFrame reads datetime64 values as longs

2017-11-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22417:


Assignee: Apache Spark

> createDataFrame from a pandas.DataFrame reads datetime64 values as longs
> 
>
> Key: SPARK-22417
> URL: https://issues.apache.org/jira/browse/SPARK-22417
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Bryan Cutler
>Assignee: Apache Spark
>Priority: Normal
>
> When trying to create a Spark DataFrame from an existing Pandas DataFrame 
> using {{createDataFrame}}, columns with datetime64 values are converted as 
> long values.  This is only when the schema is not specified.  
> {code}
> In [2]: import pandas as pd
>...: from datetime import datetime
>...: 
> In [3]: pdf = pd.DataFrame({"ts": [datetime(2017, 10, 31, 1, 1, 1)]})
> In [4]: df = spark.createDataFrame(pdf)
> In [5]: df.show()
> +---+
> | ts|
> +---+
> |15094116610|
> +---+
> In [6]: df.schema
> Out[6]: StructType(List(StructField(ts,LongType,true)))
> {code}
> Spark should interpret a datetime64\[D\] value to DateType and other 
> datetime64 values to TImestampType.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22147) BlockId.hashCode allocates a StringBuilder/String on each call

2017-11-03 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237775#comment-16237775
 ] 

Bryan Cutler commented on SPARK-22147:
--

Sorry, I linked the above PR to this JIRA accidentally

> BlockId.hashCode allocates a StringBuilder/String on each call
> --
>
> Key: SPARK-22147
> URL: https://issues.apache.org/jira/browse/SPARK-22147
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 2.2.0
>Reporter: Sergei Lebedev
>Assignee: Sergei Lebedev
>Priority: Minor
> Fix For: 2.3.0
>
>
> The base class {{BlockId}} 
> [defines|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockId.scala#L44]
>  {{hashCode}} and {{equals}} for all its subclasses in terms of {{name}}. 
> This makes the definitions of different ID types [very 
> concise|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockId.scala#L52].
>  The downside, however, is redundant allocations. While I don't think this 
> could be the major issue, it is still a bit disappointing to increase GC 
> pressure on the driver for nothing. For our machine learning workloads, we've 
> seen as much as 10% of all allocations on the driver coming from 
> {{BlockId.hashCode}} calls done for 
> [BlockManagerMasterEndpoint.blockLocations|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala#L54].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22417) createDataFrame from a pandas.DataFrame reads datetime64 values as longs

2017-11-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237772#comment-16237772
 ] 

Apache Spark commented on SPARK-22417:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/19646

> createDataFrame from a pandas.DataFrame reads datetime64 values as longs
> 
>
> Key: SPARK-22417
> URL: https://issues.apache.org/jira/browse/SPARK-22417
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Bryan Cutler
>Priority: Normal
>
> When trying to create a Spark DataFrame from an existing Pandas DataFrame 
> using {{createDataFrame}}, columns with datetime64 values are converted as 
> long values.  This is only when the schema is not specified.  
> {code}
> In [2]: import pandas as pd
>...: from datetime import datetime
>...: 
> In [3]: pdf = pd.DataFrame({"ts": [datetime(2017, 10, 31, 1, 1, 1)]})
> In [4]: df = spark.createDataFrame(pdf)
> In [5]: df.show()
> +---+
> | ts|
> +---+
> |15094116610|
> +---+
> In [6]: df.schema
> Out[6]: StructType(List(StructField(ts,LongType,true)))
> {code}
> Spark should interpret a datetime64\[D\] value to DateType and other 
> datetime64 values to TImestampType.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22426) Spark AM launching containers on node where External spark shuffle service failed to initialize

2017-11-03 Thread Prabhu Joseph (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237769#comment-16237769
 ] 

Prabhu Joseph commented on SPARK-22426:
---

Thanks [~jerryshao], we can close this as a duplicate.

> Spark AM launching containers on node where External spark shuffle service 
> failed to initialize
> ---
>
> Key: SPARK-22426
> URL: https://issues.apache.org/jira/browse/SPARK-22426
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, YARN
>Affects Versions: 1.6.3
>Reporter: Prabhu Joseph
>Priority: Major
>
> When Spark External Shuffle Service on a NodeManager fails, the remote 
> executors will fail while fetching the data from the executors launched on 
> this Node. Spark ApplicationMaster should not launch containers on this Node 
> or not use external shuffle service.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22437) jdbc write fails to set default mode

2017-11-03 Thread Adrian Bridgett (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237733#comment-16237733
 ] 

Adrian Bridgett commented on SPARK-22437:
-

good grief that was fast!

> jdbc write fails to set default mode
> 
>
> Key: SPARK-22437
> URL: https://issues.apache.org/jira/browse/SPARK-22437
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.2.0
> Environment: python3
>Reporter: Adrian Bridgett
>
> With this bit of code:
> {code}
> df.write.jdbc(jdbc_url, table)
> {code}
> We see this error:
> {code}
> 09:54:22 2017-11-03 09:54:19,985 INFO   File 
> "/opt/spark220/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 820, 
> in jdbc
> 09:54:22 2017-11-03 09:54:19,985 INFO   File 
> "/opt/spark220/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 
> 1133, in __call__
> 09:54:22 2017-11-03 09:54:19,986 INFO   File 
> "/opt/spark220/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
> 09:54:22 2017-11-03 09:54:19,986 INFO   File 
> "/opt/spark220/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in 
> get_return_value
> 09:54:22 2017-11-03 09:54:19,987 INFO py4j.protocol.Py4JJavaError: An 
> error occurred while calling o106.jdbc.
> 09:54:22 2017-11-03 09:54:19,987 INFO : scala.MatchError: null
> 09:54:22 2017-11-03 09:54:19,987 INFO at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:62)
> 09:54:22 2017-11-03 09:54:19,987 INFO at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:472)
> 09:54:22 2017-11-03 09:54:19,987 INFO at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
> ...
> {code}
> This seems to be that "mode" isn't correctly picking up the "error" default 
> as listed in 
> https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame
> Note that if it was erroring because of existing data it'd say "SaveMode: 
> ErrorIfExists." as seen in 
> ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala).
> Changing our code to this worked:
> {code}
> df.write.jdbc(jdbc_url, table, mode='append')
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark

2017-11-03 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237731#comment-16237731
 ] 

Timothy Hunter commented on SPARK-21866:


Adding {{spark.read.image}} is going to create a (soft) dependency between the 
core and mllib, which hosts the implementation of the current reader methods. 
This is fine and can dealt with using reflection, but since this would involve 
adding a core API to Spark, I suggest we do it as a follow-up task.

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>Priority: Major
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
> * specialized formats used in meteorology, the medical field, etc. are not 
> supported
> * this format is specialized to images and does not attempt to solve the more 
> general problem of representing n-dimensional tensors in Spark
> h2. Proposed API changes
> We propose to add a new package in the package structure, under the MLlib 
> project:
> {{org.apache.spark.image}}
> h3. Data format
> We propose to add the following structure:
> imageSchema = StructType([
> * StructField("mode", StringType(), False),
> ** The exact representation of the data.
> ** The values are described in the following OpenCV convention. Basically, 
> the type has both "depth" and "number of channels" info: in particular, type 
> "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 
> (value 32 in the table) with the channel order specified 

[jira] [Commented] (SPARK-22430) Unknown tag warnings when building R docs with Roxygen 6.0.1

2017-11-03 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237719#comment-16237719
 ] 

Felix Cheung commented on SPARK-22430:
--

I am seeing it too. I think we can just remove the tag but Jenkins is running 
an older version of roxygen2 and our earlier attempt to update it didn't go 
well (we have a JIRA on this). Will need to be very careful with this.



> Unknown tag warnings when building R docs with Roxygen 6.0.1
> 
>
> Key: SPARK-22430
> URL: https://issues.apache.org/jira/browse/SPARK-22430
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.3.0
> Environment: Roxygen 6.0.1
>Reporter: Joel Croteau
>
> When building R docs using create-rd.sh with Roxygen 6.0.1, a large number of 
> unknown tag warnings are generated:
> {noformat}
> Warning: @export [schema.R#33]: unknown tag
> Warning: @export [schema.R#53]: unknown tag
> Warning: @export [schema.R#63]: unknown tag
> Warning: @export [schema.R#80]: unknown tag
> Warning: @export [schema.R#123]: unknown tag
> Warning: @export [schema.R#141]: unknown tag
> Warning: @export [schema.R#216]: unknown tag
> Warning: @export [generics.R#388]: unknown tag
> Warning: @export [generics.R#403]: unknown tag
> Warning: @export [generics.R#407]: unknown tag
> Warning: @export [generics.R#414]: unknown tag
> Warning: @export [generics.R#418]: unknown tag
> Warning: @export [generics.R#422]: unknown tag
> Warning: @export [generics.R#428]: unknown tag
> Warning: @export [generics.R#432]: unknown tag
> Warning: @export [generics.R#438]: unknown tag
> Warning: @export [generics.R#442]: unknown tag
> Warning: @export [generics.R#446]: unknown tag
> Warning: @export [generics.R#450]: unknown tag
> Warning: @export [generics.R#454]: unknown tag
> Warning: @export [generics.R#459]: unknown tag
> Warning: @export [generics.R#467]: unknown tag
> Warning: @export [generics.R#475]: unknown tag
> Warning: @export [generics.R#479]: unknown tag
> Warning: @export [generics.R#483]: unknown tag
> Warning: @export [generics.R#487]: unknown tag
> Warning: @export [generics.R#498]: unknown tag
> Warning: @export [generics.R#502]: unknown tag
> Warning: @export [generics.R#506]: unknown tag
> Warning: @export [generics.R#512]: unknown tag
> Warning: @export [generics.R#518]: unknown tag
> Warning: @export [generics.R#526]: unknown tag
> Warning: @export [generics.R#530]: unknown tag
> Warning: @export [generics.R#534]: unknown tag
> Warning: @export [generics.R#538]: unknown tag
> Warning: @export [generics.R#542]: unknown tag
> Warning: @export [generics.R#549]: unknown tag
> Warning: @export [generics.R#556]: unknown tag
> Warning: @export [generics.R#560]: unknown tag
> Warning: @export [generics.R#567]: unknown tag
> Warning: @export [generics.R#571]: unknown tag
> Warning: @export [generics.R#575]: unknown tag
> Warning: @export [generics.R#579]: unknown tag
> Warning: @export [generics.R#583]: unknown tag
> Warning: @export [generics.R#587]: unknown tag
> Warning: @export [generics.R#591]: unknown tag
> Warning: @export [generics.R#595]: unknown tag
> Warning: @export [generics.R#599]: unknown tag
> Warning: @export [generics.R#603]: unknown tag
> Warning: @export [generics.R#607]: unknown tag
> Warning: @export [generics.R#611]: unknown tag
> Warning: @export [generics.R#615]: unknown tag
> Warning: @export [generics.R#619]: unknown tag
> Warning: @export [generics.R#623]: unknown tag
> Warning: @export [generics.R#627]: unknown tag
> Warning: @export [generics.R#631]: unknown tag
> Warning: @export [generics.R#635]: unknown tag
> Warning: @export [generics.R#639]: unknown tag
> Warning: @export [generics.R#643]: unknown tag
> Warning: @export [generics.R#647]: unknown tag
> Warning: @export [generics.R#654]: unknown tag
> Warning: @export [generics.R#658]: unknown tag
> Warning: @export [generics.R#663]: unknown tag
> Warning: @export [generics.R#667]: unknown tag
> Warning: @export [generics.R#672]: unknown tag
> Warning: @export [generics.R#676]: unknown tag
> Warning: @export [generics.R#680]: unknown tag
> Warning: @export [generics.R#684]: unknown tag
> Warning: @export [generics.R#690]: unknown tag
> Warning: @export [generics.R#696]: unknown tag
> Warning: @export [generics.R#702]: unknown tag
> Warning: @export [generics.R#706]: unknown tag
> Warning: @export [generics.R#710]: unknown tag
> Warning: @export [generics.R#716]: unknown tag
> Warning: @export [generics.R#720]: unknown tag
> Warning: @export [generics.R#726]: unknown tag
> Warning: @export [generics.R#730]: unknown tag
> Warning: @export [generics.R#734]: unknown tag
> Warning: @export [generics.R#738]: unknown tag
> Warning: @export [generics.R#742]: unknown tag
> Warning: @export [generics.R#750]: unknown tag
> Warning: 

[jira] [Assigned] (SPARK-22437) jdbc write fails to set default mode

2017-11-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22437:


Assignee: (was: Apache Spark)

> jdbc write fails to set default mode
> 
>
> Key: SPARK-22437
> URL: https://issues.apache.org/jira/browse/SPARK-22437
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.2.0
> Environment: python3
>Reporter: Adrian Bridgett
>
> With this bit of code:
> {code}
> df.write.jdbc(jdbc_url, table)
> {code}
> We see this error:
> {code}
> 09:54:22 2017-11-03 09:54:19,985 INFO   File 
> "/opt/spark220/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 820, 
> in jdbc
> 09:54:22 2017-11-03 09:54:19,985 INFO   File 
> "/opt/spark220/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 
> 1133, in __call__
> 09:54:22 2017-11-03 09:54:19,986 INFO   File 
> "/opt/spark220/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
> 09:54:22 2017-11-03 09:54:19,986 INFO   File 
> "/opt/spark220/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in 
> get_return_value
> 09:54:22 2017-11-03 09:54:19,987 INFO py4j.protocol.Py4JJavaError: An 
> error occurred while calling o106.jdbc.
> 09:54:22 2017-11-03 09:54:19,987 INFO : scala.MatchError: null
> 09:54:22 2017-11-03 09:54:19,987 INFO at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:62)
> 09:54:22 2017-11-03 09:54:19,987 INFO at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:472)
> 09:54:22 2017-11-03 09:54:19,987 INFO at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
> ...
> {code}
> This seems to be that "mode" isn't correctly picking up the "error" default 
> as listed in 
> https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame
> Note that if it was erroring because of existing data it'd say "SaveMode: 
> ErrorIfExists." as seen in 
> ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala).
> Changing our code to this worked:
> {code}
> df.write.jdbc(jdbc_url, table, mode='append')
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22437) jdbc write fails to set default mode

2017-11-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237716#comment-16237716
 ] 

Apache Spark commented on SPARK-22437:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/19654

> jdbc write fails to set default mode
> 
>
> Key: SPARK-22437
> URL: https://issues.apache.org/jira/browse/SPARK-22437
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.2.0
> Environment: python3
>Reporter: Adrian Bridgett
>
> With this bit of code:
> {code}
> df.write.jdbc(jdbc_url, table)
> {code}
> We see this error:
> {code}
> 09:54:22 2017-11-03 09:54:19,985 INFO   File 
> "/opt/spark220/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 820, 
> in jdbc
> 09:54:22 2017-11-03 09:54:19,985 INFO   File 
> "/opt/spark220/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 
> 1133, in __call__
> 09:54:22 2017-11-03 09:54:19,986 INFO   File 
> "/opt/spark220/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
> 09:54:22 2017-11-03 09:54:19,986 INFO   File 
> "/opt/spark220/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in 
> get_return_value
> 09:54:22 2017-11-03 09:54:19,987 INFO py4j.protocol.Py4JJavaError: An 
> error occurred while calling o106.jdbc.
> 09:54:22 2017-11-03 09:54:19,987 INFO : scala.MatchError: null
> 09:54:22 2017-11-03 09:54:19,987 INFO at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:62)
> 09:54:22 2017-11-03 09:54:19,987 INFO at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:472)
> 09:54:22 2017-11-03 09:54:19,987 INFO at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
> ...
> {code}
> This seems to be that "mode" isn't correctly picking up the "error" default 
> as listed in 
> https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame
> Note that if it was erroring because of existing data it'd say "SaveMode: 
> ErrorIfExists." as seen in 
> ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala).
> Changing our code to this worked:
> {code}
> df.write.jdbc(jdbc_url, table, mode='append')
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22437) jdbc write fails to set default mode

2017-11-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22437:


Assignee: Apache Spark

> jdbc write fails to set default mode
> 
>
> Key: SPARK-22437
> URL: https://issues.apache.org/jira/browse/SPARK-22437
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.2.0
> Environment: python3
>Reporter: Adrian Bridgett
>Assignee: Apache Spark
>
> With this bit of code:
> {code}
> df.write.jdbc(jdbc_url, table)
> {code}
> We see this error:
> {code}
> 09:54:22 2017-11-03 09:54:19,985 INFO   File 
> "/opt/spark220/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 820, 
> in jdbc
> 09:54:22 2017-11-03 09:54:19,985 INFO   File 
> "/opt/spark220/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 
> 1133, in __call__
> 09:54:22 2017-11-03 09:54:19,986 INFO   File 
> "/opt/spark220/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
> 09:54:22 2017-11-03 09:54:19,986 INFO   File 
> "/opt/spark220/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in 
> get_return_value
> 09:54:22 2017-11-03 09:54:19,987 INFO py4j.protocol.Py4JJavaError: An 
> error occurred while calling o106.jdbc.
> 09:54:22 2017-11-03 09:54:19,987 INFO : scala.MatchError: null
> 09:54:22 2017-11-03 09:54:19,987 INFO at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:62)
> 09:54:22 2017-11-03 09:54:19,987 INFO at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:472)
> 09:54:22 2017-11-03 09:54:19,987 INFO at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
> ...
> {code}
> This seems to be that "mode" isn't correctly picking up the "error" default 
> as listed in 
> https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame
> Note that if it was erroring because of existing data it'd say "SaveMode: 
> ErrorIfExists." as seen in 
> ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala).
> Changing our code to this worked:
> {code}
> df.write.jdbc(jdbc_url, table, mode='append')
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22433) Linear regression R^2 train/test terminology related

2017-11-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237676#comment-16237676
 ] 

Sean Owen commented on SPARK-22433:
---

Likewise, the goal here is not to adopt statistics terminology. As the name 
implies, MLlib is coming more from ML terminology and practices. We should 
stick to standard language and approaches where there is a clear standard, 
completely agree; where there is a ML-oriented standard, we should probably 
prefer that one for consistency.

I'm not sure I agree with these yet. What is a regression prediction metric? 
and why can't R^2 be an evaluation metric? I am not sure I'd use it that way, 
but it's coherent. There's no issue with computing R^2 on a test set vs train 
set it was trained on -- could be negative, sure.

I understand your distinction, but "linear regression with L1 reg" still 
produces a linear model. The cost function is not the same as in linear 
regression, but the name is also different. LASSO is, I think, less well known 
for people that would use this than L1. While I wouldn't mind LASSO, I don't 
see a clear motive to change this.


> Linear regression R^2 train/test terminology related 
> -
>
> Key: SPARK-22433
> URL: https://issues.apache.org/jira/browse/SPARK-22433
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Teng Peng
>Priority: Minor
>
> Traditional statistics is traditional statistics. Their goal, framework, and 
> terminologies are not the same as ML. However, in linear regression related 
> components, this distinction is not clear, which is reflected:
> 1. regressionMetric + regressionEvaluator : 
> * R2 shouldn't be there. 
> * A better name "regressionPredictionMetric".
> 2. LinearRegressionSuite: 
> * Shouldn't test R2 and residuals on test data. 
> * There is no train set and test set in this setting.
> 3. Terminology: there is no "linear regression with L1 regularization". 
> Linear regression is linear. Adding a penalty term, then it is no longer 
> linear. Just call it "LASSO", "ElasticNet".
> There are more. I am working on correcting them.
> They are not breaking anything, but it does not make one feel good to see the 
> basic distinction is blurred.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22418) Add test cases for NULL Handling

2017-11-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22418:


Assignee: (was: Apache Spark)

> Add test cases for NULL Handling
> 
>
> Key: SPARK-22418
> URL: https://issues.apache.org/jira/browse/SPARK-22418
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Normal
>
> Add test cases mentioned in the link https://sqlite.org/nulls.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22418) Add test cases for NULL Handling

2017-11-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22418:


Assignee: Apache Spark

> Add test cases for NULL Handling
> 
>
> Key: SPARK-22418
> URL: https://issues.apache.org/jira/browse/SPARK-22418
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Normal
>
> Add test cases mentioned in the link https://sqlite.org/nulls.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22418) Add test cases for NULL Handling

2017-11-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237647#comment-16237647
 ] 

Apache Spark commented on SPARK-22418:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/19653

> Add test cases for NULL Handling
> 
>
> Key: SPARK-22418
> URL: https://issues.apache.org/jira/browse/SPARK-22418
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Normal
>
> Add test cases mentioned in the link https://sqlite.org/nulls.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22439) Not able to get numeric columns for the file having decimal values

2017-11-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237626#comment-16237626
 ] 

Sean Owen commented on SPARK-22439:
---

Your code doesn't work as is, but I adapted it to Scala to try it. I realize 
you're calling a private method, which happens to be visible in the Java 
bytecode but isn't meant to be called. I don't think you're intended to call 
this directly.

> Not able to get numeric columns for the file having decimal values
> --
>
> Key: SPARK-22439
> URL: https://issues.apache.org/jira/browse/SPARK-22439
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 2.2.0
>Reporter: Navya Krishnappa
>Priority: Major
>
> When reading the below-mentioned decimal value by specifying header as true.
> SourceFile: 
> 8.95977565356765764E+20
> 8.95977565356765764E+20
> 8.95977565356765764E+20
> Source code1:
> Dataset dataset = getSqlContext().read()
> .option(PARSER_LIB, "commons")
> .option(INFER_SCHEMA, "true")
> .option(HEADER, "true")
> .option(DELIMITER, ",")
> .option(QUOTE, "\"")
> .option(ESCAPE, "
> ")
> .option(MODE, Mode.PERMISSIVE)
> .csv(sourceFile);
> dataset.numericColumns()
> Result: 
> Caused by: java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:347)
>   at scala.None$.get(Option.scala:345)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$numericColumns$2.apply(Dataset.scala:223)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$numericColumns$2.apply(Dataset.scala:222)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22438) OutOfMemoryError on very small data sets

2017-11-03 Thread Morten Hornbech (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237553#comment-16237553
 ] 

Morten Hornbech commented on SPARK-22438:
-

I honestly can't see whether those are duplicates. I have seen these OOM errors 
occur in a lot of very different situations, and I am not an expert on Spark 
Core. Since the linked issue is resolved you could run my 10 lines of code on 
your 2.3.0 development build to know for sure.

> OutOfMemoryError on very small data sets
> 
>
> Key: SPARK-22438
> URL: https://issues.apache.org/jira/browse/SPARK-22438
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Morten Hornbech
>Priority: Critical
>
> We have a customer that uses Spark as an engine for running SQL on a 
> collection of small datasets, typically no greater than a few thousand rows. 
> Recently we started observing out-of-memory errors on some new workloads. 
> Even though the datasets were only a few kilobytes, the job would almost 
> immediately spike to > 10GB of memory usage, producing an out-of-memory error 
> on the modest hardware (2 CPUs, 16 RAM) that is used. Using larger hardware 
> and allocating more memory to Spark (4 CPUs, 32 RAM) made the job complete, 
> but still with an unreasonable high memory usage.
> The query involved was a left join on two datasets. In some, but not all, 
> cases we were able to remove or reduce the problem by rewriting the query to 
> use an exists sub-select instead. After a lot of debugging we were able to 
> reproduce the problem locally with the following test:
> {code:java}
> case class Data(value: String)
> val session = SparkSession.builder.master("local[1]").getOrCreate()
> import session.implicits._
> val foo = session.createDataset((1 to 500).map(i => Data(i.toString)))
> val bar = session.createDataset((1 to 1).map(i => Data(i.toString)))
> foo.persist(StorageLevel.MEMORY_ONLY)
> foo.createTempView("foo")
> bar.createTempView("bar")
> val result = session.sql("select * from bar left join foo on bar.value = 
> foo.value")
> result.coalesce(2).collect()
> {code}
> Running this produces the error below:
> {code:java}
> java.lang.OutOfMemoryError: Unable to acquire 28 bytes of memory, got 0
>at 
> org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:127)
>at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:372)
>at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:396)
>at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109)
>at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>at 
> org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83)
>at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedBufferedToRowWithNullFreeJoinKey(SortMergeJoinExec.scala:774)
>at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.(SortMergeJoinExec.scala:649)
>at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$doExecute$1.apply(SortMergeJoinExec.scala:198)
>at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$doExecute$1.apply(SortMergeJoinExec.scala:136)
>at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
>at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>at 
> org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:100)
>at 
> org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:99)
>at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
>at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
>at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>at 

[jira] [Assigned] (SPARK-22407) Add rdd id column on storage page to speed up navigating

2017-11-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-22407:
-

Assignee: zhoukang

> Add rdd id column on storage page to speed up navigating
> 
>
> Key: SPARK-22407
> URL: https://issues.apache.org/jira/browse/SPARK-22407
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: zhoukang
>Assignee: zhoukang
> Fix For: 2.3.0
>
> Attachments: add-rddid.png, rdd-cache.png
>
>
> We can add rdd id column on storage page to speed up nagigating when many 
> rdds are cached.
> Example after adding as below:
> !add-rddid.png!
> !rdd-cache.png!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22407) Add rdd id column on storage page to speed up navigating

2017-11-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22407.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19625
[https://github.com/apache/spark/pull/19625]

> Add rdd id column on storage page to speed up navigating
> 
>
> Key: SPARK-22407
> URL: https://issues.apache.org/jira/browse/SPARK-22407
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: zhoukang
> Fix For: 2.3.0
>
> Attachments: add-rddid.png, rdd-cache.png
>
>
> We can add rdd id column on storage page to speed up nagigating when many 
> rdds are cached.
> Example after adding as below:
> !add-rddid.png!
> !rdd-cache.png!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22436) New function strip() to remove all whitespace from string

2017-11-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237530#comment-16237530
 ] 

Sean Owen commented on SPARK-22436:
---

Yes, but that's true of any Python UDF, and not every one can be built into 
Spark. These functions you cite are there for SQL support. I don't think the 
idea is to expand into all variants.

> New function strip() to remove all whitespace from string
> -
>
> Key: SPARK-22436
> URL: https://issues.apache.org/jira/browse/SPARK-22436
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.2.0
>Reporter: Andreas Maier
>Priority: Minor
>  Labels: features
>
> Since ticket SPARK-17299 the [trim() 
> function|https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.trim]
>  will not remove any whitespace characters from beginning and end of a string 
> but only spaces. This is correct in regard to the SQL standard, but it opens 
> a gap in functionality. 
> My suggestion is to add to the Spark API in analogy to pythons standard 
> library the functions l/r/strip(), which should remove all whitespace 
> characters from a string from beginning and/or end of a string respectively. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22438) OutOfMemoryError on very small data sets

2017-11-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22438.
---
Resolution: Duplicate

Have a look through JIRA first. This looks like 
https://issues.apache.org/jira/browse/SPARK-21033

> OutOfMemoryError on very small data sets
> 
>
> Key: SPARK-22438
> URL: https://issues.apache.org/jira/browse/SPARK-22438
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Morten Hornbech
>Priority: Critical
>
> We have a customer that uses Spark as an engine for running SQL on a 
> collection of small datasets, typically no greater than a few thousand rows. 
> Recently we started observing out-of-memory errors on some new workloads. 
> Even though the datasets were only a few kilobytes, the job would almost 
> immediately spike to > 10GB of memory usage, producing an out-of-memory error 
> on the modest hardware (2 CPUs, 16 RAM) that is used. Using larger hardware 
> and allocating more memory to Spark (4 CPUs, 32 RAM) made the job complete, 
> but still with an unreasonable high memory usage.
> The query involved was a left join on two datasets. In some, but not all, 
> cases we were able to remove or reduce the problem by rewriting the query to 
> use an exists sub-select instead. After a lot of debugging we were able to 
> reproduce the problem locally with the following test:
> {code:java}
> case class Data(value: String)
> val session = SparkSession.builder.master("local[1]").getOrCreate()
> import session.implicits._
> val foo = session.createDataset((1 to 500).map(i => Data(i.toString)))
> val bar = session.createDataset((1 to 1).map(i => Data(i.toString)))
> foo.persist(StorageLevel.MEMORY_ONLY)
> foo.createTempView("foo")
> bar.createTempView("bar")
> val result = session.sql("select * from bar left join foo on bar.value = 
> foo.value")
> result.coalesce(2).collect()
> {code}
> Running this produces the error below:
> {code:java}
> java.lang.OutOfMemoryError: Unable to acquire 28 bytes of memory, got 0
>at 
> org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:127)
>at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:372)
>at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:396)
>at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109)
>at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>at 
> org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83)
>at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedBufferedToRowWithNullFreeJoinKey(SortMergeJoinExec.scala:774)
>at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.(SortMergeJoinExec.scala:649)
>at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$doExecute$1.apply(SortMergeJoinExec.scala:198)
>at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$doExecute$1.apply(SortMergeJoinExec.scala:136)
>at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
>at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>at 
> org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:100)
>at 
> org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:99)
>at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
>at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
>at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>at 

[jira] [Updated] (SPARK-22439) Not able to get numeric columns for the file having decimal values

2017-11-03 Thread Navya Krishnappa (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Navya Krishnappa updated SPARK-22439:
-
Summary: Not able to get numeric columns for the file having decimal values 
 (was: Not able to get numeric columns for the attached file)

> Not able to get numeric columns for the file having decimal values
> --
>
> Key: SPARK-22439
> URL: https://issues.apache.org/jira/browse/SPARK-22439
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 2.2.0
>Reporter: Navya Krishnappa
>Priority: Major
>
> When reading the below-mentioned decimal value by specifying header as true.
> SourceFile: 
> 8.95977565356765764E+20
> 8.95977565356765764E+20
> 8.95977565356765764E+20
> Source code1:
> Dataset dataset = getSqlContext().read()
> .option(PARSER_LIB, "commons")
> .option(INFER_SCHEMA, "true")
> .option(HEADER, "true")
> .option(DELIMITER, ",")
> .option(QUOTE, "\"")
> .option(ESCAPE, "
> ")
> .option(MODE, Mode.PERMISSIVE)
> .csv(sourceFile);
> dataset.numericColumns()
> Result: 
> Caused by: java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:347)
>   at scala.None$.get(Option.scala:345)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$numericColumns$2.apply(Dataset.scala:223)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$numericColumns$2.apply(Dataset.scala:222)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22439) Not able to get numeric columns for the attached file

2017-11-03 Thread Navya Krishnappa (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Navya Krishnappa updated SPARK-22439:
-
Summary: Not able to get numeric columns for the attached file  (was: Not 
able to get numeric column for the attached file)

> Not able to get numeric columns for the attached file
> -
>
> Key: SPARK-22439
> URL: https://issues.apache.org/jira/browse/SPARK-22439
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 2.2.0
>Reporter: Navya Krishnappa
>Priority: Major
>
> When reading the below-mentioned decimal value by specifying header as true.
> SourceFile: 
> 8.95977565356765764E+20
> 8.95977565356765764E+20
> 8.95977565356765764E+20
> Source code1:
> Dataset dataset = getSqlContext().read()
> .option(PARSER_LIB, "commons")
> .option(INFER_SCHEMA, "true")
> .option(HEADER, "true")
> .option(DELIMITER, ",")
> .option(QUOTE, "\"")
> .option(ESCAPE, "
> ")
> .option(MODE, Mode.PERMISSIVE)
> .csv(sourceFile);
> dataset.numericColumns()
> Result: 
> Caused by: java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:347)
>   at scala.None$.get(Option.scala:345)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$numericColumns$2.apply(Dataset.scala:223)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$numericColumns$2.apply(Dataset.scala:222)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22439) Not able to get numeric column for the attached file

2017-11-03 Thread Navya Krishnappa (JIRA)
Navya Krishnappa created SPARK-22439:


 Summary: Not able to get numeric column for the attached file
 Key: SPARK-22439
 URL: https://issues.apache.org/jira/browse/SPARK-22439
 Project: Spark
  Issue Type: Bug
  Components: Java API, SQL
Affects Versions: 2.2.0
Reporter: Navya Krishnappa
Priority: Major


When reading the below-mentioned decimal value by specifying header as true.

SourceFile: 
8.95977565356765764E+20
8.95977565356765764E+20
8.95977565356765764E+20

Source code1:
Dataset dataset = getSqlContext().read()
.option(PARSER_LIB, "commons")
.option(INFER_SCHEMA, "true")
.option(HEADER, "true")
.option(DELIMITER, ",")
.option(QUOTE, "\"")
.option(ESCAPE, "
")
.option(MODE, Mode.PERMISSIVE)
.csv(sourceFile);

dataset.numericColumns()

Result: 
Caused by: java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at scala.None$.get(Option.scala:345)
at 
org.apache.spark.sql.Dataset$$anonfun$numericColumns$2.apply(Dataset.scala:223)
at 
org.apache.spark.sql.Dataset$$anonfun$numericColumns$2.apply(Dataset.scala:222)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22438) OutOfMemoryError on very small data sets

2017-11-03 Thread Morten Hornbech (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Morten Hornbech updated SPARK-22438:

Description: 
We have a customer that uses Spark as an engine for running SQL on a collection 
of small datasets, typically no greater than a few thousand rows. Recently we 
started observing out-of-memory errors on some new workloads. Even though the 
datasets were only a few kilobytes, the job would almost immediately spike to > 
10GB of memory usage, producing an out-of-memory error on the modest hardware 
(2 CPUs, 16 RAM) that is used. Using larger hardware and allocating more memory 
to Spark (4 CPUs, 32 RAM) made the job complete, but still with an unreasonable 
high memory usage.

The query involved was a left join on two datasets. In some, but not all, cases 
we were able to remove or reduce the problem by rewriting the query to use an 
exists sub-select instead. After a lot of debugging we were able to reproduce 
the problem locally with the following test:

{code:java}
case class Data(value: String)

val session = SparkSession.builder.master("local[1]").getOrCreate()
import session.implicits._

val foo = session.createDataset((1 to 500).map(i => Data(i.toString)))
val bar = session.createDataset((1 to 1).map(i => Data(i.toString)))

foo.persist(StorageLevel.MEMORY_ONLY)

foo.createTempView("foo")
bar.createTempView("bar")

val result = session.sql("select * from bar left join foo on bar.value = 
foo.value")
result.coalesce(2).collect()
{code}

Running this produces the error below:

{code:java}
java.lang.OutOfMemoryError: Unable to acquire 28 bytes of memory, got 0
   at 
org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:127)
   at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:372)
   at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:396)
   at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109)
   at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
 Source)
   at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
   at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
   at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
   at 
org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83)
   at 
org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedBufferedToRowWithNullFreeJoinKey(SortMergeJoinExec.scala:774)
   at 
org.apache.spark.sql.execution.joins.SortMergeJoinScanner.(SortMergeJoinExec.scala:649)
   at 
org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$doExecute$1.apply(SortMergeJoinExec.scala:198)
   at 
org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$doExecute$1.apply(SortMergeJoinExec.scala:136)
   at 
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
   at 
org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:100)
   at 
org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:99)
   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
   at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
   at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
   at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
   at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
   at org.apache.spark.scheduler.Task.run(Task.scala:108)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)
{code}

The exact failure point varies with the number of threads given to spark, the 
"coalesce" value and the number of rows in "foo". Using an inner join, removing 
the call to persist, removing the call to coalease (or using repartition) will 
all independently make the error go away.

The reason persist and 

[jira] [Created] (SPARK-22438) OutOfMemoryError on very small data sets

2017-11-03 Thread Morten Hornbech (JIRA)
Morten Hornbech created SPARK-22438:
---

 Summary: OutOfMemoryError on very small data sets
 Key: SPARK-22438
 URL: https://issues.apache.org/jira/browse/SPARK-22438
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Morten Hornbech
Priority: Critical


We have a customer that uses Spark as an engine for running SQL on a collection 
of small datasets, typically no greater than a few thousand rows. Recently we 
started observing out-of-memory errors on some new workloads. Even though the 
datasets were only a few kilobytes, the job would almost immediately spike to > 
10GB of memory usage, producing an out-of-memory error on the modest hardware 
(2 CPUs, 16 RAM) that is used. Using larger hardware and allocating more memory 
to Spark (4 CPUs, 32 RAM) made the job complete, but still with an unreasonable 
high memory usage.

The query involved was a left join on two datasets. In some, but not all, cases 
we were able to remove or reduce the problem by rewriting the query to use an 
exists sub-select instead. After a lot of debugging we were able to reproduce 
the problem locally with the following test:

{{
case class Data(value: String)

val session = SparkSession.builder.master("local[1]").getOrCreate()
import session.implicits._

val foo = session.createDataset((1 to 500).map(i => Data(i.toString)))
val bar = session.createDataset((1 to 1).map(i => Data(i.toString)))

foo.persist(StorageLevel.MEMORY_ONLY)

foo.createTempView("foo")
bar.createTempView("bar")

val result = session.sql("select * from bar left join foo on bar.value = 
foo.value")
result.coalesce(2).collect()
}}

Running this produces the error below:

{{
java.lang.OutOfMemoryError: Unable to acquire 28 bytes of memory, got 0
   at 
org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:127)
   at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:372)
   at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:396)
   at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109)
   at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
 Source)
   at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
   at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
   at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
   at 
org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83)
   at 
org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedBufferedToRowWithNullFreeJoinKey(SortMergeJoinExec.scala:774)
   at 
org.apache.spark.sql.execution.joins.SortMergeJoinScanner.(SortMergeJoinExec.scala:649)
   at 
org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$doExecute$1.apply(SortMergeJoinExec.scala:198)
   at 
org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$doExecute$1.apply(SortMergeJoinExec.scala:136)
   at 
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
   at 
org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:100)
   at 
org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:99)
   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
   at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
   at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
   at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
   at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
   at org.apache.spark.scheduler.Task.run(Task.scala:108)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)
}}

The exact failure point varies with the number of threads given to spark, the 
"coalesce" value and the number of rows 

[jira] [Created] (SPARK-22437) jdbc write fails to set default mode

2017-11-03 Thread Adrian Bridgett (JIRA)
Adrian Bridgett created SPARK-22437:
---

 Summary: jdbc write fails to set default mode
 Key: SPARK-22437
 URL: https://issues.apache.org/jira/browse/SPARK-22437
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 2.2.0
 Environment: python3
Reporter: Adrian Bridgett


With this bit of code:
{code}
df.write.jdbc(jdbc_url, table)
{code}

We see this error:
{code}
09:54:22 2017-11-03 09:54:19,985 INFO   File 
"/opt/spark220/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 820, in 
jdbc
09:54:22 2017-11-03 09:54:19,985 INFO   File 
"/opt/spark220/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, 
in __call__
09:54:22 2017-11-03 09:54:19,986 INFO   File 
"/opt/spark220/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
09:54:22 2017-11-03 09:54:19,986 INFO   File 
"/opt/spark220/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in 
get_return_value
09:54:22 2017-11-03 09:54:19,987 INFO py4j.protocol.Py4JJavaError: An error 
occurred while calling o106.jdbc.
09:54:22 2017-11-03 09:54:19,987 INFO : scala.MatchError: null
09:54:22 2017-11-03 09:54:19,987 INFO   at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:62)
09:54:22 2017-11-03 09:54:19,987 INFO   at 
org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:472)
09:54:22 2017-11-03 09:54:19,987 INFO   at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
...
{code}

This seems to be that "mode" isn't correctly picking up the "error" default as 
listed in 
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame

Note that if it was erroring because of existing data it'd say "SaveMode: 
ErrorIfExists." as seen in 
./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala).

Changing our code to this worked:
{code}
df.write.jdbc(jdbc_url, table, mode='append')
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22436) New function strip() to remove all whitespace from string

2017-11-03 Thread Andreas Maier (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237446#comment-16237446
 ] 

Andreas Maier commented on SPARK-22436:
---

Python UDFs are very slow, aren't they? I believe a Spark native function would 
be much faster. And in fact it was already available with trim() before 
SPARK-17299 .

> New function strip() to remove all whitespace from string
> -
>
> Key: SPARK-22436
> URL: https://issues.apache.org/jira/browse/SPARK-22436
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.2.0
>Reporter: Andreas Maier
>Priority: Minor
>  Labels: features
>
> Since ticket SPARK-17299 the [trim() 
> function|https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.trim]
>  will not remove any whitespace characters from beginning and end of a string 
> but only spaces. This is correct in regard to the SQL standard, but it opens 
> a gap in functionality. 
> My suggestion is to add to the Spark API in analogy to pythons standard 
> library the functions l/r/strip(), which should remove all whitespace 
> characters from a string from beginning and/or end of a string respectively. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22418) Add test cases for NULL Handling

2017-11-03 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237442#comment-16237442
 ] 

Marco Gaido commented on SPARK-22418:
-

can I work on this?

> Add test cases for NULL Handling
> 
>
> Key: SPARK-22418
> URL: https://issues.apache.org/jira/browse/SPARK-22418
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Normal
>
> Add test cases mentioned in the link https://sqlite.org/nulls.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22436) New function strip() to remove all whitespace from string

2017-11-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237420#comment-16237420
 ] 

Sean Owen commented on SPARK-22436:
---

Why does it need to be in Spark as opposed to a simple UDF?

> New function strip() to remove all whitespace from string
> -
>
> Key: SPARK-22436
> URL: https://issues.apache.org/jira/browse/SPARK-22436
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.2.0
>Reporter: Andreas Maier
>Priority: Minor
>  Labels: features
>
> Since ticket SPARK-17299 the [trim() 
> function|https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.trim]
>  will not remove any whitespace characters from beginning and end of a string 
> but only spaces. This is correct in regard to the SQL standard, but it opens 
> a gap in functionality. 
> My suggestion is to add to the Spark API in analogy to pythons standard 
> library the functions l/r/strip(), which should remove all whitespace 
> characters from a string from beginning and/or end of a string respectively. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22435) Support processing array and map type using script

2017-11-03 Thread jin xing (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jin xing updated SPARK-22435:
-
Priority: Major  (was: Critical)

> Support processing array and map type using script
> --
>
> Key: SPARK-22435
> URL: https://issues.apache.org/jira/browse/SPARK-22435
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: jin xing
>Priority: Major
>
> Currently, It is not supported to use script(e.g. python) to process array 
> type or map type, it will complain with below message:
> {{org.apache.spark.sql.catalyst.expressions.UnsafeArrayData cannot be cast to 
> [Ljava.lang.Object}}
> {{org.apache.spark.sql.catalyst.expressions.UnsafeMapData cannot be cast to 
> java.util.Map}}
> Could we support it?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22436) New function strip() to remove all whitespace from string

2017-11-03 Thread Andreas Maier (JIRA)
Andreas Maier created SPARK-22436:
-

 Summary: New function strip() to remove all whitespace from string
 Key: SPARK-22436
 URL: https://issues.apache.org/jira/browse/SPARK-22436
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Spark Core
Affects Versions: 2.2.0
Reporter: Andreas Maier
Priority: Minor


Since ticket SPARK-17299 the [trim() 
function|https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.trim]
 will not remove any whitespace characters from beginning and end of a string 
but only spaces. This is correct in regard to the SQL standard, but it opens a 
gap in functionality. 

My suggestion is to add to the Spark API in analogy to pythons standard library 
the functions l/r/strip(), which should remove all whitespace characters from a 
string from beginning and/or end of a string respectively. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22435) Support processing array and map type using script

2017-11-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22435:


Assignee: Apache Spark

> Support processing array and map type using script
> --
>
> Key: SPARK-22435
> URL: https://issues.apache.org/jira/browse/SPARK-22435
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: jin xing
>Assignee: Apache Spark
>Priority: Critical
>
> Currently, It is not supported to use script(e.g. python) to process array 
> type or map type, it will complain with below message:
> {{org.apache.spark.sql.catalyst.expressions.UnsafeArrayData cannot be cast to 
> [Ljava.lang.Object}}
> {{org.apache.spark.sql.catalyst.expressions.UnsafeMapData cannot be cast to 
> java.util.Map}}
> Could we support it?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22435) Support processing array and map type using script

2017-11-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237407#comment-16237407
 ] 

Apache Spark commented on SPARK-22435:
--

User 'jinxing64' has created a pull request for this issue:
https://github.com/apache/spark/pull/19652

> Support processing array and map type using script
> --
>
> Key: SPARK-22435
> URL: https://issues.apache.org/jira/browse/SPARK-22435
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: jin xing
>Priority: Critical
>
> Currently, It is not supported to use script(e.g. python) to process array 
> type or map type, it will complain with below message:
> {{org.apache.spark.sql.catalyst.expressions.UnsafeArrayData cannot be cast to 
> [Ljava.lang.Object}}
> {{org.apache.spark.sql.catalyst.expressions.UnsafeMapData cannot be cast to 
> java.util.Map}}
> Could we support it?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22435) Support processing array and map type using script

2017-11-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22435:


Assignee: (was: Apache Spark)

> Support processing array and map type using script
> --
>
> Key: SPARK-22435
> URL: https://issues.apache.org/jira/browse/SPARK-22435
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: jin xing
>Priority: Critical
>
> Currently, It is not supported to use script(e.g. python) to process array 
> type or map type, it will complain with below message:
> {{org.apache.spark.sql.catalyst.expressions.UnsafeArrayData cannot be cast to 
> [Ljava.lang.Object}}
> {{org.apache.spark.sql.catalyst.expressions.UnsafeMapData cannot be cast to 
> java.util.Map}}
> Could we support it?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22435) Support processing array and map type using script

2017-11-03 Thread jin xing (JIRA)
jin xing created SPARK-22435:


 Summary: Support processing array and map type using script
 Key: SPARK-22435
 URL: https://issues.apache.org/jira/browse/SPARK-22435
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: jin xing
Priority: Critical


Currently, It is not supported to use script(e.g. python) to process array type 
or map type, it will complain with below message:
{{org.apache.spark.sql.catalyst.expressions.UnsafeArrayData cannot be cast to 
[Ljava.lang.Object}}
{{org.apache.spark.sql.catalyst.expressions.UnsafeMapData cannot be cast to 
java.util.Map}}
Could we support it?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22434) Spark structured streaming with HBase

2017-11-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22434.
---
   Resolution: Invalid
Fix Version/s: (was: 2.0.3)

Questions are for the mailing list

> Spark structured streaming with HBase
> -
>
> Key: SPARK-22434
> URL: https://issues.apache.org/jira/browse/SPARK-22434
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 2.1.2
>Reporter: Harendra Singh
>Priority: Blocker
>   Original Estimate: 2m
>  Remaining Estimate: 2m
>
> Hi Team,
> We are doing streaming on kafka data which being collected from MySQL.
> Now once all the analytics has been done i want to save my data directly to 
> Hbase.
> I have through the spark structured streaming document but couldn't find any 
> sink with Hbase.  
> 
> Code Snip which i used to read the data from Kafka is below.
> ==
> val records = spark.readStream.format("kafka").option("subscribe", 
> "kaapociot").option("kafka.bootstrap.servers", 
> "XX.XX.XX.XX:6667").option("startingOffsets", "earliest").load
> val jsonschema = StructType(Seq(StructField("header", StringType, 
> true),StructField("event", StringType, true)))
> val uschema = StructType(Seq(
>  StructField("MeterNumber", StringType, true),
> StructField("Utility", StringType, true),
> StructField("VendorServiceNumber", StringType, true),
> StructField("VendorName", StringType, true),
> StructField("SiteNumber",  StringType, true),
> StructField("SiteName", StringType, true),
> StructField("Location", StringType, true),
> StructField("timestamp", LongType, true),
> StructField("power", DoubleType, true)
> ))
> val DF_Hbase = records.selectExpr("cast (value as string) as 
> json").select(from_json($"json",schema=jsonschema).as("data")).select("data.event").select(from_json($"event",
>  uschema).as("mykafkadata")).select("mykafkadata.*")
> ===
> Now finally, i want to save DF_Hbase dataframe in hbase.
> Please help me out
> Thanks 
> Harendra Singh



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22434) Spark structured streaming with HBase

2017-11-03 Thread Harendra Singh (JIRA)
Harendra Singh created SPARK-22434:
--

 Summary: Spark structured streaming with HBase
 Key: SPARK-22434
 URL: https://issues.apache.org/jira/browse/SPARK-22434
 Project: Spark
  Issue Type: Task
  Components: Structured Streaming
Affects Versions: 2.1.2
Reporter: Harendra Singh
Priority: Blocker
 Fix For: 2.0.3


Hi Team,

We are doing streaming on kafka data which being collected from MySQL.

Now once all the analytics has been done i want to save my data directly to 
Hbase.

I have through the spark structured streaming document but couldn't find any 
sink with Hbase.  

Code Snip which i used to read the data from Kafka is below.
==

val records = spark.readStream.format("kafka").option("subscribe", 
"kaapociot").option("kafka.bootstrap.servers", 
"XX.XX.XX.XX:6667").option("startingOffsets", "earliest").load

val jsonschema = StructType(Seq(StructField("header", StringType, 
true),StructField("event", StringType, true)))

val uschema = StructType(Seq(
 StructField("MeterNumber", StringType, true),
StructField("Utility", StringType, true),
StructField("VendorServiceNumber", StringType, true),
StructField("VendorName", StringType, true),
StructField("SiteNumber",  StringType, true),
StructField("SiteName", StringType, true),
StructField("Location", StringType, true),
StructField("timestamp", LongType, true),
StructField("power", DoubleType, true)
))
val DF_Hbase = records.selectExpr("cast (value as string) as 
json").select(from_json($"json",schema=jsonschema).as("data")).select("data.event").select(from_json($"event",
 uschema).as("mykafkadata")).select("mykafkadata.*")
===
Now finally, i want to save DF_Hbase dataframe in hbase.

Please help me out

Thanks 
Harendra Singh



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22420) Spark SQL return invalid json string for struct with date/datetime field

2017-11-03 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237368#comment-16237368
 ] 

Marco Gaido commented on SPARK-22420:
-

I think this is related and will be resolved by SPARK-20202

> Spark SQL return invalid json string for struct with date/datetime field
> 
>
> Key: SPARK-22420
> URL: https://issues.apache.org/jira/browse/SPARK-22420
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: pin_zhang
>Priority: Normal
>
> Run SQL with JDBC client in spark hiveserver2
> select  named_struct ( 'b',current_timestamp) from test;
> +---+--+
> | named_struct(b, current_timestamp())  |
> +---+--+
> | {"b":2017-11-01 23:18:40.988} |
> The json string is is invalid, date time value should be quoted.
> If run sql in Apache hiveserver2, get expected json string
> select  named_struct ( 'b',current_timestamp) from dummy_table ;
> +--+--+
> |   _c0|
> +--+--+
> | {"b":"2017-11-01 23:21:24.168"}  |
> +--+--+



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21668) Ability to run driver programs within a container

2017-11-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237329#comment-16237329
 ] 

Apache Spark commented on SPARK-21668:
--

User 'tashoyan' has created a pull request for this issue:
https://github.com/apache/spark/pull/18885

> Ability to run driver programs within a container
> -
>
> Key: SPARK-21668
> URL: https://issues.apache.org/jira/browse/SPARK-21668
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Arseniy Tashoyan
>Priority: Minor
>  Labels: containers, docker, driver, spark-submit, standalone
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> When a driver program in Client mode runs in a Docker container, it binds to 
> the IP address of the container, not the host machine. This container IP 
> address is accessible only within the host machine, it is inaccessible for 
> master and worker nodes.
> For example, the host machine has IP address 192.168.216.10. When Docker 
> machine starts a container, it places it to a special bridged network and 
> assigns it an IP address like 172.17.0.2. All Spark nodes belonging to the 
> 192.168.216.0 network cannot access the bridged network with the container. 
> Therefore, the driver program is not able to communicate with the Spark 
> cluster.
> Spark already provides SPARK_PUBLIC_DNS environment variable for this 
> purpose. However, in this scenario setting SPARK_PUBLIC_DNS to the host 
> machine IP address does not work.
> Topic on StackOverflow: 
> [https://stackoverflow.com/questions/45489248/running-spark-driver-program-in-docker-container-no-connection-back-from-execu]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22429) Streaming checkpointing code does not retry after failure due to NullPointerException

2017-11-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237282#comment-16237282
 ] 

Sean Owen commented on SPARK-22429:
---

[~tmgstev] it will have to be vs. master. Master and all branches seem fine 
though: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/ What 
do you see?

> Streaming checkpointing code does not retry after failure due to 
> NullPointerException
> -
>
> Key: SPARK-22429
> URL: https://issues.apache.org/jira/browse/SPARK-22429
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 1.6.3, 2.2.0
>Reporter: Tristan Stevens
>
> CheckpointWriteHandler has a built in retry mechanism. However 
> SPARK-14930/SPARK-13693 put in a fix to de-allocate the fs object, yet 
> initialises it in the wrong place for the while loop, and therefore on 
> attempt 2 it fails with a NPE.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22430) Unknown tag warnings when building R docs with Roxygen 6.0.1

2017-11-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237268#comment-16237268
 ] 

Sean Owen commented on SPARK-22430:
---

I see it too, because I have Roxygen 6.0.1 locally. [~felixcheung] have you 
seen this? I wonder if we could all use 6.0.1, but that's separate.
Do you know of a fix?

> Unknown tag warnings when building R docs with Roxygen 6.0.1
> 
>
> Key: SPARK-22430
> URL: https://issues.apache.org/jira/browse/SPARK-22430
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.3.0
> Environment: Roxygen 6.0.1
>Reporter: Joel Croteau
>
> When building R docs using create-rd.sh with Roxygen 6.0.1, a large number of 
> unknown tag warnings are generated:
> {noformat}
> Warning: @export [schema.R#33]: unknown tag
> Warning: @export [schema.R#53]: unknown tag
> Warning: @export [schema.R#63]: unknown tag
> Warning: @export [schema.R#80]: unknown tag
> Warning: @export [schema.R#123]: unknown tag
> Warning: @export [schema.R#141]: unknown tag
> Warning: @export [schema.R#216]: unknown tag
> Warning: @export [generics.R#388]: unknown tag
> Warning: @export [generics.R#403]: unknown tag
> Warning: @export [generics.R#407]: unknown tag
> Warning: @export [generics.R#414]: unknown tag
> Warning: @export [generics.R#418]: unknown tag
> Warning: @export [generics.R#422]: unknown tag
> Warning: @export [generics.R#428]: unknown tag
> Warning: @export [generics.R#432]: unknown tag
> Warning: @export [generics.R#438]: unknown tag
> Warning: @export [generics.R#442]: unknown tag
> Warning: @export [generics.R#446]: unknown tag
> Warning: @export [generics.R#450]: unknown tag
> Warning: @export [generics.R#454]: unknown tag
> Warning: @export [generics.R#459]: unknown tag
> Warning: @export [generics.R#467]: unknown tag
> Warning: @export [generics.R#475]: unknown tag
> Warning: @export [generics.R#479]: unknown tag
> Warning: @export [generics.R#483]: unknown tag
> Warning: @export [generics.R#487]: unknown tag
> Warning: @export [generics.R#498]: unknown tag
> Warning: @export [generics.R#502]: unknown tag
> Warning: @export [generics.R#506]: unknown tag
> Warning: @export [generics.R#512]: unknown tag
> Warning: @export [generics.R#518]: unknown tag
> Warning: @export [generics.R#526]: unknown tag
> Warning: @export [generics.R#530]: unknown tag
> Warning: @export [generics.R#534]: unknown tag
> Warning: @export [generics.R#538]: unknown tag
> Warning: @export [generics.R#542]: unknown tag
> Warning: @export [generics.R#549]: unknown tag
> Warning: @export [generics.R#556]: unknown tag
> Warning: @export [generics.R#560]: unknown tag
> Warning: @export [generics.R#567]: unknown tag
> Warning: @export [generics.R#571]: unknown tag
> Warning: @export [generics.R#575]: unknown tag
> Warning: @export [generics.R#579]: unknown tag
> Warning: @export [generics.R#583]: unknown tag
> Warning: @export [generics.R#587]: unknown tag
> Warning: @export [generics.R#591]: unknown tag
> Warning: @export [generics.R#595]: unknown tag
> Warning: @export [generics.R#599]: unknown tag
> Warning: @export [generics.R#603]: unknown tag
> Warning: @export [generics.R#607]: unknown tag
> Warning: @export [generics.R#611]: unknown tag
> Warning: @export [generics.R#615]: unknown tag
> Warning: @export [generics.R#619]: unknown tag
> Warning: @export [generics.R#623]: unknown tag
> Warning: @export [generics.R#627]: unknown tag
> Warning: @export [generics.R#631]: unknown tag
> Warning: @export [generics.R#635]: unknown tag
> Warning: @export [generics.R#639]: unknown tag
> Warning: @export [generics.R#643]: unknown tag
> Warning: @export [generics.R#647]: unknown tag
> Warning: @export [generics.R#654]: unknown tag
> Warning: @export [generics.R#658]: unknown tag
> Warning: @export [generics.R#663]: unknown tag
> Warning: @export [generics.R#667]: unknown tag
> Warning: @export [generics.R#672]: unknown tag
> Warning: @export [generics.R#676]: unknown tag
> Warning: @export [generics.R#680]: unknown tag
> Warning: @export [generics.R#684]: unknown tag
> Warning: @export [generics.R#690]: unknown tag
> Warning: @export [generics.R#696]: unknown tag
> Warning: @export [generics.R#702]: unknown tag
> Warning: @export [generics.R#706]: unknown tag
> Warning: @export [generics.R#710]: unknown tag
> Warning: @export [generics.R#716]: unknown tag
> Warning: @export [generics.R#720]: unknown tag
> Warning: @export [generics.R#726]: unknown tag
> Warning: @export [generics.R#730]: unknown tag
> Warning: @export [generics.R#734]: unknown tag
> Warning: @export [generics.R#738]: unknown tag
> Warning: @export [generics.R#742]: unknown tag
> Warning: @export [generics.R#750]: unknown tag
> Warning: @export [generics.R#754]: unknown tag
> Warning: @export 

[jira] [Updated] (SPARK-22427) StackOverFlowError when using FPGrowth

2017-11-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-22427:
--
Description: 
code part:
val path = jobConfig.getString("hdfspath")
val vectordata = sc.sparkContext.textFile(path)
val finaldata = sc.createDataset(vectordata.map(obj => {
  obj.split(" ")
}).filter(arr => arr.length > 0)).toDF("items")
val fpg = new FPGrowth()

fpg.setMinSupport(minSupport).setItemsCol("items").setMinConfidence(minConfidence)
val train = fpg.fit(finaldata)
print(train.freqItemsets.count())
print(train.associationRules.count())
train.save("/tmp/FPGModel")

And encountered following exception:
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at scala.Option.foreach(Option.scala:257)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
at 
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:278)
at 
org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2430)
at 
org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2429)
at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2837)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2836)
at org.apache.spark.sql.Dataset.count(Dataset.scala:2429)
at DataMining.FPGrowth$.runJob(FPGrowth.scala:116)
at DataMining.testFPG$.main(FPGrowth.scala:36)
at DataMining.testFPG.main(FPGrowth.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.StackOverflowError
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:616)
at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:36)
at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:33)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
at 
com.twitter.chill.TraversableSerializer$$anonfun$write$1.apply(Traversable.scala:29)
at 
com.twitter.chill.TraversableSerializer$$anonfun$write$1.apply(Traversable.scala:27)
at 

[jira] [Commented] (SPARK-22423) Scala test source files like TestHiveSingleton.scala should be in scala source root

2017-11-03 Thread xubo245 (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237176#comment-16237176
 ] 

xubo245 commented on SPARK-22423:
-

OK, I will fix it.

> Scala test source files like TestHiveSingleton.scala should be in scala 
> source root
> ---
>
> Key: SPARK-22423
> URL: https://issues.apache.org/jira/browse/SPARK-22423
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: xubo245
>Priority: Minor
>
> The TestHiveSingleton.scala file should be in scala directory, not in java 
> directory



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22427) StackOverFlowError when using FPGrowth

2017-11-03 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237174#comment-16237174
 ] 

yuhao yang commented on SPARK-22427:


Could you please try to increase the stack size, E.g. with -Xss10m ? 

> StackOverFlowError when using FPGrowth
> --
>
> Key: SPARK-22427
> URL: https://issues.apache.org/jira/browse/SPARK-22427
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.2.0
> Environment: Centos Linux 3.10.0-327.el7.x86_64
> java 1.8.0.111
> spark 2.2.0
>Reporter: lyt
>Priority: Normal
>
> code part:
> val path = jobConfig.getString("hdfspath")
> val vectordata = sc.sparkContext.textFile(path)
> val finaldata = sc.createDataset(vectordata.map(obj => {
>   obj.split(" ")
> }).filter(arr => arr.length > 0)).toDF("items")
> val fpg = new FPGrowth()
> 
> fpg.setMinSupport(minSupport).setItemsCol("items").setMinConfidence(minConfidence)
> val train = fpg.fit(finaldata)
> print(train.freqItemsets.count())
> print(train.associationRules.count())
> train.save("/tmp/FPGModel")
> And encountered following exception:
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
>   at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:278)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2430)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2429)
>   at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2837)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2836)
>   at org.apache.spark.sql.Dataset.count(Dataset.scala:2429)
>   at DataMining.FPGrowth$.runJob(FPGrowth.scala:116)
>   at DataMining.testFPG$.main(FPGrowth.scala:36)
>   at DataMining.testFPG.main(FPGrowth.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
>   at 

[jira] [Updated] (SPARK-22211) LimitPushDown optimization for FullOuterJoin generates wrong results

2017-11-03 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22211:

Target Version/s: 2.2.1, 2.3.0

> LimitPushDown optimization for FullOuterJoin generates wrong results
> 
>
> Key: SPARK-22211
> URL: https://issues.apache.org/jira/browse/SPARK-22211
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: on community.cloude.databrick.com 
> Runtime Version 3.2 (includes Apache Spark 2.2.0, Scala 2.11)
>Reporter: Benyi Wang
>Priority: Major
>
> LimitPushDown pushes LocalLimit to one side for FullOuterJoin, but this may 
> generate a wrong result:
> Assume we use limit(1) and LocalLimit will be pushed to left side, and id=999 
> is selected, but at right side we have 100K rows including 999, the result 
> will be
> - one row is (999, 999)
> - the rest rows are (null, xxx)
> Once you call show(), the row (999,999) has only 1/10th chance to be 
> selected by CollectLimit.
> The actual optimization might be, 
> - push down limit
> - but convert the join to Broadcast LeftOuterJoin or RightOuterJoin.
> Here is my notebook:
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/349451637617406/2750346983121008/656075277290/latest.html
> {code:java}
> import scala.util.Random._
> val dl = shuffle(1 to 10).toDF("id")
> val dr = shuffle(1 to 10).toDF("id")
> println("data frame dl:")
> dl.explain
> println("data frame dr:")
> dr.explain
> val j = dl.join(dr, dl("id") === dr("id"), "outer").limit(1)
> j.explain
> j.show(false)
> {code}
> {code}
> data frame dl:
> == Physical Plan ==
> LocalTableScan [id#10]
> data frame dr:
> == Physical Plan ==
> LocalTableScan [id#16]
> == Physical Plan ==
> CollectLimit 1
> +- SortMergeJoin [id#10], [id#16], FullOuter
>:- *Sort [id#10 ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(id#10, 200)
>: +- *LocalLimit 1
>:+- LocalTableScan [id#10]
>+- *Sort [id#16 ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(id#16, 200)
>  +- LocalTableScan [id#16]
> import scala.util.Random._
> dl: org.apache.spark.sql.DataFrame = [id: int]
> dr: org.apache.spark.sql.DataFrame = [id: int]
> j: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, id: int]
> ++---+
> |id  |id |
> ++---+
> |null|148|
> ++---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22211) LimitPushDown optimization for FullOuterJoin generates wrong results

2017-11-03 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237133#comment-16237133
 ] 

Xiao Li commented on SPARK-22211:
-

Will submit a PR based on my previous PR 
https://github.com/apache/spark/pull/10454

> LimitPushDown optimization for FullOuterJoin generates wrong results
> 
>
> Key: SPARK-22211
> URL: https://issues.apache.org/jira/browse/SPARK-22211
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: on community.cloude.databrick.com 
> Runtime Version 3.2 (includes Apache Spark 2.2.0, Scala 2.11)
>Reporter: Benyi Wang
>Priority: Major
>
> LimitPushDown pushes LocalLimit to one side for FullOuterJoin, but this may 
> generate a wrong result:
> Assume we use limit(1) and LocalLimit will be pushed to left side, and id=999 
> is selected, but at right side we have 100K rows including 999, the result 
> will be
> - one row is (999, 999)
> - the rest rows are (null, xxx)
> Once you call show(), the row (999,999) has only 1/10th chance to be 
> selected by CollectLimit.
> The actual optimization might be, 
> - push down limit
> - but convert the join to Broadcast LeftOuterJoin or RightOuterJoin.
> Here is my notebook:
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/349451637617406/2750346983121008/656075277290/latest.html
> {code:java}
> import scala.util.Random._
> val dl = shuffle(1 to 10).toDF("id")
> val dr = shuffle(1 to 10).toDF("id")
> println("data frame dl:")
> dl.explain
> println("data frame dr:")
> dr.explain
> val j = dl.join(dr, dl("id") === dr("id"), "outer").limit(1)
> j.explain
> j.show(false)
> {code}
> {code}
> data frame dl:
> == Physical Plan ==
> LocalTableScan [id#10]
> data frame dr:
> == Physical Plan ==
> LocalTableScan [id#16]
> == Physical Plan ==
> CollectLimit 1
> +- SortMergeJoin [id#10], [id#16], FullOuter
>:- *Sort [id#10 ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(id#10, 200)
>: +- *LocalLimit 1
>:+- LocalTableScan [id#10]
>+- *Sort [id#16 ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(id#16, 200)
>  +- LocalTableScan [id#16]
> import scala.util.Random._
> dl: org.apache.spark.sql.DataFrame = [id: int]
> dr: org.apache.spark.sql.DataFrame = [id: int]
> j: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, id: int]
> ++---+
> |id  |id |
> ++---+
> |null|148|
> ++---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org