date:20190613

[jira] [Resolved] (SPARK-28047) [UI] Little bug in executorspage.js

2019-06-13 Thread feiwang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang resolved SPARK-28047.
-
Resolution: Not A Problem

> [UI]  Little bug in executorspage.js
> 
>
> Key: SPARK-28047
> URL: https://issues.apache.org/jira/browse/SPARK-28047
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.3
>Reporter: feiwang
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28048) pyspark.sql.functions.explode will abondon the row which has a empty list column when applied to the column

2019-06-13 Thread Ma Xinmin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ma Xinmin updated SPARK-28048:
--
Shepherd: Dongjoon Hyun

> pyspark.sql.functions.explode will abondon the row which has a empty list 
> column when applied to the column
> ---
>
> Key: SPARK-28048
> URL: https://issues.apache.org/jira/browse/SPARK-28048
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.1
>Reporter: Ma Xinmin
>Priority: Major
>
> {code}
> from pyspark.sql import Row
> from pyspark.sql.functions import explode
> eDF = spark.createDataFrame([Row(a=1, intlist=[1,2], mapfield={"a": "b"}), 
> Row(a=2, intlist=[], mapfield={"a": "b"})])
> eDF = eDF.withColumn('another', explode(eDF.intlist)).collect()
> eDF
> {code}
> The `a=2` row is missing in the output



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28048) pyspark.sql.functions.explode will abondon the row which has a empty list column when applied to the column

2019-06-13 Thread Ma Xinmin (JIRA)

Ma Xinmin created SPARK-28048:
-

 Summary: pyspark.sql.functions.explode will abondon the row which 
has a empty list column when applied to the column
 Key: SPARK-28048
 URL: https://issues.apache.org/jira/browse/SPARK-28048
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.1.1
Reporter: Ma Xinmin


{code}
from pyspark.sql import Row
from pyspark.sql.functions import explode
eDF = spark.createDataFrame([Row(a=1, intlist=[1,2], mapfield={"a": "b"}), 
Row(a=2, intlist=[], mapfield={"a": "b"})])
eDF = eDF.withColumn('another', explode(eDF.intlist)).collect()
eDF
{code}

The `a=2` row is missing in the output



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28043) Reading json with duplicate columns drops the first column value

2019-06-13 Thread Liang-Chi Hsieh (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863680#comment-16863680
 ] 

Liang-Chi Hsieh commented on SPARK-28043:
-

I tried to look around that, like 
https://stackoverflow.com/questions/21832701/does-json-syntax-allow-duplicate-keys-in-an-object.

So JSON doesn't disallow duplicate keys. Spark SQL doesn't disallow duplicate 
field names, although it can be impose some difficulties when using a DataFrame 
with duplicate field names. To clarify it, just because Spark SQL allows 
duplicate field names that doesn't mean that we should use such feature. But I 
think that, to some extent, the current behavior isn't consistent.

{code}
scala> val jsonRDD = spark.sparkContext.parallelize(Seq("[{ \"a\": \"blah\", 
\"a\": \"blah2\"} ]"))
jsonRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[1] at 
parallelize at :23
scala> val df = spark.read.json(jsonRDD)
df: org.apache.spark.sql.DataFrame = [a: string, a: string] 
scala> df.show
++-+
|   a|a|
++-+
|null|blah2|
++-+
{code}

> Reading json with duplicate columns drops the first column value
> 
>
> Key: SPARK-28043
> URL: https://issues.apache.org/jira/browse/SPARK-28043
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Mukul Murthy
>Priority: Major
>
> When reading a JSON blob with duplicate fields, Spark appears to ignore the 
> value of the first one. JSON recommends unique names but does not require it; 
> since JSON and Spark SQL both allow duplicate field names, we should fix the 
> bug where the first column value is getting dropped.
>  
> I'm guessing somewhere when parsing JSON, we're turning it into a Map which 
> is causing the first value to be overridden.
>  
> Repro (Python, 2.4):
> >>> jsonRDD = spark.sparkContext.parallelize(["\\{ \"a\": \"blah\", \"a\": 
> >>> \"blah2\"}"])
>  >>> df = spark.read.json(jsonRDD)
>  >>> df.show()
>  +-++
> |a|a|
> +-++
> |null|blah2|
> +-++
>  
> The expected response would be:
> +-++
> |a|a|
> +-++
> |blah|blah2|
> +-++



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28047) [UI] Little bug in executorspage.js

2019-06-13 Thread feiwang (JIRA)

feiwang created SPARK-28047:
---

 Summary: [UI]  Little bug in executorspage.js
 Key: SPARK-28047
 URL: https://issues.apache.org/jira/browse/SPARK-28047
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.4.3
Reporter: feiwang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27018) Checkpointed RDD deleted prematurely when using GBTClassifier

2019-06-13 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27018:


Assignee: (was: Apache Spark)

> Checkpointed RDD deleted prematurely when using GBTClassifier
> -
>
> Key: SPARK-27018
> URL: https://issues.apache.org/jira/browse/SPARK-27018
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Spark Core
>Affects Versions: 2.2.2, 2.2.3, 2.3.3, 2.4.0
> Environment: OS: Ubuntu Linux 18.10
> Java: java version "1.8.0_201"
> Java(TM) SE Runtime Environment (build 1.8.0_201-b09)
> Java HotSpot(TM) 64-Bit Server VM (build 25.201-b09, mixed mode)
> Reproducible with a single-node Spark in standalone mode.
> Reproducible with Zepellin or Spark shell.
>  
>Reporter: Piotr Kołaczkowski
>Priority: Major
> Attachments: 
> Fix_check_if_the_next_checkpoint_exists_before_deleting_the_old_one.patch
>
>
> Steps to reproduce:
> {noformat}
> import org.apache.spark.ml.linalg.Vectors
> import org.apache.spark.ml.classification.GBTClassifier
> case class Row(features: org.apache.spark.ml.linalg.Vector, label: Int)
> sc.setCheckpointDir("/checkpoints")
> val trainingData = sc.parallelize(1 to 2426874, 256).map(x => 
> Row(Vectors.dense(x, x + 1, x * 2 % 10), if (x % 5 == 0) 1 else 0)).toDF
> val classifier = new GBTClassifier()
>   .setLabelCol("label")
>   .setFeaturesCol("features")
>   .setProbabilityCol("probability")
>   .setMaxIter(100)
>   .setMaxDepth(10)
>   .setCheckpointInterval(2)
> classifier.fit(trainingData){noformat}
>  
> The last line fails with:
> {noformat}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 56.0 failed 10 times, most recent failure: Lost task 0.9 in stage 56.0 
> (TID 12058, 127.0.0.1, executor 0): java.io.FileNotFoundException: 
> /checkpoints/191c9209-0955-440f-8c11-f042bdf7f804/rdd-51
> at 
> com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$1.applyOrElse(DseFileSystem.scala:63)
> at 
> com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$1.applyOrElse(DseFileSystem.scala:61)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
> at 
> com.datastax.bdp.fs.hadoop.DseFileSystem.com$datastax$bdp$fs$hadoop$DseFileSystem$$translateToHadoopExceptions(DseFileSystem.scala:70)
> at 
> com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$6.apply(DseFileSystem.scala:264)
> at 
> com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$6.apply(DseFileSystem.scala:264)
> at 
> com.datastax.bdp.fs.hadoop.DseFsInputStream.input(DseFsInputStream.scala:31)
> at 
> com.datastax.bdp.fs.hadoop.DseFsInputStream.openUnderlyingDataSource(DseFsInputStream.scala:39)
> at com.datastax.bdp.fs.hadoop.DseFileSystem.open(DseFileSystem.scala:269)
> at 
> org.apache.spark.rdd.ReliableCheckpointRDD$.readCheckpointFile(ReliableCheckpointRDD.scala:292)
> at 
> org.apache.spark.rdd.ReliableCheckpointRDD.compute(ReliableCheckpointRDD.scala:100)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:322)
> at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
> at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
> at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
> at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
> at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
> at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
> at

[jira] [Assigned] (SPARK-27018) Checkpointed RDD deleted prematurely when using GBTClassifier

2019-06-13 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27018:


Assignee: Apache Spark

> Checkpointed RDD deleted prematurely when using GBTClassifier
> -
>
> Key: SPARK-27018
> URL: https://issues.apache.org/jira/browse/SPARK-27018
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Spark Core
>Affects Versions: 2.2.2, 2.2.3, 2.3.3, 2.4.0
> Environment: OS: Ubuntu Linux 18.10
> Java: java version "1.8.0_201"
> Java(TM) SE Runtime Environment (build 1.8.0_201-b09)
> Java HotSpot(TM) 64-Bit Server VM (build 25.201-b09, mixed mode)
> Reproducible with a single-node Spark in standalone mode.
> Reproducible with Zepellin or Spark shell.
>  
>Reporter: Piotr Kołaczkowski
>Assignee: Apache Spark
>Priority: Major
> Attachments: 
> Fix_check_if_the_next_checkpoint_exists_before_deleting_the_old_one.patch
>
>
> Steps to reproduce:
> {noformat}
> import org.apache.spark.ml.linalg.Vectors
> import org.apache.spark.ml.classification.GBTClassifier
> case class Row(features: org.apache.spark.ml.linalg.Vector, label: Int)
> sc.setCheckpointDir("/checkpoints")
> val trainingData = sc.parallelize(1 to 2426874, 256).map(x => 
> Row(Vectors.dense(x, x + 1, x * 2 % 10), if (x % 5 == 0) 1 else 0)).toDF
> val classifier = new GBTClassifier()
>   .setLabelCol("label")
>   .setFeaturesCol("features")
>   .setProbabilityCol("probability")
>   .setMaxIter(100)
>   .setMaxDepth(10)
>   .setCheckpointInterval(2)
> classifier.fit(trainingData){noformat}
>  
> The last line fails with:
> {noformat}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 56.0 failed 10 times, most recent failure: Lost task 0.9 in stage 56.0 
> (TID 12058, 127.0.0.1, executor 0): java.io.FileNotFoundException: 
> /checkpoints/191c9209-0955-440f-8c11-f042bdf7f804/rdd-51
> at 
> com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$1.applyOrElse(DseFileSystem.scala:63)
> at 
> com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$1.applyOrElse(DseFileSystem.scala:61)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
> at 
> com.datastax.bdp.fs.hadoop.DseFileSystem.com$datastax$bdp$fs$hadoop$DseFileSystem$$translateToHadoopExceptions(DseFileSystem.scala:70)
> at 
> com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$6.apply(DseFileSystem.scala:264)
> at 
> com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$6.apply(DseFileSystem.scala:264)
> at 
> com.datastax.bdp.fs.hadoop.DseFsInputStream.input(DseFsInputStream.scala:31)
> at 
> com.datastax.bdp.fs.hadoop.DseFsInputStream.openUnderlyingDataSource(DseFsInputStream.scala:39)
> at com.datastax.bdp.fs.hadoop.DseFileSystem.open(DseFileSystem.scala:269)
> at 
> org.apache.spark.rdd.ReliableCheckpointRDD$.readCheckpointFile(ReliableCheckpointRDD.scala:292)
> at 
> org.apache.spark.rdd.ReliableCheckpointRDD.compute(ReliableCheckpointRDD.scala:100)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:322)
> at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
> at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
> at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
> at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
> at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
> at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
> at

[jira] [Updated] (SPARK-27018) Checkpointed RDD deleted prematurely when using GBTClassifier

2019-06-13 Thread zhengruifeng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-27018:
-
Component/s: Spark Core

> Checkpointed RDD deleted prematurely when using GBTClassifier
> -
>
> Key: SPARK-27018
> URL: https://issues.apache.org/jira/browse/SPARK-27018
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Spark Core
>Affects Versions: 2.2.2, 2.2.3, 2.3.3, 2.4.0
> Environment: OS: Ubuntu Linux 18.10
> Java: java version "1.8.0_201"
> Java(TM) SE Runtime Environment (build 1.8.0_201-b09)
> Java HotSpot(TM) 64-Bit Server VM (build 25.201-b09, mixed mode)
> Reproducible with a single-node Spark in standalone mode.
> Reproducible with Zepellin or Spark shell.
>  
>Reporter: Piotr Kołaczkowski
>Priority: Major
> Attachments: 
> Fix_check_if_the_next_checkpoint_exists_before_deleting_the_old_one.patch
>
>
> Steps to reproduce:
> {noformat}
> import org.apache.spark.ml.linalg.Vectors
> import org.apache.spark.ml.classification.GBTClassifier
> case class Row(features: org.apache.spark.ml.linalg.Vector, label: Int)
> sc.setCheckpointDir("/checkpoints")
> val trainingData = sc.parallelize(1 to 2426874, 256).map(x => 
> Row(Vectors.dense(x, x + 1, x * 2 % 10), if (x % 5 == 0) 1 else 0)).toDF
> val classifier = new GBTClassifier()
>   .setLabelCol("label")
>   .setFeaturesCol("features")
>   .setProbabilityCol("probability")
>   .setMaxIter(100)
>   .setMaxDepth(10)
>   .setCheckpointInterval(2)
> classifier.fit(trainingData){noformat}
>  
> The last line fails with:
> {noformat}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 56.0 failed 10 times, most recent failure: Lost task 0.9 in stage 56.0 
> (TID 12058, 127.0.0.1, executor 0): java.io.FileNotFoundException: 
> /checkpoints/191c9209-0955-440f-8c11-f042bdf7f804/rdd-51
> at 
> com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$1.applyOrElse(DseFileSystem.scala:63)
> at 
> com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$1.applyOrElse(DseFileSystem.scala:61)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
> at 
> com.datastax.bdp.fs.hadoop.DseFileSystem.com$datastax$bdp$fs$hadoop$DseFileSystem$$translateToHadoopExceptions(DseFileSystem.scala:70)
> at 
> com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$6.apply(DseFileSystem.scala:264)
> at 
> com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$6.apply(DseFileSystem.scala:264)
> at 
> com.datastax.bdp.fs.hadoop.DseFsInputStream.input(DseFsInputStream.scala:31)
> at 
> com.datastax.bdp.fs.hadoop.DseFsInputStream.openUnderlyingDataSource(DseFsInputStream.scala:39)
> at com.datastax.bdp.fs.hadoop.DseFileSystem.open(DseFileSystem.scala:269)
> at 
> org.apache.spark.rdd.ReliableCheckpointRDD$.readCheckpointFile(ReliableCheckpointRDD.scala:292)
> at 
> org.apache.spark.rdd.ReliableCheckpointRDD.compute(ReliableCheckpointRDD.scala:100)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:322)
> at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
> at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
> at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
> at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
> at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
> at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
>

[jira] [Updated] (SPARK-28022) k8s pod affinity to achieve cloud native friendly autoscaling

2019-06-13 Thread Henry Yu (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Henry Yu updated SPARK-28022:
-
Summary: k8s pod affinity to achieve cloud native friendly autoscaling   
(was: k8s pod affinity achieve cloud native friendly autoscaling )

> k8s pod affinity to achieve cloud native friendly autoscaling 
> --
>
> Key: SPARK-28022
> URL: https://issues.apache.org/jira/browse/SPARK-28022
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.3
>Reporter: Henry Yu
>Priority: Major
>
> Hi, in order to achieve cloud native friendly autoscaling , I propose to add 
> a pod affinity feature.
> Traditionally, when we use spark in fix size yarn cluster, it make sense to 
> spread containers to every node.
> Coming to cloud native resource manage, we want to release node when we don't 
> need it any more.
> Pod affinity feature counts to place all pods of certain application to some 
> nodes instead of all nodes.
> By the way,  using pod template is not a good choice, adding application id  
> to pod affinity term when submit is more robust.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-28021) A unappropriate exception in StaticMemoryManager.getMaxExecutionMemory

2019-06-13 Thread child2d (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

child2d closed SPARK-28021.
---

> A unappropriate exception in StaticMemoryManager.getMaxExecutionMemory
> --
>
> Key: SPARK-28021
> URL: https://issues.apache.org/jira/browse/SPARK-28021
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: child2d
>Priority: Minor
>
> When i review StaticMemoryManager.scala, there comes a question to me.
> {code:java}
> private def getMaxExecutionMemory(conf: SparkConf): Long = {
>   val systemMaxMemory = conf.getLong("spark.testing.memory", 
> Runtime.getRuntime.maxMemory)
>   if (systemMaxMemory < MIN_MEMORY_BYTES) {
> throw new IllegalArgumentException(s"System memory $systemMaxMemory must 
> " +
>   s"be at least $MIN_MEMORY_BYTES. Please increase heap size using the 
> --driver-memory " +
>   s"option or spark.driver.memory in Spark configuration.")
>   }
>   if (conf.contains("spark.executor.memory")) {
> val executorMemory = conf.getSizeAsBytes("spark.executor.memory")
> if (executorMemory < MIN_MEMORY_BYTES) {
>   throw new IllegalArgumentException(s"Executor memory $executorMemory 
> must be at least " +
> s"$MIN_MEMORY_BYTES. Please increase executor memory using the " +
> s"--executor-memory option or spark.executor.memory in Spark 
> configuration.")
> }
>   }
>   val memoryFraction = conf.getDouble("spark.shuffle.memoryFraction", 0.2)
>   val safetyFraction = conf.getDouble("spark.shuffle.safetyFraction", 0.8)
>   (systemMaxMemory * memoryFraction * safetyFraction).toLong
> }
> {code}
> When a executor tries to getMaxExecutionMemory, it should set systemMaxMemory 
> by using Runtime.getRuntime.maxMemory first, then compares the value between 
> systemMaxMemory and MIN_MEMORY_BYTES.
> If the compared value is true, program thows an exception to remind user to 
> increase heap size by using --driver-memory.
> I wonder if it is wrong because the heap size of executors are setted by 
> --executor-memory?
> Although there is another exception about adjusting executor's memory below, 
> i just think that the first exception may be not appropriate.
> Thanks for answering my question!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28021) A unappropriate exception in StaticMemoryManager.getMaxExecutionMemory

2019-06-13 Thread child2d (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863648#comment-16863648
 ] 

child2d commented on SPARK-28021:
-

Thanks for reminding. I will close the issue.

> A unappropriate exception in StaticMemoryManager.getMaxExecutionMemory
> --
>
> Key: SPARK-28021
> URL: https://issues.apache.org/jira/browse/SPARK-28021
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: child2d
>Priority: Minor
>
> When i review StaticMemoryManager.scala, there comes a question to me.
> {code:java}
> private def getMaxExecutionMemory(conf: SparkConf): Long = {
>   val systemMaxMemory = conf.getLong("spark.testing.memory", 
> Runtime.getRuntime.maxMemory)
>   if (systemMaxMemory < MIN_MEMORY_BYTES) {
> throw new IllegalArgumentException(s"System memory $systemMaxMemory must 
> " +
>   s"be at least $MIN_MEMORY_BYTES. Please increase heap size using the 
> --driver-memory " +
>   s"option or spark.driver.memory in Spark configuration.")
>   }
>   if (conf.contains("spark.executor.memory")) {
> val executorMemory = conf.getSizeAsBytes("spark.executor.memory")
> if (executorMemory < MIN_MEMORY_BYTES) {
>   throw new IllegalArgumentException(s"Executor memory $executorMemory 
> must be at least " +
> s"$MIN_MEMORY_BYTES. Please increase executor memory using the " +
> s"--executor-memory option or spark.executor.memory in Spark 
> configuration.")
> }
>   }
>   val memoryFraction = conf.getDouble("spark.shuffle.memoryFraction", 0.2)
>   val safetyFraction = conf.getDouble("spark.shuffle.safetyFraction", 0.8)
>   (systemMaxMemory * memoryFraction * safetyFraction).toLong
> }
> {code}
> When a executor tries to getMaxExecutionMemory, it should set systemMaxMemory 
> by using Runtime.getRuntime.maxMemory first, then compares the value between 
> systemMaxMemory and MIN_MEMORY_BYTES.
> If the compared value is true, program thows an exception to remind user to 
> increase heap size by using --driver-memory.
> I wonder if it is wrong because the heap size of executors are setted by 
> --executor-memory?
> Although there is another exception about adjusting executor's memory below, 
> i just think that the first exception may be not appropriate.
> Thanks for answering my question!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28023) Trim the string when cast string type to other types

2019-06-13 Thread Yuming Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863643#comment-16863643
 ] 

Yuming Wang commented on SPARK-28023:
-

I'm working on.

> Trim the string when cast string type to other types
> 
>
> Key: SPARK-28023
> URL: https://issues.apache.org/jira/browse/SPARK-28023
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> For example:
> {code:sql}
> SELECT bool '   f   ';
> select int2 '  21234 ';
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27925) Better control numBins of curves in BinaryClassificationMetrics

2019-06-13 Thread zhengruifeng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-27925.
--
Resolution: Not A Problem

> Better control numBins of curves in BinaryClassificationMetrics
> ---
>
> Key: SPARK-27925
> URL: https://issues.apache.org/jira/browse/SPARK-27925
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Major
>
> In case of large datasets with tens of thousands of partitions, current curve 
> down-sampling method tend to generate much more bins than the value set by 
> #numBins.
> Since in current impl, grouping is done within partitions, that is to say, 
> each partition contains at least one bin.
> A more reasonable way is to bring the grouping op forward into the sort op, 
> then we can directly set the #bins as the #partitions, and regard one 
> partition as one bin.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28045) add missing RankingEvaluator

2019-06-13 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28045:


Assignee: (was: Apache Spark)

> add missing RankingEvaluator
> 
>
> Key: SPARK-28045
> URL: https://issues.apache.org/jira/browse/SPARK-28045
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Major
>
> expose RankingEvaluator



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28045) add missing RankingEvaluator

2019-06-13 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28045:


Assignee: Apache Spark

> add missing RankingEvaluator
> 
>
> Key: SPARK-28045
> URL: https://issues.apache.org/jira/browse/SPARK-28045
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Major
>
> expose RankingEvaluator



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28046) OOM caused by building hash table when the compressed ratio of small table is normal

2019-06-13 Thread Ke Jia (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-28046:
---
Attachment: image-2019-06-14-10-34-53-379.png

> OOM caused by building hash table when the compressed ratio of small table is 
> normal
> 
>
> Key: SPARK-28046
> URL: https://issues.apache.org/jira/browse/SPARK-28046
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: Ke Jia
>Priority: Major
> Attachments: image-2019-06-14-10-34-53-379.png
>
>
> Currently, spark will convert the sort merge join to broadcast hash join when 
> the small table compressed  size <= the broadcast threshold.  Same with 
> Spark, AE also convert the smj to bhj based on the compressed size in 
> runtime.  In our test, when enable ae with 32M broadcast threshold, one smj 
> with 16M compressed size is converted to bhj. However, when building the hash 
> table, the 16M small table is decompressed with 2GB size and has 134485048 
> row count, which has a mount of continuous and repeated values. Therefore, 
> the following OOM exception occurs when building hash table:
> !image-2019-06-14-10-29-00-499.png!
> And based on this founding , it may be not reasonable to decide whether smj 
> be converted to bhj only by the compressed size. In AE, we add the condition 
> with the estimation  decompressed size based on the row counts. And in spark, 
> we may also need the decompressed size or row counts condition judgment not 
> only the compressed size when converting the smj to bhj.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28046) OOM caused by building hash table when the compressed ratio of small table is normal

2019-06-13 Thread Ke Jia (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-28046:
---
Description: 
Currently, spark will convert the sort merge join to broadcast hash join when 
the small table compressed  size <= the broadcast threshold.  Same with Spark, 
AE also convert the smj to bhj based on the compressed size in runtime.  In our 
test, when enable ae with 32M broadcast threshold, one smj with 16M compressed 
size is converted to bhj. However, when building the hash table, the 16M small 
table is decompressed with 2GB size and has 134485048 row count, which has a 
mount of continuous and repeated values. Therefore, the following OOM exception 
occurs when building hash table:

!image-2019-06-14-10-34-53-379.png!

And based on this founding , it may be not reasonable to decide whether smj be 
converted to bhj only by the compressed size. In AE, we add the condition with 
the estimation  decompressed size based on the row counts. And in spark, we may 
also need the decompressed size or row counts condition judgment not only the 
compressed size when converting the smj to bhj.

  was:
Currently, spark will convert the sort merge join to broadcast hash join when 
the small table compressed  size <= the broadcast threshold.  Same with Spark, 
AE also convert the smj to bhj based on the compressed size in runtime.  In our 
test, when enable ae with 32M broadcast threshold, one smj with 16M compressed 
size is converted to bhj. However, when building the hash table, the 16M small 
table is decompressed with 2GB size and has 134485048 row count, which has a 
mount of continuous and repeated values. Therefore, the following OOM exception 
occurs when building hash table:

!image-2019-06-14-10-29-00-499.png!

And based on this founding , it may be not reasonable to decide whether smj be 
converted to bhj only by the compressed size. In AE, we add the condition with 
the estimation  decompressed size based on the row counts. And in spark, we may 
also need the decompressed size or row counts condition judgment not only the 
compressed size when converting the smj to bhj.


> OOM caused by building hash table when the compressed ratio of small table is 
> normal
> 
>
> Key: SPARK-28046
> URL: https://issues.apache.org/jira/browse/SPARK-28046
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: Ke Jia
>Priority: Major
> Attachments: image-2019-06-14-10-34-53-379.png
>
>
> Currently, spark will convert the sort merge join to broadcast hash join when 
> the small table compressed  size <= the broadcast threshold.  Same with 
> Spark, AE also convert the smj to bhj based on the compressed size in 
> runtime.  In our test, when enable ae with 32M broadcast threshold, one smj 
> with 16M compressed size is converted to bhj. However, when building the hash 
> table, the 16M small table is decompressed with 2GB size and has 134485048 
> row count, which has a mount of continuous and repeated values. Therefore, 
> the following OOM exception occurs when building hash table:
> !image-2019-06-14-10-34-53-379.png!
> And based on this founding , it may be not reasonable to decide whether smj 
> be converted to bhj only by the compressed size. In AE, we add the condition 
> with the estimation  decompressed size based on the row counts. And in spark, 
> we may also need the decompressed size or row counts condition judgment not 
> only the compressed size when converting the smj to bhj.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28046) OOM caused by building hash table when the compressed ratio of small table is normal

2019-06-13 Thread Ke Jia (JIRA)

Ke Jia created SPARK-28046:
--

 Summary: OOM caused by building hash table when the compressed 
ratio of small table is normal
 Key: SPARK-28046
 URL: https://issues.apache.org/jira/browse/SPARK-28046
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.1
Reporter: Ke Jia


Currently, spark will convert the sort merge join to broadcast hash join when 
the small table compressed  size <= the broadcast threshold.  Same with Spark, 
AE also convert the smj to bhj based on the compressed size in runtime.  In our 
test, when enable ae with 32M broadcast threshold, one smj with 16M compressed 
size is converted to bhj. However, when building the hash table, the 16M small 
table is decompressed with 2GB size and has 134485048 row count, which has a 
mount of continuous and repeated values. Therefore, the following OOM exception 
occurs when building hash table:

!image-2019-06-14-10-29-00-499.png!

And based on this founding , it may be not reasonable to decide whether smj be 
converted to bhj only by the compressed size. In AE, we add the condition with 
the estimation  decompressed size based on the row counts. And in spark, we may 
also need the decompressed size or row counts condition judgment not only the 
compressed size when converting the smj to bhj.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28045) add missing RankingEvaluator

2019-06-13 Thread zhengruifeng (JIRA)

zhengruifeng created SPARK-28045:


 Summary: add missing RankingEvaluator
 Key: SPARK-28045
 URL: https://issues.apache.org/jira/browse/SPARK-28045
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 3.0.0
Reporter: zhengruifeng


expose RankingEvaluator



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28044) MulticlassClassificationEvaluator support more metrics

2019-06-13 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28044:


Assignee: Apache Spark

> MulticlassClassificationEvaluator support more metrics
> --
>
> Key: SPARK-28044
> URL: https://issues.apache.org/jira/browse/SPARK-28044
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Minor
>
> expose more metrics in evaluator:
> weightedTruePositiveRate
> weightedFalsePositiveRate
> weightedFMeasure
> truePositiveRateByLabel
> falsePositiveRateByLabel
> precisionByLabel
> recallByLabel
> fMeasureByLabel



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28044) MulticlassClassificationEvaluator support more metrics

2019-06-13 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28044:


Assignee: (was: Apache Spark)

> MulticlassClassificationEvaluator support more metrics
> --
>
> Key: SPARK-28044
> URL: https://issues.apache.org/jira/browse/SPARK-28044
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Minor
>
> expose more metrics in evaluator:
> weightedTruePositiveRate
> weightedFalsePositiveRate
> weightedFMeasure
> truePositiveRateByLabel
> falsePositiveRateByLabel
> precisionByLabel
> recallByLabel
> fMeasureByLabel



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28044) MulticlassClassificationEvaluator support more metrics

2019-06-13 Thread zhengruifeng (JIRA)

zhengruifeng created SPARK-28044:


 Summary: MulticlassClassificationEvaluator support more metrics
 Key: SPARK-28044
 URL: https://issues.apache.org/jira/browse/SPARK-28044
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.0.0
Reporter: zhengruifeng


expose more metrics in evaluator:

weightedTruePositiveRate

weightedFalsePositiveRate

weightedFMeasure

truePositiveRateByLabel

falsePositiveRateByLabel

precisionByLabel

recallByLabel

fMeasureByLabel



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore

2019-06-13 Thread HonglunChen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863598#comment-16863598
 ] 

HonglunChen commented on SPARK-18112:
-

[~dongjoon] Thank you, I get it.

> Spark2.x does not support read data from Hive 2.x metastore
> ---
>
> Key: SPARK-18112
> URL: https://issues.apache.org/jira/browse/SPARK-18112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: KaiXu
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.2.0
>
>
> Hive2.0 has been released in February 2016, after that Hive2.0.1 and 
> Hive2.1.0 have also been released for a long time, but till now spark only 
> support to read hive metastore data from Hive1.2.1 and older version, since 
> Hive2.x has many bugs fixed and performance improvement it's better and 
> urgent to upgrade to support Hive2.x
> failed to load data from hive2.x metastore:
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28041) Increase the minimum pandas version to 0.23.2

2019-06-13 Thread Bryan Cutler (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863570#comment-16863570
 ] 

Bryan Cutler commented on SPARK-28041:
--

Yes, definitely. I made a quick PR, but we should run it through the mailing 
list. I'll send it now. 

> Increase the minimum pandas version to 0.23.2
> -
>
> Key: SPARK-28041
> URL: https://issues.apache.org/jira/browse/SPARK-28041
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Currently, the minimum supported Pandas version is 0.19.2. We bumped up 
> testing in the Jenkins env to 0.23.2 and since 0.19.2 was released nearly 3 
> years ago, it is not always compatible with other Python libraries. 
> Increasing the version to 0.23.2 will also allow some workarounds to be 
> removed and make maintenance easier.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27463) Support Dataframe Cogroup via Pandas UDFs

2019-06-13 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863567#comment-16863567
 ] 

Hyukjin Kwon commented on SPARK-27463:
--

Yea I think it'd be easier to discuss about this with a Pr

> Support Dataframe Cogroup via Pandas UDFs 
> --
>
> Key: SPARK-27463
> URL: https://issues.apache.org/jira/browse/SPARK-27463
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Chris Martin
>Priority: Major
>
> Recent work on Pandas UDFs in Spark, has allowed for improved 
> interoperability between Pandas and Spark.  This proposal aims to extend this 
> by introducing a new Pandas UDF type which would allow for a cogroup 
> operation to be applied to two PySpark DataFrames.
> Full details are in the google document linked below.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28041) Increase the minimum pandas version to 0.23.2

2019-06-13 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28041:


Assignee: Apache Spark

> Increase the minimum pandas version to 0.23.2
> -
>
> Key: SPARK-28041
> URL: https://issues.apache.org/jira/browse/SPARK-28041
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Assignee: Apache Spark
>Priority: Major
>
> Currently, the minimum supported Pandas version is 0.19.2. We bumped up 
> testing in the Jenkins env to 0.23.2 and since 0.19.2 was released nearly 3 
> years ago, it is not always compatible with other Python libraries. 
> Increasing the version to 0.23.2 will also allow some workarounds to be 
> removed and make maintenance easier.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28041) Increase the minimum pandas version to 0.23.2

2019-06-13 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28041:


Assignee: (was: Apache Spark)

> Increase the minimum pandas version to 0.23.2
> -
>
> Key: SPARK-28041
> URL: https://issues.apache.org/jira/browse/SPARK-28041
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Currently, the minimum supported Pandas version is 0.19.2. We bumped up 
> testing in the Jenkins env to 0.23.2 and since 0.19.2 was released nearly 3 
> years ago, it is not always compatible with other Python libraries. 
> Increasing the version to 0.23.2 will also allow some workarounds to be 
> removed and make maintenance easier.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28041) Increase the minimum pandas version to 0.23.2

2019-06-13 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863565#comment-16863565
 ] 

Hyukjin Kwon commented on SPARK-28041:
--

[~bryanc], BTW, can we quickly discuss this one in dev mailing list?

> Increase the minimum pandas version to 0.23.2
> -
>
> Key: SPARK-28041
> URL: https://issues.apache.org/jira/browse/SPARK-28041
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Currently, the minimum supported Pandas version is 0.19.2. We bumped up 
> testing in the Jenkins env to 0.23.2 and since 0.19.2 was released nearly 3 
> years ago, it is not always compatible with other Python libraries. 
> Increasing the version to 0.23.2 will also allow some workarounds to be 
> removed and make maintenance easier.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28043) Reading json with duplicate columns drops the first column value

2019-06-13 Thread Mukul Murthy (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mukul Murthy updated SPARK-28043:
-
Description: 
When reading a JSON blob with duplicate fields, Spark appears to ignore the 
value of the first one. JSON recommends unique names but does not require it; 
since JSON and Spark SQL both allow duplicate field names, we should fix the 
bug where the first column value is getting dropped.

 

I'm guessing somewhere when parsing JSON, we're turning it into a Map which is 
causing the first value to be overridden.

 

Repro (Python, 2.4):

>>> jsonRDD = spark.sparkContext.parallelize(["\\{ \"a\": \"blah\", \"a\": 
>>> \"blah2\"}"])
 >>> df = spark.read.json(jsonRDD)
 >>> df.show()
 +-++
|a|a|

+-++
|null|blah2|

+-++

 

The expected response would be:

+-++
|a|a|

+-++
|blah|blah2|

+-++

  was:
When reading a JSON blob with duplicate fields, Spark appears to ignore the 
value of the first one. JSON recommends unique names but does not require it; 
since JSON and Spark SQL both allow duplicate field names, we should fix the 
bug where the first column value is getting dropped.

 

Repro (Python, 2.4):

>>> jsonRDD = spark.sparkContext.parallelize(["\{ \"a\": \"blah\", \"a\": 
>>> \"blah2\"}"])
>>> df = spark.read.json(jsonRDD)
>>> df.show()
++-+
| a| a|
++-+
|null|blah2|
++-+

 

The expected response would be:

++-+
| a| a|
++-+
|blah|blah2|
++-+


> Reading json with duplicate columns drops the first column value
> 
>
> Key: SPARK-28043
> URL: https://issues.apache.org/jira/browse/SPARK-28043
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Mukul Murthy
>Priority: Major
>
> When reading a JSON blob with duplicate fields, Spark appears to ignore the 
> value of the first one. JSON recommends unique names but does not require it; 
> since JSON and Spark SQL both allow duplicate field names, we should fix the 
> bug where the first column value is getting dropped.
>  
> I'm guessing somewhere when parsing JSON, we're turning it into a Map which 
> is causing the first value to be overridden.
>  
> Repro (Python, 2.4):
> >>> jsonRDD = spark.sparkContext.parallelize(["\\{ \"a\": \"blah\", \"a\": 
> >>> \"blah2\"}"])
>  >>> df = spark.read.json(jsonRDD)
>  >>> df.show()
>  +-++
> |a|a|
> +-++
> |null|blah2|
> +-++
>  
> The expected response would be:
> +-++
> |a|a|
> +-++
> |blah|blah2|
> +-++



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28043) Reading json with duplicate columns drops the first column value

2019-06-13 Thread Mukul Murthy (JIRA)

Mukul Murthy created SPARK-28043:


 Summary: Reading json with duplicate columns drops the first 
column value
 Key: SPARK-28043
 URL: https://issues.apache.org/jira/browse/SPARK-28043
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Mukul Murthy


When reading a JSON blob with duplicate fields, Spark appears to ignore the 
value of the first one. JSON recommends unique names but does not require it; 
since JSON and Spark SQL both allow duplicate field names, we should fix the 
bug where the first column value is getting dropped.

 

Repro (Python, 2.4):

>>> jsonRDD = spark.sparkContext.parallelize(["\{ \"a\": \"blah\", \"a\": 
>>> \"blah2\"}"])
>>> df = spark.read.json(jsonRDD)
>>> df.show()
++-+
| a| a|
++-+
|null|blah2|
++-+

 

The expected response would be:

++-+
| a| a|
++-+
|blah|blah2|
++-+



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28042) Support mapping spark.local.dir to hostPath volume

2019-06-13 Thread Dongjoon Hyun (JIRA)

Dongjoon Hyun created SPARK-28042:
-

 Summary: Support mapping spark.local.dir to hostPath volume
 Key: SPARK-28042
 URL: https://issues.apache.org/jira/browse/SPARK-28042
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.0.0
Reporter: Junjie Chen


Currently, the k8s executor builder mount spark.local.dir as emptyDir or 
memory, it should satisfy some small workload, while in some heavily workload 
like TPCDS, both of them can have some problem, such as pods are evicted due to 
disk pressure when using emptyDir, and OOM when using tmpfs.

In particular on cloud environment, users may allocate cluster with minimum 
configuration and add cloud storage when running workload. In this case, we can 
specify multiple elastic storage as spark.local.dir to accelerate the spilling. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27100) dag-scheduler-event-loop" java.lang.StackOverflowError

2019-06-13 Thread Parth Chandra (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863381#comment-16863381
 ] 

Parth Chandra commented on SPARK-27100:
---

Opened a PR with a fix and a test to reproduce the issue. 
https://github.com/apache/spark/pull/24865.
Thanks to [~dbtsai] [~dongjoon] for offline help with this one. 

> dag-scheduler-event-loop" java.lang.StackOverflowError
> --
>
> Key: SPARK-27100
> URL: https://issues.apache.org/jira/browse/SPARK-27100
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.1.3, 2.3.3
>Reporter: KaiXu
>Priority: Major
> Attachments: SPARK-27100-Overflow.txt, stderr
>
>
> ALS in Spark MLlib causes StackOverflow:
>  /opt/sparkml/spark213/bin/spark-submit  --properties-file 
> /opt/HiBench/report/als/spark/conf/sparkbench/spark.conf --class 
> com.intel.hibench.sparkbench.ml.ALSExample --master yarn-client 
> --num-executors 3 --executor-memory 322g 
> /opt/HiBench/sparkbench/assembly/target/sparkbench-assembly-7.1-SNAPSHOT-dist.jar
>  --numUsers 4 --numProducts 6 --rank 100 --numRecommends 20 
> --numIterations 100 --kryo false --implicitPrefs true --numProductBlocks -1 
> --numUserBlocks -1 --lambda 1.0 hdfs://bdw-slave20:8020/HiBench/ALS/Input
>  
> Exception in thread "dag-scheduler-event-loop" java.lang.StackOverflowError
>  at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1534)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
>  at 
> scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468)
>  at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
>  at 
> scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468)
>  at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27100) dag-scheduler-event-loop" java.lang.StackOverflowError

2019-06-13 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27100:


Assignee: Apache Spark

> dag-scheduler-event-loop" java.lang.StackOverflowError
> --
>
> Key: SPARK-27100
> URL: https://issues.apache.org/jira/browse/SPARK-27100
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.1.3, 2.3.3
>Reporter: KaiXu
>Assignee: Apache Spark
>Priority: Major
> Attachments: SPARK-27100-Overflow.txt, stderr
>
>
> ALS in Spark MLlib causes StackOverflow:
>  /opt/sparkml/spark213/bin/spark-submit  --properties-file 
> /opt/HiBench/report/als/spark/conf/sparkbench/spark.conf --class 
> com.intel.hibench.sparkbench.ml.ALSExample --master yarn-client 
> --num-executors 3 --executor-memory 322g 
> /opt/HiBench/sparkbench/assembly/target/sparkbench-assembly-7.1-SNAPSHOT-dist.jar
>  --numUsers 4 --numProducts 6 --rank 100 --numRecommends 20 
> --numIterations 100 --kryo false --implicitPrefs true --numProductBlocks -1 
> --numUserBlocks -1 --lambda 1.0 hdfs://bdw-slave20:8020/HiBench/ALS/Input
>  
> Exception in thread "dag-scheduler-event-loop" java.lang.StackOverflowError
>  at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1534)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
>  at 
> scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468)
>  at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
>  at 
> scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468)
>  at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27100) dag-scheduler-event-loop" java.lang.StackOverflowError

2019-06-13 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27100:


Assignee: (was: Apache Spark)

> dag-scheduler-event-loop" java.lang.StackOverflowError
> --
>
> Key: SPARK-27100
> URL: https://issues.apache.org/jira/browse/SPARK-27100
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.1.3, 2.3.3
>Reporter: KaiXu
>Priority: Major
> Attachments: SPARK-27100-Overflow.txt, stderr
>
>
> ALS in Spark MLlib causes StackOverflow:
>  /opt/sparkml/spark213/bin/spark-submit  --properties-file 
> /opt/HiBench/report/als/spark/conf/sparkbench/spark.conf --class 
> com.intel.hibench.sparkbench.ml.ALSExample --master yarn-client 
> --num-executors 3 --executor-memory 322g 
> /opt/HiBench/sparkbench/assembly/target/sparkbench-assembly-7.1-SNAPSHOT-dist.jar
>  --numUsers 4 --numProducts 6 --rank 100 --numRecommends 20 
> --numIterations 100 --kryo false --implicitPrefs true --numProductBlocks -1 
> --numUserBlocks -1 --lambda 1.0 hdfs://bdw-slave20:8020/HiBench/ALS/Input
>  
> Exception in thread "dag-scheduler-event-loop" java.lang.StackOverflowError
>  at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1534)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
>  at 
> scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468)
>  at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
>  at 
> scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468)
>  at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27100) dag-scheduler-event-loop" java.lang.StackOverflowError

2019-06-13 Thread Parth Chandra (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Parth Chandra updated SPARK-27100:
--
Attachment: SPARK-27100-Overflow.txt

> dag-scheduler-event-loop" java.lang.StackOverflowError
> --
>
> Key: SPARK-27100
> URL: https://issues.apache.org/jira/browse/SPARK-27100
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.1.3, 2.3.3
>Reporter: KaiXu
>Priority: Major
> Attachments: SPARK-27100-Overflow.txt, stderr
>
>
> ALS in Spark MLlib causes StackOverflow:
>  /opt/sparkml/spark213/bin/spark-submit  --properties-file 
> /opt/HiBench/report/als/spark/conf/sparkbench/spark.conf --class 
> com.intel.hibench.sparkbench.ml.ALSExample --master yarn-client 
> --num-executors 3 --executor-memory 322g 
> /opt/HiBench/sparkbench/assembly/target/sparkbench-assembly-7.1-SNAPSHOT-dist.jar
>  --numUsers 4 --numProducts 6 --rank 100 --numRecommends 20 
> --numIterations 100 --kryo false --implicitPrefs true --numProductBlocks -1 
> --numUserBlocks -1 --lambda 1.0 hdfs://bdw-slave20:8020/HiBench/ALS/Input
>  
> Exception in thread "dag-scheduler-event-loop" java.lang.StackOverflowError
>  at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1534)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
>  at 
> scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468)
>  at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
>  at 
> scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468)
>  at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27100) dag-scheduler-event-loop" java.lang.StackOverflowError

2019-06-13 Thread Parth Chandra (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863375#comment-16863375
 ] 

Parth Chandra commented on SPARK-27100:
---

The stack overflow is due to serialization of a ShuffleMapTask (see attached 
file with complete stack) [^SPARK-27100-Overflow.txt] . 

ShuffleMapTask.partition is a FilePartition and FilePartition.files is a Stream 
which is essentially a linked list. It is therefore serialized recursively.
If the number of files in each partition is, say, 1 files, recursing into a 
linked list of length 1 causes a stack overflow. This is a general problem 
with serialization of Scala streams (and other collections that are lazily 
initialized) that is fixed in 2.13 (https://github.com/scala/scala/pull/6676). 

The problem is only in Bucketed partitions. The corresponding implementation 
for non Bucketed partitions uses a StreamBuffer. 
 
Partial expansion of ShuffleMapTask just before the stack overflow -

{code:java}
obj = \{ShuffleMapTask@16639} Method threw 'java.lang.StackOverflowError' 
exception. Cannot evaluate org.apache.spark.scheduler.ShuffleMapTask.toString()
 taskBinary = \{TorrentBroadcast@17216} Method threw 
'java.lang.StackOverflowError' exception. Cannot evaluate 
org.apache.spark.broadcast.TorrentBroadcast.toString()
 partition = \{FilePartition@17217} Method threw 'java.lang.StackOverflowError' 
exception. Cannot evaluate 
org.apache.spark.sql.execution.datasources.FilePartition.toString()
 index = 0
 files = \{Stream$Cons@17244} Method threw 'java.lang.StackOverflowError' 
exception. Cannot evaluate scala.collection.immutable.Stream$Cons.toString()
 hd = \{PartitionedFile@17246} Method threw 'java.lang.StackOverflowError' 
exception. Cannot evaluate 
org.apache.spark.sql.execution.datasources.PartitionedFile.toString()
 partitionValues = \{GenericInternalRow@17259} Method threw 
'java.lang.StackOverflowError' exception. Cannot evaluate 
org.apache.spark.sql.catalyst.expressions.GenericInternalRow.toString()
 filePath = 
"hdfs://path/a.db/master/version=1/rangeid=000/part-00039-295e3ac1-760c-482e-8640-5e5d1539c2c9_0.c000.gz.parquet"
 start = 0
 length = 225781388
 locations = \{String[3]@17261} 
 tlVal = \{Stream$Cons@16687} Method threw 'java.lang.StackOverflowError' 
exception. Cannot evaluate scala.collection.immutable.Stream$Cons.toString()
 hd = \{PartitionedFile@17249} Method threw 'java.lang.StackOverflowError' 
exception. Cannot evaluate 
org.apache.spark.sql.execution.datasources.PartitionedFile.toString()
 partitionValues = \{GenericInternalRow@17255} Method threw 
'java.lang.StackOverflowError' exception. Cannot evaluate 
org.apache.spark.sql.catalyst.expressions.GenericInternalRow.toString()
 filePath = 
"hdfs://path/a.db/master/version=1/rangeid=001/part-00061-0346437e-7f8f-44ac-8739-94d1ee285c0b_0.c000.gz.parquet"
 start = 0
 length = 431239612
 locations = \{String[3]@17257} 
 tlVal = \{Stream$Cons@16812} Method threw 'java.lang.StackOverflowError' 
exception. Cannot evaluate scala.collection.immutable.Stream$Cons.toString()
 hd = \{PartitionedFile@17264} Method threw 'java.lang.StackOverflowError' 
exception. Cannot evaluate 
org.apache.spark.sql.execution.datasources.PartitionedFile.toString()
 partitionValues = \{GenericInternalRow@17268} Method threw 
'java.lang.StackOverflowError' exception. Cannot evaluate 
org.apache.spark.sql.catalyst.expressions.GenericInternalRow.toString()
 filePath = 
"hdfs://path/a.db/master/version=1/rangeid=002/part-00058-b3a99b18-140e-43ed-838e-276eaa45a5f3_0.c000.gz.parquet"
 start = 0
 length = 219930113
 locations = \{String[3]@17270} 
 tlVal = \{Stream$Cons@17265} Method threw 'java.lang.StackOverflowError' 
exception. Cannot evaluate scala.collection.immutable.Stream$Cons.toString()
 hd = \{PartitionedFile@17273} Method threw 'java.lang.StackOverflowError' 
exception. Cannot evaluate 
org.apache.spark.sql.execution.datasources.PartitionedFile.toString()
 partitionValues = \{GenericInternalRow@17277} Method threw 
'java.lang.StackOverflowError' exception. Cannot evaluate 
org.apache.spark.sql.catalyst.expressions.GenericInternalRow.toString()
 filePath = 
"hdfs://path/a.db/master/version=1/rangeid=003/part-00051-58be3faa-0611-49de-8546-1656b6086934_0.c000.gz.parquet"
 start = 0
 length = 219503110
 locations = \{String[3]@17279} 
 tlVal = \{Stream$Cons@17274} Method threw 'java.lang.StackOverflowError' 
exception. Cannot evaluate scala.collection.immutable.Stream$Cons.toString()
 tlGen = null
 tlGen = null
 tlGen = null
 tlGen = null
{code}


> dag-scheduler-event-loop" java.lang.StackOverflowError
> --
>
> Key: SPARK-27100
> URL: https://issues.apache.org/jira/browse/SPARK-27100
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.1.3, 2.3.3
>

[jira] [Created] (SPARK-28041) Increase the minimum pandas version to 0.23.2

2019-06-13 Thread Bryan Cutler (JIRA)

Bryan Cutler created SPARK-28041:


 Summary: Increase the minimum pandas version to 0.23.2
 Key: SPARK-28041
 URL: https://issues.apache.org/jira/browse/SPARK-28041
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.0.0
Reporter: Bryan Cutler


Currently, the minimum supported Pandas version is 0.19.2. We bumped up testing 
in the Jenkins env to 0.23.2 and since 0.19.2 was released nearly 3 years ago, 
it is not always compatible with other Python libraries. Increasing the version 
to 0.23.2 will also allow some workarounds to be removed and make maintenance 
easier.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27578) Support INTERVAL ... HOUR TO SECOND syntax

2019-06-13 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27578.
---
   Resolution: Fixed
 Assignee: Zhu, Lipeng
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/24472

> Support INTERVAL ... HOUR TO SECOND syntax
> --
>
> Key: SPARK-27578
> URL: https://issues.apache.org/jira/browse/SPARK-27578
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhu, Lipeng
>Assignee: Zhu, Lipeng
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, SparkSQL can support interval format like this. 
>  
> {code:java}
> select interval '5 23:59:59.155' day to second.{code}
>  
> Can SparkSQL support grammar like below, as Presto/Teradata can support it 
> well now.
> {code:java}
> select interval '23:59:59.155' hour to second
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14864) [MLLIB] Implement Doc2Vec

2019-06-13 Thread Ayush Singh (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-14864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863246#comment-16863246
 ] 

Ayush Singh commented on SPARK-14864:
-

[~michelle] any reason why this issue has been marked resolved?

 I can work on it.

> [MLLIB] Implement Doc2Vec
> -
>
> Key: SPARK-14864
> URL: https://issues.apache.org/jira/browse/SPARK-14864
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Peter Mountanos
>Priority: Minor
>  Labels: bulk-closed
>
> It would be useful to implement Doc2Vec, as described in the paper 
> [Distributed Representations of Sentences and 
> Documents|https://cs.stanford.edu/~quocle/paragraph_vector.pdf]. Gensim has 
> an implementation [Deep learning with 
> paragraph2vec|https://radimrehurek.com/gensim/models/doc2vec.html]. 
> Le & Mikolov show that when aggregating Word2Vec vector representations for a 
> paragraph/document, it does not perform well for prediction tasks. Instead, 
> they propose the Paragraph Vector implementation, which provides 
> state-of-the-art results on several text classification and sentiment 
> analysis tasks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28040) sql() fails to process output of glue::glue_data()

2019-06-13 Thread Michael Chirico (JIRA)

Michael Chirico created SPARK-28040:
---

 Summary: sql() fails to process output of glue::glue_data()
 Key: SPARK-28040
 URL: https://issues.apache.org/jira/browse/SPARK-28040
 Project: Spark
  Issue Type: Bug
  Components: R
Affects Versions: 2.4.3
Reporter: Michael Chirico


{{glue}} package is quite natural for sending parameterized queries to Spark 
from R. Very similar to Python's {{format}} for strings. Error is as simple as
{code:java}
library(glue)
library(sparkR)
sparkR.session()

query = glue_data(list(val = 4), 'select {val}')
sql(query){code}
Error in writeType(con, serdeType) : 
  Unsupported type for serialization glue
{{sql(as.character(query))}} works as expected but this is a bit awkward / 
post-hoc



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-28006) User-defined grouped transform pandas_udf for window operations

2019-06-13 Thread Li Jin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863186#comment-16863186
 ] 

Li Jin edited comment on SPARK-28006 at 6/13/19 3:36 PM:
-

Hi [~viirya] good questions!

>> Can we use pandas agg udfs as window function?

pandas agg udfs as window function is supported. With both unbounded and 
bounded window.

>> Because the proposed GROUPED_XFORM udf calculates output values for all rows 
>> in the group, looks like the proposed GROUPED_XFORM udf can only use window 
>> frame (UnboundedPreceding, UnboundedFollowing)

This is correct. It is really using unbounded window as groups here (because 
there is no groupby transform API in Spark sql).


was (Author: icexelloss):
Hi [~viirya] good questions:

>> Can we use pandas agg udfs as window function?

pandas agg udfs as window function is supported. With both unbounded and 
bounded window.

>> Because the proposed GROUPED_XFORM udf calculates output values for all rows 
>> in the group, looks like the proposed GROUPED_XFORM udf can only use window 
>> frame (UnboundedPreceding, UnboundedFollowing)

This is correct. It is really using unbounded window as groups here (because 
there is no groupby transform API in Spark sql).

> User-defined grouped transform pandas_udf for window operations
> ---
>
> Key: SPARK-28006
> URL: https://issues.apache.org/jira/browse/SPARK-28006
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: Li Jin
>Priority: Major
>
> Currently, pandas_udf supports "grouped aggregate" type that can be used with 
> unbounded and unbounded windows. There is another set of use cases that can 
> benefit from a "grouped transform" type pandas_udf.
> Grouped transform is defined as a N -> N mapping over a group. For example, 
> "compute zscore for values in the group using the grouped mean and grouped 
> stdev", or "rank the values in the group".
> Currently, in order to do this, user needs to use "grouped apply", for 
> example:
> {code:java}
> @pandas_udf(schema, GROUPED_MAP)
> def subtract_mean(pdf)
> v = pdf['v']
> pdf['v'] = v - v.mean()
> return pdf
> df.groupby('id').apply(subtract_mean)
> # +---++
> # | id|   v|
> # +---++
> # |  1|-0.5|
> # |  1| 0.5|
> # |  2|-3.0|
> # |  2|-1.0|
> # |  2| 4.0|
> # +---++{code}
> This approach has a few downside:
>  * Specifying the full return schema is complicated for the user although the 
> function only changes one column.
>  * The column name 'v' inside as part of the udf, makes the udf less reusable.
>  * The entire dataframe is serialized to pass to Python although only one 
> column is needed.
> Here we propose a new type of pandas_udf to work with these types of use 
> cases:
> {code:java}
> df = spark.createDataFrame(
> [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
> ("id", "v"))
> @pandas_udf('double', GROUPED_XFORM)
> def subtract_mean(v):
> return v - v.mean()
> w = Window.partitionBy('id')
> df = df.withColumn('v', subtract_mean(df['v']).over(w))
> # +---++
> # | id|   v|
> # +---++
> # |  1|-0.5|
> # |  1| 0.5|
> # |  2|-3.0|
> # |  2|-1.0|
> # |  2| 4.0|
> # +---++{code}
> Which addresses the above downsides.
>  * The user only needs to specify the output type of a single column.
>  * The column being zscored is decoupled from the udf implementation
>  * We only need to send one column to Python worker and concat the result 
> with the original dataframe (this is what grouped aggregate is doing already)
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28006) User-defined grouped transform pandas_udf for window operations

2019-06-13 Thread Li Jin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863186#comment-16863186
 ] 

Li Jin commented on SPARK-28006:


Hi [~viirya] good questions:

>> Can we use pandas agg udfs as window function?

pandas agg udfs as window function is supported. With both unbounded and 
bounded window.

>> Because the proposed GROUPED_XFORM udf calculates output values for all rows 
>> in the group, looks like the proposed GROUPED_XFORM udf can only use window 
>> frame (UnboundedPreceding, UnboundedFollowing)

This is correct. It is really using unbounded window as groups here (because 
there is no groupby transform API in Spark sql).

> User-defined grouped transform pandas_udf for window operations
> ---
>
> Key: SPARK-28006
> URL: https://issues.apache.org/jira/browse/SPARK-28006
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: Li Jin
>Priority: Major
>
> Currently, pandas_udf supports "grouped aggregate" type that can be used with 
> unbounded and unbounded windows. There is another set of use cases that can 
> benefit from a "grouped transform" type pandas_udf.
> Grouped transform is defined as a N -> N mapping over a group. For example, 
> "compute zscore for values in the group using the grouped mean and grouped 
> stdev", or "rank the values in the group".
> Currently, in order to do this, user needs to use "grouped apply", for 
> example:
> {code:java}
> @pandas_udf(schema, GROUPED_MAP)
> def subtract_mean(pdf)
> v = pdf['v']
> pdf['v'] = v - v.mean()
> return pdf
> df.groupby('id').apply(subtract_mean)
> # +---++
> # | id|   v|
> # +---++
> # |  1|-0.5|
> # |  1| 0.5|
> # |  2|-3.0|
> # |  2|-1.0|
> # |  2| 4.0|
> # +---++{code}
> This approach has a few downside:
>  * Specifying the full return schema is complicated for the user although the 
> function only changes one column.
>  * The column name 'v' inside as part of the udf, makes the udf less reusable.
>  * The entire dataframe is serialized to pass to Python although only one 
> column is needed.
> Here we propose a new type of pandas_udf to work with these types of use 
> cases:
> {code:java}
> df = spark.createDataFrame(
> [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
> ("id", "v"))
> @pandas_udf('double', GROUPED_XFORM)
> def subtract_mean(v):
> return v - v.mean()
> w = Window.partitionBy('id')
> df = df.withColumn('v', subtract_mean(df['v']).over(w))
> # +---++
> # | id|   v|
> # +---++
> # |  1|-0.5|
> # |  1| 0.5|
> # |  2|-3.0|
> # |  2|-1.0|
> # |  2| 4.0|
> # +---++{code}
> Which addresses the above downsides.
>  * The user only needs to specify the output type of a single column.
>  * The column being zscored is decoupled from the udf implementation
>  * We only need to send one column to Python worker and concat the result 
> with the original dataframe (this is what grouped aggregate is doing already)
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28006) User-defined grouped transform pandas_udf for window operations

2019-06-13 Thread Liang-Chi Hsieh (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863180#comment-16863180
 ] 

Liang-Chi Hsieh commented on SPARK-28006:
-

I'm curious about two questions:

Can we use pandas agg udfs as window function?

Because the proposed GROUPED_XFORM udf calculates output values for all rows in 
the group, looks like the proposed GROUPED_XFORM udf can only use window frame 
(UnboundedPreceding, UnboundedFollowing)?

> User-defined grouped transform pandas_udf for window operations
> ---
>
> Key: SPARK-28006
> URL: https://issues.apache.org/jira/browse/SPARK-28006
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: Li Jin
>Priority: Major
>
> Currently, pandas_udf supports "grouped aggregate" type that can be used with 
> unbounded and unbounded windows. There is another set of use cases that can 
> benefit from a "grouped transform" type pandas_udf.
> Grouped transform is defined as a N -> N mapping over a group. For example, 
> "compute zscore for values in the group using the grouped mean and grouped 
> stdev", or "rank the values in the group".
> Currently, in order to do this, user needs to use "grouped apply", for 
> example:
> {code:java}
> @pandas_udf(schema, GROUPED_MAP)
> def subtract_mean(pdf)
> v = pdf['v']
> pdf['v'] = v - v.mean()
> return pdf
> df.groupby('id').apply(subtract_mean)
> # +---++
> # | id|   v|
> # +---++
> # |  1|-0.5|
> # |  1| 0.5|
> # |  2|-3.0|
> # |  2|-1.0|
> # |  2| 4.0|
> # +---++{code}
> This approach has a few downside:
>  * Specifying the full return schema is complicated for the user although the 
> function only changes one column.
>  * The column name 'v' inside as part of the udf, makes the udf less reusable.
>  * The entire dataframe is serialized to pass to Python although only one 
> column is needed.
> Here we propose a new type of pandas_udf to work with these types of use 
> cases:
> {code:java}
> df = spark.createDataFrame(
> [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
> ("id", "v"))
> @pandas_udf('double', GROUPED_XFORM)
> def subtract_mean(v):
> return v - v.mean()
> w = Window.partitionBy('id')
> df = df.withColumn('v', subtract_mean(df['v']).over(w))
> # +---++
> # | id|   v|
> # +---++
> # |  1|-0.5|
> # |  1| 0.5|
> # |  2|-3.0|
> # |  2|-1.0|
> # |  2| 4.0|
> # +---++{code}
> Which addresses the above downsides.
>  * The user only needs to specify the output type of a single column.
>  * The column being zscored is decoupled from the udf implementation
>  * We only need to send one column to Python worker and concat the result 
> with the original dataframe (this is what grouped aggregate is doing already)
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27463) Support Dataframe Cogroup via Pandas UDFs

2019-06-13 Thread Li Jin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863177#comment-16863177
 ] 

Li Jin commented on SPARK-27463:


Yeah I think the exact spelling of the API can go either way. I think the 
current options are pretty close that we don't need to commit to either one at 
this moment and they shouldn't affect the implementation too much.

[~d80tb7] [~hyukjin.kwon] How about we start working towards a PR that 
implements one of the proposed APIs and go from there?

> Support Dataframe Cogroup via Pandas UDFs 
> --
>
> Key: SPARK-27463
> URL: https://issues.apache.org/jira/browse/SPARK-27463
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Chris Martin
>Priority: Major
>
> Recent work on Pandas UDFs in Spark, has allowed for improved 
> interoperability between Pandas and Spark.  This proposal aims to extend this 
> by introducing a new Pandas UDF type which would allow for a cogroup 
> operation to be applied to two PySpark DataFrames.
> Full details are in the google document linked below.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28039) Add float4.sql

2019-06-13 Thread Yuming Wang (JIRA)

Yuming Wang created SPARK-28039:
---

 Summary: Add float4.sql
 Key: SPARK-28039
 URL: https://issues.apache.org/jira/browse/SPARK-28039
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang


In this ticket, we plan to add the regression test cases of 
https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/float4.sql.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16692) multilabel classification to DataFrame, ML

2019-06-13 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-16692:
-

Assignee: zhengruifeng

>  multilabel classification to DataFrame, ML
> ---
>
> Key: SPARK-16692
> URL: https://issues.apache.org/jira/browse/SPARK-16692
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Weizhi Li
>Assignee: zhengruifeng
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> For the multi labels evaluations. There is a method to in MLlib named 
> MultilabelMetrics: A multilabel classification problem involves mapping each 
> sample in a dataset to a set of class labels. In this type of classification 
> problem, the labels are not mutually exclusive. For example, when classifying 
> a set of news articles into topics, a single article might be both science 
> and politics.
> Added this method to support DataFrame in ML. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16692) multilabel classification to DataFrame, ML

2019-06-13 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16692.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24777
[https://github.com/apache/spark/pull/24777]

>  multilabel classification to DataFrame, ML
> ---
>
> Key: SPARK-16692
> URL: https://issues.apache.org/jira/browse/SPARK-16692
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Weizhi Li
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 3.0.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> For the multi labels evaluations. There is a method to in MLlib named 
> MultilabelMetrics: A multilabel classification problem involves mapping each 
> sample in a dataset to a set of class labels. In this type of classification 
> problem, the labels are not mutually exclusive. For example, when classifying 
> a set of news articles into topics, a single article might be both science 
> and politics.
> Added this method to support DataFrame in ML. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28033) String concatenation should low priority than other operators

2019-06-13 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28033:

Summary: String concatenation should low priority than other operators  
(was: String concatenation low priority than other operators)

> String concatenation should low priority than other operators
> -
>
> Key: SPARK-28033
> URL: https://issues.apache.org/jira/browse/SPARK-28033
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.3
>Reporter: Yuming Wang
>Priority: Major
>
> Spark SQL:
> {code:sql}
> spark-sql> explain select 'four: ' || 2 + 2;
> == Physical Plan ==
> *(1) Project [null AS (CAST(concat(four: , CAST(2 AS STRING)) AS DOUBLE) + 
> CAST(2 AS DOUBLE))#2]
> +- Scan OneRowRelation[]
> spark-sql> select 'four: ' || 2 + 2;
> NULL
> {code}
> Hive:
> {code:sql}
> hive> select 'four: ' || 2 + 2;
> OK
> four: 4
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28038) Add text.sql

2019-06-13 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28038:


Assignee: Apache Spark

> Add text.sql
> 
>
> Key: SPARK-28038
> URL: https://issues.apache.org/jira/browse/SPARK-28038
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> In this ticket, we plan to add the regression test cases of 
> [https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/text.sql].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28038) Add text.sql

2019-06-13 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28038:


Assignee: (was: Apache Spark)

> Add text.sql
> 
>
> Key: SPARK-28038
> URL: https://issues.apache.org/jira/browse/SPARK-28038
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> In this ticket, we plan to add the regression test cases of 
> [https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/text.sql].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27966) input_file_name empty when listing files in parallel

2019-06-13 Thread Liang-Chi Hsieh (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863030#comment-16863030
 ] 

Liang-Chi Hsieh commented on SPARK-27966:
-

I can't see where input_file_name is, from the truncated output. If it is just 
in the Project, I can't tell why it doesn't work. If there is no good 
reproducer, I agree with [~hyukjin.kwon] that we may resolve this JIRA.



> input_file_name empty when listing files in parallel
> 
>
> Key: SPARK-27966
> URL: https://issues.apache.org/jira/browse/SPARK-27966
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.0
> Environment: Databricks 5.3 (includes Apache Spark 2.4.0, Scala 2.11)
>  
> Worker Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2
> Workers: 3
> Driver Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2
>Reporter: Christian Homberg
>Priority: Minor
> Attachments: input_file_name_bug
>
>
> I ran into an issue similar and probably related to SPARK-26128. The 
> _org.apache.spark.sql.functions.input_file_name_ is sometimes empty.
>  
> {code:java}
> df.select(input_file_name()).show(5,false)
> {code}
>  
> {code:java}
> +-+
> |input_file_name()|
> +-+
> | |
> | |
> | |
> | |
> | |
> +-+
> {code}
> My environment is databricks and debugging the Log4j output showed me that 
> the issue occurred when the files are being listed in parallel, e.g. when 
> {code:java}
> 19/06/06 11:50:47 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 127; threshold: 32
> 19/06/06 11:50:47 INFO InMemoryFileIndex: Listing leaf files and directories 
> in parallel under:{code}
>  
> Everything's fine as long as
> {code:java}
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 6; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> {code}
>  
> Setting spark.sql.sources.parallelPartitionDiscovery.threshold to  
> resolves the issue for me.
>  
> *edit: the problem is not exclusively linked to listing files in parallel. 
> I've setup a larger cluster for which after parallel file listing the 
> input_file_name did return the correct filename. After inspecting the log4j 
> again, I assume that it's linked to some kind of MetaStore being full. I've 
> attached a section of the log4j output that I think should indicate why it's 
> failing. If you need more, please let me know.*
>  ** 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28038) Add text.sql

2019-06-13 Thread Yuming Wang (JIRA)

Yuming Wang created SPARK-28038:
---

 Summary: Add text.sql
 Key: SPARK-28038
 URL: https://issues.apache.org/jira/browse/SPARK-28038
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang


In this ticket, we plan to add the regression test cases of 
[https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/text.sql].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28037) Add built-in String Functions: quote_literal

2019-06-13 Thread Yuming Wang (JIRA)

Yuming Wang created SPARK-28037:
---

 Summary: Add built-in String Functions: quote_literal
 Key: SPARK-28037
 URL: https://issues.apache.org/jira/browse/SPARK-28037
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang


||Function||Return Type||Description||Example||Result||
|{{quote_literal(_{{string}}_ }}{{text}}{{)}}|{{text}}|Return the given string 
suitably quoted to be used as a string literal in an SQL statement string. 
Embedded single-quotes and backslashes are properly doubled. Note that 
{{quote_literal}} returns null on null input; if the argument might be null, 
{{quote_nullable}} is often more suitable. See also [Example 
43.1|https://www.postgresql.org/docs/11/plpgsql-statements.html#PLPGSQL-QUOTE-LITERAL-EXAMPLE].|{{quote_literal(E'O\'Reilly')}}|{{'O''Reilly'}}|
|{{quote_literal(_{{value}}_ }}{{anyelement}}{{)}}|{{text}}|Coerce the given 
value to text and then quote it as a literal. Embedded single-quotes and 
backslashes are properly doubled.|{{quote_literal(42.5)}}|{{'42.5'}}|

https://www.postgresql.org/docs/11/functions-string.html
https://docs.aws.amazon.com/redshift/latest/dg/r_QUOTE_LITERAL.html
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/String/QUOTE_LITERAL.htm?tocpath=SQL%20Reference%20Manual%7CSQL%20Functions%7CString%20Functions%7C_38



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-28033) String concatenation low priority than other operators

2019-06-13 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28033:

Comment: was deleted

(was: I'm working on.)

> String concatenation low priority than other operators
> --
>
> Key: SPARK-28033
> URL: https://issues.apache.org/jira/browse/SPARK-28033
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.3
>Reporter: Yuming Wang
>Priority: Major
>
> Spark SQL:
> {code:sql}
> spark-sql> explain select 'four: ' || 2 + 2;
> == Physical Plan ==
> *(1) Project [null AS (CAST(concat(four: , CAST(2 AS STRING)) AS DOUBLE) + 
> CAST(2 AS DOUBLE))#2]
> +- Scan OneRowRelation[]
> spark-sql> select 'four: ' || 2 + 2;
> NULL
> {code}
> Hive:
> {code:sql}
> hive> select 'four: ' || 2 + 2;
> OK
> four: 4
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28033) String concatenation low priority than other operators

2019-06-13 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28033:

Summary: String concatenation low priority than other operators  (was: 
String concatenation low priority than other arithmeticBinary)

> String concatenation low priority than other operators
> --
>
> Key: SPARK-28033
> URL: https://issues.apache.org/jira/browse/SPARK-28033
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.3
>Reporter: Yuming Wang
>Priority: Major
>
> Spark SQL:
> {code:sql}
> spark-sql> explain select 'four: ' || 2 + 2;
> == Physical Plan ==
> *(1) Project [null AS (CAST(concat(four: , CAST(2 AS STRING)) AS DOUBLE) + 
> CAST(2 AS DOUBLE))#2]
> +- Scan OneRowRelation[]
> spark-sql> select 'four: ' || 2 + 2;
> NULL
> {code}
> Hive:
> {code:sql}
> hive> select 'four: ' || 2 + 2;
> OK
> four: 4
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-28025) HDFSBackedStateStoreProvider should not leak .crc files

2019-06-13 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862908#comment-16862908
 ] 

Stavros Kontopoulos edited comment on SPARK-28025 at 6/13/19 10:53 AM:
---

I just found out that the following config options would suffice to avoid 
creating crcs for the given case:

--conf 
spark.hadoop.spark.sql.streaming.checkpointFileManagerClass=org.apache.spark.sql.execution.streaming.FileSystemBasedCheckpointFileManager
 --conf spark.hadoop.fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem

So no need to make a PR to disable crc emission if user wants to for this case. 
I could make one to cover all cases to be able to enable/disable crcs if 
needed. However, the filesystem hierarchy in hadoop is a bit inconsistent. So 
the LocalFileSystem will use a a FilterFileSystem that has the flag 
setWriteChecksum but the DistributedFileSystem does not have it and it is 
controlled by property 
[https://github.com/apache/hadoop/blob/533138718cc05b78e0afe583d7a9bd30e8a48fdc/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/client/impl/TestBlockReaderLocal.java#L620]

`dfs.checksum.type`

Btw there is a nice summary on the topic in chapter 5 of the book: hadoop the 
definitive guide. 

Looking at the old PR, I am not sure it would work with the LocalFs as it seems 
that the crc file will be renamed by default as the underlying system supports 
checksums by default.  


was (Author: skonto):
I just found out that the following config options would suffice to avoid 
creating crcs:

--conf 
spark.hadoop.spark.sql.streaming.checkpointFileManagerClass=org.apache.spark.sql.execution.streaming.FileSystemBasedCheckpointFileManager
 --conf spark.hadoop.fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem

So need to make a PR to disable crc emission if user wants to. Unfortunately 
the filesystem hierarchy in hadoop is a bit inconsistent. So the 
LocalFileSystem will use a a FilterFileSystem that has the flag 
setWriteChecksum but the DistributedFileSystem does not have it and it is 
controlled by property 
[https://github.com/apache/hadoop/blob/533138718cc05b78e0afe583d7a9bd30e8a48fdc/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/client/impl/TestBlockReaderLocal.java#L620]

`dfs.checksum.type`.

Btw there is a nice summary on the topic in chapter 5 of the book: hadoop the 
definitive guide. 

So should we clean up things like in the old PR? I am not sure if the old PR 
would work with the LocalFs as it seems that the crc file will be renamed by 
default as the underlying system supports checksums by default.  

> HDFSBackedStateStoreProvider should not leak .crc files 
> 
>
> Key: SPARK-28025
> URL: https://issues.apache.org/jira/browse/SPARK-28025
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
> Environment: Spark 2.4.3
> Kubernetes 1.11(?) (OpenShift)
> StateStore storage on a mounted PVC. Viewed as a local filesystem by the 
> `FileContextBasedCheckpointFileManager` : 
> {noformat}
> scala> glusterfm.isLocal
> res17: Boolean = true{noformat}
>Reporter: Gerard Maas
>Priority: Major
>
> The HDFSBackedStateStoreProvider when using the default CheckpointFileManager 
> is leaving '.crc' files behind. There's a .crc file created for each 
> `atomicFile` operation of the CheckpointFileManager.
> Over time, the number of files becomes very large. It makes the state store 
> file system constantly increase in size and, in our case, deteriorates the 
> file system performance.
> Here's a sample of one of our spark storage volumes after 2 days of execution 
> (4 stateful streaming jobs, each on a different sub-dir):
>  # 
> {noformat}
> Total files in PVC (used for checkpoints and state store)
> $find . | wc -l
> 431796
> # .crc files
> $find . -name "*.crc" | wc -l
> 418053{noformat}
> With each .crc file taking one storage block, the used storage runs into the 
> GBs of data.
> These jobs are running on Kubernetes. Our shared storage provider, GlusterFS, 
> shows serious performance deterioration with this large number of files:
> {noformat}
> DEBUG HDFSBackedStateStoreProvider: fetchFiles() took 29164ms{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-28025) HDFSBackedStateStoreProvider should not leak .crc files

2019-06-13 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862908#comment-16862908
 ] 

Stavros Kontopoulos edited comment on SPARK-28025 at 6/13/19 10:51 AM:
---

I just found out that the following config options would suffice to avoid 
creating crcs:

--conf 
spark.hadoop.spark.sql.streaming.checkpointFileManagerClass=org.apache.spark.sql.execution.streaming.FileSystemBasedCheckpointFileManager
 --conf spark.hadoop.fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem

So need to make a PR to disable crc emission if user wants to. Unfortunately 
the filesystem hierarchy in hadoop is a bit inconsistent. So the 
LocalFileSystem will use a a FilterFileSystem that has the flag 
setWriteChecksum but the DistributedFileSystem does not have it and it is 
controlled by property 
[https://github.com/apache/hadoop/blob/533138718cc05b78e0afe583d7a9bd30e8a48fdc/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/client/impl/TestBlockReaderLocal.java#L620]

`dfs.checksum.type`.

Btw there is a nice summary on the topic in chapter 5 of the book: hadoop the 
definitive guide. 

So should we clean up things like in the old PR? I am not sure if the old PR 
would work with the LocalFs as it seems that the crc file will be renamed by 
default as the underlying system supports checksums by default.  


was (Author: skonto):
I just found out that the following config options would suffice to avoid 
creating crcs:

--conf 
spark.hadoop.spark.sql.streaming.checkpointFileManagerClass=org.apache.spark.sql.execution.streaming.FileSystemBasedCheckpointFileManager
 --conf spark.hadoop.fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem

So need to make a PR to disable crc emission if user wants to in this case but 
could make one for the generic case if the filesystem supports the flag 
mentioned above. Will do that. Btw there is a nice summary on the topic in 
chapter 5 of the book: hadoop the definitive guide. 

So should we clean up things like in the old PR? I am not sure if the old PR 
would work with the LocalFs as it seems that the crc file will be renamed by 
default as the underlying system supports checksums by default.  

> HDFSBackedStateStoreProvider should not leak .crc files 
> 
>
> Key: SPARK-28025
> URL: https://issues.apache.org/jira/browse/SPARK-28025
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
> Environment: Spark 2.4.3
> Kubernetes 1.11(?) (OpenShift)
> StateStore storage on a mounted PVC. Viewed as a local filesystem by the 
> `FileContextBasedCheckpointFileManager` : 
> {noformat}
> scala> glusterfm.isLocal
> res17: Boolean = true{noformat}
>Reporter: Gerard Maas
>Priority: Major
>
> The HDFSBackedStateStoreProvider when using the default CheckpointFileManager 
> is leaving '.crc' files behind. There's a .crc file created for each 
> `atomicFile` operation of the CheckpointFileManager.
> Over time, the number of files becomes very large. It makes the state store 
> file system constantly increase in size and, in our case, deteriorates the 
> file system performance.
> Here's a sample of one of our spark storage volumes after 2 days of execution 
> (4 stateful streaming jobs, each on a different sub-dir):
>  # 
> {noformat}
> Total files in PVC (used for checkpoints and state store)
> $find . | wc -l
> 431796
> # .crc files
> $find . -name "*.crc" | wc -l
> 418053{noformat}
> With each .crc file taking one storage block, the used storage runs into the 
> GBs of data.
> These jobs are running on Kubernetes. Our shared storage provider, GlusterFS, 
> shows serious performance deterioration with this large number of files:
> {noformat}
> DEBUG HDFSBackedStateStoreProvider: fetchFiles() took 29164ms{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28036) Built-in udf left/right has inconsistent behavior

2019-06-13 Thread Yuming Wang (JIRA)

Yuming Wang created SPARK-28036:
---

 Summary: Built-in udf left/right has inconsistent behavior
 Key: SPARK-28036
 URL: https://issues.apache.org/jira/browse/SPARK-28036
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang


PostgreSQL:
{code:sql}
postgres=# select left('ahoj', -2), right('ahoj', -2);
 left | right 
--+---
 ah   | oj
(1 row)
{code}

Spark SQL:

{code:sql}
spark-sql> select left('ahoj', -2), right('ahoj', -2);

spark-sql>
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28035) Test JoinSuite."equi-join is hash-join" is incompatible with its title.

2019-06-13 Thread Jiatao Tao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862928#comment-16862928
 ] 

Jiatao Tao commented on SPARK-28035:


this test "multiple-key equi-join is hash-join" also the same situation.

> Test JoinSuite."equi-join is hash-join" is incompatible with its title.
> ---
>
> Key: SPARK-28035
> URL: https://issues.apache.org/jira/browse/SPARK-28035
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.4.3
>Reporter: Jiatao Tao
>Priority: Trivial
> Attachments: image-2019-06-13-10-32-06-759.png
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-28025) HDFSBackedStateStoreProvider should not leak .crc files

2019-06-13 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862908#comment-16862908
 ] 

Stavros Kontopoulos edited comment on SPARK-28025 at 6/13/19 10:38 AM:
---

I just found out that the following config options would suffice to avoid 
creating crcs:

--conf 
spark.hadoop.spark.sql.streaming.checkpointFileManagerClass=org.apache.spark.sql.execution.streaming.FileSystemBasedCheckpointFileManager
 --conf spark.hadoop.fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem

So need to make a PR to disable crc emission if user wants to in this case but 
could make one for the generic case if the filesystem supports the flag 
mentioned above. Will do that. Btw there is a nice summary on the topic in 
chapter 5 of the book: hadoop the definitive guide. 

So should we clean up things like in the old PR? I am not sure if the old PR 
would work with the LocalFs as it seems that the crc file will be renamed by 
default as the underlying system supports checksums by default.  


was (Author: skonto):
I just found out that the following config options would suffice to avoid 
creating crcs:

--conf 
spark.hadoop.spark.sql.streaming.checkpointFileManagerClass=org.apache.spark.sql.execution.streaming.FileSystemBasedCheckpointFileManager
 --conf spark.hadoop.fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem

So need to make a PR to disable crc emission if user wants to in this case but 
could make one for the generic case if the filesystem supports the flag 
mentioned above . Btw there is a nice summary on the topic in chapter 5 of the 
book: hadoop the definitive guide. 

So should we clean up things like in the old PR? I am not sure if the old PR 
would work with the LocalFs as it seems that the crc file will be renamed by 
default.  

> HDFSBackedStateStoreProvider should not leak .crc files 
> 
>
> Key: SPARK-28025
> URL: https://issues.apache.org/jira/browse/SPARK-28025
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
> Environment: Spark 2.4.3
> Kubernetes 1.11(?) (OpenShift)
> StateStore storage on a mounted PVC. Viewed as a local filesystem by the 
> `FileContextBasedCheckpointFileManager` : 
> {noformat}
> scala> glusterfm.isLocal
> res17: Boolean = true{noformat}
>Reporter: Gerard Maas
>Priority: Major
>
> The HDFSBackedStateStoreProvider when using the default CheckpointFileManager 
> is leaving '.crc' files behind. There's a .crc file created for each 
> `atomicFile` operation of the CheckpointFileManager.
> Over time, the number of files becomes very large. It makes the state store 
> file system constantly increase in size and, in our case, deteriorates the 
> file system performance.
> Here's a sample of one of our spark storage volumes after 2 days of execution 
> (4 stateful streaming jobs, each on a different sub-dir):
>  # 
> {noformat}
> Total files in PVC (used for checkpoints and state store)
> $find . | wc -l
> 431796
> # .crc files
> $find . -name "*.crc" | wc -l
> 418053{noformat}
> With each .crc file taking one storage block, the used storage runs into the 
> GBs of data.
> These jobs are running on Kubernetes. Our shared storage provider, GlusterFS, 
> shows serious performance deterioration with this large number of files:
> {noformat}
> DEBUG HDFSBackedStateStoreProvider: fetchFiles() took 29164ms{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28035) Test JoinSuite."equi-join is hash-join" is incompatible with its title.

2019-06-13 Thread Jiatao Tao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862927#comment-16862927
 ] 

Jiatao Tao commented on SPARK-28035:


The title means hash-join, but when I debug, found it is a merge join.

 

And cannot get the idea that this test do, why only asserting the size? 

 

If this is a problem, I would like to modify this test, hope to hear your 
opinion, thanks.

!image-2019-06-13-10-32-06-759.png!

> Test JoinSuite."equi-join is hash-join" is incompatible with its title.
> ---
>
> Key: SPARK-28035
> URL: https://issues.apache.org/jira/browse/SPARK-28035
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.4.3
>Reporter: Jiatao Tao
>Priority: Trivial
> Attachments: image-2019-06-13-10-32-06-759.png
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-28025) HDFSBackedStateStoreProvider should not leak .crc files

2019-06-13 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862908#comment-16862908
 ] 

Stavros Kontopoulos edited comment on SPARK-28025 at 6/13/19 10:36 AM:
---

I just found out that the following config options would suffice to avoid 
creating crcs:

--conf 
spark.hadoop.spark.sql.streaming.checkpointFileManagerClass=org.apache.spark.sql.execution.streaming.FileSystemBasedCheckpointFileManager
 --conf spark.hadoop.fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem

So need to make a PR to disable crc emission if user wants to in this case but 
could make one for the generic case if the filesystem supports the flag 
mentioned above . Btw there is a nice summary on the topic in chapter 5 of the 
book: hadoop the definitive guide. 

So should we clean up things like in the old PR? I am not sure if the old PR 
would work with the LocalFs as it seems that the crc file will be renamed by 
default.  


was (Author: skonto):
I just found out that the following config options would suffice to avoid 
creating crcs:

--conf 
spark.hadoop.spark.sql.streaming.checkpointFileManagerClass=org.apache.spark.sql.execution.streaming.FileSystemBasedCheckpointFileManager
 --conf spark.hadoop.fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem

So need to make a PR to disable crc emission if user wants to in this case but 
could make one for the generic case. Btw there is a nice summary on the topic 
in chapter 5 of the book: hadoop the definitive guide. 

So should we clean up things like in the old PR? I am not sure if the old PR 
would work with the LocalFs as it seems that the crc file will be renamed by 
default.  

> HDFSBackedStateStoreProvider should not leak .crc files 
> 
>
> Key: SPARK-28025
> URL: https://issues.apache.org/jira/browse/SPARK-28025
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
> Environment: Spark 2.4.3
> Kubernetes 1.11(?) (OpenShift)
> StateStore storage on a mounted PVC. Viewed as a local filesystem by the 
> `FileContextBasedCheckpointFileManager` : 
> {noformat}
> scala> glusterfm.isLocal
> res17: Boolean = true{noformat}
>Reporter: Gerard Maas
>Priority: Major
>
> The HDFSBackedStateStoreProvider when using the default CheckpointFileManager 
> is leaving '.crc' files behind. There's a .crc file created for each 
> `atomicFile` operation of the CheckpointFileManager.
> Over time, the number of files becomes very large. It makes the state store 
> file system constantly increase in size and, in our case, deteriorates the 
> file system performance.
> Here's a sample of one of our spark storage volumes after 2 days of execution 
> (4 stateful streaming jobs, each on a different sub-dir):
>  # 
> {noformat}
> Total files in PVC (used for checkpoints and state store)
> $find . | wc -l
> 431796
> # .crc files
> $find . -name "*.crc" | wc -l
> 418053{noformat}
> With each .crc file taking one storage block, the used storage runs into the 
> GBs of data.
> These jobs are running on Kubernetes. Our shared storage provider, GlusterFS, 
> shows serious performance deterioration with this large number of files:
> {noformat}
> DEBUG HDFSBackedStateStoreProvider: fetchFiles() took 29164ms{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-28025) HDFSBackedStateStoreProvider should not leak .crc files

2019-06-13 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862908#comment-16862908
 ] 

Stavros Kontopoulos edited comment on SPARK-28025 at 6/13/19 10:35 AM:
---

I just found out that the following config options would suffice to avoid 
creating crcs:

--conf 
spark.hadoop.spark.sql.streaming.checkpointFileManagerClass=org.apache.spark.sql.execution.streaming.FileSystemBasedCheckpointFileManager
 --conf spark.hadoop.fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem

So need to make a PR to disable crc emission if user wants to in this case but 
could make one for the generic case. Btw there is a nice summary on the topic 
in chapter 5 of the book: hadoop the definitive guide. 

So should we clean up things like in the old PR? I am not sure if the old PR 
would work with the LocalFs as it seems that the crc file will be renamed by 
default.  


was (Author: skonto):
I just found out that the following config options would suffice to avoid 
creating crcs:

--conf 
spark.hadoop.spark.sql.streaming.checkpointFileManagerClass=org.apache.spark.sql.execution.streaming.FileSystemBasedCheckpointFileManager
 --conf spark.hadoop.fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem 

So should we clean up things like in the old PR? I am not sure if the old PR 
would work with the LocalFs as it seems that the crc file will be renamed by 
default.  

> HDFSBackedStateStoreProvider should not leak .crc files 
> 
>
> Key: SPARK-28025
> URL: https://issues.apache.org/jira/browse/SPARK-28025
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
> Environment: Spark 2.4.3
> Kubernetes 1.11(?) (OpenShift)
> StateStore storage on a mounted PVC. Viewed as a local filesystem by the 
> `FileContextBasedCheckpointFileManager` : 
> {noformat}
> scala> glusterfm.isLocal
> res17: Boolean = true{noformat}
>Reporter: Gerard Maas
>Priority: Major
>
> The HDFSBackedStateStoreProvider when using the default CheckpointFileManager 
> is leaving '.crc' files behind. There's a .crc file created for each 
> `atomicFile` operation of the CheckpointFileManager.
> Over time, the number of files becomes very large. It makes the state store 
> file system constantly increase in size and, in our case, deteriorates the 
> file system performance.
> Here's a sample of one of our spark storage volumes after 2 days of execution 
> (4 stateful streaming jobs, each on a different sub-dir):
>  # 
> {noformat}
> Total files in PVC (used for checkpoints and state store)
> $find . | wc -l
> 431796
> # .crc files
> $find . -name "*.crc" | wc -l
> 418053{noformat}
> With each .crc file taking one storage block, the used storage runs into the 
> GBs of data.
> These jobs are running on Kubernetes. Our shared storage provider, GlusterFS, 
> shows serious performance deterioration with this large number of files:
> {noformat}
> DEBUG HDFSBackedStateStoreProvider: fetchFiles() took 29164ms{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28035) Test JoinSuite."equi-join is hash-join" is incompatible with its title.

2019-06-13 Thread Jiatao Tao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiatao Tao updated SPARK-28035:
---
Attachment: image-2019-06-13-10-32-06-759.png

> Test JoinSuite."equi-join is hash-join" is incompatible with its title.
> ---
>
> Key: SPARK-28035
> URL: https://issues.apache.org/jira/browse/SPARK-28035
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.4.3
>Reporter: Jiatao Tao
>Priority: Trivial
> Attachments: image-2019-06-13-10-32-06-759.png
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28035) Test JoinSuite."equi-join is hash-join" is incompatible with its title.

2019-06-13 Thread Jiatao Tao (JIRA)

Jiatao Tao created SPARK-28035:
--

 Summary: Test JoinSuite."equi-join is hash-join" is incompatible 
with its title.
 Key: SPARK-28035
 URL: https://issues.apache.org/jira/browse/SPARK-28035
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Affects Versions: 2.4.3
Reporter: Jiatao Tao






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-27546) Should repalce DateTimeUtils#defaultTimeZoneuse with sessionLocalTimeZone

2019-06-13 Thread Jiatao Tao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiatao Tao reopened SPARK-27546:


Hi, I update the comment, could someone take a look?

> Should repalce DateTimeUtils#defaultTimeZoneuse with sessionLocalTimeZone
> -
>
> Key: SPARK-27546
> URL: https://issues.apache.org/jira/browse/SPARK-27546
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jiatao Tao
>Priority: Minor
> Attachments: image-2019-04-23-08-10-00-475.png, 
> image-2019-04-23-08-10-50-247.png
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28025) HDFSBackedStateStoreProvider should not leak .crc files

2019-06-13 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862908#comment-16862908
 ] 

Stavros Kontopoulos commented on SPARK-28025:
-

I just found out that the following config options would suffice to avoid 
creating crcs:

--conf 
spark.hadoop.spark.sql.streaming.checkpointFileManagerClass=org.apache.spark.sql.execution.streaming.FileSystemBasedCheckpointFileManager
 --conf spark.hadoop.fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem 

So should we clean up things like in the old PR? I am not sure if the old PR 
would work with the LocalFs as it seems that the crc file will be renamed by 
default.  

> HDFSBackedStateStoreProvider should not leak .crc files 
> 
>
> Key: SPARK-28025
> URL: https://issues.apache.org/jira/browse/SPARK-28025
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
> Environment: Spark 2.4.3
> Kubernetes 1.11(?) (OpenShift)
> StateStore storage on a mounted PVC. Viewed as a local filesystem by the 
> `FileContextBasedCheckpointFileManager` : 
> {noformat}
> scala> glusterfm.isLocal
> res17: Boolean = true{noformat}
>Reporter: Gerard Maas
>Priority: Major
>
> The HDFSBackedStateStoreProvider when using the default CheckpointFileManager 
> is leaving '.crc' files behind. There's a .crc file created for each 
> `atomicFile` operation of the CheckpointFileManager.
> Over time, the number of files becomes very large. It makes the state store 
> file system constantly increase in size and, in our case, deteriorates the 
> file system performance.
> Here's a sample of one of our spark storage volumes after 2 days of execution 
> (4 stateful streaming jobs, each on a different sub-dir):
>  # 
> {noformat}
> Total files in PVC (used for checkpoints and state store)
> $find . | wc -l
> 431796
> # .crc files
> $find . -name "*.crc" | wc -l
> 418053{noformat}
> With each .crc file taking one storage block, the used storage runs into the 
> GBs of data.
> These jobs are running on Kubernetes. Our shared storage provider, GlusterFS, 
> shows serious performance deterioration with this large number of files:
> {noformat}
> DEBUG HDFSBackedStateStoreProvider: fetchFiles() took 29164ms{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27930) List all built-in UDFs have different names

2019-06-13 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-27930:

Description: 
This ticket list all built-in UDFs have different names: 
|PostgreSQL|Spark SQL|Note|
|random|rand| |
|format|format_string|Spark's {{format_string}} is based on the implementation 
of {{java.util.Formatter}}.
 Which makes some formats of PostgreSQL can not supported, such as: 
{{format_string('>>%-s<<', 'Hello')}}|

  was:
This ticket list all built-in UDFs have different names: 
|PostgreSQL|Spark SQL|Note|
|random|rand| |
|format|format_string|Spark's {{format_string}} is based on the implementation 
of {{java.util.Formatter}}, which makes some formats of PostgreSQL not 
supported, such as: {{format_string('>>%-s<<', 'Hello')}}|


> List all built-in UDFs have different names
> ---
>
> Key: SPARK-27930
> URL: https://issues.apache.org/jira/browse/SPARK-27930
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> This ticket list all built-in UDFs have different names: 
> |PostgreSQL|Spark SQL|Note|
> |random|rand| |
> |format|format_string|Spark's {{format_string}} is based on the 
> implementation of {{java.util.Formatter}}.
>  Which makes some formats of PostgreSQL can not supported, such as: 
> {{format_string('>>%-s<<', 'Hello')}}|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27930) List all built-in UDFs have different names

2019-06-13 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-27930:

Description: 
This ticket list all built-in UDFs have different names: 
|PostgreSQL|Spark SQL|Note|
|random|rand| |
|format|format_string|Spark's {{format_string}} is based on the implementation 
of {{java.util.Formatter}}, which makes some formats of PostgreSQL not 
supported, such as: {{format_string('>>%-s<<', 'Hello')}}|

  was:
This ticket list all built-in UDFs have different names:
|PostgreSQL|Spark SQL|
|random|rand|
|format|format_string|


> List all built-in UDFs have different names
> ---
>
> Key: SPARK-27930
> URL: https://issues.apache.org/jira/browse/SPARK-27930
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> This ticket list all built-in UDFs have different names: 
> |PostgreSQL|Spark SQL|Note|
> |random|rand| |
> |format|format_string|Spark's {{format_string}} is based on the 
> implementation of {{java.util.Formatter}}, which makes some formats of 
> PostgreSQL not supported, such as: {{format_string('>>%-s<<', 'Hello')}}|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28016) Spark hangs when an execution plan has many projections on nested structs

2019-06-13 Thread Ruslan Yushchenko (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862898#comment-16862898
 ] 

Ruslan Yushchenko commented on SPARK-28016:
---

Attached a self-contained example ([^SparkApp1IssueSelfContained.scala]). 
Slightly simplified the example, only numeric fields are used now. Also, the 
number of transformations is now easily changeable by changing the upper bound 
of the `Range()`.

This example also reproduces the issue in the current master. On master it does 
not produce a timeout error, just freezes.

> Spark hangs when an execution plan has many projections on nested structs
> -
>
> Key: SPARK-28016
> URL: https://issues.apache.org/jira/browse/SPARK-28016
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.4.3
> Environment: Tried in
>  * Spark 2.2.1, Spark 2.4.3 in local mode on Linux, MasOS and Windows
>  * Spark 2.4.3 / Yarn on a Linux cluster
>Reporter: Ruslan Yushchenko
>Priority: Major
> Attachments: NestedOps.scala, SparkApp1Issue.scala, 
> SparkApp1IssueSelfContained.scala, SparkApp2Workaround.scala, 
> spark-app-nested.tgz
>
>
> Spark applications freeze on execution plan optimization stage (Catalyst) 
> when a logical execution plan contains a lot of projections that operate on 
> nested struct fields.
> 2 Spark Applications are attached. One demonstrates the issue, the other 
> demonstrates a workaround. Also, an archive is attached where these jobs are 
> packages as a Maven Project.
> To reproduce the attached Spark App does the following:
>  * A small dataframe is created from a JSON example.
>  * A nested withColumn map transformation is used to apply a transformation 
> on a struct field and create a new struct field. The code for this 
> transformation is also attached. 
>  * Once more than 11 such transformations are applied the Catalyst optimizer 
> freezes on optimizing the execution plan
> {code:scala}
> package za.co.absa.spark.app
> import org.apache.spark.sql._
> import org.apache.spark.sql.functions._
> object SparkApp1Issue {
>   // A sample data for a dataframe with nested structs
>   val sample =
> """
>   |{
>   |  "strings": {
>   |"simple": "Culpa repellat nesciunt accusantium",
>   |"all_random": "DESebo8d%fL9sX@AzVin",
>   |"whitespaces": "qbbl"
>   |  },
>   |  "numerics": {
>   |"small_positive": 722,
>   |"small_negative": -660,
>   |"big_positive": 669223368251997,
>   |"big_negative": -161176863305841,
>   |"zero": 0
>   |  }
>   |}
> """.stripMargin ::
>   """{
> |  "strings": {
> |"simple": "Accusamus quia vel deleniti",
> |"all_random": "rY*KS]jPBpa[",
> |"whitespaces": "  t e   t   rp   z p"
> |  },
> |  "numerics": {
> |"small_positive": 268,
> |"small_negative": -134,
> |"big_positive": 768990048149640,
> |"big_negative": -684718954884696,
> |"zero": 0
> |  }
> |}
> |""".stripMargin ::
>   """{
> |  "strings": {
> |"simple": "Quia numquam deserunt delectus rem est",
> |"all_random": "GmRdQlE4Avn1hSlVPAH",
> |"whitespaces": "   c   sayv   drf "
> |  },
> |  "numerics": {
> |"small_positive": 909,
> |"small_negative": -363,
> |"big_positive": 592517494751902,
> |"big_negative": -703224505589638,
> |"zero": 0
> |  }
> |}
> |""".stripMargin :: Nil
>   /**
> * This Spark Job demonstrates an issue of execution plan freezing when 
> there are a lot of projections
> * involving nested structs in an execution plan.
> *
> * The example works as follows:
> * - A small dataframe is created from a JSON example above
> * - A nested withColumn map transformation is used to apply a 
> transformation on a struct field and create
> *   a new struct field.
> * - Once more than 11 such transformations are applied the Catalyst 
> optimizer freezes on optimizing
> *   the execution plan
> */
>   def main(args: Array[String]): Unit = {
> val sparkBuilder = SparkSession.builder().appName("Nested Projections 
> Issue")
> val spark = sparkBuilder
>   .master("local[4]")
>   .getOrCreate()
> import spark.implicits._
> import za.co.absa.spark.utils.NestedOps.DataSetWrapper
> val df = spark.read.json(sample.toDS)
> // Apply several uppercase and negation transformations
> val dfOutput = df
>   .nestedWithColumnMap("strings.simple", "strings.uppercase1", c => 
> upper(c))
>

[jira] [Assigned] (SPARK-28034) Add with.sql

2019-06-13 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28034:


Assignee: (was: Apache Spark)

> Add with.sql
> 
>
> Key: SPARK-28034
> URL: https://issues.apache.org/jira/browse/SPARK-28034
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Peter Toth
>Priority: Major
>
> In this ticket, we plan to add the regression test cases of 
> [https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/with.sql]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28034) Add with.sql

2019-06-13 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28034:


Assignee: Apache Spark

> Add with.sql
> 
>
> Key: SPARK-28034
> URL: https://issues.apache.org/jira/browse/SPARK-28034
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Peter Toth
>Assignee: Apache Spark
>Priority: Major
>
> In this ticket, we plan to add the regression test cases of 
> [https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/with.sql]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28034) Add with.sql

2019-06-13 Thread Peter Toth (JIRA)

Peter Toth created SPARK-28034:
--

 Summary: Add with.sql
 Key: SPARK-28034
 URL: https://issues.apache.org/jira/browse/SPARK-28034
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Peter Toth


In this ticket, we plan to add the regression test cases of 
[https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/with.sql]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28033) String concatenation low priority than other arithmeticBinary

2019-06-13 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28033:


Assignee: (was: Apache Spark)

> String concatenation low priority than other arithmeticBinary
> -
>
> Key: SPARK-28033
> URL: https://issues.apache.org/jira/browse/SPARK-28033
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.3
>Reporter: Yuming Wang
>Priority: Major
>
> Spark SQL:
> {code:sql}
> spark-sql> explain select 'four: ' || 2 + 2;
> == Physical Plan ==
> *(1) Project [null AS (CAST(concat(four: , CAST(2 AS STRING)) AS DOUBLE) + 
> CAST(2 AS DOUBLE))#2]
> +- Scan OneRowRelation[]
> spark-sql> select 'four: ' || 2 + 2;
> NULL
> {code}
> Hive:
> {code:sql}
> hive> select 'four: ' || 2 + 2;
> OK
> four: 4
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28033) String concatenation low priority than other arithmeticBinary

2019-06-13 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28033:


Assignee: Apache Spark

> String concatenation low priority than other arithmeticBinary
> -
>
> Key: SPARK-28033
> URL: https://issues.apache.org/jira/browse/SPARK-28033
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.3
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> Spark SQL:
> {code:sql}
> spark-sql> explain select 'four: ' || 2 + 2;
> == Physical Plan ==
> *(1) Project [null AS (CAST(concat(four: , CAST(2 AS STRING)) AS DOUBLE) + 
> CAST(2 AS DOUBLE))#2]
> +- Scan OneRowRelation[]
> spark-sql> select 'four: ' || 2 + 2;
> NULL
> {code}
> Hive:
> {code:sql}
> hive> select 'four: ' || 2 + 2;
> OK
> four: 4
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28033) String concatenation low priority than other arithmeticBinary

2019-06-13 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28033:

Affects Version/s: (was: 3.0.0)
   2.3.3

> String concatenation low priority than other arithmeticBinary
> -
>
> Key: SPARK-28033
> URL: https://issues.apache.org/jira/browse/SPARK-28033
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.3
>Reporter: Yuming Wang
>Priority: Major
>
> Spark SQL:
> {code:sql}
> spark-sql> explain select 'four: ' || 2 + 2;
> == Physical Plan ==
> *(1) Project [null AS (CAST(concat(four: , CAST(2 AS STRING)) AS DOUBLE) + 
> CAST(2 AS DOUBLE))#2]
> +- Scan OneRowRelation[]
> spark-sql> select 'four: ' || 2 + 2;
> NULL
> {code}
> Hive:
> {code:sql}
> hive> select 'four: ' || 2 + 2;
> OK
> four: 4
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28033) String concatenation low priority than other arithmeticBinary

2019-06-13 Thread Yuming Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862844#comment-16862844
 ] 

Yuming Wang commented on SPARK-28033:
-

I'm working on.

> String concatenation low priority than other arithmeticBinary
> -
>
> Key: SPARK-28033
> URL: https://issues.apache.org/jira/browse/SPARK-28033
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Spark SQL:
> {code:sql}
> spark-sql> explain select 'four: ' || 2 + 2;
> == Physical Plan ==
> *(1) Project [null AS (CAST(concat(four: , CAST(2 AS STRING)) AS DOUBLE) + 
> CAST(2 AS DOUBLE))#2]
> +- Scan OneRowRelation[]
> spark-sql> select 'four: ' || 2 + 2;
> NULL
> {code}
> Hive:
> {code:sql}
> hive> select 'four: ' || 2 + 2;
> OK
> four: 4
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28033) String concatenation low priority than other arithmeticBinary

2019-06-13 Thread Yuming Wang (JIRA)

Yuming Wang created SPARK-28033:
---

 Summary: String concatenation low priority than other 
arithmeticBinary
 Key: SPARK-28033
 URL: https://issues.apache.org/jira/browse/SPARK-28033
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang


Spark SQL:
{code:sql}
spark-sql> explain select 'four: ' || 2 + 2;
== Physical Plan ==
*(1) Project [null AS (CAST(concat(four: , CAST(2 AS STRING)) AS DOUBLE) + 
CAST(2 AS DOUBLE))#2]
+- Scan OneRowRelation[]
spark-sql> select 'four: ' || 2 + 2;
NULL
{code}

Hive:

{code:sql}
hive> select 'four: ' || 2 + 2;
OK
four: 4
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27463) Support Dataframe Cogroup via Pandas UDFs

2019-06-13 Thread Chris Martin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862828#comment-16862828
 ] 

Chris Martin commented on SPARK-27463:
--

Hi [~hyukjin.kwon]

Ah I see your concern now.  It think it’s fair to say that the cogrouping 
functionality proposed has no analogous API in Pandas.  In my opinion that’s 
understandable as Pandas is fundamentally a library for manipulating local data 
so the problems of colocating multiple DatafFrames don’t apply as they do in 
Spark.  That said, the inspiration behind the proposed API is clearly that of 
the Pandas groupby().apply() so I’d argue it is not without precedent.

I think the more direct comparison here is with the existing Dataset cogroup, 
where high level functionality is almost exactly the same (partition two 
distinct DatafFrames such that partitions are cogroup and apply a flatmap 
operation over them) with the differences being in the cogroup key definition 
(typed for datasets, untyped for pandas-udf), Input (Iterables for Datasets, 
Pandas DataFrames for pandas-udf) and Output (Iterable for Datasets, pandas 
Dataframe for pandas-udf). Now at this point one might observe that we have two 
different language-specific implementations of the same high level 
functionality.  This is true, however it’s been the case since the introduction 
of Pandas Udfs (see groupBy().apply() vs groupByKey().flatmapgroups()) and is 
imho a good thing; it allows us to provide functionality that plays to the 
strength of each individual language given that what is simple and idiomatic in 
Python is not in Scala and vice versa.

If, considering this, we agree that this cogroup functionality both useful and 
suitable as exposing via a Pandas UDF (and I hope we do, but please say if you 
disagree), the question now comes as to what we would like the api to be. At 
this point let’s consider the API as currently proposed in the design doc.

 
{code:java}
result = df1.cogroup(df2, on='id').apply(my_pandas_udf)
{code}

This API is concise and consistent with existing groupby.apply().  The 
disadvantage is that it isn’t consistent with Dataset’s cogroup and, as this 
API doesn’t exist in Pandas, it can’t be consistent with that (although I would 
argue that if Pandas did introduce such an API it would look a lot like this).

The alternative would be to implement something on RelationalGroupedData as 
described by Li in the post above (I think we can discount something based on 
KeyValueGroupedDataset as if my reading of the code is correct this would only 
apply for typed APIs which this isn’t).  The big advantage here is that this is 
much more consistent with the existing Dataset cogroup.  On the flip side it 
comes at the cost of a little more verbosity and IMHO is a little less 
pythonic/in the style of Pandas.  That being the case, I’m slightly in favour 
of the the API as currently proposed in the design doc, but am happy to be 
swayed to something else if the majority have a different opinion.

> Support Dataframe Cogroup via Pandas UDFs 
> --
>
> Key: SPARK-27463
> URL: https://issues.apache.org/jira/browse/SPARK-27463
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Chris Martin
>Priority: Major
>
> Recent work on Pandas UDFs in Spark, has allowed for improved 
> interoperability between Pandas and Spark.  This proposal aims to extend this 
> by introducing a new Pandas UDF type which would allow for a cogroup 
> operation to be applied to two PySpark DataFrames.
> Full details are in the google document linked below.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27966) input_file_name empty when listing files in parallel

2019-06-13 Thread Christian Homberg (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862822#comment-16862822
 ] 

Christian Homberg commented on SPARK-27966:
---

This is the truncated output I get from df.explain.

 
{code:java}
== Physical Plan ==
CollectLimit 21
+- *(1) Project [{COLUMNDEFINITIONS} ... 21 more fields]
+- *(1) FileScan csv [{COLUMNDEFINITIONS}... 21 more fields] Batched: false, 
DataFilters: [], Format: CSV, Location: 
InMemoryFileIndex[dbfs:/mnt{LOCATIONTRUNCATED}, dbfs:/m..., PartitionFilters: 
[], PushedFilters: [], ReadSchema: struct{COLUMNDEFINITIONS}...
{code}
 

Is there anything else I can provide?

 

> input_file_name empty when listing files in parallel
> 
>
> Key: SPARK-27966
> URL: https://issues.apache.org/jira/browse/SPARK-27966
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.0
> Environment: Databricks 5.3 (includes Apache Spark 2.4.0, Scala 2.11)
>  
> Worker Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2
> Workers: 3
> Driver Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2
>Reporter: Christian Homberg
>Priority: Minor
> Attachments: input_file_name_bug
>
>
> I ran into an issue similar and probably related to SPARK-26128. The 
> _org.apache.spark.sql.functions.input_file_name_ is sometimes empty.
>  
> {code:java}
> df.select(input_file_name()).show(5,false)
> {code}
>  
> {code:java}
> +-+
> |input_file_name()|
> +-+
> | |
> | |
> | |
> | |
> | |
> +-+
> {code}
> My environment is databricks and debugging the Log4j output showed me that 
> the issue occurred when the files are being listed in parallel, e.g. when 
> {code:java}
> 19/06/06 11:50:47 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 127; threshold: 32
> 19/06/06 11:50:47 INFO InMemoryFileIndex: Listing leaf files and directories 
> in parallel under:{code}
>  
> Everything's fine as long as
> {code:java}
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 6; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> {code}
>  
> Setting spark.sql.sources.parallelPartitionDiscovery.threshold to  
> resolves the issue for me.
>  
> *edit: the problem is not exclusively linked to listing files in parallel. 
> I've setup a larger cluster for which after parallel file listing the 
> input_file_name did return the correct filename. After inspecting the log4j 
> again, I assume that it's linked to some kind of MetaStore being full. I've 
> attached a section of the log4j output that I think should indicate why it's 
> failing. If you need more, please let me know.*
>  ** 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13882) Remove org.apache.spark.sql.execution.local

2019-06-13 Thread Lai Zhou (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-13882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862641#comment-16862641
 ] 

Lai Zhou edited comment on SPARK-13882 at 6/13/19 8:07 AM:
---

hi,[~rxin], is this iterator-based local mode  will be re-introduced in the 
future ?

I think a direct iterator-based local mode will be high-efficiency , that can 
help people to do real-time queries.


was (Author: hhlai1990):
hi,[~rxin], is this iterator-based local mode  will be re-introduced in the 
future ?

I think a direct iterator-based local mode will be high-efficiency , than can 
help people to do real-time queries.

> Remove org.apache.spark.sql.execution.local
> ---
>
> Key: SPARK-13882
> URL: https://issues.apache.org/jira/browse/SPARK-13882
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
> Fix For: 2.0.0
>
>
> We introduced some local operators in org.apache.spark.sql.execution.local 
> package but never fully wired the engine to actually use these. We still plan 
> to implement a full local mode, but it's probably going to be fairly 
> different from what the current iterator-based local mode would look like.
> Let's just remove them for now, and we can always re-introduced them in the 
> future by looking at branch-1.6.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28024) Incorrect numeric values when out of range

2019-06-13 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-28024:

Priority: Critical  (was: Major)

> Incorrect numeric values when out of range
> --
>
> Key: SPARK-28024
> URL: https://issues.apache.org/jira/browse/SPARK-28024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Critical
>  Labels: correctness
> Attachments: SPARK-28024.png
>
>
> For example:
> {code:sql}
> select tinyint(128) * tinyint(2); -- 0
> select smallint(2147483647) * smallint(2); -- -2
> select int(2147483647) * int(2); -- -2
> SELECT smallint((-32768)) * smallint(-1); -- -32768
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28024) Incorrect numeric values when out of range

2019-06-13 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-28024:

Labels: correctness  (was: )

> Incorrect numeric values when out of range
> --
>
> Key: SPARK-28024
> URL: https://issues.apache.org/jira/browse/SPARK-28024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: correctness
> Attachments: SPARK-28024.png
>
>
> For example:
> {code:sql}
> select tinyint(128) * tinyint(2); -- 0
> select smallint(2147483647) * smallint(2); -- -2
> select int(2147483647) * int(2); -- -2
> SELECT smallint((-32768)) * smallint(-1); -- -32768
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28024) Incorrect numeric values when out of range

2019-06-13 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-28024:

Target Version/s: 3.0.0

> Incorrect numeric values when out of range
> --
>
> Key: SPARK-28024
> URL: https://issues.apache.org/jira/browse/SPARK-28024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Critical
>  Labels: correctness
> Attachments: SPARK-28024.png
>
>
> For example:
> {code:sql}
> select tinyint(128) * tinyint(2); -- 0
> select smallint(2147483647) * smallint(2); -- -2
> select int(2147483647) * int(2); -- -2
> SELECT smallint((-32768)) * smallint(-1); -- -32768
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28016) Spark hangs when an execution plan has many projections on nested structs

2019-06-13 Thread Ruslan Yushchenko (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Yushchenko updated SPARK-28016:
--
Attachment: SparkApp1IssueSelfContained.scala

> Spark hangs when an execution plan has many projections on nested structs
> -
>
> Key: SPARK-28016
> URL: https://issues.apache.org/jira/browse/SPARK-28016
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.4.3
> Environment: Tried in
>  * Spark 2.2.1, Spark 2.4.3 in local mode on Linux, MasOS and Windows
>  * Spark 2.4.3 / Yarn on a Linux cluster
>Reporter: Ruslan Yushchenko
>Priority: Major
> Attachments: NestedOps.scala, SparkApp1Issue.scala, 
> SparkApp1IssueSelfContained.scala, SparkApp2Workaround.scala, 
> spark-app-nested.tgz
>
>
> Spark applications freeze on execution plan optimization stage (Catalyst) 
> when a logical execution plan contains a lot of projections that operate on 
> nested struct fields.
> 2 Spark Applications are attached. One demonstrates the issue, the other 
> demonstrates a workaround. Also, an archive is attached where these jobs are 
> packages as a Maven Project.
> To reproduce the attached Spark App does the following:
>  * A small dataframe is created from a JSON example.
>  * A nested withColumn map transformation is used to apply a transformation 
> on a struct field and create a new struct field. The code for this 
> transformation is also attached. 
>  * Once more than 11 such transformations are applied the Catalyst optimizer 
> freezes on optimizing the execution plan
> {code:scala}
> package za.co.absa.spark.app
> import org.apache.spark.sql._
> import org.apache.spark.sql.functions._
> object SparkApp1Issue {
>   // A sample data for a dataframe with nested structs
>   val sample =
> """
>   |{
>   |  "strings": {
>   |"simple": "Culpa repellat nesciunt accusantium",
>   |"all_random": "DESebo8d%fL9sX@AzVin",
>   |"whitespaces": "qbbl"
>   |  },
>   |  "numerics": {
>   |"small_positive": 722,
>   |"small_negative": -660,
>   |"big_positive": 669223368251997,
>   |"big_negative": -161176863305841,
>   |"zero": 0
>   |  }
>   |}
> """.stripMargin ::
>   """{
> |  "strings": {
> |"simple": "Accusamus quia vel deleniti",
> |"all_random": "rY*KS]jPBpa[",
> |"whitespaces": "  t e   t   rp   z p"
> |  },
> |  "numerics": {
> |"small_positive": 268,
> |"small_negative": -134,
> |"big_positive": 768990048149640,
> |"big_negative": -684718954884696,
> |"zero": 0
> |  }
> |}
> |""".stripMargin ::
>   """{
> |  "strings": {
> |"simple": "Quia numquam deserunt delectus rem est",
> |"all_random": "GmRdQlE4Avn1hSlVPAH",
> |"whitespaces": "   c   sayv   drf "
> |  },
> |  "numerics": {
> |"small_positive": 909,
> |"small_negative": -363,
> |"big_positive": 592517494751902,
> |"big_negative": -703224505589638,
> |"zero": 0
> |  }
> |}
> |""".stripMargin :: Nil
>   /**
> * This Spark Job demonstrates an issue of execution plan freezing when 
> there are a lot of projections
> * involving nested structs in an execution plan.
> *
> * The example works as follows:
> * - A small dataframe is created from a JSON example above
> * - A nested withColumn map transformation is used to apply a 
> transformation on a struct field and create
> *   a new struct field.
> * - Once more than 11 such transformations are applied the Catalyst 
> optimizer freezes on optimizing
> *   the execution plan
> */
>   def main(args: Array[String]): Unit = {
> val sparkBuilder = SparkSession.builder().appName("Nested Projections 
> Issue")
> val spark = sparkBuilder
>   .master("local[4]")
>   .getOrCreate()
> import spark.implicits._
> import za.co.absa.spark.utils.NestedOps.DataSetWrapper
> val df = spark.read.json(sample.toDS)
> // Apply several uppercase and negation transformations
> val dfOutput = df
>   .nestedWithColumnMap("strings.simple", "strings.uppercase1", c => 
> upper(c))
>   .nestedWithColumnMap("strings.all_random", "strings.uppercase2", c => 
> upper(c))
>   .nestedWithColumnMap("strings.whitespaces", "strings.uppercase3", c => 
> upper(c))
>   .nestedWithColumnMap("numerics.small_positive", "numerics.num1", c => 
> -c)
>   .nestedWithColumnMap("numerics.small_negative", "numerics.num2", c => 
> -c)
>

[jira] [Resolved] (SPARK-27322) DataSourceV2 table relation

2019-06-13 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27322.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24741
[https://github.com/apache/spark/pull/24741]

> DataSourceV2 table relation
> ---
>
> Key: SPARK-27322
> URL: https://issues.apache.org/jira/browse/SPARK-27322
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Assignee: John Zhuge
>Priority: Major
> Fix For: 3.0.0
>
>
> Support multi-catalog in the following SELECT code paths:
>  * SELECT * FROM catalog.db.tbl
>  * TABLE catalog.db.tbl
>  * JOIN or UNION tables from different catalogs
>  * SparkSession.table("catalog.db.tbl")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25299) Use remote storage for persisting shuffle data

2019-06-13 Thread Saisai Shao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-25299:

Labels: SPIP  (was: )

> Use remote storage for persisting shuffle data
> --
>
> Key: SPARK-25299
> URL: https://issues.apache.org/jira/browse/SPARK-25299
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Major
>  Labels: SPIP
>
> In Spark, the shuffle primitive requires Spark executors to persist data to 
> the local disk of the worker nodes. If executors crash, the external shuffle 
> service can continue to serve the shuffle data that was written beyond the 
> lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the 
> external shuffle service is deployed on every worker node. The shuffle 
> service shares local disk with the executors that run on its node.
> There are some shortcomings with the way shuffle is fundamentally implemented 
> right now. Particularly:
>  * If any external shuffle service process or node becomes unavailable, all 
> applications that had an executor that ran on that node must recompute the 
> shuffle blocks that were lost.
>  * Similarly to the above, the external shuffle service must be kept running 
> at all times, which may waste resources when no applications are using that 
> shuffle service node.
>  * Mounting local storage can prevent users from taking advantage of 
> desirable isolation benefits from using containerized environments, like 
> Kubernetes. We had an external shuffle service implementation in an early 
> prototype of the Kubernetes backend, but it was rejected due to its strict 
> requirement to be able to mount hostPath volumes or other persistent volume 
> setups.
> In the following [architecture discussion 
> document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40]
>  (note: _not_ an SPIP), we brainstorm various high level architectures for 
> improving the external shuffle service in a way that addresses the above 
> problems. The purpose of this umbrella JIRA is to promote additional 
> discussion on how we can approach these problems, both at the architecture 
> level and the implementation level. We anticipate filing sub-issues that 
> break down the tasks that must be completed to achieve this goal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28016) Spark hangs when an execution plan has many projections on nested structs

2019-06-13 Thread Ruslan Yushchenko (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862752#comment-16862752
 ] 

Ruslan Yushchenko commented on SPARK-28016:
---

Thank you for your quick response.
 * Will make the example self contained.
 * Will run on current master and let you know if I can reproduce it.

> Spark hangs when an execution plan has many projections on nested structs
> -
>
> Key: SPARK-28016
> URL: https://issues.apache.org/jira/browse/SPARK-28016
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.4.3
> Environment: Tried in
>  * Spark 2.2.1, Spark 2.4.3 in local mode on Linux, MasOS and Windows
>  * Spark 2.4.3 / Yarn on a Linux cluster
>Reporter: Ruslan Yushchenko
>Priority: Major
> Attachments: NestedOps.scala, SparkApp1Issue.scala, 
> SparkApp2Workaround.scala, spark-app-nested.tgz
>
>
> Spark applications freeze on execution plan optimization stage (Catalyst) 
> when a logical execution plan contains a lot of projections that operate on 
> nested struct fields.
> 2 Spark Applications are attached. One demonstrates the issue, the other 
> demonstrates a workaround. Also, an archive is attached where these jobs are 
> packages as a Maven Project.
> To reproduce the attached Spark App does the following:
>  * A small dataframe is created from a JSON example.
>  * A nested withColumn map transformation is used to apply a transformation 
> on a struct field and create a new struct field. The code for this 
> transformation is also attached. 
>  * Once more than 11 such transformations are applied the Catalyst optimizer 
> freezes on optimizing the execution plan
> {code:scala}
> package za.co.absa.spark.app
> import org.apache.spark.sql._
> import org.apache.spark.sql.functions._
> object SparkApp1Issue {
>   // A sample data for a dataframe with nested structs
>   val sample =
> """
>   |{
>   |  "strings": {
>   |"simple": "Culpa repellat nesciunt accusantium",
>   |"all_random": "DESebo8d%fL9sX@AzVin",
>   |"whitespaces": "qbbl"
>   |  },
>   |  "numerics": {
>   |"small_positive": 722,
>   |"small_negative": -660,
>   |"big_positive": 669223368251997,
>   |"big_negative": -161176863305841,
>   |"zero": 0
>   |  }
>   |}
> """.stripMargin ::
>   """{
> |  "strings": {
> |"simple": "Accusamus quia vel deleniti",
> |"all_random": "rY*KS]jPBpa[",
> |"whitespaces": "  t e   t   rp   z p"
> |  },
> |  "numerics": {
> |"small_positive": 268,
> |"small_negative": -134,
> |"big_positive": 768990048149640,
> |"big_negative": -684718954884696,
> |"zero": 0
> |  }
> |}
> |""".stripMargin ::
>   """{
> |  "strings": {
> |"simple": "Quia numquam deserunt delectus rem est",
> |"all_random": "GmRdQlE4Avn1hSlVPAH",
> |"whitespaces": "   c   sayv   drf "
> |  },
> |  "numerics": {
> |"small_positive": 909,
> |"small_negative": -363,
> |"big_positive": 592517494751902,
> |"big_negative": -703224505589638,
> |"zero": 0
> |  }
> |}
> |""".stripMargin :: Nil
>   /**
> * This Spark Job demonstrates an issue of execution plan freezing when 
> there are a lot of projections
> * involving nested structs in an execution plan.
> *
> * The example works as follows:
> * - A small dataframe is created from a JSON example above
> * - A nested withColumn map transformation is used to apply a 
> transformation on a struct field and create
> *   a new struct field.
> * - Once more than 11 such transformations are applied the Catalyst 
> optimizer freezes on optimizing
> *   the execution plan
> */
>   def main(args: Array[String]): Unit = {
> val sparkBuilder = SparkSession.builder().appName("Nested Projections 
> Issue")
> val spark = sparkBuilder
>   .master("local[4]")
>   .getOrCreate()
> import spark.implicits._
> import za.co.absa.spark.utils.NestedOps.DataSetWrapper
> val df = spark.read.json(sample.toDS)
> // Apply several uppercase and negation transformations
> val dfOutput = df
>   .nestedWithColumnMap("strings.simple", "strings.uppercase1", c => 
> upper(c))
>   .nestedWithColumnMap("strings.all_random", "strings.uppercase2", c => 
> upper(c))
>   .nestedWithColumnMap("strings.whitespaces", "strings.uppercase3", c => 
> upper(c))
>   .nestedWithColumnMap("numerics.small_positive", "numerics.num1", c => 
> -c)
>

[jira] [Commented] (SPARK-28024) Incorrect numeric values when out of range

2019-06-13 Thread Wenchen Fan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862747#comment-16862747
 ] 

Wenchen Fan commented on SPARK-28024:
-

AFAIK other databases would throw overflow exception. We may need a config to 
change this behavior.

> Incorrect numeric values when out of range
> --
>
> Key: SPARK-28024
> URL: https://issues.apache.org/jira/browse/SPARK-28024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: SPARK-28024.png
>
>
> For example:
> {code:sql}
> select tinyint(128) * tinyint(2); -- 0
> select smallint(2147483647) * smallint(2); -- -2
> select int(2147483647) * int(2); -- -2
> SELECT smallint((-32768)) * smallint(-1); -- -32768
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27322) DataSourceV2 table relation

2019-06-13 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-27322:
---

Assignee: John Zhuge

> DataSourceV2 table relation
> ---
>
> Key: SPARK-27322
> URL: https://issues.apache.org/jira/browse/SPARK-27322
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Assignee: John Zhuge
>Priority: Major
>
> Support multi-catalog in the following SELECT code paths:
>  * SELECT * FROM catalog.db.tbl
>  * TABLE catalog.db.tbl
>  * JOIN or UNION tables from different catalogs
>  * SparkSession.table("catalog.db.tbl")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27842) Inconsistent results of Statistics.corr() and PearsonCorrelation.computeCorrelationMatrix()

2019-06-13 Thread Peter Nijem (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862739#comment-16862739
 ] 

Peter Nijem commented on SPARK-27842:
-

[~hyukjin.kwon] do you need another input from my side?

If so, please let me know and I will be glad to help.

 

Peter

> Inconsistent results of Statistics.corr() and 
> PearsonCorrelation.computeCorrelationMatrix()
> ---
>
> Key: SPARK-27842
> URL: https://issues.apache.org/jira/browse/SPARK-27842
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Spark Core, Windows
>Affects Versions: 2.3.1
>Reporter: Peter Nijem
>Priority: Major
> Attachments: vectorList.txt
>
>
> Hi,
> I am working with Spark Java API in local mode (1 node, 8 cores). Spark 
> version as follows in my pom.xml:
> *MLLib*
> _spark-mllib_2.11_
>  _2.3.1_
> *Core*
> _spark-core_2.11_
>  _2.3.1_
> I am experiencing inconsistent results of correlation when starting my Spark 
> application with 8 cores vs 1/2/3 cores.
> I've created a Main class which reads from a file a list of Vectors; 240 
> Vector which each Vector is of the length of 226. 
> As you can see, I am firstly initializing Spark with local[*], running 
> Correlation, saving the Matrix and stopping Spark. Then, I do the same but 
> for local[3].
> I am expecting to get the same matrices on both runs. But this is not the 
> case. The input file is attached.
> I tried to compute the correlation using 
> PearsonCorrealtion.computeCorrelationMatrix() but I faced the same issue here 
> as well.
>  
> In my work, I am dependent on the resulting correlation matrix. Thus, I am 
> experiencing bad results in y application due to the inconsistent results I 
> am getting. As a workaround, I am working now with local[1]
>  
>  
> {code:java}
> import org.apache.spark.SparkConf;
> import org.apache.spark.api.java.JavaSparkContext;
> import org.apache.spark.mllib.linalg.DenseVector;
> import org.apache.spark.mllib.linalg.Matrix;
> import org.apache.spark.mllib.linalg.Vector;
> import org.apache.spark.mllib.stat.Statistics;
> import org.apache.spark.rdd.RDD;
> import org.junit.Assert;
> import java.io.BufferedReader;
> import java.io.FileReader;
> import java.io.IOException;
> import java.math.RoundingMode;
> import java.text.DecimalFormat;
> import java.util.ArrayList;
> import java.util.Arrays;
> import java.util.List;
> import java.util.stream.Collectors;
> public class TestSparkCorr {
> private static JavaSparkContext ctx;
> public static void main(String[] args) {
> List> doublesLists = readInputFile();
> List resultVectors = getVectorsList(doublesLists);
> //===
> initSpark("*");
> RDD RDD_vector = ctx.parallelize(resultVectors).rdd();
> Matrix m = Statistics.corr(RDD_vector, "pearson");
> stopSpark();
> //===
> initSpark("3");
> RDD RDD_vector_3 = ctx.parallelize(resultVectors).rdd();
> Matrix m3 = Statistics.corr(RDD_vector_3, "pearson");
> stopSpark();
> //===
> Assert.assertEquals(m3, m);
> }
> private static List getVectorsList(List> doublesLists) {
> List resultVectors = new ArrayList<>(doublesLists.size());
> for (List vector : doublesLists) {
> double[] x = new double[vector.size()];
> for(int i = 0; i < x.length; i++){
> x[i] = vector.get(i);
> }
> resultVectors.add(new DenseVector(x));
> }
> return resultVectors;
> }
> private static List> readInputFile() {
> List> doublesLists = new ArrayList<>();
> try (BufferedReader reader = new BufferedReader(new FileReader(
> ".//output//vectorList.txt"))) {
> String line = reader.readLine();
> while (line != null) {
> String[] splitLine = line.substring(1, line.length() - 2).split(",");
> List doubles = Arrays.stream(splitLine).map(x -> 
> Double.valueOf(x.trim())).collect(Collectors.toList());
> doublesLists.add(doubles);
> // read next line
> line = reader.readLine();
> }
> } catch (IOException e) {
> e.printStackTrace();
> }
> return doublesLists;
> }
> private static void initSpark(String coresNum) {
> final SparkConf sparkConf = new SparkConf().setAppName("Span");
> sparkConf.setMaster(String.format("local[%s]", coresNum));
> sparkConf.set("spark.ui.enabled", "false");
> ctx = new JavaSparkContext(sparkConf);
> }
> private static void stopSpark() {
> ctx.stop();
> if(ctx.sc().isStopped()){
> ctx = null;
> }
> }
> }
> {code}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

91 matches

Mail list logo