[jira] [Commented] (SPARK-12216) Spark failed to delete temp directory

2018-01-22 Thread Igor Babalich (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16334061#comment-16334061
 ] 

Igor Babalich commented on SPARK-12216:
---

Same issue under Windows10, Java 1.8  Spark 2.1.1 Hadoop 2.7   

The problem is that JVM/Classloader under Windows is not closing Jar files and 
keeps them open, so it is not possible to delete these Jar files.

The workaround is to write/fix a custom classloader for these jars and 
accurately close them before deletion in ShutdownHookManager

See more description at: 
[http://management-platform.blogspot.ca/2009/01/classloaders-keeping-jar-files-open.html]

http://loracular.blogspot.ca/2009/12/dynamic-class-loader-with.html

> Spark failed to delete temp directory 
> --
>
> Key: SPARK-12216
> URL: https://issues.apache.org/jira/browse/SPARK-12216
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
> Environment: windows 7 64 bit
> Spark 1.52
> Java 1.8.0.65
> PATH includes:
> C:\Users\Stefan\spark-1.5.2-bin-hadoop2.6\bin
> C:\ProgramData\Oracle\Java\javapath
> C:\Users\Stefan\scala\bin
> SYSTEM variables set are:
> JAVA_HOME=C:\Program Files\Java\jre1.8.0_65
> HADOOP_HOME=C:\Users\Stefan\hadoop-2.6.0\bin
> (where the bin\winutils resides)
> both \tmp and \tmp\hive have permissions
> drwxrwxrwx as detected by winutils ls
>Reporter: stefan
>Priority: Minor
>
> The mailing list archives have no obvious solution to this:
> scala> :q
> Stopping spark context.
> 15/12/08 16:24:22 ERROR ShutdownHookManager: Exception while deleting Spark 
> temp dir: 
> C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff
> java.io.IOException: Failed to delete: 
> C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff
> at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:884)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:63)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:60)
> at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:60)
> at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:264)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
> at scala.util.Try$.apply(Try.scala:161)
> at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216)
> at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19217) Offer easy cast from vector to array

2018-01-22 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333987#comment-16333987
 ] 

Takeshi Yamamuro commented on SPARK-19217:
--

ok, I'll reconsider this.

> Offer easy cast from vector to array
> 
>
> Key: SPARK-19217
> URL: https://issues.apache.org/jira/browse/SPARK-19217
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Working with ML often means working with DataFrames with vector columns. You 
> can't save these DataFrames to storage (edit: at least as ORC) without 
> converting the vector columns to array columns, and there doesn't appear to 
> an easy way to make that conversion.
> This is a common enough problem that it is [documented on Stack 
> Overflow|http://stackoverflow.com/q/35855382/877069]. The current solutions 
> to making the conversion from a vector column to an array column are:
> # Convert the DataFrame to an RDD and back
> # Use a UDF
> Both approaches work fine, but it really seems like you should be able to do 
> something like this instead:
> {code}
> (le_data
> .select(
> col('features').cast('array').alias('features')
> ))
> {code}
> We already have an {{ArrayType}} in {{pyspark.sql.types}}, but it appears 
> that {{cast()}} doesn't support this conversion.
> Would this be an appropriate thing to add?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23177) PySpark parameter-less UDFs raise exception if applied after distinct

2018-01-22 Thread Jakub (JIRA)
Jakub created SPARK-23177:
-

 Summary: PySpark parameter-less UDFs raise exception if applied 
after distinct
 Key: SPARK-23177
 URL: https://issues.apache.org/jira/browse/SPARK-23177
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.2.1, 2.2.0, 2.1.2
Reporter: Jakub


It seems there is an issue with UDFs that take no arguments, but only if UDF is 
applied after {{distinct()}} operation.

Here is the short example, that allows reproduce an issue in PySpark shell:
{code:java}
import pyspark.sql.functions as f
import uuid

df = spark.createDataFrame([(1,2), (3,4)])
f_udf = f.udf(lambda: str(uuid.uuid4()))
df.distinct().withColumn("a", f_udf()).show()
{code}
and it raises the following exception:
{noformat}
Traceback (most recent call last):
  File "", line 1, in 
  File "/opt/spark/python/pyspark/sql/dataframe.py", line 336, in show
print(self._jdf.showString(n, 20))
  File "/opt/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 
1133, in __call__
  File "/opt/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
  File "/opt/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, 
in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o54.showString.
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
attribute, tree: pythonUDF0#16
at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$33.apply(HashAggregateExec.scala:475)
at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$33.apply(HashAggregateExec.scala:474)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec.generateResultCode(HashAggregateExec.scala:474)
at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduceWithKeys(HashAggregateExec.scala:612)
at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduce(HashAggregateExec.scala:148)
at 
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:85)
at 
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:80)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at 
org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:80)
at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec.produce(HashAggregateExec.scala:38)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec.doCodeGen(WholeStageCodegenExec.scala:331)
at 

[jira] [Comment Edited] (SPARK-23178) Kryo Unsafe problems with count distinct from cache

2018-01-22 Thread KIryl Sultanau (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16334206#comment-16334206
 ] 

KIryl Sultanau edited comment on SPARK-23178 at 1/22/18 12:38 PM:
--

With unsafe switch off this example works fine:  
{quote}.config("spark.kryo.unsafe", "false")
{quote}
No null or incorrect IDs in both data sets.


was (Author: kirills2006):
With unsafe switch off this example works fine:  
{quote}.config("spark.kryo.unsafe", "false")
{quote}

> Kryo Unsafe problems with count distinct from cache
> ---
>
> Key: SPARK-23178
> URL: https://issues.apache.org/jira/browse/SPARK-23178
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1
>Reporter: KIryl Sultanau
>Priority: Major
> Attachments: Unsafe-issue.png
>
>
> val spark = SparkSession
>     .builder
>     .appName("unsafe-issue")
>     .master("local[*]")
>     .config("spark.serializer", 
> "org.apache.spark.serializer.KryoSerializer")
>     .config("spark.kryo.unsafe", "true")
>     .config("spark.kryo.registrationRequired", "false")
>     .getOrCreate()
>     val devicesDF = spark.read.format("csv")
>     .option("header", "true")
>     .option("delimiter", "\t")
>     .load("/data/Devices.tsv").cache()
>     val gatewaysDF = spark.read.format("csv")
>     .option("header", "true")
>     .option("delimiter", "\t")
>     .load("/data/Gateways.tsv").cache()
>     val devJoinedDF = devicesDF.join(gatewaysDF, Seq("GatewayId"), 
> "inner").cache()
>     devJoinedDF.printSchema()
>     println(devJoinedDF.count())
>     println(devJoinedDF.select("DeviceId").distinct().count())
>     println(devJoinedDF.groupBy("DeviceId").count().filter("count>1").count())
>     println(devJoinedDF.groupBy("DeviceId").count().filter("count=1").count())



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23178) Kryo Unsafe problems with count distinct from cache

2018-01-22 Thread KIryl Sultanau (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KIryl Sultanau updated SPARK-23178:
---
Attachment: Unsafe-off.png

> Kryo Unsafe problems with count distinct from cache
> ---
>
> Key: SPARK-23178
> URL: https://issues.apache.org/jira/browse/SPARK-23178
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1
>Reporter: KIryl Sultanau
>Priority: Minor
> Attachments: Unsafe-issue.png, Unsafe-off.png
>
>
> val spark = SparkSession
>     .builder
>     .appName("unsafe-issue")
>     .master("local[*]")
>     .config("spark.serializer", 
> "org.apache.spark.serializer.KryoSerializer")
>     .config("spark.kryo.unsafe", "true")
>     .config("spark.kryo.registrationRequired", "false")
>     .getOrCreate()
>     val devicesDF = spark.read.format("csv")
>     .option("header", "true")
>     .option("delimiter", "\t")
>     .load("/data/Devices.tsv").cache()
>     val gatewaysDF = spark.read.format("csv")
>     .option("header", "true")
>     .option("delimiter", "\t")
>     .load("/data/Gateways.tsv").cache()
>     val devJoinedDF = devicesDF.join(gatewaysDF, Seq("GatewayId"), 
> "inner").cache()
>     devJoinedDF.printSchema()
>     println(devJoinedDF.count())
>     println(devJoinedDF.select("DeviceId").distinct().count())
>     println(devJoinedDF.groupBy("DeviceId").count().filter("count>1").count())
>     println(devJoinedDF.groupBy("DeviceId").count().filter("count=1").count())



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23178) Kryo Unsafe problems with count distinct from cache

2018-01-22 Thread KIryl Sultanau (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16334206#comment-16334206
 ] 

KIryl Sultanau edited comment on SPARK-23178 at 1/22/18 12:48 PM:
--

With unsafe switched off this example works fine:  
{quote}.config("spark.kryo.unsafe", "false")
{quote}
No null or incorrect IDs in both data sets.


was (Author: kirills2006):
With unsafe switch off this example works fine:  
{quote}.config("spark.kryo.unsafe", "false")
{quote}
No null or incorrect IDs in both data sets.

> Kryo Unsafe problems with count distinct from cache
> ---
>
> Key: SPARK-23178
> URL: https://issues.apache.org/jira/browse/SPARK-23178
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1
>Reporter: KIryl Sultanau
>Priority: Minor
> Attachments: Unsafe-issue.png, Unsafe-off.png
>
>
> Spark incorrectly process cached data with Kryo & Unsafe options.
> Distinct count from cache doesn't work correctly. Example available below:
> {quote}val spark = SparkSession
>      .builder
>      .appName("unsafe-issue")
>      .master("local[*]")
>      .config("spark.serializer", 
> "org.apache.spark.serializer.KryoSerializer")
>      .config("spark.kryo.unsafe", "true")
>      .config("spark.kryo.registrationRequired", "false")
>      .getOrCreate()
>     val devicesDF = spark.read.format("csv")
>      .option("header", "true")
>      .option("delimiter", "\t")
>      .load("/data/Devices.tsv").cache()
>     val gatewaysDF = spark.read.format("csv")
>      .option("header", "true")
>      .option("delimiter", "\t")
>      .load("/data/Gateways.tsv").cache()
>     val devJoinedDF = devicesDF.join(gatewaysDF, Seq("GatewayId"), 
> "inner").cache()
>      devJoinedDF.printSchema()
>     println(devJoinedDF.count())
>     println(devJoinedDF.select("DeviceId").distinct().count())
>     println(devJoinedDF.groupBy("DeviceId").count().filter("count>1").count())
>     println(devJoinedDF.groupBy("DeviceId").count().filter("count=1").count())
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23178) Kryo Unsafe problems with count distinct from cache

2018-01-22 Thread KIryl Sultanau (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KIryl Sultanau updated SPARK-23178:
---
Attachment: Unsafe-issue.png

> Kryo Unsafe problems with count distinct from cache
> ---
>
> Key: SPARK-23178
> URL: https://issues.apache.org/jira/browse/SPARK-23178
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1
>Reporter: KIryl Sultanau
>Priority: Major
> Attachments: Unsafe-issue.png
>
>
> val spark = SparkSession
>     .builder
>     .appName("unsafe-issue")
>     .master("local[*]")
>     .config("spark.serializer", 
> "org.apache.spark.serializer.KryoSerializer")
>     .config("spark.kryo.unsafe", "true")
>     .config("spark.kryo.registrationRequired", "false")
>     .getOrCreate()
>     val devicesDF = spark.read.format("csv")
>     .option("header", "true")
>     .option("delimiter", "\t")
>     .load("/data/Devices.tsv").cache()
>     val gatewaysDF = spark.read.format("csv")
>     .option("header", "true")
>     .option("delimiter", "\t")
>     .load("/data/Gateways.tsv").cache()
>     val devJoinedDF = devicesDF.join(gatewaysDF, Seq("GatewayId"), 
> "inner").cache()
>     devJoinedDF.printSchema()
>     println(devJoinedDF.count())
>     println(devJoinedDF.select("DeviceId").distinct().count())
>     println(devJoinedDF.groupBy("DeviceId").count().filter("count>1").count())
>     println(devJoinedDF.groupBy("DeviceId").count().filter("count=1").count())



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23178) Kryo Unsafe problems with count distinct from cache

2018-01-22 Thread KIryl Sultanau (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16334206#comment-16334206
 ] 

KIryl Sultanau commented on SPARK-23178:


With unsafe switch off this works fine:  
{quote}.config("spark.kryo.unsafe", "false")
{quote}

> Kryo Unsafe problems with count distinct from cache
> ---
>
> Key: SPARK-23178
> URL: https://issues.apache.org/jira/browse/SPARK-23178
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1
>Reporter: KIryl Sultanau
>Priority: Major
> Attachments: Unsafe-issue.png
>
>
> val spark = SparkSession
>     .builder
>     .appName("unsafe-issue")
>     .master("local[*]")
>     .config("spark.serializer", 
> "org.apache.spark.serializer.KryoSerializer")
>     .config("spark.kryo.unsafe", "true")
>     .config("spark.kryo.registrationRequired", "false")
>     .getOrCreate()
>     val devicesDF = spark.read.format("csv")
>     .option("header", "true")
>     .option("delimiter", "\t")
>     .load("/data/Devices.tsv").cache()
>     val gatewaysDF = spark.read.format("csv")
>     .option("header", "true")
>     .option("delimiter", "\t")
>     .load("/data/Gateways.tsv").cache()
>     val devJoinedDF = devicesDF.join(gatewaysDF, Seq("GatewayId"), 
> "inner").cache()
>     devJoinedDF.printSchema()
>     println(devJoinedDF.count())
>     println(devJoinedDF.select("DeviceId").distinct().count())
>     println(devJoinedDF.groupBy("DeviceId").count().filter("count>1").count())
>     println(devJoinedDF.groupBy("DeviceId").count().filter("count=1").count())



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23178) Kryo Unsafe problems with count distinct from cache

2018-01-22 Thread KIryl Sultanau (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KIryl Sultanau updated SPARK-23178:
---
Priority: Minor  (was: Major)

> Kryo Unsafe problems with count distinct from cache
> ---
>
> Key: SPARK-23178
> URL: https://issues.apache.org/jira/browse/SPARK-23178
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1
>Reporter: KIryl Sultanau
>Priority: Minor
> Attachments: Unsafe-issue.png
>
>
> val spark = SparkSession
>     .builder
>     .appName("unsafe-issue")
>     .master("local[*]")
>     .config("spark.serializer", 
> "org.apache.spark.serializer.KryoSerializer")
>     .config("spark.kryo.unsafe", "true")
>     .config("spark.kryo.registrationRequired", "false")
>     .getOrCreate()
>     val devicesDF = spark.read.format("csv")
>     .option("header", "true")
>     .option("delimiter", "\t")
>     .load("/data/Devices.tsv").cache()
>     val gatewaysDF = spark.read.format("csv")
>     .option("header", "true")
>     .option("delimiter", "\t")
>     .load("/data/Gateways.tsv").cache()
>     val devJoinedDF = devicesDF.join(gatewaysDF, Seq("GatewayId"), 
> "inner").cache()
>     devJoinedDF.printSchema()
>     println(devJoinedDF.count())
>     println(devJoinedDF.select("DeviceId").distinct().count())
>     println(devJoinedDF.groupBy("DeviceId").count().filter("count>1").count())
>     println(devJoinedDF.groupBy("DeviceId").count().filter("count=1").count())



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23178) Kryo Unsafe problems with count distinct from cache

2018-01-22 Thread KIryl Sultanau (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16334206#comment-16334206
 ] 

KIryl Sultanau edited comment on SPARK-23178 at 1/22/18 12:35 PM:
--

With unsafe switch off this example works fine:  
{quote}.config("spark.kryo.unsafe", "false")
{quote}


was (Author: kirills2006):
With unsafe switch off this works fine:  
{quote}.config("spark.kryo.unsafe", "false")
{quote}

> Kryo Unsafe problems with count distinct from cache
> ---
>
> Key: SPARK-23178
> URL: https://issues.apache.org/jira/browse/SPARK-23178
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1
>Reporter: KIryl Sultanau
>Priority: Major
> Attachments: Unsafe-issue.png
>
>
> val spark = SparkSession
>     .builder
>     .appName("unsafe-issue")
>     .master("local[*]")
>     .config("spark.serializer", 
> "org.apache.spark.serializer.KryoSerializer")
>     .config("spark.kryo.unsafe", "true")
>     .config("spark.kryo.registrationRequired", "false")
>     .getOrCreate()
>     val devicesDF = spark.read.format("csv")
>     .option("header", "true")
>     .option("delimiter", "\t")
>     .load("/data/Devices.tsv").cache()
>     val gatewaysDF = spark.read.format("csv")
>     .option("header", "true")
>     .option("delimiter", "\t")
>     .load("/data/Gateways.tsv").cache()
>     val devJoinedDF = devicesDF.join(gatewaysDF, Seq("GatewayId"), 
> "inner").cache()
>     devJoinedDF.printSchema()
>     println(devJoinedDF.count())
>     println(devJoinedDF.select("DeviceId").distinct().count())
>     println(devJoinedDF.groupBy("DeviceId").count().filter("count>1").count())
>     println(devJoinedDF.groupBy("DeviceId").count().filter("count=1").count())



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23170) Dump the statistics of effective runs of analyzer and optimizer rules

2018-01-22 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23170.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> Dump the statistics of effective runs of analyzer and optimizer rules
> -
>
> Key: SPARK-23170
> URL: https://issues.apache.org/jira/browse/SPARK-23170
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
> Fix For: 2.3.0
>
>
> Dump the statistics of effective runs of analyzer and optimizer rules.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23178) Kryo Unsafe problems with count distinct from cache

2018-01-22 Thread KIryl Sultanau (JIRA)
KIryl Sultanau created SPARK-23178:
--

 Summary: Kryo Unsafe problems with count distinct from cache
 Key: SPARK-23178
 URL: https://issues.apache.org/jira/browse/SPARK-23178
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.1, 2.2.0
Reporter: KIryl Sultanau


val spark = SparkSession
    .builder
    .appName("unsafe-issue")
    .master("local[*]")
    .config("spark.serializer", 
"org.apache.spark.serializer.KryoSerializer")
    .config("spark.kryo.unsafe", "true")
    .config("spark.kryo.registrationRequired", "false")
    .getOrCreate()

    val devicesDF = spark.read.format("csv")
    .option("header", "true")
    .option("delimiter", "\t")
    .load("/data/Devices.tsv").cache()

    val gatewaysDF = spark.read.format("csv")
    .option("header", "true")
    .option("delimiter", "\t")
    .load("/data/Gateways.tsv").cache()

    val devJoinedDF = devicesDF.join(gatewaysDF, Seq("GatewayId"), 
"inner").cache()
    devJoinedDF.printSchema()

    println(devJoinedDF.count())

    println(devJoinedDF.select("DeviceId").distinct().count())

    println(devJoinedDF.groupBy("DeviceId").count().filter("count>1").count())

    println(devJoinedDF.groupBy("DeviceId").count().filter("count=1").count())



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23178) Kryo Unsafe problems with count distinct from cache

2018-01-22 Thread KIryl Sultanau (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KIryl Sultanau updated SPARK-23178:
---
Description: 
Spark incorrectly process cached data with Kryo & Unsafe options.

Distinct count from cache doesn't work correctly. Example available below:
{quote}val spark = SparkSession
     .builder
     .appName("unsafe-issue")
     .master("local[*]")
     .config("spark.serializer", 
"org.apache.spark.serializer.KryoSerializer")
     .config("spark.kryo.unsafe", "true")
     .config("spark.kryo.registrationRequired", "false")
     .getOrCreate()

    val devicesDF = spark.read.format("csv")
     .option("header", "true")
     .option("delimiter", "\t")
     .load("/data/Devices.tsv").cache()

    val gatewaysDF = spark.read.format("csv")
     .option("header", "true")
     .option("delimiter", "\t")
     .load("/data/Gateways.tsv").cache()

    val devJoinedDF = devicesDF.join(gatewaysDF, Seq("GatewayId"), 
"inner").cache()
     devJoinedDF.printSchema()

    println(devJoinedDF.count())

    println(devJoinedDF.select("DeviceId").distinct().count())

    println(devJoinedDF.groupBy("DeviceId").count().filter("count>1").count())

    println(devJoinedDF.groupBy("DeviceId").count().filter("count=1").count())
{quote}

  was:
val spark = SparkSession
    .builder
    .appName("unsafe-issue")
    .master("local[*]")
    .config("spark.serializer", 
"org.apache.spark.serializer.KryoSerializer")
    .config("spark.kryo.unsafe", "true")
    .config("spark.kryo.registrationRequired", "false")
    .getOrCreate()

    val devicesDF = spark.read.format("csv")
    .option("header", "true")
    .option("delimiter", "\t")
    .load("/data/Devices.tsv").cache()

    val gatewaysDF = spark.read.format("csv")
    .option("header", "true")
    .option("delimiter", "\t")
    .load("/data/Gateways.tsv").cache()

    val devJoinedDF = devicesDF.join(gatewaysDF, Seq("GatewayId"), 
"inner").cache()
    devJoinedDF.printSchema()

    println(devJoinedDF.count())

    println(devJoinedDF.select("DeviceId").distinct().count())

    println(devJoinedDF.groupBy("DeviceId").count().filter("count>1").count())

    println(devJoinedDF.groupBy("DeviceId").count().filter("count=1").count())


> Kryo Unsafe problems with count distinct from cache
> ---
>
> Key: SPARK-23178
> URL: https://issues.apache.org/jira/browse/SPARK-23178
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1
>Reporter: KIryl Sultanau
>Priority: Minor
> Attachments: Unsafe-issue.png, Unsafe-off.png
>
>
> Spark incorrectly process cached data with Kryo & Unsafe options.
> Distinct count from cache doesn't work correctly. Example available below:
> {quote}val spark = SparkSession
>      .builder
>      .appName("unsafe-issue")
>      .master("local[*]")
>      .config("spark.serializer", 
> "org.apache.spark.serializer.KryoSerializer")
>      .config("spark.kryo.unsafe", "true")
>      .config("spark.kryo.registrationRequired", "false")
>      .getOrCreate()
>     val devicesDF = spark.read.format("csv")
>      .option("header", "true")
>      .option("delimiter", "\t")
>      .load("/data/Devices.tsv").cache()
>     val gatewaysDF = spark.read.format("csv")
>      .option("header", "true")
>      .option("delimiter", "\t")
>      .load("/data/Gateways.tsv").cache()
>     val devJoinedDF = devicesDF.join(gatewaysDF, Seq("GatewayId"), 
> "inner").cache()
>      devJoinedDF.printSchema()
>     println(devJoinedDF.count())
>     println(devJoinedDF.select("DeviceId").distinct().count())
>     println(devJoinedDF.groupBy("DeviceId").count().filter("count>1").count())
>     println(devJoinedDF.groupBy("DeviceId").count().filter("count=1").count())
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23154) Document backwards compatibility guarantees for ML persistence

2018-01-22 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16334757#comment-16334757
 ] 

Yanbo Liang commented on SPARK-23154:
-

Sounds good! It should be helpful to document backwards compatibility. Further 
more, I think we can write some tools to test the backwards compatibility for 
ML persistence during QA of releasing, just like performance regression test. 
Thanks.

> Document backwards compatibility guarantees for ML persistence
> --
>
> Key: SPARK-23154
> URL: https://issues.apache.org/jira/browse/SPARK-23154
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Major
>
> We have (as far as I know) maintained backwards compatibility for ML 
> persistence, but this is not documented anywhere.  I'd like us to document it 
> (for spark.ml, not for spark.mllib).
> I'd recommend something like:
> {quote}
> In general, MLlib maintains backwards compatibility for ML persistence.  
> I.e., if you save an ML model or Pipeline in one version of Spark, then you 
> should be able to load it back and use it in a future version of Spark.  
> However, there are rare exceptions, described below.
> Model persistence: Is a model or Pipeline saved using Apache Spark ML 
> persistence in Spark version X loadable by Spark version Y?
> * Major versions: No guarantees, but best-effort.
> * Minor and patch versions: Yes; these are backwards compatible.
> * Note about the format: There are no guarantees for a stable persistence 
> format, but model loading itself is designed to be backwards compatible.
> Model behavior: Does a model or Pipeline in Spark version X behave 
> identically in Spark version Y?
> * Major versions: No guarantees, but best-effort.
> * Minor and patch versions: Identical behavior, except for bug fixes.
> For both model persistence and model behavior, any breaking changes across a 
> minor version or patch version are reported in the Spark version release 
> notes. If a breakage is not reported in release notes, then it should be 
> treated as a bug to be fixed.
> {quote}
> How does this sound?
> Note: We unfortunately don't have tests for backwards compatibility (which 
> has technical hurdles and can be discussed in [SPARK-15573]).  However, we 
> have made efforts to maintain it during PR review and Spark release QA, and 
> most users expect it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23121) When the Spark Streaming app is running for a period of time, the page is incorrectly reported when accessing '/ jobs /' or '/ jobs / job /? Id = 13' and ui can not be

2018-01-22 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-23121:
--

Assignee: Sandor Murakozi

> When the Spark Streaming app is running for a period of time, the page is 
> incorrectly reported when accessing '/ jobs /' or '/ jobs / job /? Id = 13' 
> and ui can not be accessed.
> -
>
> Key: SPARK-23121
> URL: https://issues.apache.org/jira/browse/SPARK-23121
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: guoxiaolongzte
>Assignee: Sandor Murakozi
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: 1.png, 2.png
>
>
> When the Spark Streaming app is running for a period of time, the page is 
> incorrectly reported when accessing '/ jobs /' or '/ jobs / job /? Id = 13' 
> and ui can not be accessed.
>  
> Test command:
> ./bin/spark-submit --class org.apache.spark.examples.streaming.HdfsWordCount 
> ./examples/jars/spark-examples_2.11-2.4.0-SNAPSHOT.jar /spark
>  
> The app is running for a period of time,  ui can not be accessed, please see 
> attachment.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23121) When the Spark Streaming app is running for a period of time, the page is incorrectly reported when accessing '/ jobs /' or '/ jobs / job /? Id = 13' and ui can not be

2018-01-22 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-23121.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20330
[https://github.com/apache/spark/pull/20330]

> When the Spark Streaming app is running for a period of time, the page is 
> incorrectly reported when accessing '/ jobs /' or '/ jobs / job /? Id = 13' 
> and ui can not be accessed.
> -
>
> Key: SPARK-23121
> URL: https://issues.apache.org/jira/browse/SPARK-23121
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: guoxiaolongzte
>Assignee: Sandor Murakozi
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: 1.png, 2.png
>
>
> When the Spark Streaming app is running for a period of time, the page is 
> incorrectly reported when accessing '/ jobs /' or '/ jobs / job /? Id = 13' 
> and ui can not be accessed.
>  
> Test command:
> ./bin/spark-submit --class org.apache.spark.examples.streaming.HdfsWordCount 
> ./examples/jars/spark-examples_2.11-2.4.0-SNAPSHOT.jar /spark
>  
> The app is running for a period of time,  ui can not be accessed, please see 
> attachment.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23014) Migrate MemorySink fully to v2

2018-01-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23014:


Assignee: Apache Spark

> Migrate MemorySink fully to v2
> --
>
> Key: SPARK-23014
> URL: https://issues.apache.org/jira/browse/SPARK-23014
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Apache Spark
>Priority: Major
>
> There's already a MemorySinkV2, but its use is controlled by a flag. We need 
> to remove the V1 sink and always use it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23014) Migrate MemorySink fully to v2

2018-01-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16334747#comment-16334747
 ] 

Apache Spark commented on SPARK-23014:
--

User 'jose-torres' has created a pull request for this issue:
https://github.com/apache/spark/pull/20351

> Migrate MemorySink fully to v2
> --
>
> Key: SPARK-23014
> URL: https://issues.apache.org/jira/browse/SPARK-23014
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>
> There's already a MemorySinkV2, but its use is controlled by a flag. We need 
> to remove the V1 sink and always use it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23014) Migrate MemorySink fully to v2

2018-01-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23014:


Assignee: (was: Apache Spark)

> Migrate MemorySink fully to v2
> --
>
> Key: SPARK-23014
> URL: https://issues.apache.org/jira/browse/SPARK-23014
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>
> There's already a MemorySinkV2, but its use is controlled by a flag. We need 
> to remove the V1 sink and always use it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23186) Initialize DriverManager first before loading Drivers

2018-01-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335382#comment-16335382
 ] 

Apache Spark commented on SPARK-23186:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/20359

> Initialize DriverManager first before loading Drivers
> -
>
> Key: SPARK-23186
> URL: https://issues.apache.org/jira/browse/SPARK-23186
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Since some JDBC Drivers have class initialization code to call 
> `DriverManager`, we need to initialize DriverManager first in order to avoid 
> potential deadlock situation like STORM-2527.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23186) Initialize DriverManager first before loading Drivers

2018-01-22 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23186:
--
Description: 
Since some JDBC Drivers have class initialization code to call `DriverManager`, 
we need to initialize DriverManager first in order to avoid potential deadlock 
situation like STORM-2527 and like the following in Apache Spark.

{code}
Thread 9587: (state = BLOCKED)
 - 
sun.reflect.NativeConstructorAccessorImpl.newInstance0(java.lang.reflect.Constructor,
 java.lang.Object[]) @bci=0 (Compiled frame; information may be imprecise)
 - sun.reflect.NativeConstructorAccessorImpl.newInstance(java.lang.Object[]) 
@bci=85, line=62 (Compiled frame)
 - 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(java.lang.Object[]) 
@bci=5, line=45 (Compiled frame)
 - java.lang.reflect.Constructor.newInstance(java.lang.Object[]) @bci=79, 
line=423 (Compiled frame)
 - java.lang.Class.newInstance() @bci=138, line=442 (Compiled frame)
 - java.util.ServiceLoader$LazyIterator.nextService() @bci=119, line=380 
(Interpreted frame)
 - java.util.ServiceLoader$LazyIterator.next() @bci=11, line=404 (Interpreted 
frame)
 - java.util.ServiceLoader$1.next() @bci=37, line=480 (Interpreted frame)
 - java.sql.DriverManager$2.run() @bci=21, line=603 (Interpreted frame)
 - java.sql.DriverManager$2.run() @bci=1, line=583 (Interpreted frame)
 - java.security.AccessController.doPrivileged(java.security.PrivilegedAction) 
@bci=0 (Compiled frame)
 - java.sql.DriverManager.loadInitialDrivers() @bci=27, line=583 (Interpreted 
frame)
 - java.sql.DriverManager.() @bci=32, line=101 (Interpreted frame)
 - 
org.apache.phoenix.mapreduce.util.ConnectionUtil.getConnection(java.lang.String,
 java.lang.Integer, java.lang.String, java.util.Properties) @bci=12, line=98 
(Interpreted frame)
 - 
org.apache.phoenix.mapreduce.util.ConnectionUtil.getInputConnection(org.apache.hadoop.conf.Configuration,
 java.util.Properties) @bci=22, line=57 (Interpreted frame)
 - 
org.apache.phoenix.mapreduce.PhoenixInputFormat.getQueryPlan(org.apache.hadoop.mapreduce.JobContext,
 org.apache.hadoop.conf.Configuration) @bci=61, line=116 (Interpreted frame)
 - 
org.apache.phoenix.mapreduce.PhoenixInputFormat.createRecordReader(org.apache.hadoop.mapreduce.InputSplit,
 org.apache.hadoop.mapreduce.TaskAttemptContext) @bci=10, line=71 (Interpreted 
frame)
 - 
org.apache.spark.rdd.NewHadoopRDD$$anon$1.(org.apache.spark.rdd.NewHadoopRDD,
 org.apache.spark.Partition, org.apache.spark.TaskContext) @bci=233, line=156 
(Interpreted frame)

Thread 9170: (state = BLOCKED)
 - org.apache.phoenix.jdbc.PhoenixDriver.() @bci=35, line=125 
(Interpreted frame)
 - 
sun.reflect.NativeConstructorAccessorImpl.newInstance0(java.lang.reflect.Constructor,
 java.lang.Object[]) @bci=0 (Compiled frame)
 - sun.reflect.NativeConstructorAccessorImpl.newInstance(java.lang.Object[]) 
@bci=85, line=62 (Compiled frame)
 - 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(java.lang.Object[]) 
@bci=5, line=45 (Compiled frame)
 - java.lang.reflect.Constructor.newInstance(java.lang.Object[]) @bci=79, 
line=423 (Compiled frame)
 - java.lang.Class.newInstance() @bci=138, line=442 (Compiled frame)
 - 
org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(java.lang.String)
 @bci=89, line=46 (Interpreted frame)
 - 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply()
 @bci=7, line=53 (Interpreted frame)
 - 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply()
 @bci=1, line=52 (Interpreted frame)
 - 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD,
 org.apache.spark.Partition, org.apache.spark.TaskContext) @bci=81, line=347 
(Interpreted frame)
 - 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(org.apache.spark.Partition,
 org.apache.spark.TaskContext) @bci=7, line=339 (Interpreted frame)
{code}

  was:
Since some JDBC Drivers have class initialization code to call `DriverManager`, 
we need to initialize DriverManager first in order to avoid potential deadlock 
situation like STORM-2527.

{code}
Thread 9587: (state = BLOCKED)
 - 
sun.reflect.NativeConstructorAccessorImpl.newInstance0(java.lang.reflect.Constructor,
 java.lang.Object[]) @bci=0 (Compiled frame; information may be imprecise)
 - sun.reflect.NativeConstructorAccessorImpl.newInstance(java.lang.Object[]) 
@bci=85, line=62 (Compiled frame)
 - 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(java.lang.Object[]) 
@bci=5, line=45 (Compiled frame)
 - java.lang.reflect.Constructor.newInstance(java.lang.Object[]) @bci=79, 
line=423 (Compiled frame)
 - java.lang.Class.newInstance() @bci=138, line=442 (Compiled frame)
 - java.util.ServiceLoader$LazyIterator.nextService() @bci=119, line=380 
(Interpreted frame)
 - 

[jira] [Commented] (SPARK-23187) Accumulator object can not be sent from Executor to Driver

2018-01-22 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335428#comment-16335428
 ] 

Saisai Shao commented on SPARK-23187:
-

I will try to investigate on it.

> Accumulator object can not be sent from Executor to Driver
> --
>
> Key: SPARK-23187
> URL: https://issues.apache.org/jira/browse/SPARK-23187
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Lantao Jin
>Priority: Major
>
> In the Executor.scala->reportHeartBeat(), task Metrics value can not be sent 
> to Driver (In receive side all values are zero).
> I write an UT for explanation.
> {code}
> diff --git 
> a/core/src/test/scala/org/apache/spark/rpc/netty/NettyRpcEnvSuite.scala 
> b/core/src/test/scala/org/apache/spark/rpc/netty/NettyRpcEnvSuite.scala
> index f9481f8..57fb096 100644
> --- a/core/src/test/scala/org/apache/spark/rpc/netty/NettyRpcEnvSuite.scala
> +++ b/core/src/test/scala/org/apache/spark/rpc/netty/NettyRpcEnvSuite.scala
> @@ -17,11 +17,16 @@
>  package org.apache.spark.rpc.netty
> +import scala.collection.mutable.ArrayBuffer
> +
>  import org.scalatest.mockito.MockitoSugar
>  import org.apache.spark._
>  import org.apache.spark.network.client.TransportClient
>  import org.apache.spark.rpc._
> +import org.apache.spark.util.AccumulatorContext
> +import org.apache.spark.util.AccumulatorV2
> +import org.apache.spark.util.LongAccumulator
>  class NettyRpcEnvSuite extends RpcEnvSuite with MockitoSugar {
> @@ -83,5 +88,21 @@ class NettyRpcEnvSuite extends RpcEnvSuite with 
> MockitoSugar {
>  assertRequestMessageEquals(
>msg3,
>RequestMessage(nettyEnv, client, msg3.serialize(nettyEnv)))
> +
> +val acc = new LongAccumulator
> +val sc = SparkContext.getOrCreate(new 
> SparkConf().setMaster("local").setAppName("testAcc"));
> +sc.register(acc, "testAcc")
> +acc.setValue(1)
> +//val msg4 = new RequestMessage(senderAddress, receiver, acc)
> +//assertRequestMessageEquals(
> +//  msg4,
> +//  RequestMessage(nettyEnv, client, msg4.serialize(nettyEnv)))
> +
> +val accbuf = new ArrayBuffer[AccumulatorV2[_, _]]()
> +accbuf += acc
> +val msg5 = new RequestMessage(senderAddress, receiver, accbuf)
> +assertRequestMessageEquals(
> +  msg5,
> +  RequestMessage(nettyEnv, client, msg5.serialize(nettyEnv)))
>}
>  }
> {code}
> msg4 and msg5 are all going to failed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23186) Initialize DriverManager first before loading Drivers

2018-01-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335451#comment-16335451
 ] 

Apache Spark commented on SPARK-23186:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/20357

> Initialize DriverManager first before loading Drivers
> -
>
> Key: SPARK-23186
> URL: https://issues.apache.org/jira/browse/SPARK-23186
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Since some JDBC Drivers have class initialization code to call 
> `DriverManager`, we need to initialize DriverManager first in order to avoid 
> potential deadlock situation like the following or STORM-2527.
> {code}
> Thread 9587: (state = BLOCKED)
>  - 
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(java.lang.reflect.Constructor,
>  java.lang.Object[]) @bci=0 (Compiled frame; information may be imprecise)
>  - sun.reflect.NativeConstructorAccessorImpl.newInstance(java.lang.Object[]) 
> @bci=85, line=62 (Compiled frame)
>  - 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(java.lang.Object[]) 
> @bci=5, line=45 (Compiled frame)
>  - java.lang.reflect.Constructor.newInstance(java.lang.Object[]) @bci=79, 
> line=423 (Compiled frame)
>  - java.lang.Class.newInstance() @bci=138, line=442 (Compiled frame)
>  - java.util.ServiceLoader$LazyIterator.nextService() @bci=119, line=380 
> (Interpreted frame)
>  - java.util.ServiceLoader$LazyIterator.next() @bci=11, line=404 (Interpreted 
> frame)
>  - java.util.ServiceLoader$1.next() @bci=37, line=480 (Interpreted frame)
>  - java.sql.DriverManager$2.run() @bci=21, line=603 (Interpreted frame)
>  - java.sql.DriverManager$2.run() @bci=1, line=583 (Interpreted frame)
>  - 
> java.security.AccessController.doPrivileged(java.security.PrivilegedAction) 
> @bci=0 (Compiled frame)
>  - java.sql.DriverManager.loadInitialDrivers() @bci=27, line=583 (Interpreted 
> frame)
>  - java.sql.DriverManager.() @bci=32, line=101 (Interpreted frame)
>  - 
> org.apache.phoenix.mapreduce.util.ConnectionUtil.getConnection(java.lang.String,
>  java.lang.Integer, java.lang.String, java.util.Properties) @bci=12, line=98 
> (Interpreted frame)
>  - 
> org.apache.phoenix.mapreduce.util.ConnectionUtil.getInputConnection(org.apache.hadoop.conf.Configuration,
>  java.util.Properties) @bci=22, line=57 (Interpreted frame)
>  - 
> org.apache.phoenix.mapreduce.PhoenixInputFormat.getQueryPlan(org.apache.hadoop.mapreduce.JobContext,
>  org.apache.hadoop.conf.Configuration) @bci=61, line=116 (Interpreted frame)
>  - 
> org.apache.phoenix.mapreduce.PhoenixInputFormat.createRecordReader(org.apache.hadoop.mapreduce.InputSplit,
>  org.apache.hadoop.mapreduce.TaskAttemptContext) @bci=10, line=71 
> (Interpreted frame)
>  - 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.(org.apache.spark.rdd.NewHadoopRDD,
>  org.apache.spark.Partition, org.apache.spark.TaskContext) @bci=233, line=156 
> (Interpreted frame)
> Thread 9170: (state = BLOCKED)
>  - org.apache.phoenix.jdbc.PhoenixDriver.() @bci=35, line=125 
> (Interpreted frame)
>  - 
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(java.lang.reflect.Constructor,
>  java.lang.Object[]) @bci=0 (Compiled frame)
>  - sun.reflect.NativeConstructorAccessorImpl.newInstance(java.lang.Object[]) 
> @bci=85, line=62 (Compiled frame)
>  - 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(java.lang.Object[]) 
> @bci=5, line=45 (Compiled frame)
>  - java.lang.reflect.Constructor.newInstance(java.lang.Object[]) @bci=79, 
> line=423 (Compiled frame)
>  - java.lang.Class.newInstance() @bci=138, line=442 (Compiled frame)
>  - 
> org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(java.lang.String)
>  @bci=89, line=46 (Interpreted frame)
>  - 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply()
>  @bci=7, line=53 (Interpreted frame)
>  - 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply()
>  @bci=1, line=52 (Interpreted frame)
>  - 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD,
>  org.apache.spark.Partition, org.apache.spark.TaskContext) @bci=81, line=347 
> (Interpreted frame)
>  - 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(org.apache.spark.Partition,
>  org.apache.spark.TaskContext) @bci=7, line=339 (Interpreted frame)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23186) Initialize DriverManager first before loading Drivers

2018-01-22 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23186:
--
Summary: Initialize DriverManager first before loading Drivers  (was: 
Loading JDBC Drivers should be syncronized)

> Initialize DriverManager first before loading Drivers
> -
>
> Key: SPARK-23186
> URL: https://issues.apache.org/jira/browse/SPARK-23186
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Since some JDBC Drivers have class initialization code to call 
> `DriverManager`, we need to synchronize them to avoid potential deadlock 
> situation like STORM-2527.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23186) Initialize DriverManager first before loading Drivers

2018-01-22 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23186:
--
Description: Since some JDBC Drivers have class initialization code to call 
`DriverManager`, we need to initialize DriverManager first in order to avoid 
potential deadlock situation like STORM-2527.  (was: Since some JDBC Drivers 
have class initialization code to call `DriverManager`, we need to synchronize 
them to avoid potential deadlock situation like STORM-2527.)

> Initialize DriverManager first before loading Drivers
> -
>
> Key: SPARK-23186
> URL: https://issues.apache.org/jira/browse/SPARK-23186
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Since some JDBC Drivers have class initialization code to call 
> `DriverManager`, we need to initialize DriverManager first in order to avoid 
> potential deadlock situation like STORM-2527.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22274) User-defined aggregation functions with pandas udf

2018-01-22 Thread Takuya Ueshin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-22274.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 19872
[https://github.com/apache/spark/pull/19872]

> User-defined aggregation functions with pandas udf
> --
>
> Key: SPARK-22274
> URL: https://issues.apache.org/jira/browse/SPARK-22274
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Li Jin
>Assignee: Li Jin
>Priority: Major
> Fix For: 2.4.0
>
>
> This function doesn't implement partial aggregation and shuffles all data. A 
> uadf that supports partial aggregation is not covered by this Jira.
> Exmaple:
> {code:java}
> @pandas_udf(DoubleType())
> def mean(v)
>   return v.mean()
> df.groupby('id').apply(mean(df.v1), mean(df.v2))
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23187) Accumulator object can not be sent from Executor to Driver

2018-01-22 Thread Lantao Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335424#comment-16335424
 ] 

Lantao Jin commented on SPARK-23187:


Hi [~jerryshao], do you have time to look at it?

> Accumulator object can not be sent from Executor to Driver
> --
>
> Key: SPARK-23187
> URL: https://issues.apache.org/jira/browse/SPARK-23187
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Lantao Jin
>Priority: Major
>
> In the Executor.scala->reportHeartBeat(), task Metrics value can not be sent 
> to Driver (In receive side all values are zero).
> I write an UT for explanation.
> {code}
> diff --git 
> a/core/src/test/scala/org/apache/spark/rpc/netty/NettyRpcEnvSuite.scala 
> b/core/src/test/scala/org/apache/spark/rpc/netty/NettyRpcEnvSuite.scala
> index f9481f8..57fb096 100644
> --- a/core/src/test/scala/org/apache/spark/rpc/netty/NettyRpcEnvSuite.scala
> +++ b/core/src/test/scala/org/apache/spark/rpc/netty/NettyRpcEnvSuite.scala
> @@ -17,11 +17,16 @@
>  package org.apache.spark.rpc.netty
> +import scala.collection.mutable.ArrayBuffer
> +
>  import org.scalatest.mockito.MockitoSugar
>  import org.apache.spark._
>  import org.apache.spark.network.client.TransportClient
>  import org.apache.spark.rpc._
> +import org.apache.spark.util.AccumulatorContext
> +import org.apache.spark.util.AccumulatorV2
> +import org.apache.spark.util.LongAccumulator
>  class NettyRpcEnvSuite extends RpcEnvSuite with MockitoSugar {
> @@ -83,5 +88,21 @@ class NettyRpcEnvSuite extends RpcEnvSuite with 
> MockitoSugar {
>  assertRequestMessageEquals(
>msg3,
>RequestMessage(nettyEnv, client, msg3.serialize(nettyEnv)))
> +
> +val acc = new LongAccumulator
> +val sc = SparkContext.getOrCreate(new 
> SparkConf().setMaster("local").setAppName("testAcc"));
> +sc.register(acc, "testAcc")
> +acc.setValue(1)
> +//val msg4 = new RequestMessage(senderAddress, receiver, acc)
> +//assertRequestMessageEquals(
> +//  msg4,
> +//  RequestMessage(nettyEnv, client, msg4.serialize(nettyEnv)))
> +
> +val accbuf = new ArrayBuffer[AccumulatorV2[_, _]]()
> +accbuf += acc
> +val msg5 = new RequestMessage(senderAddress, receiver, accbuf)
> +assertRequestMessageEquals(
> +  msg5,
> +  RequestMessage(nettyEnv, client, msg5.serialize(nettyEnv)))
>}
>  }
> {code}
> msg4 and msg5 are all going to failed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23177) PySpark parameter-less UDFs raise exception if applied after distinct

2018-01-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23177:


Assignee: (was: Apache Spark)

> PySpark parameter-less UDFs raise exception if applied after distinct
> -
>
> Key: SPARK-23177
> URL: https://issues.apache.org/jira/browse/SPARK-23177
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.2, 2.2.0, 2.2.1
>Reporter: Jakub Wasikowski
>Priority: Major
>
> It seems there is an issue with UDFs that take no arguments, but only if UDF 
> is applied after {{distinct()}} operation.
> Here is the short example, that allows reproduce an issue in PySpark shell:
> {code:java}
> import pyspark.sql.functions as f
> import uuid
> df = spark.createDataFrame([(1,2), (3,4)])
> f_udf = f.udf(lambda: str(uuid.uuid4()))
> df.distinct().withColumn("a", f_udf()).show()
> {code}
> and it raises the following exception:
> {noformat}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/opt/spark/python/pyspark/sql/dataframe.py", line 336, in show
> print(self._jdf.showString(n, 20))
>   File "/opt/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 
> 1133, in __call__
>   File "/opt/spark/python/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File "/opt/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 
> 319, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o54.showString.
> : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: pythonUDF0#16
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
>   at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$33.apply(HashAggregateExec.scala:475)
>   at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$33.apply(HashAggregateExec.scala:474)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.generateResultCode(HashAggregateExec.scala:474)
>   at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduceWithKeys(HashAggregateExec.scala:612)
>   at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduce(HashAggregateExec.scala:148)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:85)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:80)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
>  

[jira] [Assigned] (SPARK-23177) PySpark parameter-less UDFs raise exception if applied after distinct

2018-01-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23177:


Assignee: Apache Spark

> PySpark parameter-less UDFs raise exception if applied after distinct
> -
>
> Key: SPARK-23177
> URL: https://issues.apache.org/jira/browse/SPARK-23177
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.2, 2.2.0, 2.2.1
>Reporter: Jakub Wasikowski
>Assignee: Apache Spark
>Priority: Major
>
> It seems there is an issue with UDFs that take no arguments, but only if UDF 
> is applied after {{distinct()}} operation.
> Here is the short example, that allows reproduce an issue in PySpark shell:
> {code:java}
> import pyspark.sql.functions as f
> import uuid
> df = spark.createDataFrame([(1,2), (3,4)])
> f_udf = f.udf(lambda: str(uuid.uuid4()))
> df.distinct().withColumn("a", f_udf()).show()
> {code}
> and it raises the following exception:
> {noformat}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/opt/spark/python/pyspark/sql/dataframe.py", line 336, in show
> print(self._jdf.showString(n, 20))
>   File "/opt/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 
> 1133, in __call__
>   File "/opt/spark/python/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File "/opt/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 
> 319, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o54.showString.
> : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: pythonUDF0#16
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
>   at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$33.apply(HashAggregateExec.scala:475)
>   at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$33.apply(HashAggregateExec.scala:474)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.generateResultCode(HashAggregateExec.scala:474)
>   at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduceWithKeys(HashAggregateExec.scala:612)
>   at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduce(HashAggregateExec.scala:148)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:85)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:80)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> 

[jira] [Commented] (SPARK-23177) PySpark parameter-less UDFs raise exception if applied after distinct

2018-01-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335433#comment-16335433
 ] 

Apache Spark commented on SPARK-23177:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/20360

> PySpark parameter-less UDFs raise exception if applied after distinct
> -
>
> Key: SPARK-23177
> URL: https://issues.apache.org/jira/browse/SPARK-23177
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.2, 2.2.0, 2.2.1
>Reporter: Jakub Wasikowski
>Priority: Major
>
> It seems there is an issue with UDFs that take no arguments, but only if UDF 
> is applied after {{distinct()}} operation.
> Here is the short example, that allows reproduce an issue in PySpark shell:
> {code:java}
> import pyspark.sql.functions as f
> import uuid
> df = spark.createDataFrame([(1,2), (3,4)])
> f_udf = f.udf(lambda: str(uuid.uuid4()))
> df.distinct().withColumn("a", f_udf()).show()
> {code}
> and it raises the following exception:
> {noformat}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/opt/spark/python/pyspark/sql/dataframe.py", line 336, in show
> print(self._jdf.showString(n, 20))
>   File "/opt/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 
> 1133, in __call__
>   File "/opt/spark/python/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File "/opt/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 
> 319, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o54.showString.
> : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: pythonUDF0#16
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
>   at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$33.apply(HashAggregateExec.scala:475)
>   at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$33.apply(HashAggregateExec.scala:474)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.generateResultCode(HashAggregateExec.scala:474)
>   at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduceWithKeys(HashAggregateExec.scala:612)
>   at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduce(HashAggregateExec.scala:148)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:85)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:80)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

[jira] [Created] (SPARK-23187) Accumulator object can not be sent from Executor to Driver

2018-01-22 Thread Lantao Jin (JIRA)
Lantao Jin created SPARK-23187:
--

 Summary: Accumulator object can not be sent from Executor to Driver
 Key: SPARK-23187
 URL: https://issues.apache.org/jira/browse/SPARK-23187
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.1
Reporter: Lantao Jin


In the Executor.scala->reportHeartBeat(), task Metrics value can not be sent to 
Driver (In receive side all values are zero).

I write an UT for explanation.
{code}
diff --git 
a/core/src/test/scala/org/apache/spark/rpc/netty/NettyRpcEnvSuite.scala 
b/core/src/test/scala/org/apache/spark/rpc/netty/NettyRpcEnvSuite.scala
index f9481f8..57fb096 100644
--- a/core/src/test/scala/org/apache/spark/rpc/netty/NettyRpcEnvSuite.scala
+++ b/core/src/test/scala/org/apache/spark/rpc/netty/NettyRpcEnvSuite.scala
@@ -17,11 +17,16 @@

 package org.apache.spark.rpc.netty

+import scala.collection.mutable.ArrayBuffer
+
 import org.scalatest.mockito.MockitoSugar

 import org.apache.spark._
 import org.apache.spark.network.client.TransportClient
 import org.apache.spark.rpc._
+import org.apache.spark.util.AccumulatorContext
+import org.apache.spark.util.AccumulatorV2
+import org.apache.spark.util.LongAccumulator

 class NettyRpcEnvSuite extends RpcEnvSuite with MockitoSugar {

@@ -83,5 +88,21 @@ class NettyRpcEnvSuite extends RpcEnvSuite with MockitoSugar 
{
 assertRequestMessageEquals(
   msg3,
   RequestMessage(nettyEnv, client, msg3.serialize(nettyEnv)))
+
+val acc = new LongAccumulator
+val sc = SparkContext.getOrCreate(new 
SparkConf().setMaster("local").setAppName("testAcc"));
+sc.register(acc, "testAcc")
+acc.setValue(1)
+//val msg4 = new RequestMessage(senderAddress, receiver, acc)
+//assertRequestMessageEquals(
+//  msg4,
+//  RequestMessage(nettyEnv, client, msg4.serialize(nettyEnv)))
+
+val accbuf = new ArrayBuffer[AccumulatorV2[_, _]]()
+accbuf += acc
+val msg5 = new RequestMessage(senderAddress, receiver, accbuf)
+assertRequestMessageEquals(
+  msg5,
+  RequestMessage(nettyEnv, client, msg5.serialize(nettyEnv)))
   }
 }
{code}

msg4 and msg5 are all going to failed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23187) Accumulator object can not be sent from Executor to Driver

2018-01-22 Thread Lantao Jin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-23187:
---
Affects Version/s: 2.3.1

> Accumulator object can not be sent from Executor to Driver
> --
>
> Key: SPARK-23187
> URL: https://issues.apache.org/jira/browse/SPARK-23187
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1, 2.3.1
>Reporter: Lantao Jin
>Priority: Major
>
> In the Executor.scala->reportHeartBeat(), task Metrics value can not be sent 
> to Driver (In receive side all values are zero).
> I write an UT for explanation.
> {code}
> diff --git 
> a/core/src/test/scala/org/apache/spark/rpc/netty/NettyRpcEnvSuite.scala 
> b/core/src/test/scala/org/apache/spark/rpc/netty/NettyRpcEnvSuite.scala
> index f9481f8..57fb096 100644
> --- a/core/src/test/scala/org/apache/spark/rpc/netty/NettyRpcEnvSuite.scala
> +++ b/core/src/test/scala/org/apache/spark/rpc/netty/NettyRpcEnvSuite.scala
> @@ -17,11 +17,16 @@
>  package org.apache.spark.rpc.netty
> +import scala.collection.mutable.ArrayBuffer
> +
>  import org.scalatest.mockito.MockitoSugar
>  import org.apache.spark._
>  import org.apache.spark.network.client.TransportClient
>  import org.apache.spark.rpc._
> +import org.apache.spark.util.AccumulatorContext
> +import org.apache.spark.util.AccumulatorV2
> +import org.apache.spark.util.LongAccumulator
>  class NettyRpcEnvSuite extends RpcEnvSuite with MockitoSugar {
> @@ -83,5 +88,21 @@ class NettyRpcEnvSuite extends RpcEnvSuite with 
> MockitoSugar {
>  assertRequestMessageEquals(
>msg3,
>RequestMessage(nettyEnv, client, msg3.serialize(nettyEnv)))
> +
> +val acc = new LongAccumulator
> +val sc = SparkContext.getOrCreate(new 
> SparkConf().setMaster("local").setAppName("testAcc"));
> +sc.register(acc, "testAcc")
> +acc.setValue(1)
> +//val msg4 = new RequestMessage(senderAddress, receiver, acc)
> +//assertRequestMessageEquals(
> +//  msg4,
> +//  RequestMessage(nettyEnv, client, msg4.serialize(nettyEnv)))
> +
> +val accbuf = new ArrayBuffer[AccumulatorV2[_, _]]()
> +accbuf += acc
> +val msg5 = new RequestMessage(senderAddress, receiver, accbuf)
> +assertRequestMessageEquals(
> +  msg5,
> +  RequestMessage(nettyEnv, client, msg5.serialize(nettyEnv)))
>}
>  }
> {code}
> msg4 and msg5 are all going to failed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23187) Accumulator object can not be sent from Executor to Driver

2018-01-22 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-23187:

Affects Version/s: (was: 2.3.1)
   2.3.0

> Accumulator object can not be sent from Executor to Driver
> --
>
> Key: SPARK-23187
> URL: https://issues.apache.org/jira/browse/SPARK-23187
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Lantao Jin
>Priority: Major
>
> In the Executor.scala->reportHeartBeat(), task Metrics value can not be sent 
> to Driver (In receive side all values are zero).
> I write an UT for explanation.
> {code}
> diff --git 
> a/core/src/test/scala/org/apache/spark/rpc/netty/NettyRpcEnvSuite.scala 
> b/core/src/test/scala/org/apache/spark/rpc/netty/NettyRpcEnvSuite.scala
> index f9481f8..57fb096 100644
> --- a/core/src/test/scala/org/apache/spark/rpc/netty/NettyRpcEnvSuite.scala
> +++ b/core/src/test/scala/org/apache/spark/rpc/netty/NettyRpcEnvSuite.scala
> @@ -17,11 +17,16 @@
>  package org.apache.spark.rpc.netty
> +import scala.collection.mutable.ArrayBuffer
> +
>  import org.scalatest.mockito.MockitoSugar
>  import org.apache.spark._
>  import org.apache.spark.network.client.TransportClient
>  import org.apache.spark.rpc._
> +import org.apache.spark.util.AccumulatorContext
> +import org.apache.spark.util.AccumulatorV2
> +import org.apache.spark.util.LongAccumulator
>  class NettyRpcEnvSuite extends RpcEnvSuite with MockitoSugar {
> @@ -83,5 +88,21 @@ class NettyRpcEnvSuite extends RpcEnvSuite with 
> MockitoSugar {
>  assertRequestMessageEquals(
>msg3,
>RequestMessage(nettyEnv, client, msg3.serialize(nettyEnv)))
> +
> +val acc = new LongAccumulator
> +val sc = SparkContext.getOrCreate(new 
> SparkConf().setMaster("local").setAppName("testAcc"));
> +sc.register(acc, "testAcc")
> +acc.setValue(1)
> +//val msg4 = new RequestMessage(senderAddress, receiver, acc)
> +//assertRequestMessageEquals(
> +//  msg4,
> +//  RequestMessage(nettyEnv, client, msg4.serialize(nettyEnv)))
> +
> +val accbuf = new ArrayBuffer[AccumulatorV2[_, _]]()
> +accbuf += acc
> +val msg5 = new RequestMessage(senderAddress, receiver, accbuf)
> +assertRequestMessageEquals(
> +  msg5,
> +  RequestMessage(nettyEnv, client, msg5.serialize(nettyEnv)))
>}
>  }
> {code}
> msg4 and msg5 are all going to failed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23187) Accumulator object can not be sent from Executor to Driver

2018-01-22 Thread Lantao Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335431#comment-16335431
 ] 

Lantao Jin commented on SPARK-23187:


The simplest way to verify is just set a fixed value in reportHeartBeat() like:
{code}
diff --git a/core/src/main/scala/org/apache/spark/executor/Executor.scala 
b/core/src/main/scala/org/apache/spark/executor/Executor.scala
index 2c3a8ef..cdb9730 100644
--- a/core/src/main/scala/org/apache/spark/executor/Executor.scala
+++ b/core/src/main/scala/org/apache/spark/executor/Executor.scala
@@ -775,7 +775,7 @@ private[spark] class Executor(
 for (taskRunner <- runningTasks.values().asScala) {
   if (taskRunner.task != null) {
 taskRunner.task.metrics.mergeShuffleReadMetrics()
-taskRunner.task.metrics.setJvmGCTime(curGCTime - 
taskRunner.startGCTime)
+taskRunner.task.metrics.setJvmGCTime(54321)
 accumUpdates += ((taskRunner.taskId, 
taskRunner.task.metrics.accumulators()))
   }
 }
{code}

Then print the value in driver side. It should be 0.

> Accumulator object can not be sent from Executor to Driver
> --
>
> Key: SPARK-23187
> URL: https://issues.apache.org/jira/browse/SPARK-23187
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Lantao Jin
>Priority: Major
>
> In the Executor.scala->reportHeartBeat(), task Metrics value can not be sent 
> to Driver (In receive side all values are zero).
> I write an UT for explanation.
> {code}
> diff --git 
> a/core/src/test/scala/org/apache/spark/rpc/netty/NettyRpcEnvSuite.scala 
> b/core/src/test/scala/org/apache/spark/rpc/netty/NettyRpcEnvSuite.scala
> index f9481f8..57fb096 100644
> --- a/core/src/test/scala/org/apache/spark/rpc/netty/NettyRpcEnvSuite.scala
> +++ b/core/src/test/scala/org/apache/spark/rpc/netty/NettyRpcEnvSuite.scala
> @@ -17,11 +17,16 @@
>  package org.apache.spark.rpc.netty
> +import scala.collection.mutable.ArrayBuffer
> +
>  import org.scalatest.mockito.MockitoSugar
>  import org.apache.spark._
>  import org.apache.spark.network.client.TransportClient
>  import org.apache.spark.rpc._
> +import org.apache.spark.util.AccumulatorContext
> +import org.apache.spark.util.AccumulatorV2
> +import org.apache.spark.util.LongAccumulator
>  class NettyRpcEnvSuite extends RpcEnvSuite with MockitoSugar {
> @@ -83,5 +88,21 @@ class NettyRpcEnvSuite extends RpcEnvSuite with 
> MockitoSugar {
>  assertRequestMessageEquals(
>msg3,
>RequestMessage(nettyEnv, client, msg3.serialize(nettyEnv)))
> +
> +val acc = new LongAccumulator
> +val sc = SparkContext.getOrCreate(new 
> SparkConf().setMaster("local").setAppName("testAcc"));
> +sc.register(acc, "testAcc")
> +acc.setValue(1)
> +//val msg4 = new RequestMessage(senderAddress, receiver, acc)
> +//assertRequestMessageEquals(
> +//  msg4,
> +//  RequestMessage(nettyEnv, client, msg4.serialize(nettyEnv)))
> +
> +val accbuf = new ArrayBuffer[AccumulatorV2[_, _]]()
> +accbuf += acc
> +val msg5 = new RequestMessage(senderAddress, receiver, accbuf)
> +assertRequestMessageEquals(
> +  msg5,
> +  RequestMessage(nettyEnv, client, msg5.serialize(nettyEnv)))
>}
>  }
> {code}
> msg4 and msg5 are all going to failed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22274) User-defined aggregation functions with pandas udf

2018-01-22 Thread Takuya Ueshin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin reassigned SPARK-22274:
-

Assignee: Li Jin

> User-defined aggregation functions with pandas udf
> --
>
> Key: SPARK-22274
> URL: https://issues.apache.org/jira/browse/SPARK-22274
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Li Jin
>Assignee: Li Jin
>Priority: Major
>
> This function doesn't implement partial aggregation and shuffles all data. A 
> uadf that supports partial aggregation is not covered by this Jira.
> Exmaple:
> {code:java}
> @pandas_udf(DoubleType())
> def mean(v)
>   return v.mean()
> df.groupby('id').apply(mean(df.v1), mean(df.v2))
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23186) Initialize DriverManager first before loading Drivers

2018-01-22 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23186:
--
Description: 
Since some JDBC Drivers have class initialization code to call `DriverManager`, 
we need to initialize DriverManager first in order to avoid potential deadlock 
situation like STORM-2527.

{code}
Thread 9587: (state = BLOCKED)
 - 
sun.reflect.NativeConstructorAccessorImpl.newInstance0(java.lang.reflect.Constructor,
 java.lang.Object[]) @bci=0 (Compiled frame; information may be imprecise)
 - sun.reflect.NativeConstructorAccessorImpl.newInstance(java.lang.Object[]) 
@bci=85, line=62 (Compiled frame)
 - 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(java.lang.Object[]) 
@bci=5, line=45 (Compiled frame)
 - java.lang.reflect.Constructor.newInstance(java.lang.Object[]) @bci=79, 
line=423 (Compiled frame)
 - java.lang.Class.newInstance() @bci=138, line=442 (Compiled frame)
 - java.util.ServiceLoader$LazyIterator.nextService() @bci=119, line=380 
(Interpreted frame)
 - java.util.ServiceLoader$LazyIterator.next() @bci=11, line=404 (Interpreted 
frame)
 - java.util.ServiceLoader$1.next() @bci=37, line=480 (Interpreted frame)
 - java.sql.DriverManager$2.run() @bci=21, line=603 (Interpreted frame)
 - java.sql.DriverManager$2.run() @bci=1, line=583 (Interpreted frame)
 - java.security.AccessController.doPrivileged(java.security.PrivilegedAction) 
@bci=0 (Compiled frame)
 - java.sql.DriverManager.loadInitialDrivers() @bci=27, line=583 (Interpreted 
frame)
 - java.sql.DriverManager.() @bci=32, line=101 (Interpreted frame)
 - 
org.apache.phoenix.mapreduce.util.ConnectionUtil.getConnection(java.lang.String,
 java.lang.Integer, java.lang.String, java.util.Properties) @bci=12, line=98 
(Interpreted frame)
 - 
org.apache.phoenix.mapreduce.util.ConnectionUtil.getInputConnection(org.apache.hadoop.conf.Configuration,
 java.util.Properties) @bci=22, line=57 (Interpreted frame)
 - 
org.apache.phoenix.mapreduce.PhoenixInputFormat.getQueryPlan(org.apache.hadoop.mapreduce.JobContext,
 org.apache.hadoop.conf.Configuration) @bci=61, line=116 (Interpreted frame)
 - 
org.apache.phoenix.mapreduce.PhoenixInputFormat.createRecordReader(org.apache.hadoop.mapreduce.InputSplit,
 org.apache.hadoop.mapreduce.TaskAttemptContext) @bci=10, line=71 (Interpreted 
frame)
 - 
org.apache.spark.rdd.NewHadoopRDD$$anon$1.(org.apache.spark.rdd.NewHadoopRDD,
 org.apache.spark.Partition, org.apache.spark.TaskContext) @bci=233, line=156 
(Interpreted frame)

Thread 9170: (state = BLOCKED)
 - org.apache.phoenix.jdbc.PhoenixDriver.() @bci=35, line=125 
(Interpreted frame)
 - 
sun.reflect.NativeConstructorAccessorImpl.newInstance0(java.lang.reflect.Constructor,
 java.lang.Object[]) @bci=0 (Compiled frame)
 - sun.reflect.NativeConstructorAccessorImpl.newInstance(java.lang.Object[]) 
@bci=85, line=62 (Compiled frame)
 - 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(java.lang.Object[]) 
@bci=5, line=45 (Compiled frame)
 - java.lang.reflect.Constructor.newInstance(java.lang.Object[]) @bci=79, 
line=423 (Compiled frame)
 - java.lang.Class.newInstance() @bci=138, line=442 (Compiled frame)
 - 
org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(java.lang.String)
 @bci=89, line=46 (Interpreted frame)
 - 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply()
 @bci=7, line=53 (Interpreted frame)
 - 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply()
 @bci=1, line=52 (Interpreted frame)
 - 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD,
 org.apache.spark.Partition, org.apache.spark.TaskContext) @bci=81, line=347 
(Interpreted frame)
 - 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(org.apache.spark.Partition,
 org.apache.spark.TaskContext) @bci=7, line=339 (Interpreted frame)
{code}

  was:Since some JDBC Drivers have class initialization code to call 
`DriverManager`, we need to initialize DriverManager first in order to avoid 
potential deadlock situation like STORM-2527.


> Initialize DriverManager first before loading Drivers
> -
>
> Key: SPARK-23186
> URL: https://issues.apache.org/jira/browse/SPARK-23186
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Since some JDBC Drivers have class initialization code to call 
> `DriverManager`, we need to initialize DriverManager first in order to avoid 
> potential deadlock situation like STORM-2527.
> {code}
> Thread 9587: (state = BLOCKED)
>  - 
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(java.lang.reflect.Constructor,
>  java.lang.Object[]) @bci=0 

[jira] [Updated] (SPARK-23186) Initialize DriverManager first before loading Drivers

2018-01-22 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23186:
--
Description: 
Since some JDBC Drivers have class initialization code to call `DriverManager`, 
we need to initialize DriverManager first in order to avoid potential deadlock 
situation like the following or STORM-2527.

{code}
Thread 9587: (state = BLOCKED)
 - 
sun.reflect.NativeConstructorAccessorImpl.newInstance0(java.lang.reflect.Constructor,
 java.lang.Object[]) @bci=0 (Compiled frame; information may be imprecise)
 - sun.reflect.NativeConstructorAccessorImpl.newInstance(java.lang.Object[]) 
@bci=85, line=62 (Compiled frame)
 - 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(java.lang.Object[]) 
@bci=5, line=45 (Compiled frame)
 - java.lang.reflect.Constructor.newInstance(java.lang.Object[]) @bci=79, 
line=423 (Compiled frame)
 - java.lang.Class.newInstance() @bci=138, line=442 (Compiled frame)
 - java.util.ServiceLoader$LazyIterator.nextService() @bci=119, line=380 
(Interpreted frame)
 - java.util.ServiceLoader$LazyIterator.next() @bci=11, line=404 (Interpreted 
frame)
 - java.util.ServiceLoader$1.next() @bci=37, line=480 (Interpreted frame)
 - java.sql.DriverManager$2.run() @bci=21, line=603 (Interpreted frame)
 - java.sql.DriverManager$2.run() @bci=1, line=583 (Interpreted frame)
 - java.security.AccessController.doPrivileged(java.security.PrivilegedAction) 
@bci=0 (Compiled frame)
 - java.sql.DriverManager.loadInitialDrivers() @bci=27, line=583 (Interpreted 
frame)
 - java.sql.DriverManager.() @bci=32, line=101 (Interpreted frame)
 - 
org.apache.phoenix.mapreduce.util.ConnectionUtil.getConnection(java.lang.String,
 java.lang.Integer, java.lang.String, java.util.Properties) @bci=12, line=98 
(Interpreted frame)
 - 
org.apache.phoenix.mapreduce.util.ConnectionUtil.getInputConnection(org.apache.hadoop.conf.Configuration,
 java.util.Properties) @bci=22, line=57 (Interpreted frame)
 - 
org.apache.phoenix.mapreduce.PhoenixInputFormat.getQueryPlan(org.apache.hadoop.mapreduce.JobContext,
 org.apache.hadoop.conf.Configuration) @bci=61, line=116 (Interpreted frame)
 - 
org.apache.phoenix.mapreduce.PhoenixInputFormat.createRecordReader(org.apache.hadoop.mapreduce.InputSplit,
 org.apache.hadoop.mapreduce.TaskAttemptContext) @bci=10, line=71 (Interpreted 
frame)
 - 
org.apache.spark.rdd.NewHadoopRDD$$anon$1.(org.apache.spark.rdd.NewHadoopRDD,
 org.apache.spark.Partition, org.apache.spark.TaskContext) @bci=233, line=156 
(Interpreted frame)

Thread 9170: (state = BLOCKED)
 - org.apache.phoenix.jdbc.PhoenixDriver.() @bci=35, line=125 
(Interpreted frame)
 - 
sun.reflect.NativeConstructorAccessorImpl.newInstance0(java.lang.reflect.Constructor,
 java.lang.Object[]) @bci=0 (Compiled frame)
 - sun.reflect.NativeConstructorAccessorImpl.newInstance(java.lang.Object[]) 
@bci=85, line=62 (Compiled frame)
 - 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(java.lang.Object[]) 
@bci=5, line=45 (Compiled frame)
 - java.lang.reflect.Constructor.newInstance(java.lang.Object[]) @bci=79, 
line=423 (Compiled frame)
 - java.lang.Class.newInstance() @bci=138, line=442 (Compiled frame)
 - 
org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(java.lang.String)
 @bci=89, line=46 (Interpreted frame)
 - 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply()
 @bci=7, line=53 (Interpreted frame)
 - 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply()
 @bci=1, line=52 (Interpreted frame)
 - 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD,
 org.apache.spark.Partition, org.apache.spark.TaskContext) @bci=81, line=347 
(Interpreted frame)
 - 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(org.apache.spark.Partition,
 org.apache.spark.TaskContext) @bci=7, line=339 (Interpreted frame)
{code}

  was:
Since some JDBC Drivers have class initialization code to call `DriverManager`, 
we need to initialize DriverManager first in order to avoid potential deadlock 
situation like STORM-2527 and like the following in Apache Spark.

{code}
Thread 9587: (state = BLOCKED)
 - 
sun.reflect.NativeConstructorAccessorImpl.newInstance0(java.lang.reflect.Constructor,
 java.lang.Object[]) @bci=0 (Compiled frame; information may be imprecise)
 - sun.reflect.NativeConstructorAccessorImpl.newInstance(java.lang.Object[]) 
@bci=85, line=62 (Compiled frame)
 - 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(java.lang.Object[]) 
@bci=5, line=45 (Compiled frame)
 - java.lang.reflect.Constructor.newInstance(java.lang.Object[]) @bci=79, 
line=423 (Compiled frame)
 - java.lang.Class.newInstance() @bci=138, line=442 (Compiled frame)
 - java.util.ServiceLoader$LazyIterator.nextService() @bci=119, line=380 
(Interpreted frame)
 - 

[jira] [Commented] (SPARK-22711) _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from cloudpickle.py

2018-01-22 Thread seemab (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335463#comment-16335463
 ] 

seemab commented on SPARK-22711:


kindly share..how did you updated thet code


> _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from 
> cloudpickle.py
> ---
>
> Key: SPARK-22711
> URL: https://issues.apache.org/jira/browse/SPARK-22711
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Submit
>Affects Versions: 2.2.0, 2.2.1
> Environment: Ubuntu pseudo distributed installation of Spark 2.2.0
>Reporter: Prateek
>Priority: Major
> Attachments: Jira_Spark_minimized_code.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> When I submit a Pyspark program with spark-submit command this error is 
> thrown.
> It happens when for code like below
> RDD2 = RDD1.map(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v)
> or 
> RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v)
> or
> RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduce(lambda c,v :c+v)
> Traceback (most recent call last):
>   File "/home/prateek/Project/textrank.py", line 299, in 
> summaryRDD = sentenceTokensReduceRDD.map(lambda m: 
> get_summary(m)).reduceByKey(lambda c,v :c+v)
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1608, 
> in reduceByKey
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1846, 
> in combineByKey
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1783, 
> in partitionBy
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, 
> in _jrdd
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2388, 
> in _wrap_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2374, 
> in _prepare_for_python_RDD
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 
> 460, in dumps
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 704, in dumps
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 148, in dump
>   File "/usr/lib/python3.5/pickle.py", line 408, in dump
> self.save(obj)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 740, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 770, in save_list
> self._batch_appends(obj)
>   File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends
> save(x)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 770, in save_list
> self._batch_appends(obj)
>   File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends
> save(x)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit 

[jira] [Commented] (SPARK-23173) from_json can produce nulls for fields which are marked as non-nullable

2018-01-22 Thread Burak Yavuz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16334226#comment-16334226
 ] 

Burak Yavuz commented on SPARK-23173:
-

In terms of usability, I prefer 1. In terms of the viewpoint of a data 
engineer, I would like 2 as well if that's not too hard.

 

Basically, if I expect that my data doesn't have nulls, but is suddenly 
outputting them, I would rather have it fail initially (or get written out to 
the \_corrupt\_record column).

In an ideal world, I should be able to either permit nullable fields (Option 
1), or have the record be written out as corrupt.

> from_json can produce nulls for fields which are marked as non-nullable
> ---
>
> Key: SPARK-23173
> URL: https://issues.apache.org/jira/browse/SPARK-23173
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Herman van Hovell
>Priority: Major
>
> The {{from_json}} function uses a schema to convert a string into a Spark SQL 
> struct. This schema can contain non-nullable fields. The underlying 
> {{JsonToStructs}} expression does not check if a resulting struct respects 
> the nullability of the schema. This leads to very weird problems in consuming 
> expressions. In our case parquet writing would produce an illegal parquet 
> file.
> There are roughly solutions here:
>  # Assume that each field in schema passed to {{from_json}} is nullable, and 
> ignore the nullability information set in the passed schema.
>  # Validate the object during runtime, and fail execution if the data is null 
> where we are not expecting this.
> I currently am slightly in favor of option 1, since this is the more 
> performant option and a lot easier to do.
> WDYT? cc [~rxin] [~marmbrus] [~hyukjin.kwon] [~brkyvz]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23179) Support option to throw exception if overflow occurs

2018-01-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16334321#comment-16334321
 ] 

Apache Spark commented on SPARK-23179:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/20350

> Support option to throw exception if overflow occurs
> 
>
> Key: SPARK-23179
> URL: https://issues.apache.org/jira/browse/SPARK-23179
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Priority: Major
>
> SQL ANSI 2011 states that in case of overflow during arithmetic operations, 
> an exception should be thrown. This is what most of the SQL DBs do (eg. 
> SQLServer, DB2). Hive currently returns NULL (as Spark does) but HIVE-18291 
> is open to be SQL compliant.
> I propose to have a config option which allows to decide whether Spark should 
> behave according to SQL standards or in the current way (ie. returning NULL).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23179) Support option to throw exception if overflow occurs

2018-01-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23179:


Assignee: Apache Spark

> Support option to throw exception if overflow occurs
> 
>
> Key: SPARK-23179
> URL: https://issues.apache.org/jira/browse/SPARK-23179
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Assignee: Apache Spark
>Priority: Major
>
> SQL ANSI 2011 states that in case of overflow during arithmetic operations, 
> an exception should be thrown. This is what most of the SQL DBs do (eg. 
> SQLServer, DB2). Hive currently returns NULL (as Spark does) but HIVE-18291 
> is open to be SQL compliant.
> I propose to have a config option which allows to decide whether Spark should 
> behave according to SQL standards or in the current way (ie. returning NULL).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23179) Support option to throw exception if overflow occurs

2018-01-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23179:


Assignee: (was: Apache Spark)

> Support option to throw exception if overflow occurs
> 
>
> Key: SPARK-23179
> URL: https://issues.apache.org/jira/browse/SPARK-23179
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Priority: Major
>
> SQL ANSI 2011 states that in case of overflow during arithmetic operations, 
> an exception should be thrown. This is what most of the SQL DBs do (eg. 
> SQLServer, DB2). Hive currently returns NULL (as Spark does) but HIVE-18291 
> is open to be SQL compliant.
> I propose to have a config option which allows to decide whether Spark should 
> behave according to SQL standards or in the current way (ie. returning NULL).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13964) Feature hashing improvements

2018-01-22 Thread Artem Kalchenko (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16334350#comment-16334350
 ] 

Artem Kalchenko commented on SPARK-13964:
-

How about adding hashing for quadratic features like in Vowpal Wabbit:

[https://gist.github.com/luoq/b4c374b5cbabe3ae76ffacdac22750af]

> Feature hashing improvements
> 
>
> Key: SPARK-13964
> URL: https://issues.apache.org/jira/browse/SPARK-13964
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Nick Pentreath
>Priority: Minor
>
> Investigate improvements to Spark ML feature hashing (see e.g. 
> http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html#sklearn.feature_extraction.FeatureHasher).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23173) from_json can produce nulls for fields which are marked as non-nullable

2018-01-22 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-23173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16334388#comment-16334388
 ] 

Michał Świtakowski commented on SPARK-23173:


I think starting with option 1 is a good idea for a remedy of the corruption 
issue.

Verifying the data would certainly be good and it can be done in at least two 
approaches:

(1) detect incorrect data and fail

(2) write rejected data to a separate file/column as Burak suggests

 

(1) can even be orthogonal if the verification is done at the level of parquet 
encoding. It would help avoid the corruption with all sources, not just JSON.

> from_json can produce nulls for fields which are marked as non-nullable
> ---
>
> Key: SPARK-23173
> URL: https://issues.apache.org/jira/browse/SPARK-23173
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Herman van Hovell
>Priority: Major
>
> The {{from_json}} function uses a schema to convert a string into a Spark SQL 
> struct. This schema can contain non-nullable fields. The underlying 
> {{JsonToStructs}} expression does not check if a resulting struct respects 
> the nullability of the schema. This leads to very weird problems in consuming 
> expressions. In our case parquet writing would produce an illegal parquet 
> file.
> There are roughly solutions here:
>  # Assume that each field in schema passed to {{from_json}} is nullable, and 
> ignore the nullability information set in the passed schema.
>  # Validate the object during runtime, and fail execution if the data is null 
> where we are not expecting this.
> I currently am slightly in favor of option 1, since this is the more 
> performant option and a lot easier to do.
> WDYT? cc [~rxin] [~marmbrus] [~hyukjin.kwon] [~brkyvz]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23183) Failure caused by TaskContext is missing in the thread spawned by Custom RDD

2018-01-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-23183:
--
Shepherd:   (was: Jiang Xingbo)
   Flags:   (was: Important)
Target Version/s:   (was: 2.2.1)
  Labels:   (was: easyfix)
Priority: Major  (was: Critical)

> Failure caused by TaskContext is missing in the thread spawned by Custom RDD 
> -
>
> Key: SPARK-23183
> URL: https://issues.apache.org/jira/browse/SPARK-23183
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.1
> Environment: This issue is only reproducible when user code spawns a 
> new thread where the RDD iterator is used. It can be reproduced in any spark 
> version.
>Reporter: Yongqin Xiao
>Priority: Major
>
> This is related to the already resolved issue 
> https://issues.apache.org/jira/browse/SPARK-18406., which was resolved by 
> "explicitly pass the TID value to the {{unlock}} method".
> The fix resolved the use cases at that time.
> However, a new use case failed, which has a custom RDD followed by an 
> aggregator. The stack trace is:
> {noformat}
> 18/01/22 13:19:51 WARN TaskSetManager: Lost task 75.0 in stage 2.0 (TID 124, 
> psrhcdh58c2c001.informatica.com, executor 3): java.lang.RuntimeException: DTM 
> execution failed with error: 
> com.informatica.powercenter.sdk.dtm.DTMException: 
> java.lang.NullPointerException
> [com.informatica.powercenter.sdk.dtm.DTMException: 
> java.lang.NullPointerException
>  at 
> com.informatica.products.infatransform.spark.edtm.DataExchangeRunnable$DataChannelRunnable.pushInputDataToDTM(DataExchangeRunnable.java:554)
>  at 
> com.informatica.products.infatransform.spark.edtm.DataExchangeRunnable$DataChannelRunnable.run(DataExchangeRunnable.java:424)
>  at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.NullPointerException
>  at  
> <---org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:419)
>  at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:79)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>  at scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:31)
>  at 
> com.informatica.products.infatransform.spark.edtm.DataExchangeRunnable$DataChannelRunnable.pushInputDataToDTM(DataExchangeRunnable.java:486)
>  ... 2 more
> ]
>  at 
> com.informatica.products.infatransform.spark.edtm.DTMIteratorForBulkOutput.finishUp(DTMIteratorForBulkOutput.java:94)
>  at 
> com.informatica.products.infatransform.spark.edtm.DTMIteratorForBulkOutput.hasNext(DTMIteratorForBulkOutput.java:53)
>  at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
>  at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>  at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:968)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:959)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:899)
>  at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:959)
>  at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:705)
>  at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>  at org.apache.spark.scheduler.Task.run(Task.scala:99)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

[jira] [Commented] (SPARK-23183) Failure caused by TaskContext is missing in the thread spawned by Custom RDD

2018-01-22 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335168#comment-16335168
 ] 

Sean Owen commented on SPARK-23183:
---

OK; fix the title? but same point stands I think. Spawning your own threads 
won't be guaranteed to work.

> Failure caused by TaskContext is missing in the thread spawned by Custom RDD 
> -
>
> Key: SPARK-23183
> URL: https://issues.apache.org/jira/browse/SPARK-23183
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.1
> Environment: This issue is only reproducible when user code spawns a 
> new thread where the RDD iterator is used. It can be reproduced in any spark 
> version.
>Reporter: Yongqin Xiao
>Priority: Major
>
> This is related to the already resolved issue 
> https://issues.apache.org/jira/browse/SPARK-18406., which was resolved by 
> "explicitly pass the TID value to the {{unlock}} method".
> The fix resolved the use cases at that time.
> However, a new use case failed, which has a custom RDD followed by an 
> aggregator. The stack trace is:
> {noformat}
> 18/01/22 13:19:51 WARN TaskSetManager: Lost task 75.0 in stage 2.0 (TID 124, 
> psrhcdh58c2c001.informatica.com, executor 3): java.lang.RuntimeException: DTM 
> execution failed with error: 
> com.informatica.powercenter.sdk.dtm.DTMException: 
> java.lang.NullPointerException
> [com.informatica.powercenter.sdk.dtm.DTMException: 
> java.lang.NullPointerException
>  at 
> com.informatica.products.infatransform.spark.edtm.DataExchangeRunnable$DataChannelRunnable.pushInputDataToDTM(DataExchangeRunnable.java:554)
>  at 
> com.informatica.products.infatransform.spark.edtm.DataExchangeRunnable$DataChannelRunnable.run(DataExchangeRunnable.java:424)
>  at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.NullPointerException
>  at  
> <---org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:419)
>  at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:79)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>  at scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:31)
>  at 
> com.informatica.products.infatransform.spark.edtm.DataExchangeRunnable$DataChannelRunnable.pushInputDataToDTM(DataExchangeRunnable.java:486)
>  ... 2 more
> ]
>  at 
> com.informatica.products.infatransform.spark.edtm.DTMIteratorForBulkOutput.finishUp(DTMIteratorForBulkOutput.java:94)
>  at 
> com.informatica.products.infatransform.spark.edtm.DTMIteratorForBulkOutput.hasNext(DTMIteratorForBulkOutput.java:53)
>  at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
>  at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>  at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:968)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:959)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:899)
>  at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:959)
>  at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:705)
>  at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>  at org.apache.spark.scheduler.Task.run(Task.scala:99)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> 

[jira] [Updated] (SPARK-23183) Failure caused by TaskContext is missing in the thread spawned by user code

2018-01-22 Thread Yongqin Xiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongqin Xiao updated SPARK-23183:
-
Summary: Failure caused by TaskContext is missing in the thread spawned by 
user code  (was: Failure caused by TaskContext is missing in the thread spawned 
by Custom RDD )

> Failure caused by TaskContext is missing in the thread spawned by user code
> ---
>
> Key: SPARK-23183
> URL: https://issues.apache.org/jira/browse/SPARK-23183
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.1
> Environment: This issue is only reproducible when user code spawns a 
> new thread where the RDD iterator is used. It can be reproduced in any spark 
> version.
>Reporter: Yongqin Xiao
>Priority: Major
>
> This is related to the already resolved issue 
> https://issues.apache.org/jira/browse/SPARK-18406., which was resolved by 
> "explicitly pass the TID value to the {{unlock}} method".
> The fix resolved the use cases at that time.
> However, a new use case failed, which has a custom RDD followed by an 
> aggregator. The stack trace is:
> {noformat}
> 18/01/22 13:19:51 WARN TaskSetManager: Lost task 75.0 in stage 2.0 (TID 124, 
> psrhcdh58c2c001.informatica.com, executor 3): java.lang.RuntimeException: DTM 
> execution failed with error: 
> com.informatica.powercenter.sdk.dtm.DTMException: 
> java.lang.NullPointerException
> [com.informatica.powercenter.sdk.dtm.DTMException: 
> java.lang.NullPointerException
>  at 
> com.informatica.products.infatransform.spark.edtm.DataExchangeRunnable$DataChannelRunnable.pushInputDataToDTM(DataExchangeRunnable.java:554)
>  at 
> com.informatica.products.infatransform.spark.edtm.DataExchangeRunnable$DataChannelRunnable.run(DataExchangeRunnable.java:424)
>  at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.NullPointerException
>  at  
> <---org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:419)
>  at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:79)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>  at scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:31)
>  at 
> com.informatica.products.infatransform.spark.edtm.DataExchangeRunnable$DataChannelRunnable.pushInputDataToDTM(DataExchangeRunnable.java:486)
>  ... 2 more
> ]
>  at 
> com.informatica.products.infatransform.spark.edtm.DTMIteratorForBulkOutput.finishUp(DTMIteratorForBulkOutput.java:94)
>  at 
> com.informatica.products.infatransform.spark.edtm.DTMIteratorForBulkOutput.hasNext(DTMIteratorForBulkOutput.java:53)
>  at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
>  at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>  at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:968)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:959)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:899)
>  at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:959)
>  at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:705)
>  at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>  at org.apache.spark.scheduler.Task.run(Task.scala:99)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> 

[jira] [Created] (SPARK-23181) Add compatibility tests for SHS serialized data / disk format

2018-01-22 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-23181:
--

 Summary: Add compatibility tests for SHS serialized data / disk 
format
 Key: SPARK-23181
 URL: https://issues.apache.org/jira/browse/SPARK-23181
 Project: Spark
  Issue Type: Task
  Components: Tests
Affects Versions: 2.3.0
Reporter: Marcelo Vanzin


The SHS in 2.3.0 has the ability to serialize history data to disk (see 
SPARK-18085 and its sub-tasks). This means that if either the serialized data 
or the disk format changes, the code needs to be modified to either support the 
old formats, or discard the old data (and re-create it from logs).

We should add integration tests that help us detect whether one of these 
changes has occurred. The should check data generated by old versions of Spark 
and fail if that data cannot be read back.

The Hive suites recently added the ability to download old Spark versions and 
generate data from those old versions to test that new code can read it, we 
could use something similar to test this (starting with when 2.3.0 is released).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11315) Add YARN extension service to publish Spark events to YARN timeline service

2018-01-22 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-11315:
--

Assignee: Marcelo Vanzin

> Add YARN extension service to publish Spark events to YARN timeline service
> ---
>
> Key: SPARK-11315
> URL: https://issues.apache.org/jira/browse/SPARK-11315
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Affects Versions: 1.5.1
> Environment: Hadoop 2.6+
>Reporter: Steve Loughran
>Assignee: Marcelo Vanzin
>Priority: Major
>
> Add an extension service (using SPARK-11314) to subscribe to Spark lifecycle 
> events, batch them and forward them to the YARN Application Timeline Service. 
> This data can then be retrieved by a new back end for the Spark History 
> Service, and by other analytics tools.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23052) Migrate Microbatch ConsoleSink to v2

2018-01-22 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-23052:
--
Summary: Migrate Microbatch ConsoleSink to v2  (was: Migrate 
MicrConsoleSink to v2)

> Migrate Microbatch ConsoleSink to v2
> 
>
> Key: SPARK-23052
> URL: https://issues.apache.org/jira/browse/SPARK-23052
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Jose Torres
>Priority: Major
> Fix For: 2.3.0, 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23103) LevelDB store not iterating correctly when indexed value has negative value

2018-01-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16334972#comment-16334972
 ] 

Apache Spark commented on SPARK-23103:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/20353

> LevelDB store not iterating correctly when indexed value has negative value
> ---
>
> Key: SPARK-23103
> URL: https://issues.apache.org/jira/browse/SPARK-23103
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.3.0
>
>
> Marking as minor since I don't believe we currently have anything that needs 
> to store negative values in indexed fields. But I wrote a unit test and got:
>  
> {noformat}
> [error] Test 
> org.apache.spark.util.kvstore.LevelDBSuite.testNegativeIndexValues failed: 
> java.lang.AssertionError: expected:<[-50, 0, 50]> but was:<[[0, -50, 50]]>, 
> took 0.025 sec
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23184) All jobs page is broken when some stage is missing

2018-01-22 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335155#comment-16335155
 ] 

Shixiong Zhu commented on SPARK-23184:
--

Not yet. But I see it's duplicated after reading the patch.

> All jobs page is broken when some stage is missing
> --
>
> Key: SPARK-23184
> URL: https://issues.apache.org/jira/browse/SPARK-23184
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Priority: Blocker
> Fix For: 2.3.0
>
>
> {code}
> h2. HTTP ERROR 500
> Problem accessing /jobs/. Reason:
> Server Error
>  
> h3. Caused by:
> java.util.NoSuchElementException: No stage with id 44959 at 
> org.apache.spark.status.AppStatusStore.lastStageAttempt(AppStatusStore.scala:104)
>  at 
> org.apache.spark.ui.jobs.JobDataSource.org$apache$spark$ui$jobs$JobDataSource$$jobRow(AllJobsPage.scala:430)
>  at 
> org.apache.spark.ui.jobs.JobDataSource$$anonfun$26.apply(AllJobsPage.scala:408)
>  at 
> org.apache.spark.ui.jobs.JobDataSource$$anonfun$26.apply(AllJobsPage.scala:408)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
>  at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45) at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
> scala.collection.AbstractTraversable.map(Traversable.scala:104) at 
> org.apache.spark.ui.jobs.JobDataSource.(AllJobsPage.scala:408) at 
> org.apache.spark.ui.jobs.JobPagedTable.(AllJobsPage.scala:502) at 
> org.apache.spark.ui.jobs.AllJobsPage.jobsTable(AllJobsPage.scala:244) at 
> org.apache.spark.ui.jobs.AllJobsPage.render(AllJobsPage.scala:293) at 
> org.apache.spark.ui.WebUI$$anonfun$3.apply(WebUI.scala:98) at 
> org.apache.spark.ui.WebUI$$anonfun$3.apply(WebUI.scala:98) at 
> org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at 
> javax.servlet.http.HttpServlet.service(HttpServlet.java:687) at 
> javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at 
> org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:584) at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>  at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) 
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) 
> at 
> org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)
>  at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
>  at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>  at org.eclipse.jetty.server.Server.handle(Server.java:534) at 
> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320) at 
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) 
> at 
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
>  at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108) at 
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>  at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>  at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
>  at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
>  at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
>  at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
>  at java.lang.Thread.run(Thread.java:748)
> {code}
>  
> This is the index page. It should not crash even if a stage is missing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23184) All jobs page is broken when some stage is missing

2018-01-22 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-23184.
--
   Resolution: Duplicate
Fix Version/s: 2.3.0

> All jobs page is broken when some stage is missing
> --
>
> Key: SPARK-23184
> URL: https://issues.apache.org/jira/browse/SPARK-23184
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Priority: Blocker
> Fix For: 2.3.0
>
>
> {code}
> h2. HTTP ERROR 500
> Problem accessing /jobs/. Reason:
> Server Error
>  
> h3. Caused by:
> java.util.NoSuchElementException: No stage with id 44959 at 
> org.apache.spark.status.AppStatusStore.lastStageAttempt(AppStatusStore.scala:104)
>  at 
> org.apache.spark.ui.jobs.JobDataSource.org$apache$spark$ui$jobs$JobDataSource$$jobRow(AllJobsPage.scala:430)
>  at 
> org.apache.spark.ui.jobs.JobDataSource$$anonfun$26.apply(AllJobsPage.scala:408)
>  at 
> org.apache.spark.ui.jobs.JobDataSource$$anonfun$26.apply(AllJobsPage.scala:408)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
>  at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45) at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
> scala.collection.AbstractTraversable.map(Traversable.scala:104) at 
> org.apache.spark.ui.jobs.JobDataSource.(AllJobsPage.scala:408) at 
> org.apache.spark.ui.jobs.JobPagedTable.(AllJobsPage.scala:502) at 
> org.apache.spark.ui.jobs.AllJobsPage.jobsTable(AllJobsPage.scala:244) at 
> org.apache.spark.ui.jobs.AllJobsPage.render(AllJobsPage.scala:293) at 
> org.apache.spark.ui.WebUI$$anonfun$3.apply(WebUI.scala:98) at 
> org.apache.spark.ui.WebUI$$anonfun$3.apply(WebUI.scala:98) at 
> org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at 
> javax.servlet.http.HttpServlet.service(HttpServlet.java:687) at 
> javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at 
> org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:584) at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>  at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) 
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) 
> at 
> org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)
>  at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
>  at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>  at org.eclipse.jetty.server.Server.handle(Server.java:534) at 
> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320) at 
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) 
> at 
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
>  at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108) at 
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>  at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>  at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
>  at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
>  at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
>  at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
>  at java.lang.Thread.run(Thread.java:748)
> {code}
>  
> This is the index page. It should not crash even if a stage is missing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23183) Failure caused by TaskContext is missing in the thread spawned by Custom RDD

2018-01-22 Thread Yongqin Xiao (JIRA)
Yongqin Xiao created SPARK-23183:


 Summary: Failure caused by TaskContext is missing in the thread 
spawned by Custom RDD 
 Key: SPARK-23183
 URL: https://issues.apache.org/jira/browse/SPARK-23183
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.1, 2.1.0
 Environment: This issue is only reproducible when user code spawns a 
new thread where the RDD iterator is used. It can be reproduced in any spark 
version.
Reporter: Yongqin Xiao


This is related to the already resolved issue 
https://issues.apache.org/jira/browse/SPARK-18406., which was resolved by 
"explicitly pass the TID value to the {{unlock}} method".

The fix resolved the use cases at that time.

However, a new use case failed, which has a custom RDD followed by an 
aggregator. The stack trace is:


{noformat}
18/01/22 13:19:51 WARN TaskSetManager: Lost task 75.0 in stage 2.0 (TID 124, 
psrhcdh58c2c001.informatica.com, executor 3): java.lang.RuntimeException: DTM 
execution failed with error: com.informatica.powercenter.sdk.dtm.DTMException: 
java.lang.NullPointerException
[com.informatica.powercenter.sdk.dtm.DTMException: 
java.lang.NullPointerException
 at 
com.informatica.products.infatransform.spark.edtm.DataExchangeRunnable$DataChannelRunnable.pushInputDataToDTM(DataExchangeRunnable.java:554)
 at 
com.informatica.products.infatransform.spark.edtm.DataExchangeRunnable$DataChannelRunnable.run(DataExchangeRunnable.java:424)
 at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
 at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:419)
 at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:79)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
 at scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:31)
 at 
com.informatica.products.infatransform.spark.edtm.DataExchangeRunnable$DataChannelRunnable.pushInputDataToDTM(DataExchangeRunnable.java:486)
 ... 2 more
]
 at 
com.informatica.products.infatransform.spark.edtm.DTMIteratorForBulkOutput.finishUp(DTMIteratorForBulkOutput.java:94)
 at 
com.informatica.products.infatransform.spark.edtm.DTMIteratorForBulkOutput.hasNext(DTMIteratorForBulkOutput.java:53)
 at 
scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
 at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215)
 at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:968)
 at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:959)
 at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:899)
 at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:959)
 at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:705)
 at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
 at org.apache.spark.scheduler.Task.run(Task.scala:99)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:748)
Caused by: com.informatica.products.infatransform.spark.boot.SparkDTMException: 
com.informatica.powercenter.sdk.dtm.DTMException: java.lang.NullPointerException
[com.informatica.powercenter.sdk.dtm.DTMException: 
java.lang.NullPointerException
 at 
com.informatica.products.infatransform.spark.edtm.DataExchangeRunnable$DataChannelRunnable.pushInputDataToDTM(DataExchangeRunnable.java:554)
 at 

[jira] [Commented] (SPARK-23183) Failure caused by TaskContext is missing in the thread spawned by Custom RDD

2018-01-22 Thread Yongqin Xiao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335059#comment-16335059
 ] 

Yongqin Xiao commented on SPARK-23183:
--

Exact same issue reported by other user: 
[https://stackoverflow.com/questions/41674069/running-threads-in-spark-dataframe-foreachpartition]
 , where reproduction code is very simple

 

> Failure caused by TaskContext is missing in the thread spawned by Custom RDD 
> -
>
> Key: SPARK-23183
> URL: https://issues.apache.org/jira/browse/SPARK-23183
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.1
> Environment: This issue is only reproducible when user code spawns a 
> new thread where the RDD iterator is used. It can be reproduced in any spark 
> version.
>Reporter: Yongqin Xiao
>Priority: Critical
>  Labels: easyfix
>
> This is related to the already resolved issue 
> https://issues.apache.org/jira/browse/SPARK-18406., which was resolved by 
> "explicitly pass the TID value to the {{unlock}} method".
> The fix resolved the use cases at that time.
> However, a new use case failed, which has a custom RDD followed by an 
> aggregator. The stack trace is:
> {noformat}
> 18/01/22 13:19:51 WARN TaskSetManager: Lost task 75.0 in stage 2.0 (TID 124, 
> psrhcdh58c2c001.informatica.com, executor 3): java.lang.RuntimeException: DTM 
> execution failed with error: 
> com.informatica.powercenter.sdk.dtm.DTMException: 
> java.lang.NullPointerException
> [com.informatica.powercenter.sdk.dtm.DTMException: 
> java.lang.NullPointerException
>  at 
> com.informatica.products.infatransform.spark.edtm.DataExchangeRunnable$DataChannelRunnable.pushInputDataToDTM(DataExchangeRunnable.java:554)
>  at 
> com.informatica.products.infatransform.spark.edtm.DataExchangeRunnable$DataChannelRunnable.run(DataExchangeRunnable.java:424)
>  at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.NullPointerException
>  at  
> <---org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:419)
>  at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:79)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>  at scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:31)
>  at 
> com.informatica.products.infatransform.spark.edtm.DataExchangeRunnable$DataChannelRunnable.pushInputDataToDTM(DataExchangeRunnable.java:486)
>  ... 2 more
> ]
>  at 
> com.informatica.products.infatransform.spark.edtm.DTMIteratorForBulkOutput.finishUp(DTMIteratorForBulkOutput.java:94)
>  at 
> com.informatica.products.infatransform.spark.edtm.DTMIteratorForBulkOutput.hasNext(DTMIteratorForBulkOutput.java:53)
>  at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
>  at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>  at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:968)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:959)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:899)
>  at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:959)
>  at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:705)
>  at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>  at org.apache.spark.scheduler.Task.run(Task.scala:99)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>  at 
> 

[jira] [Created] (SPARK-23184) All jobs page is broken when some stage is missing

2018-01-22 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-23184:


 Summary: All jobs page is broken when some stage is missing
 Key: SPARK-23184
 URL: https://issues.apache.org/jira/browse/SPARK-23184
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.3.0
Reporter: Shixiong Zhu


{code}
h2. HTTP ERROR 500

Problem accessing /jobs/. Reason:

Server Error

 
h3. Caused by:

java.util.NoSuchElementException: No stage with id 44959 at 
org.apache.spark.status.AppStatusStore.lastStageAttempt(AppStatusStore.scala:104)
 at 
org.apache.spark.ui.jobs.JobDataSource.org$apache$spark$ui$jobs$JobDataSource$$jobRow(AllJobsPage.scala:430)
 at 
org.apache.spark.ui.jobs.JobDataSource$$anonfun$26.apply(AllJobsPage.scala:408) 
at 
org.apache.spark.ui.jobs.JobDataSource$$anonfun$26.apply(AllJobsPage.scala:408) 
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
 at scala.collection.immutable.List.foreach(List.scala:381) at 
scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
 at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45) at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
scala.collection.AbstractTraversable.map(Traversable.scala:104) at 
org.apache.spark.ui.jobs.JobDataSource.(AllJobsPage.scala:408) at 
org.apache.spark.ui.jobs.JobPagedTable.(AllJobsPage.scala:502) at 
org.apache.spark.ui.jobs.AllJobsPage.jobsTable(AllJobsPage.scala:244) at 
org.apache.spark.ui.jobs.AllJobsPage.render(AllJobsPage.scala:293) at 
org.apache.spark.ui.WebUI$$anonfun$3.apply(WebUI.scala:98) at 
org.apache.spark.ui.WebUI$$anonfun$3.apply(WebUI.scala:98) at 
org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at 
javax.servlet.http.HttpServlet.service(HttpServlet.java:687) at 
javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at 
org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:584) at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
 at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) 
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
 at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) 
at 
org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493) 
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
 at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) 
at org.eclipse.jetty.server.Server.handle(Server.java:534) at 
org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320) at 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) at 
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
 at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108) at 
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) 
at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
 at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
 at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
 at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
 at 
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) 
at java.lang.Thread.run(Thread.java:748)

{code}

 

This is the index page. It should not crash even if a stage is missing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23183) Failure caused by TaskContext is missing in the thread spawned by Custom RDD

2018-01-22 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335153#comment-16335153
 ] 

Sean Owen commented on SPARK-23183:
---

I don't think a custom RDD that spawns threads is guaranteed to work; things 
like ThreadLocals aren't set as expected.

> Failure caused by TaskContext is missing in the thread spawned by Custom RDD 
> -
>
> Key: SPARK-23183
> URL: https://issues.apache.org/jira/browse/SPARK-23183
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.1
> Environment: This issue is only reproducible when user code spawns a 
> new thread where the RDD iterator is used. It can be reproduced in any spark 
> version.
>Reporter: Yongqin Xiao
>Priority: Critical
>  Labels: easyfix
>
> This is related to the already resolved issue 
> https://issues.apache.org/jira/browse/SPARK-18406., which was resolved by 
> "explicitly pass the TID value to the {{unlock}} method".
> The fix resolved the use cases at that time.
> However, a new use case failed, which has a custom RDD followed by an 
> aggregator. The stack trace is:
> {noformat}
> 18/01/22 13:19:51 WARN TaskSetManager: Lost task 75.0 in stage 2.0 (TID 124, 
> psrhcdh58c2c001.informatica.com, executor 3): java.lang.RuntimeException: DTM 
> execution failed with error: 
> com.informatica.powercenter.sdk.dtm.DTMException: 
> java.lang.NullPointerException
> [com.informatica.powercenter.sdk.dtm.DTMException: 
> java.lang.NullPointerException
>  at 
> com.informatica.products.infatransform.spark.edtm.DataExchangeRunnable$DataChannelRunnable.pushInputDataToDTM(DataExchangeRunnable.java:554)
>  at 
> com.informatica.products.infatransform.spark.edtm.DataExchangeRunnable$DataChannelRunnable.run(DataExchangeRunnable.java:424)
>  at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.NullPointerException
>  at  
> <---org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:419)
>  at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:79)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>  at scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:31)
>  at 
> com.informatica.products.infatransform.spark.edtm.DataExchangeRunnable$DataChannelRunnable.pushInputDataToDTM(DataExchangeRunnable.java:486)
>  ... 2 more
> ]
>  at 
> com.informatica.products.infatransform.spark.edtm.DTMIteratorForBulkOutput.finishUp(DTMIteratorForBulkOutput.java:94)
>  at 
> com.informatica.products.infatransform.spark.edtm.DTMIteratorForBulkOutput.hasNext(DTMIteratorForBulkOutput.java:53)
>  at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
>  at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>  at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:968)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:959)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:899)
>  at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:959)
>  at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:705)
>  at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>  at org.apache.spark.scheduler.Task.run(Task.scala:99)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> 

[jira] [Updated] (SPARK-23183) Failure caused by TaskContext is missing in the thread spawned by Custom RDD

2018-01-22 Thread Yongqin Xiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongqin Xiao updated SPARK-23183:
-
Description: 
This is related to the already resolved issue 
https://issues.apache.org/jira/browse/SPARK-18406., which was resolved by 
"explicitly pass the TID value to the {{unlock}} method".

The fix resolved the use cases at that time.

However, a new use case failed, which has a custom RDD followed by an 
aggregator. The stack trace is:
{noformat}
18/01/22 13:19:51 WARN TaskSetManager: Lost task 75.0 in stage 2.0 (TID 124, 
psrhcdh58c2c001.informatica.com, executor 3): java.lang.RuntimeException: DTM 
execution failed with error: com.informatica.powercenter.sdk.dtm.DTMException: 
java.lang.NullPointerException
[com.informatica.powercenter.sdk.dtm.DTMException: 
java.lang.NullPointerException
 at 
com.informatica.products.infatransform.spark.edtm.DataExchangeRunnable$DataChannelRunnable.pushInputDataToDTM(DataExchangeRunnable.java:554)
 at 
com.informatica.products.infatransform.spark.edtm.DataExchangeRunnable$DataChannelRunnable.run(DataExchangeRunnable.java:424)
 at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
 at  
<---org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:419)
 at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:79)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
 at scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:31)
 at 
com.informatica.products.infatransform.spark.edtm.DataExchangeRunnable$DataChannelRunnable.pushInputDataToDTM(DataExchangeRunnable.java:486)
 ... 2 more
]
 at 
com.informatica.products.infatransform.spark.edtm.DTMIteratorForBulkOutput.finishUp(DTMIteratorForBulkOutput.java:94)
 at 
com.informatica.products.infatransform.spark.edtm.DTMIteratorForBulkOutput.hasNext(DTMIteratorForBulkOutput.java:53)
 at 
scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
 at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215)
 at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:968)
 at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:959)
 at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:899)
 at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:959)
 at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:705)
 at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
 at org.apache.spark.scheduler.Task.run(Task.scala:99)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:748)
Caused by: com.informatica.products.infatransform.spark.boot.SparkDTMException: 
com.informatica.powercenter.sdk.dtm.DTMException: java.lang.NullPointerException
[com.informatica.powercenter.sdk.dtm.DTMException: 
java.lang.NullPointerException
 at 
com.informatica.products.infatransform.spark.edtm.DataExchangeRunnable$DataChannelRunnable.pushInputDataToDTM(DataExchangeRunnable.java:554)
 at 
com.informatica.products.infatransform.spark.edtm.DataExchangeRunnable$DataChannelRunnable.run(DataExchangeRunnable.java:424)
 at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
 at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:419)
 at 

[jira] [Assigned] (SPARK-23163) Sync Python ML API docs with Scala

2018-01-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23163:


Assignee: (was: Apache Spark)

> Sync Python ML API docs with Scala
> --
>
> Key: SPARK-23163
> URL: https://issues.apache.org/jira/browse/SPARK-23163
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Priority: Trivial
>
> Fix a few doc issues as reported in 2.3 ML QA SPARK-23109



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23163) Sync Python ML API docs with Scala

2018-01-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335040#comment-16335040
 ] 

Apache Spark commented on SPARK-23163:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/20354

> Sync Python ML API docs with Scala
> --
>
> Key: SPARK-23163
> URL: https://issues.apache.org/jira/browse/SPARK-23163
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Priority: Trivial
>
> Fix a few doc issues as reported in 2.3 ML QA SPARK-23109



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23163) Sync Python ML API docs with Scala

2018-01-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23163:


Assignee: Apache Spark

> Sync Python ML API docs with Scala
> --
>
> Key: SPARK-23163
> URL: https://issues.apache.org/jira/browse/SPARK-23163
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Apache Spark
>Priority: Trivial
>
> Fix a few doc issues as reported in 2.3 ML QA SPARK-23109



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23184) All jobs page is broken when some stage is missing

2018-01-22 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335128#comment-16335128
 ] 

Shixiong Zhu commented on SPARK-23184:
--

This seems caused by the fix for SPARK-23051.

> All jobs page is broken when some stage is missing
> --
>
> Key: SPARK-23184
> URL: https://issues.apache.org/jira/browse/SPARK-23184
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Priority: Blocker
>
> {code}
> h2. HTTP ERROR 500
> Problem accessing /jobs/. Reason:
> Server Error
>  
> h3. Caused by:
> java.util.NoSuchElementException: No stage with id 44959 at 
> org.apache.spark.status.AppStatusStore.lastStageAttempt(AppStatusStore.scala:104)
>  at 
> org.apache.spark.ui.jobs.JobDataSource.org$apache$spark$ui$jobs$JobDataSource$$jobRow(AllJobsPage.scala:430)
>  at 
> org.apache.spark.ui.jobs.JobDataSource$$anonfun$26.apply(AllJobsPage.scala:408)
>  at 
> org.apache.spark.ui.jobs.JobDataSource$$anonfun$26.apply(AllJobsPage.scala:408)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
>  at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45) at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
> scala.collection.AbstractTraversable.map(Traversable.scala:104) at 
> org.apache.spark.ui.jobs.JobDataSource.(AllJobsPage.scala:408) at 
> org.apache.spark.ui.jobs.JobPagedTable.(AllJobsPage.scala:502) at 
> org.apache.spark.ui.jobs.AllJobsPage.jobsTable(AllJobsPage.scala:244) at 
> org.apache.spark.ui.jobs.AllJobsPage.render(AllJobsPage.scala:293) at 
> org.apache.spark.ui.WebUI$$anonfun$3.apply(WebUI.scala:98) at 
> org.apache.spark.ui.WebUI$$anonfun$3.apply(WebUI.scala:98) at 
> org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at 
> javax.servlet.http.HttpServlet.service(HttpServlet.java:687) at 
> javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at 
> org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:584) at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>  at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) 
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) 
> at 
> org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)
>  at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
>  at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>  at org.eclipse.jetty.server.Server.handle(Server.java:534) at 
> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320) at 
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) 
> at 
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
>  at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108) at 
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>  at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>  at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
>  at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
>  at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
>  at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
>  at java.lang.Thread.run(Thread.java:748)
> {code}
>  
> This is the index page. It should not crash even if a stage is missing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23184) All jobs page is broken when some stage is missing

2018-01-22 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335131#comment-16335131
 ] 

Shixiong Zhu commented on SPARK-23184:
--

cc [~vanzin] [~smurakozi] [~cloud_fan]

> All jobs page is broken when some stage is missing
> --
>
> Key: SPARK-23184
> URL: https://issues.apache.org/jira/browse/SPARK-23184
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Priority: Blocker
>
> {code}
> h2. HTTP ERROR 500
> Problem accessing /jobs/. Reason:
> Server Error
>  
> h3. Caused by:
> java.util.NoSuchElementException: No stage with id 44959 at 
> org.apache.spark.status.AppStatusStore.lastStageAttempt(AppStatusStore.scala:104)
>  at 
> org.apache.spark.ui.jobs.JobDataSource.org$apache$spark$ui$jobs$JobDataSource$$jobRow(AllJobsPage.scala:430)
>  at 
> org.apache.spark.ui.jobs.JobDataSource$$anonfun$26.apply(AllJobsPage.scala:408)
>  at 
> org.apache.spark.ui.jobs.JobDataSource$$anonfun$26.apply(AllJobsPage.scala:408)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
>  at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45) at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
> scala.collection.AbstractTraversable.map(Traversable.scala:104) at 
> org.apache.spark.ui.jobs.JobDataSource.(AllJobsPage.scala:408) at 
> org.apache.spark.ui.jobs.JobPagedTable.(AllJobsPage.scala:502) at 
> org.apache.spark.ui.jobs.AllJobsPage.jobsTable(AllJobsPage.scala:244) at 
> org.apache.spark.ui.jobs.AllJobsPage.render(AllJobsPage.scala:293) at 
> org.apache.spark.ui.WebUI$$anonfun$3.apply(WebUI.scala:98) at 
> org.apache.spark.ui.WebUI$$anonfun$3.apply(WebUI.scala:98) at 
> org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at 
> javax.servlet.http.HttpServlet.service(HttpServlet.java:687) at 
> javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at 
> org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:584) at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>  at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) 
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) 
> at 
> org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)
>  at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
>  at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>  at org.eclipse.jetty.server.Server.handle(Server.java:534) at 
> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320) at 
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) 
> at 
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
>  at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108) at 
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>  at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>  at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
>  at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
>  at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
>  at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
>  at java.lang.Thread.run(Thread.java:748)
> {code}
>  
> This is the index page. It should not crash even if a stage is missing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23184) All jobs page is broken when some stage is missing

2018-01-22 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335146#comment-16335146
 ] 

Marcelo Vanzin commented on SPARK-23184:


Did you test with the fix for SPARK-23121?

> All jobs page is broken when some stage is missing
> --
>
> Key: SPARK-23184
> URL: https://issues.apache.org/jira/browse/SPARK-23184
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Priority: Blocker
>
> {code}
> h2. HTTP ERROR 500
> Problem accessing /jobs/. Reason:
> Server Error
>  
> h3. Caused by:
> java.util.NoSuchElementException: No stage with id 44959 at 
> org.apache.spark.status.AppStatusStore.lastStageAttempt(AppStatusStore.scala:104)
>  at 
> org.apache.spark.ui.jobs.JobDataSource.org$apache$spark$ui$jobs$JobDataSource$$jobRow(AllJobsPage.scala:430)
>  at 
> org.apache.spark.ui.jobs.JobDataSource$$anonfun$26.apply(AllJobsPage.scala:408)
>  at 
> org.apache.spark.ui.jobs.JobDataSource$$anonfun$26.apply(AllJobsPage.scala:408)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
>  at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45) at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
> scala.collection.AbstractTraversable.map(Traversable.scala:104) at 
> org.apache.spark.ui.jobs.JobDataSource.(AllJobsPage.scala:408) at 
> org.apache.spark.ui.jobs.JobPagedTable.(AllJobsPage.scala:502) at 
> org.apache.spark.ui.jobs.AllJobsPage.jobsTable(AllJobsPage.scala:244) at 
> org.apache.spark.ui.jobs.AllJobsPage.render(AllJobsPage.scala:293) at 
> org.apache.spark.ui.WebUI$$anonfun$3.apply(WebUI.scala:98) at 
> org.apache.spark.ui.WebUI$$anonfun$3.apply(WebUI.scala:98) at 
> org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at 
> javax.servlet.http.HttpServlet.service(HttpServlet.java:687) at 
> javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at 
> org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:584) at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>  at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) 
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) 
> at 
> org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)
>  at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
>  at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>  at org.eclipse.jetty.server.Server.handle(Server.java:534) at 
> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320) at 
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) 
> at 
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
>  at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108) at 
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>  at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>  at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
>  at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
>  at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
>  at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
>  at java.lang.Thread.run(Thread.java:748)
> {code}
>  
> This is the index page. It should not crash even if a stage is missing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22389) partitioning reporting

2018-01-22 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-22389.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> partitioning reporting
> --
>
> Key: SPARK-22389
> URL: https://issues.apache.org/jira/browse/SPARK-22389
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.3.0
>
>
> We should allow data source to report partitioning and avoid shuffle at Spark 
> side



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23183) Failure caused by TaskContext is missing in the thread spawned by Custom RDD

2018-01-22 Thread Yongqin Xiao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335160#comment-16335160
 ] 

Yongqin Xiao commented on SPARK-23183:
--

Please look at reproduction mentioned in this site: 
[https://stackoverflow.com/questions/41674069/running-threads-in-spark-dataframe-foreachpartition]
 , there is NO custom RDD at all involved there.

> Failure caused by TaskContext is missing in the thread spawned by Custom RDD 
> -
>
> Key: SPARK-23183
> URL: https://issues.apache.org/jira/browse/SPARK-23183
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.1
> Environment: This issue is only reproducible when user code spawns a 
> new thread where the RDD iterator is used. It can be reproduced in any spark 
> version.
>Reporter: Yongqin Xiao
>Priority: Critical
>  Labels: easyfix
>
> This is related to the already resolved issue 
> https://issues.apache.org/jira/browse/SPARK-18406., which was resolved by 
> "explicitly pass the TID value to the {{unlock}} method".
> The fix resolved the use cases at that time.
> However, a new use case failed, which has a custom RDD followed by an 
> aggregator. The stack trace is:
> {noformat}
> 18/01/22 13:19:51 WARN TaskSetManager: Lost task 75.0 in stage 2.0 (TID 124, 
> psrhcdh58c2c001.informatica.com, executor 3): java.lang.RuntimeException: DTM 
> execution failed with error: 
> com.informatica.powercenter.sdk.dtm.DTMException: 
> java.lang.NullPointerException
> [com.informatica.powercenter.sdk.dtm.DTMException: 
> java.lang.NullPointerException
>  at 
> com.informatica.products.infatransform.spark.edtm.DataExchangeRunnable$DataChannelRunnable.pushInputDataToDTM(DataExchangeRunnable.java:554)
>  at 
> com.informatica.products.infatransform.spark.edtm.DataExchangeRunnable$DataChannelRunnable.run(DataExchangeRunnable.java:424)
>  at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.NullPointerException
>  at  
> <---org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:419)
>  at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:79)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>  at scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:31)
>  at 
> com.informatica.products.infatransform.spark.edtm.DataExchangeRunnable$DataChannelRunnable.pushInputDataToDTM(DataExchangeRunnable.java:486)
>  ... 2 more
> ]
>  at 
> com.informatica.products.infatransform.spark.edtm.DTMIteratorForBulkOutput.finishUp(DTMIteratorForBulkOutput.java:94)
>  at 
> com.informatica.products.infatransform.spark.edtm.DTMIteratorForBulkOutput.hasNext(DTMIteratorForBulkOutput.java:53)
>  at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
>  at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>  at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:968)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:959)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:899)
>  at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:959)
>  at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:705)
>  at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>  at org.apache.spark.scheduler.Task.run(Task.scala:99)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>  at 
> 

[jira] [Commented] (SPARK-21727) Operating on an ArrayType in a SparkR DataFrame throws error

2018-01-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16334892#comment-16334892
 ] 

Apache Spark commented on SPARK-21727:
--

User 'neilalex' has created a pull request for this issue:
https://github.com/apache/spark/pull/20352

> Operating on an ArrayType in a SparkR DataFrame throws error
> 
>
> Key: SPARK-21727
> URL: https://issues.apache.org/jira/browse/SPARK-21727
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Neil Alexander McQuarrie
>Assignee: Neil Alexander McQuarrie
>Priority: Major
>
> Previously 
> [posted|https://stackoverflow.com/questions/45056973/sparkr-dataframe-with-r-lists-as-elements]
>  this as a stack overflow question but it seems to be a bug.
> If I have an R data.frame where one of the column data types is an integer 
> *list* -- i.e., each of the elements in the column embeds an entire R list of 
> integers -- then it seems I can convert this data.frame to a SparkR DataFrame 
> just fine... SparkR treats the column as ArrayType(Double). 
> However, any subsequent operation on this SparkR DataFrame appears to throw 
> an error.
> Create an example R data.frame:
> {code}
> indices <- 1:4
> myDf <- data.frame(indices)
> myDf$data <- list(rep(0, 20))}}
> {code}
> Examine it to make sure it looks okay:
> {code}
> > str(myDf) 
> 'data.frame':   4 obs. of  2 variables:  
>  $ indices: int  1 2 3 4  
>  $ data   :List of 4
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
> > head(myDf)   
>   indices   data 
> 1   1 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 2   2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 3   3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 4   4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
> {code}
> Convert it to a SparkR DataFrame:
> {code}
> library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib"))
> sparkR.session(master = "local[*]")
> mySparkDf <- as.DataFrame(myDf)
> {code}
> Examine the SparkR DataFrame schema; notice that the list column was 
> successfully converted to ArrayType:
> {code}
> > schema(mySparkDf)
> StructType
> |-name = "indices", type = "IntegerType", nullable = TRUE
> |-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE
> {code}
> However, operating on the SparkR DataFrame throws an error:
> {code}
> > collect(mySparkDf)
> 17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
> (TID 1)
> java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
> java.lang.Double is not a valid external type for schema of array
> if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
> else validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
> org.apache.spark.sql.Row, true]), 0, indices), IntegerType) AS indices#0
> ... long stack trace ...
> {code}
> Using Spark 2.2.0, R 3.4.0, Java 1.8.0_131, Windows 10.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11930) StreamInterceptor causes channel to be closed if user code throws exceptions

2018-01-22 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-11930.

Resolution: Won't Fix

No point in looking at this until there's a use case. Current code works fine 
for its purpose.

> StreamInterceptor causes channel to be closed if user code throws exceptions
> 
>
> Key: SPARK-11930
> URL: https://issues.apache.org/jira/browse/SPARK-11930
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> The current implementation of {{StreamInterceptor}} in the network library 
> allows user exceptions to bubble up to the network layer, which would cause 
> the channel to be closed.
> That's fine for the current use cases, but in some cases it might be better 
> to swallow the exceptions and ignore any incoming data that is part of the 
> stream, keeping the channel open. We should add that option.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23052) Migrate MicrConsoleSink to v2

2018-01-22 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-23052:
--
Summary: Migrate MicrConsoleSink to v2  (was: Migrate ConsoleSink to v2)

> Migrate MicrConsoleSink to v2
> -
>
> Key: SPARK-23052
> URL: https://issues.apache.org/jira/browse/SPARK-23052
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Jose Torres
>Priority: Major
> Fix For: 2.3.0, 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23052) Migrate ConsoleSink to v2

2018-01-22 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das reassigned SPARK-23052:
-

Assignee: Jose Torres

> Migrate ConsoleSink to v2
> -
>
> Key: SPARK-23052
> URL: https://issues.apache.org/jira/browse/SPARK-23052
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Jose Torres
>Priority: Major
> Fix For: 2.3.0, 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11315) Add YARN extension service to publish Spark events to YARN timeline service

2018-01-22 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-11315:
--

Assignee: (was: Marcelo Vanzin)

> Add YARN extension service to publish Spark events to YARN timeline service
> ---
>
> Key: SPARK-11315
> URL: https://issues.apache.org/jira/browse/SPARK-11315
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Affects Versions: 1.5.1
> Environment: Hadoop 2.6+
>Reporter: Steve Loughran
>Priority: Major
>
> Add an extension service (using SPARK-11314) to subscribe to Spark lifecycle 
> events, batch them and forward them to the YARN Application Timeline Service. 
> This data can then be retrieved by a new back end for the Spark History 
> Service, and by other analytics tools.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10439) Catalyst should check for overflow / underflow of date and timestamp values

2018-01-22 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-10439.

Resolution: Won't Fix

Doesn't look like there's any interest in fixing this.

> Catalyst should check for overflow / underflow of date and timestamp values
> ---
>
> Key: SPARK-10439
> URL: https://issues.apache.org/jira/browse/SPARK-10439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> While testing some code, I noticed that a few methods in {{DateTimeUtils}} 
> are prone to overflow and underflow.
> For example, {{millisToDays}} can overflow the return type ({{Int}}) if a 
> large enough input value is provided.
> Similarly, {{fromJavaTimestamp}} converts milliseconds to microseconds, which 
> can overflow if the input is {{> Long.MAX_VALUE / 1000}} (or underflow in the 
> negative case).
> There might be others but these were the ones that caught my eye.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20664) Remove stale applications from SHS listing

2018-01-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16334971#comment-16334971
 ] 

Apache Spark commented on SPARK-20664:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/20353

> Remove stale applications from SHS listing
> --
>
> Key: SPARK-20664
> URL: https://issues.apache.org/jira/browse/SPARK-20664
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Major
> Fix For: 2.3.0
>
>
> See spec in parent issue (SPARK-18085) for more details.
> This task is actually not explicit in the spec, and it's also an issue with 
> the current SHS. But having the SHS persist listing data makes it worse.
> Basically, the SHS currently does not detect when files are deleted from the 
> event log directory manually; so those applications are still listed, and 
> trying to see their UI will either show the UI (if it's loaded) or an error 
> (if it's not).
> With the new SHS, that also means that data is leaked in the disk stores used 
> to persist listing and UI data, making the problem worse.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23114) Spark R 2.3 QA umbrella

2018-01-22 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16334885#comment-16334885
 ] 

Hossein Falaki commented on SPARK-23114:


[~felixcheung] I don't have any datasets to share. I have not seen a failure 
that was for no good reason. I have been filing individual tickets for each 
failure mode I have seen. I suggest we address them individually.

> Spark R 2.3 QA umbrella
> ---
>
> Key: SPARK-23114
> URL: https://issues.apache.org/jira/browse/SPARK-23114
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Felix Cheung
>Priority: Critical
>
> This JIRA lists tasks for the next Spark release's QA period for SparkR.
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
> * Audit new public APIs (from the generated html doc)
> ** relative to Spark Scala/Java APIs
> ** relative to popular R libraries
> h2. Documentation and example code
> * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
> * Update Programming Guide
> * Update website



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12977) Factoring out StreamingListener and UI to support history UI

2018-01-22 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-12977:
--

Assignee: (was: Marcelo Vanzin)

> Factoring out StreamingListener and UI to support history UI
> 
>
> Key: SPARK-12977
> URL: https://issues.apache.org/jira/browse/SPARK-12977
> Project: Spark
>  Issue Type: Sub-task
>  Components: DStreams
>Affects Versions: 1.6.0
>Reporter: Saisai Shao
>Priority: Major
> Attachments: screenshot-1.png
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12977) Factoring out StreamingListener and UI to support history UI

2018-01-22 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-12977:
--

Assignee: Marcelo Vanzin

> Factoring out StreamingListener and UI to support history UI
> 
>
> Key: SPARK-12977
> URL: https://issues.apache.org/jira/browse/SPARK-12977
> Project: Spark
>  Issue Type: Sub-task
>  Components: DStreams
>Affects Versions: 1.6.0
>Reporter: Saisai Shao
>Assignee: Marcelo Vanzin
>Priority: Major
> Attachments: screenshot-1.png
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20841) Support table column aliases in FROM clause

2018-01-22 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-20841.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> Support table column aliases in FROM clause
> ---
>
> Key: SPARK-20841
> URL: https://issues.apache.org/jira/browse/SPARK-20841
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 2.3.0
>
>
> Some SQL dialects support a relatively obscure "table column aliases" feature 
> where you can rename columns when aliasing a relation in a {{FROM}} clause. 
> For example:
> {code}
> SELECT * FROM onecolumn AS a(x) JOIN onecolumn AS b(y) ON a.x = b.y
> {code}
> Spark does not currently support this. I would like to add support for this 
> in order to allow me to run a corpus of existing queries which depend on this 
> syntax.
> There's a good writeup on this at 
> http://modern-sql.com/feature/table-column-aliases, which has additional 
> examples and describes other databases' degrees of support for this feature.
> One tricky thing to figure out will be whether FROM clause column aliases 
> take precedence over aliases in the SELECT clause. When adding support for 
> this, we should make sure to add sufficient testing of several corner-cases, 
> including:
> * Aliasing in both the SELECT and FROM clause
> * Aliasing columns in the FROM clause both with and without an explicit AS.
> * Aliasing the wrong number of columns in the FROM clause, both greater and 
> fewer columns than were selected in the SELECT clause.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23182) Allow enabling of TCP keep alive for master RPC connections

2018-01-22 Thread Petar Petrov (JIRA)
Petar Petrov created SPARK-23182:


 Summary: Allow enabling of TCP keep alive for master RPC 
connections
 Key: SPARK-23182
 URL: https://issues.apache.org/jira/browse/SPARK-23182
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Petar Petrov


We rely heavily on preemptible worker machines in GCP/GCE. These machines 
disappear without closing the TCP connections to the master which increases the 
number of established connections and new workers can not connect because of 
"Too many open files" on the master.

To solve the problem we need to enable TCP keep alive for the RPC connections 
to the master but it's not possible to do so via configuration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23172) Expand the ReorderJoin rule to handle Project nodes

2018-01-22 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-23172:
-
Summary: Expand the ReorderJoin rule to handle Project nodes  (was: Respect 
Project nodes in ReorderJoin)

> Expand the ReorderJoin rule to handle Project nodes
> ---
>
> Key: SPARK-23172
> URL: https://issues.apache.org/jira/browse/SPARK-23172
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> The current `ReorderJoin` optimizer rule cannot flatten a pattern `Join -> 
> Project -> Join` because `ExtractFiltersAndInnerJoins`
> doesn't handle `Project` nodes. So, the current master cannot reorder joins 
> in a query below;
> {code}
> val df1 = spark.range(100).selectExpr("id % 10 AS k0", s"id % 10 AS k1", s"id 
> % 10 AS k2", "id AS v1")
> val df2 = spark.range(10).selectExpr("id AS k0", "id AS v2")
> val df3 = spark.range(10).selectExpr("id AS k1", "id AS v3")
> val df4 = spark.range(10).selectExpr("id AS k2", "id AS v4")
> df1.join(df2, "k0").join(df3, "k1").join(df4, "k2").explain(true)
> == Analyzed Logical Plan ==
> k2: bigint, k1: bigint, k0: bigint, v1: bigint, v2: bigint, v3: bigint, v4: 
> bigint
> Project [k2#5L, k1#4L, k0#3L, v1#6L, v2#16L, v3#24L, v4#32L]
> +- Join Inner, (k2#5L = k2#31L)
>:- Project [k1#4L, k0#3L, k2#5L, v1#6L, v2#16L, v3#24L]
>:  +- Join Inner, (k1#4L = k1#23L)
>: :- Project [k0#3L, k1#4L, k2#5L, v1#6L, v2#16L]
>: :  +- Join Inner, (k0#3L = k0#15L)
>: : :- Project [(id#0L % cast(10 as bigint)) AS k0#3L, (id#0L % 
> cast(10 as bigint)) AS k1#4L, (id#0L % cast(10 as bigint)) AS k2#5L, id#0
> L AS v1#6L]
>: : :  +- Range (0, 100, step=1, splits=Some(4))
>: : +- Project [id#12L AS k0#15L, id#12L AS v2#16L]
>: :+- Range (0, 10, step=1, splits=Some(4))
>: +- Project [id#20L AS k1#23L, id#20L AS v3#24L]
>:+- Range (0, 10, step=1, splits=Some(4))
>+- Project [id#28L AS k2#31L, id#28L AS v4#32L]
>   +- Range (0, 10, step=1, splits=Some(4))
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-21646) Add new type coercion rules to compatible with Hive

2018-01-22 Thread Sameer Agarwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-21646:
---
Comment: was deleted

(was: Marking it as a duplicate of SPARK-22722)

> Add new type coercion rules to compatible with Hive
> ---
>
> Key: SPARK-21646
> URL: https://issues.apache.org/jira/browse/SPARK-21646
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: Type_coercion_rules_to_compatible_with_Hive.pdf
>
>
> How to reproduce:
> hive:
> {code:sql}
> $ hive -S
> hive> create table spark_21646(c1 string, c2 string);
> hive> insert into spark_21646 values('92233720368547758071', 'a');
> hive> insert into spark_21646 values('21474836471', 'b');
> hive> insert into spark_21646 values('10', 'c');
> hive> select * from spark_21646 where c1 > 0;
> 92233720368547758071  a
> 10c
> 21474836471   b
> hive>
> {code}
> spark-sql:
> {code:sql}
> $ spark-sql -S
> spark-sql> select * from spark_21646 where c1 > 0;
> 10  c 
>   
> spark-sql> select * from spark_21646 where c1 > 0L;
> 21474836471   b
> 10c
> spark-sql> explain select * from spark_21646 where c1 > 0;
> == Physical Plan ==
> *Project [c1#14, c2#15]
> +- *Filter (isnotnull(c1#14) && (cast(c1#14 as int) > 0))
>+- *FileScan parquet spark_21646[c1#14,c2#15] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[viewfs://cluster4/user/hive/warehouse/spark_21646], 
> PartitionFilters: [], PushedFilters: [IsNotNull(c1)], ReadSchema: 
> struct
> spark-sql> 
> {code}
> As you can see, spark auto cast c1 to int type, if this value out of integer 
> range, the result is different from Hive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23148) spark.read.csv with multiline=true gives FileNotFoundException if path contains spaces

2018-01-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23148:


Assignee: (was: Apache Spark)

> spark.read.csv with multiline=true gives FileNotFoundException if path 
> contains spaces
> --
>
> Key: SPARK-23148
> URL: https://issues.apache.org/jira/browse/SPARK-23148
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bogdan Raducanu
>Priority: Major
>
> Repro code:
> {code:java}
> spark.range(10).write.csv("/tmp/a b c/a.csv")
> spark.read.option("multiLine", false).csv("/tmp/a b c/a.csv").count
> 10
> spark.read.option("multiLine", true).csv("/tmp/a b c/a.csv").count
> java.io.FileNotFoundException: File 
> file:/tmp/a%20b%20c/a.csv/part-0-cf84f9b2-5fe6-4f54-a130-a1737689db00-c000.csv
>  does not exist
> {code}
> Trying to manually escape fails in a different place:
> {code}
> spark.read.option("multiLine", true).csv("/tmp/a%20b%20c/a.csv").count
> org.apache.spark.sql.AnalysisException: Path does not exist: 
> file:/tmp/a%20b%20c/a.csv;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:683)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:387)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:387)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23148) spark.read.csv with multiline=true gives FileNotFoundException if path contains spaces

2018-01-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335260#comment-16335260
 ] 

Apache Spark commented on SPARK-23148:
--

User 'henryr' has created a pull request for this issue:
https://github.com/apache/spark/pull/20355

> spark.read.csv with multiline=true gives FileNotFoundException if path 
> contains spaces
> --
>
> Key: SPARK-23148
> URL: https://issues.apache.org/jira/browse/SPARK-23148
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bogdan Raducanu
>Priority: Major
>
> Repro code:
> {code:java}
> spark.range(10).write.csv("/tmp/a b c/a.csv")
> spark.read.option("multiLine", false).csv("/tmp/a b c/a.csv").count
> 10
> spark.read.option("multiLine", true).csv("/tmp/a b c/a.csv").count
> java.io.FileNotFoundException: File 
> file:/tmp/a%20b%20c/a.csv/part-0-cf84f9b2-5fe6-4f54-a130-a1737689db00-c000.csv
>  does not exist
> {code}
> Trying to manually escape fails in a different place:
> {code}
> spark.read.option("multiLine", true).csv("/tmp/a%20b%20c/a.csv").count
> org.apache.spark.sql.AnalysisException: Path does not exist: 
> file:/tmp/a%20b%20c/a.csv;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:683)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:387)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:387)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23148) spark.read.csv with multiline=true gives FileNotFoundException if path contains spaces

2018-01-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23148:


Assignee: Apache Spark

> spark.read.csv with multiline=true gives FileNotFoundException if path 
> contains spaces
> --
>
> Key: SPARK-23148
> URL: https://issues.apache.org/jira/browse/SPARK-23148
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bogdan Raducanu
>Assignee: Apache Spark
>Priority: Major
>
> Repro code:
> {code:java}
> spark.range(10).write.csv("/tmp/a b c/a.csv")
> spark.read.option("multiLine", false).csv("/tmp/a b c/a.csv").count
> 10
> spark.read.option("multiLine", true).csv("/tmp/a b c/a.csv").count
> java.io.FileNotFoundException: File 
> file:/tmp/a%20b%20c/a.csv/part-0-cf84f9b2-5fe6-4f54-a130-a1737689db00-c000.csv
>  does not exist
> {code}
> Trying to manually escape fails in a different place:
> {code}
> spark.read.option("multiLine", true).csv("/tmp/a%20b%20c/a.csv").count
> org.apache.spark.sql.AnalysisException: Path does not exist: 
> file:/tmp/a%20b%20c/a.csv;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:683)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:387)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:387)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23185) Make the configuration "spark.default.parallelism" can be changed on each SQL session to decrease empty files

2018-01-22 Thread LvDongrong (JIRA)
LvDongrong created SPARK-23185:
--

 Summary: Make the configuration "spark.default.parallelism" can be 
changed on each SQL session to decrease empty files
 Key: SPARK-23185
 URL: https://issues.apache.org/jira/browse/SPARK-23185
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.1
Reporter: LvDongrong


When execute "insert into ... values ...", many empty files will be 
generated.We can change the configuration "spark.default.parallelism" to 
decrease the number of empty files.But there are many occasions that we want to 
chang the configuration during each session so as not to influence other sql 
sentences, like we may use thrift server to excute many sql sentences on a SQL 
session.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23185) Make the configuration "spark.default.parallelism" can be changed on each SQL session to decrease empty files

2018-01-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23185:


Assignee: Apache Spark

> Make the configuration "spark.default.parallelism" can be changed on each SQL 
> session to decrease empty files
> -
>
> Key: SPARK-23185
> URL: https://issues.apache.org/jira/browse/SPARK-23185
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: LvDongrong
>Assignee: Apache Spark
>Priority: Minor
>
> When execute "insert into ... values ...", many empty files will be 
> generated.We can change the configuration "spark.default.parallelism" to 
> decrease the number of empty files.But there are many occasions that we want 
> to chang the configuration during each session so as not to influence other 
> sql sentences, like we may use thrift server to excute many sql sentences on 
> a SQL session.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23185) Make the configuration "spark.default.parallelism" can be changed on each SQL session to decrease empty files

2018-01-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23185:


Assignee: (was: Apache Spark)

> Make the configuration "spark.default.parallelism" can be changed on each SQL 
> session to decrease empty files
> -
>
> Key: SPARK-23185
> URL: https://issues.apache.org/jira/browse/SPARK-23185
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: LvDongrong
>Priority: Minor
>
> When execute "insert into ... values ...", many empty files will be 
> generated.We can change the configuration "spark.default.parallelism" to 
> decrease the number of empty files.But there are many occasions that we want 
> to chang the configuration during each session so as not to influence other 
> sql sentences, like we may use thrift server to excute many sql sentences on 
> a SQL session.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23185) Make the configuration "spark.default.parallelism" can be changed on each SQL session to decrease empty files

2018-01-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335346#comment-16335346
 ] 

Apache Spark commented on SPARK-23185:
--

User 'lvdongr' has created a pull request for this issue:
https://github.com/apache/spark/pull/20356

> Make the configuration "spark.default.parallelism" can be changed on each SQL 
> session to decrease empty files
> -
>
> Key: SPARK-23185
> URL: https://issues.apache.org/jira/browse/SPARK-23185
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: LvDongrong
>Priority: Minor
>
> When execute "insert into ... values ...", many empty files will be 
> generated.We can change the configuration "spark.default.parallelism" to 
> decrease the number of empty files.But there are many occasions that we want 
> to chang the configuration during each session so as not to influence other 
> sql sentences, like we may use thrift server to excute many sql sentences on 
> a SQL session.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23186) Loading JDBC Drivers should be syncronized

2018-01-22 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-23186:
-

 Summary: Loading JDBC Drivers should be syncronized
 Key: SPARK-23186
 URL: https://issues.apache.org/jira/browse/SPARK-23186
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.1
Reporter: Dongjoon Hyun


Since some JDBC Drivers have class initialization code to call `DriverManager`, 
we need to synchronize them to avoid potential deadlock situation like 
STORM-2527.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23186) Loading JDBC Drivers should be syncronized

2018-01-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23186:


Assignee: Apache Spark

> Loading JDBC Drivers should be syncronized
> --
>
> Key: SPARK-23186
> URL: https://issues.apache.org/jira/browse/SPARK-23186
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>
> Since some JDBC Drivers have class initialization code to call 
> `DriverManager`, we need to synchronize them to avoid potential deadlock 
> situation like STORM-2527.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23186) Loading JDBC Drivers should be syncronized

2018-01-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335369#comment-16335369
 ] 

Apache Spark commented on SPARK-23186:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/20357

> Loading JDBC Drivers should be syncronized
> --
>
> Key: SPARK-23186
> URL: https://issues.apache.org/jira/browse/SPARK-23186
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Since some JDBC Drivers have class initialization code to call 
> `DriverManager`, we need to synchronize them to avoid potential deadlock 
> situation like STORM-2527.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23186) Loading JDBC Drivers should be syncronized

2018-01-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23186:


Assignee: (was: Apache Spark)

> Loading JDBC Drivers should be syncronized
> --
>
> Key: SPARK-23186
> URL: https://issues.apache.org/jira/browse/SPARK-23186
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Since some JDBC Drivers have class initialization code to call 
> `DriverManager`, we need to synchronize them to avoid potential deadlock 
> situation like STORM-2527.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20749) Built-in SQL Function Support - all variants of LEN[GTH]

2018-01-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335375#comment-16335375
 ] 

Apache Spark commented on SPARK-20749:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/20358

> Built-in SQL Function Support - all variants of LEN[GTH]
> 
>
> Key: SPARK-20749
> URL: https://issues.apache.org/jira/browse/SPARK-20749
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Kazuaki Ishizaki
>Priority: Major
>  Labels: starter
> Fix For: 2.3.0
>
>
> {noformat}
> LEN[GTH]()
> {noformat}
> The SQL 99 standard includes BIT_LENGTH(), CHAR_LENGTH(), and OCTET_LENGTH() 
> functions.
> We need to support all of them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20927) Add cache operator to Unsupported Operations in Structured Streaming

2018-01-22 Thread Mathieu DESPRIEE (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16334363#comment-16334363
 ] 

Mathieu DESPRIEE edited comment on SPARK-20927 at 1/22/18 4:19 PM:
---

@[~zsxwing] actually it does make sense to cache mini-batches that have been 
fetched from the source.

In our case, we use FileStreamSource to obtain some data in a streaming way 
from a remote source (s3). It appears that new files get downloaded over and 
over and over for each streaming query we do on the stream. cache() would 
improve a lot our processing time. 

We also occasionally reprocess some historical data (in a streaming way, using 
the same queries). The penalty of fetching the data multiple times is enormous.

 


was (Author: mathieude):
@[~zsxwing] actually it does make sense to cache mini-batches that have been 
fetched from the source.

In our case, we use FileStreamSource to obtain some data in a streaming way 
from a remote source (s3). It appears that new files get downloaded over and 
over and over for each streaming query we do on the stream. cache() would 
improve a lot our processing time. 

We also occasionally reprocess some historical data (in a streaming way, using 
the same queries). The penalty of fetching the data multiple times is 
off-the-roof ...

 

> Add cache operator to Unsupported Operations in Structured Streaming 
> -
>
> Key: SPARK-20927
> URL: https://issues.apache.org/jira/browse/SPARK-20927
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>  Labels: starter
>
> Just [found 
> out|https://stackoverflow.com/questions/42062092/why-does-using-cache-on-streaming-datasets-fail-with-analysisexception-queries]
>  that {{cache}} is not allowed on streaming datasets.
> {{cache}} on streaming datasets leads to the following exception:
> {code}
> scala> spark.readStream.text("files").cache
> org.apache.spark.sql.AnalysisException: Queries with streaming sources must 
> be executed with writeStream.start();;
> FileSource[files]
>   at 
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:297)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:36)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:34)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForBatch(UnsupportedOperationChecker.scala:34)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertSupported(QueryExecution.scala:63)
>   at 
> org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:74)
>   at 
> org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:72)
>   at 
> org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:84)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:89)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:89)
>   at 
> org.apache.spark.sql.execution.CacheManager$$anonfun$cacheQuery$1.apply(CacheManager.scala:104)
>   at 
> org.apache.spark.sql.execution.CacheManager.writeLock(CacheManager.scala:68)
>   at 
> org.apache.spark.sql.execution.CacheManager.cacheQuery(CacheManager.scala:92)
>   at org.apache.spark.sql.Dataset.persist(Dataset.scala:2603)
>   at org.apache.spark.sql.Dataset.cache(Dataset.scala:2613)
>   ... 48 elided
> {code}
> It should be included in Structured Streaming's [Unsupported 
> Operations|http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#unsupported-operations].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23180) RFormulaModel should have labels member

2018-01-22 Thread Kevin Kuo (JIRA)
Kevin Kuo created SPARK-23180:
-

 Summary: RFormulaModel should have labels member
 Key: SPARK-23180
 URL: https://issues.apache.org/jira/browse/SPARK-23180
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.2.1
Reporter: Kevin Kuo


Like {{StringIndexerModel}}, {{RFormulaModel}} should have a {{labels}} member 
to facilitate constructing the appropriate {{IndexToString}} transformer to get 
the string labels back. Current workaround is to perform a transform then 
parsing the schema which is tedious.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23173) from_json can produce nulls for fields which are marked as non-nullable

2018-01-22 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16334005#comment-16334005
 ] 

Liang-Chi Hsieh commented on SPARK-23173:
-

+1 for 1 too.

> from_json can produce nulls for fields which are marked as non-nullable
> ---
>
> Key: SPARK-23173
> URL: https://issues.apache.org/jira/browse/SPARK-23173
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Herman van Hovell
>Priority: Major
>
> The {{from_json}} function uses a schema to convert a string into a Spark SQL 
> struct. This schema can contain non-nullable fields. The underlying 
> {{JsonToStructs}} expression does not check if a resulting struct respects 
> the nullability of the schema. This leads to very weird problems in consuming 
> expressions. In our case parquet writing would produce an illegal parquet 
> file.
> There are roughly solutions here:
>  # Assume that each field in schema passed to {{from_json}} is nullable, and 
> ignore the nullability information set in the passed schema.
>  # Validate the object during runtime, and fail execution if the data is null 
> where we are not expecting this.
> I currently am slightly in favor of option 1, since this is the more 
> performant option and a lot easier to do.
> WDYT? cc [~rxin] [~marmbrus] [~hyukjin.kwon] [~brkyvz]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23176) REPL project build failing in Spark v2.2.0

2018-01-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23176.
---
Resolution: Not A Problem

The 2.2.0 release built and builds correctly. You don't show a problem here.

> REPL project  build failing in Spark v2.2.0
> ---
>
> Key: SPARK-23176
> URL: https://issues.apache.org/jira/browse/SPARK-23176
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: shekhar reddy
>Priority: Blocker
>
> I tried building Spark v2.2.0 and got compilation in 
> https://github.com/apache/spark/tree/v2.2.0/repl/scala-2.11]/[src|https://github.com/apache/spark/tree/v2.2.0/repl/scala-2.11/src/main/scala/org/apache/spark/repl]/*Main.scala*
> at line 59: val jars = Utils.getUserJars(conf, isShell = 
> true).mkString(File.pathSeparator)
> I replaced with  Spark v2.2.1 and compiled and it worked fine. 
> Could you please fix this build error so that it will help for next users
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20927) Add cache operator to Unsupported Operations in Structured Streaming

2018-01-22 Thread Mathieu DESPRIEE (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16334363#comment-16334363
 ] 

Mathieu DESPRIEE edited comment on SPARK-20927 at 1/22/18 2:59 PM:
---

@[~zsxwing] actually it does make sense to cache mini-batches that have been 
fetched from the source.

In our case, we use FileStreamSource to obtain some data in a streaming way 
from a remote source (s3). It appears that new files get downloaded over and 
over and over for each streaming query we do on the stream. cache() would 
improve a lot our processing time. 

We also occasionally reprocess some historical data (in a streaming way, using 
the same queries). The penalty of fetching the data multiple times is 
off-the-roof ...

 


was (Author: mathieude):
@[~zsxwing] actually it does make sens to cache mini-batches that have been 
fetched from the source.

In our case, we use FileStreamSource to obtain some data in a streaming way 
from a remote source (s3). It appears that new files get downloaded over and 
over and over for each streaming query we do on the stream. cache() would 
improve a lot our processing time. 

We also occasionally reprocess some historical data (in a streaming way, using 
the same queries). The penalty of fetching the data multiple times is 
off-the-roof ...

 

> Add cache operator to Unsupported Operations in Structured Streaming 
> -
>
> Key: SPARK-20927
> URL: https://issues.apache.org/jira/browse/SPARK-20927
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>  Labels: starter
>
> Just [found 
> out|https://stackoverflow.com/questions/42062092/why-does-using-cache-on-streaming-datasets-fail-with-analysisexception-queries]
>  that {{cache}} is not allowed on streaming datasets.
> {{cache}} on streaming datasets leads to the following exception:
> {code}
> scala> spark.readStream.text("files").cache
> org.apache.spark.sql.AnalysisException: Queries with streaming sources must 
> be executed with writeStream.start();;
> FileSource[files]
>   at 
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:297)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:36)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:34)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForBatch(UnsupportedOperationChecker.scala:34)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertSupported(QueryExecution.scala:63)
>   at 
> org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:74)
>   at 
> org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:72)
>   at 
> org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:84)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:89)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:89)
>   at 
> org.apache.spark.sql.execution.CacheManager$$anonfun$cacheQuery$1.apply(CacheManager.scala:104)
>   at 
> org.apache.spark.sql.execution.CacheManager.writeLock(CacheManager.scala:68)
>   at 
> org.apache.spark.sql.execution.CacheManager.cacheQuery(CacheManager.scala:92)
>   at org.apache.spark.sql.Dataset.persist(Dataset.scala:2603)
>   at org.apache.spark.sql.Dataset.cache(Dataset.scala:2613)
>   ... 48 elided
> {code}
> It should be included in Structured Streaming's [Unsupported 
> Operations|http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#unsupported-operations].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20927) Add cache operator to Unsupported Operations in Structured Streaming

2018-01-22 Thread Mathieu DESPRIEE (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16334363#comment-16334363
 ] 

Mathieu DESPRIEE commented on SPARK-20927:
--

@[~zsxwing] actually it does make sens to cache mini-batches that have been 
fetched from the source.

In our case, we use FileStreamSource to obtain some data in a streaming way 
from a remote source (s3). It appears that new files get downloaded over and 
over and over for each streaming query we do on the stream. cache() would 
improve a lot our processing time. 

We also occasionally reprocess some historical data (in a streaming way, using 
the same queries). The penalty of fetching the data multiple times is 
off-the-roof ...

 

> Add cache operator to Unsupported Operations in Structured Streaming 
> -
>
> Key: SPARK-20927
> URL: https://issues.apache.org/jira/browse/SPARK-20927
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>  Labels: starter
>
> Just [found 
> out|https://stackoverflow.com/questions/42062092/why-does-using-cache-on-streaming-datasets-fail-with-analysisexception-queries]
>  that {{cache}} is not allowed on streaming datasets.
> {{cache}} on streaming datasets leads to the following exception:
> {code}
> scala> spark.readStream.text("files").cache
> org.apache.spark.sql.AnalysisException: Queries with streaming sources must 
> be executed with writeStream.start();;
> FileSource[files]
>   at 
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:297)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:36)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:34)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForBatch(UnsupportedOperationChecker.scala:34)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertSupported(QueryExecution.scala:63)
>   at 
> org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:74)
>   at 
> org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:72)
>   at 
> org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:84)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:89)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:89)
>   at 
> org.apache.spark.sql.execution.CacheManager$$anonfun$cacheQuery$1.apply(CacheManager.scala:104)
>   at 
> org.apache.spark.sql.execution.CacheManager.writeLock(CacheManager.scala:68)
>   at 
> org.apache.spark.sql.execution.CacheManager.cacheQuery(CacheManager.scala:92)
>   at org.apache.spark.sql.Dataset.persist(Dataset.scala:2603)
>   at org.apache.spark.sql.Dataset.cache(Dataset.scala:2613)
>   ... 48 elided
> {code}
> It should be included in Structured Streaming's [Unsupported 
> Operations|http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#unsupported-operations].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23179) Support option to throw exception if overflow occurs

2018-01-22 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-23179:
---

 Summary: Support option to throw exception if overflow occurs
 Key: SPARK-23179
 URL: https://issues.apache.org/jira/browse/SPARK-23179
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Marco Gaido


SQL ANSI 2011 states that in case of overflow during arithmetic operations, an 
exception should be thrown. This is what most of the SQL DBs do (eg. SQLServer, 
DB2). Hive currently returns NULL (as Spark does) but HIVE-18291 is open to be 
SQL compliant.

I propose to have a config option which allows to decide whether Spark should 
behave according to SQL standards or in the current way (ie. returning NULL).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11630) ClosureCleaner incorrectly warns for class based closures

2018-01-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11630.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20337
[https://github.com/apache/spark/pull/20337]

> ClosureCleaner incorrectly warns for class based closures
> -
>
> Key: SPARK-11630
> URL: https://issues.apache.org/jira/browse/SPARK-11630
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: Frens Jan Rumph
>Assignee: Rekha Joshi
>Priority: Trivial
> Fix For: 2.4.0
>
>
> Spark's `ClosureCleaner` utility seems to check whether a function is an 
> anonymous function: [ClosureCleaner.scala on line 
> 49|https://github.com/apache/spark/blob/f85aa06464a10f5d1563302fd76465dded475a12/core/src/main/scala/org/apache/spark/util/ClosureCleaner.scala#L49]
>  If not, it warns the user.
> However, I'm using some class based functions. Something along the lines of:
> {code}
> trait FromUnreadRow[T] extends (UnreadRow => T) with Serializable
> object ToPlainRow extends FromUnreadRow[PlainRow] {
>   override def apply(row: UnreadRow): PlainRow = ???
> }
> {code}
> This works just fine. I can't really see that the warning is actually useful 
> in this case. I appreciate checking for common 'mistakes', but in my case a 
> user might be alarmed unnecessarily.
> Anything that can be done about this? Anything I can do?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >