[jira] [Commented] (SPARK-25464) Dropping database can remove the hive warehouse directory contents

2018-09-21 Thread sandeep katta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624467#comment-16624467
 ] 

sandeep katta commented on SPARK-25464:
---

Yes I agree 2 database should not point to same path*,currently this is the 
loop hole in spark which is required to fix*,it is required to fix.If this 
solution is not okay ,then we can append the dbname.db to the location given by 
the user

for e.g

create database db1 location /user/hive/warehouse

then the location of the DB should be /user/hive/warehouse/db1.db

 

> Dropping database can remove the hive warehouse directory contents
> --
>
> Key: SPARK-25464
> URL: https://issues.apache.org/jira/browse/SPARK-25464
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Sushanta Sen
>Priority: Major
>
> Create Database.
> CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] db_name [COMMENT comment_text] 
> [*LOCATION*path] [WITH DBPROPERTIES (key1=val1, key2=val2, ...)]           
> \{{LOCATION }}If the specified path does not already exist in the underlying 
> file system, this command tries to create a directory with the path. *When 
> the database is dropped later, this directory is not deleted, 
> {color:#d04437}but currently it is deleting the directory as well.{color}
> {color:#33}please refer the below link{color}
> {color:#d04437}[databricks documentation|{color}
>  
> [https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-database.html]
>  {color:#d04437}]{color}
> if i create the database as below
> create database db1 location '/user/hive/warehouse'; //this is hive warehouse 
> directory   
> *{color:#33}on dropping this db it will also delete the warehouse 
> directory which contains the other db information.{color}*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25511) Map with "null" key not working in spark 2.3

2018-09-21 Thread Ravi Shankar (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Shankar updated SPARK-25511:
-
Description: 
I had a use case where i was creating a histogram of column values through a 
UDAF in a Map data type. It is basically just a group by count on a column's 
value that is returned as a Map. I needed to plugin 
all invalid values for the column as a "null -> count" in the map that was 
returned. In 2.1.x, this was working fine and i could create a Map with "null" 
being a key. This is not working in 2.3 and wondering if this is expected and 
if i have to change my application code: 

 
{code:java}
val myList = List(("a", "1"), ("b", "2"), ("a", "3"), (null, "4"))

val map = myList.toMap

val data = List(List("sublime", map))

val rdd = sc.parallelize(data).map(l ⇒ Row.fromSeq(l.toSeq))

val datasetSchema = StructType(List(StructField("name", StringType, true), 
StructField("songs", MapType(StringType, StringType, true), true)))

val df = spark.createDataFrame(rdd, datasetSchema)

df.take(5).foreach(println)


{code}
Output in spark 2.1.x:
{code:java}
scala> df.take(5).foreach(println)
[sublime,Map(a -> 3, b -> 2, null -> 4)]
{code}
Output in spark 2.3.x:
{code:java}
2018-09-21 15:35:25 ERROR Executor:91 - Exception in task 2.0 in stage 14.0 
(TID 39)
java.lang.RuntimeException: Error while encoding: 
java.lang.NullPointerException: Null value appeared in non-nullable field:

If the schema is inferred from a Scala tuple/case class, or a Java bean, please 
try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer 
instead of int/scala.Int).
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else 
staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
org.apache.spark.sql.Row, true]), 0, name), StringType), true, false) AS name#38
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else 
newInstance(class org.apache.spark.sql.catalyst.util.ArrayBasedMapData) AS 
songs#39
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:291)
at 
org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:589)
at 
org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:589)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException: Null value appeared in non-nullable 
field:

If the schema is inferred from a Scala tuple/case class, or a Java bean, please 
try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer 
instead of int/scala.Int).
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.If$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:288)
... 23 more

[jira] [Commented] (SPARK-25351) Handle Pandas category type when converting from Python with Arrow

2018-09-21 Thread Pooja Gadige (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624323#comment-16624323
 ] 

Pooja Gadige commented on SPARK-25351:
--

Hello! I'm working on this issue.

Should we handle the category type of Spark elements without Arrow as `string`?

If that is not case, I'll appreciate if you could please guide me in the right 
direction.

Thank you. 

> Handle Pandas category type when converting from Python with Arrow
> --
>
> Key: SPARK-25351
> URL: https://issues.apache.org/jira/browse/SPARK-25351
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Bryan Cutler
>Priority: Major
>
> There needs to be some handling of category types done when calling 
> {{createDataFrame}} with Arrow or the return value of {{pandas_udf}}.  
> Without Arrow, Spark casts each element to the category. For example 
> {noformat}
> In [1]: import pandas as pd
> In [2]: pdf = pd.DataFrame({"A":[u"a",u"b",u"c",u"a"]})
> In [3]: pdf["B"] = pdf["A"].astype('category')
> In [4]: pdf
> Out[4]: 
>A  B
> 0  a  a
> 1  b  b
> 2  c  c
> 3  a  a
> In [5]: pdf.dtypes
> Out[5]: 
> A  object
> Bcategory
> dtype: object
> In [7]: spark.conf.set("spark.sql.execution.arrow.enabled", False)
> In [8]: df = spark.createDataFrame(pdf)
> In [9]: df.show()
> +---+---+
> |  A|  B|
> +---+---+
> |  a|  a|
> |  b|  b|
> |  c|  c|
> |  a|  a|
> +---+---+
> In [10]: df.printSchema()
> root
>  |-- A: string (nullable = true)
>  |-- B: string (nullable = true)
> In [18]: spark.conf.set("spark.sql.execution.arrow.enabled", True)
> In [19]: df = spark.createDataFrame(pdf)   
>1667 spark_type = ArrayType(from_arrow_type(at.value_type))
>1668 else:
> -> 1669 raise TypeError("Unsupported type in conversion from Arrow: " 
> + str(at))
>1670 return spark_type
>1671 
> TypeError: Unsupported type in conversion from Arrow: 
> dictionary
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25511) Map with "null" key not working in spark 2.3

2018-09-21 Thread Ravi Shankar (JIRA)
Ravi Shankar created SPARK-25511:


 Summary: Map with "null" key not working in spark 2.3
 Key: SPARK-25511
 URL: https://issues.apache.org/jira/browse/SPARK-25511
 Project: Spark
  Issue Type: Bug
  Components: Optimizer
Affects Versions: 2.3.1
Reporter: Ravi Shankar


I had a use case where i was creating a histogram of a column value in a Map 
data type. It is basically just a group by count on a column's value that is 
returned as a Map. I needed to plugin all invalid 
values for the column as a "null -> count" in the map that was returned. In 
2.1.x, this was working fine and i could create a Map with "null" being a key. 
This is not working in 2.3 and wondering if this is expected and if i have to 
change my application code: 

 
{code:java}
val myList = List(("a", "1"), ("b", "2"), ("a", "3"), (null, "4"))

val map = myList.toMap

val data = List(List("sublime", map))

val rdd = sc.parallelize(data).map(l ⇒ Row.fromSeq(l.toSeq))

val datasetSchema = StructType(List(StructField("name", StringType, true), 
StructField("songs", MapType(StringType, StringType, true), true)))

val df = spark.createDataFrame(rdd, datasetSchema)

df.take(5).foreach(println)


{code}

Output in spark 2.1.x:

{code}
scala> df.take(5).foreach(println)
[sublime,Map(a -> 3, b -> 2, null -> 4)]
{code}

Output in spark 2.3.x:
{code}
2018-09-21 15:35:25 ERROR Executor:91 - Exception in task 2.0 in stage 14.0 
(TID 39)
java.lang.RuntimeException: Error while encoding: 
java.lang.NullPointerException: Null value appeared in non-nullable field:

If the schema is inferred from a Scala tuple/case class, or a Java bean, please 
try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer 
instead of int/scala.Int).
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else 
staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
org.apache.spark.sql.Row, true]), 0, name), StringType), true, false) AS name#38
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else 
newInstance(class org.apache.spark.sql.catalyst.util.ArrayBasedMapData) AS 
songs#39
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:291)
at 
org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:589)
at 
org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:589)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException: Null value appeared in non-nullable 
field:

If the schema is inferred from a Scala tuple/case class, or a Java bean, please 
try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer 
instead of int/scala.Int).
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.If$(Unknown
 Source)
at 

[jira] [Commented] (SPARK-24523) InterruptedException when closing SparkContext

2018-09-21 Thread Ankur Gupta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624231#comment-16624231
 ] 

Ankur Gupta commented on SPARK-24523:
-

That is concerning actually. You should also see the following message in the 
start of your application logs: "Dropping event from queue appStatus. This 
likely means one of the listeners is too slow and cannot keep up with the rate 
at which tasks are being started by the scheduler."

I am not sure why this occurs though. It seems that your application is 
creating a lot of events. Based on the code, all the dropped events (7937) were 
created in previous 60 seconds and maybe even more than that as some events may 
have been processed as well.

You can increase the capacity of queue: 
spark.scheduler.listenerbus.eventqueue.capacity (default is 1) to ensure 
that events are not dropped. But I am not sure how to speed up the listener 
processing these events. The only way that I understand is to increase the 
number of cores which allows the thread processing these events to run more 
often.

> InterruptedException when closing SparkContext
> --
>
> Key: SPARK-24523
> URL: https://issues.apache.org/jira/browse/SPARK-24523
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.0, 2.3.1
> Environment: EMR 5.14.0, S3/HDFS inputs and outputs; EMR 5.17
>  
>  
>  
>Reporter: Umayr Hassan
>Priority: Major
> Attachments: spark-stop-jstack.log.1, spark-stop-jstack.log.2, 
> spark-stop-jstack.log.3
>
>
> I'm running a Scala application in EMR with the following properties:
> {{--master yarn --deploy-mode cluster --driver-memory 13g --executor-memory 
> 30g --executor-cores 5 --conf spark.default.parallelism=400 --conf 
> spark.dynamicAllocation.enabled=true --conf 
> spark.dynamicAllocation.maxExecutors=20 --conf 
> spark.eventLog.dir=hdfs:///var/log/spark/apps --conf 
> spark.eventLog.enabled=true --conf 
> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 --conf 
> spark.scheduler.listenerbus.eventqueue.capacity=2 --conf 
> spark.shuffle.service.enabled=true --conf spark.sql.shuffle.partitions=400 
> --conf spark.yarn.maxAppAttempts=1}}
> The application runs fine till SparkContext is (automatically) closed, at 
> which point the SparkContext object throws. 
> {{18/06/10 10:44:43 ERROR Utils: Uncaught exception in thread pool-4-thread-1 
> java.lang.InterruptedException at java.lang.Object.wait(Native Method) at 
> java.lang.Thread.join(Thread.java:1252) at 
> java.lang.Thread.join(Thread.java:1326) at 
> org.apache.spark.scheduler.AsyncEventQueue.stop(AsyncEventQueue.scala:133) at 
> org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219)
>  at 
> org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:893) at 
> scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at 
> scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at 
> scala.collection.AbstractIterable.foreach(Iterable.scala:54) at 
> org.apache.spark.scheduler.LiveListenerBus.stop(LiveListenerBus.scala:219) at 
> org.apache.spark.SparkContext$$anonfun$stop$6.apply$mcV$sp(SparkContext.scala:1915)
>  at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1357) at 
> org.apache.spark.SparkContext.stop(SparkContext.scala:1914) at 
> org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:572) 
> at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216) 
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
>  at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
>  at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
>  at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1988) at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
>  at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
>  at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
>  at scala.util.Try$.apply(Try.scala:192) at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
>  at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> at java.util.concurrent.FutureTask.run(FutureTask.java:266) 

[jira] [Comment Edited] (SPARK-24523) InterruptedException when closing SparkContext

2018-09-21 Thread Umayr Hassan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624202#comment-16624202
 ] 

Umayr Hassan edited comment on SPARK-24523 at 9/21/18 9:30 PM:
---

[~irashid] The event logs are written to HDFS. 

[~ankur.gupta] Would the event logs have this information? Increasing driver 
cores to 4 didn't seem to help. BTW, I also see a warning message like:

{\{18/09/21 20:15:47 WARN AsyncEventQueue: Dropped 7937 events from appStatus 
since Fri Sep 21 20:14:47 UTC 2018. }}

The one thing that, for some jobs seem to help somewhat is reducing the number 
of partitions. E.g. one job (that also uses LinearRegression) runs in 30min 
instead of 2hrs when the number of partitions is reduced from 10 to 1 (the job 
took 15min in Spark 2.0.2).


was (Author: umayr_nuna):
[~irashid] The event logs are written to HDFS. 

[~ankur.gupta] Would the event logs have this information. Also, increasing 
driver cores to 4 didn't seem to help. BTW, I also see a warning message like:

{{18/09/21 20:15:47 WARN AsyncEventQueue: Dropped 7937 events from appStatus 
since Fri Sep 21 20:14:47 UTC 2018. }}

The one thing that, for some jobs seem to help somewhat is reducing the number 
of partitions. E.g. one job (that also uses LinearRegression) runs in 30min 
instead of 2hrs when the number of partitions is reduced from 10 to 1 (the job 
took 15min in Spark 2.0.2).

> InterruptedException when closing SparkContext
> --
>
> Key: SPARK-24523
> URL: https://issues.apache.org/jira/browse/SPARK-24523
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.0, 2.3.1
> Environment: EMR 5.14.0, S3/HDFS inputs and outputs; EMR 5.17
>  
>  
>  
>Reporter: Umayr Hassan
>Priority: Major
> Attachments: spark-stop-jstack.log.1, spark-stop-jstack.log.2, 
> spark-stop-jstack.log.3
>
>
> I'm running a Scala application in EMR with the following properties:
> {{--master yarn --deploy-mode cluster --driver-memory 13g --executor-memory 
> 30g --executor-cores 5 --conf spark.default.parallelism=400 --conf 
> spark.dynamicAllocation.enabled=true --conf 
> spark.dynamicAllocation.maxExecutors=20 --conf 
> spark.eventLog.dir=hdfs:///var/log/spark/apps --conf 
> spark.eventLog.enabled=true --conf 
> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 --conf 
> spark.scheduler.listenerbus.eventqueue.capacity=2 --conf 
> spark.shuffle.service.enabled=true --conf spark.sql.shuffle.partitions=400 
> --conf spark.yarn.maxAppAttempts=1}}
> The application runs fine till SparkContext is (automatically) closed, at 
> which point the SparkContext object throws. 
> {{18/06/10 10:44:43 ERROR Utils: Uncaught exception in thread pool-4-thread-1 
> java.lang.InterruptedException at java.lang.Object.wait(Native Method) at 
> java.lang.Thread.join(Thread.java:1252) at 
> java.lang.Thread.join(Thread.java:1326) at 
> org.apache.spark.scheduler.AsyncEventQueue.stop(AsyncEventQueue.scala:133) at 
> org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219)
>  at 
> org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:893) at 
> scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at 
> scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at 
> scala.collection.AbstractIterable.foreach(Iterable.scala:54) at 
> org.apache.spark.scheduler.LiveListenerBus.stop(LiveListenerBus.scala:219) at 
> org.apache.spark.SparkContext$$anonfun$stop$6.apply$mcV$sp(SparkContext.scala:1915)
>  at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1357) at 
> org.apache.spark.SparkContext.stop(SparkContext.scala:1914) at 
> org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:572) 
> at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216) 
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
>  at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
>  at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
>  at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1988) at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
>  at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
>  at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
>  at 

[jira] [Commented] (SPARK-24523) InterruptedException when closing SparkContext

2018-09-21 Thread Umayr Hassan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624202#comment-16624202
 ] 

Umayr Hassan commented on SPARK-24523:
--

[~irashid] The event logs are written to HDFS. 

[~ankur.gupta] Would the event logs have this information. Also, increasing 
driver cores to 4 didn't seem to help. BTW, I also see a warning message like:

{{18/09/21 20:15:47 WARN AsyncEventQueue: Dropped 7937 events from appStatus 
since Fri Sep 21 20:14:47 UTC 2018. }}

The one thing that, for some jobs seem to help somewhat is reducing the number 
of partitions. E.g. one job (that also uses LinearRegression) runs in 30min 
instead of 2hrs when the number of partitions is reduced from 10 to 1 (the job 
took 15min in Spark 2.0.2).

> InterruptedException when closing SparkContext
> --
>
> Key: SPARK-24523
> URL: https://issues.apache.org/jira/browse/SPARK-24523
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.0, 2.3.1
> Environment: EMR 5.14.0, S3/HDFS inputs and outputs; EMR 5.17
>  
>  
>  
>Reporter: Umayr Hassan
>Priority: Major
> Attachments: spark-stop-jstack.log.1, spark-stop-jstack.log.2, 
> spark-stop-jstack.log.3
>
>
> I'm running a Scala application in EMR with the following properties:
> {{--master yarn --deploy-mode cluster --driver-memory 13g --executor-memory 
> 30g --executor-cores 5 --conf spark.default.parallelism=400 --conf 
> spark.dynamicAllocation.enabled=true --conf 
> spark.dynamicAllocation.maxExecutors=20 --conf 
> spark.eventLog.dir=hdfs:///var/log/spark/apps --conf 
> spark.eventLog.enabled=true --conf 
> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 --conf 
> spark.scheduler.listenerbus.eventqueue.capacity=2 --conf 
> spark.shuffle.service.enabled=true --conf spark.sql.shuffle.partitions=400 
> --conf spark.yarn.maxAppAttempts=1}}
> The application runs fine till SparkContext is (automatically) closed, at 
> which point the SparkContext object throws. 
> {{18/06/10 10:44:43 ERROR Utils: Uncaught exception in thread pool-4-thread-1 
> java.lang.InterruptedException at java.lang.Object.wait(Native Method) at 
> java.lang.Thread.join(Thread.java:1252) at 
> java.lang.Thread.join(Thread.java:1326) at 
> org.apache.spark.scheduler.AsyncEventQueue.stop(AsyncEventQueue.scala:133) at 
> org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219)
>  at 
> org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:893) at 
> scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at 
> scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at 
> scala.collection.AbstractIterable.foreach(Iterable.scala:54) at 
> org.apache.spark.scheduler.LiveListenerBus.stop(LiveListenerBus.scala:219) at 
> org.apache.spark.SparkContext$$anonfun$stop$6.apply$mcV$sp(SparkContext.scala:1915)
>  at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1357) at 
> org.apache.spark.SparkContext.stop(SparkContext.scala:1914) at 
> org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:572) 
> at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216) 
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
>  at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
>  at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
>  at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1988) at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
>  at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
>  at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
>  at scala.util.Try$.apply(Try.scala:192) at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
>  at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> at java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)}}
>  
> I've not seen this behavior in Spark 2.0.2 and Spark 2.2.0 (for the same 
> application), so 

[jira] [Updated] (SPARK-25322) ML, Graph 2.4 QA: API: Experimental, DeveloperApi, final, sealed audit

2018-09-21 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-25322:
--
Priority: Critical  (was: Blocker)

> ML, Graph 2.4 QA: API: Experimental, DeveloperApi, final, sealed audit
> --
>
> Key: SPARK-25322
> URL: https://issues.apache.org/jira/browse/SPARK-25322
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Priority: Critical
>
> We should make a pass through the items marked as Experimental or 
> DeveloperApi and see if any are stable enough to be unmarked.
> We should also check for items marked final or sealed to see if they are 
> stable enough to be opened up as APIs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25319) Spark MLlib, GraphX 2.4 QA umbrella

2018-09-21 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624124#comment-16624124
 ] 

Xiangrui Meng commented on SPARK-25319:
---

[~WeichenXu123] Could you check the OPEN subtasks and see if there are still 
TODOs left?

> Spark MLlib, GraphX 2.4 QA umbrella
> ---
>
> Key: SPARK-25319
> URL: https://issues.apache.org/jira/browse/SPARK-25319
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Weichen Xu
>Assignee: Joseph K. Bradley
>Priority: Critical
> Fix For: 2.4.0
>
>
> This JIRA lists tasks for the next Spark release's QA period for MLlib and 
> GraphX. *SparkR is separate.
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
>  * Check binary API compatibility for Scala/Java
>  * Audit new public APIs (from the generated html doc)
>  ** Scala
>  ** Java compatibility
>  ** Python coverage
>  * Check Experimental, DeveloperApi tags
> h2. Algorithms and performance
>  * Performance tests
> h2. Documentation and example code
>  * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
>  * Update Programming Guide
>  * Update website



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25326) ML, Graph 2.4 QA: Programming guide update and migration guide

2018-09-21 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-25326:
-

Assignee: (was: Nick Pentreath)

> ML, Graph 2.4 QA: Programming guide update and migration guide
> --
>
> Key: SPARK-25326
> URL: https://issues.apache.org/jira/browse/SPARK-25326
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Priority: Critical
>
> Before the release, we need to update the MLlib and GraphX Programming 
> Guides. Updates will include:
>  * Add migration guide subsection.
>  ** Use the results of the QA audit JIRAs.
>  * Check phrasing, especially in main sections (for outdated items such as 
> "In this release, ...")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25325) ML, Graph 2.4 QA: Update user guide for new features & APIs

2018-09-21 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-25325:
-

Assignee: (was: Nick Pentreath)

> ML, Graph 2.4 QA: Update user guide for new features & APIs
> ---
>
> Key: SPARK-25325
> URL: https://issues.apache.org/jira/browse/SPARK-25325
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Priority: Critical
>
> Check the user guide vs. a list of new APIs (classes, methods, data members) 
> to see what items require updates to the user guide.
> For each feature missing user guide doc:
> * Create a JIRA for that feature, and assign it to the author of the feature
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").
> For MLlib:
> * This task does not include major reorganizations for the programming guide.
> * We should now begin copying algorithm details from the spark.mllib guide to 
> spark.ml as needed, rather than just linking back to the corresponding 
> algorithms in the spark.mllib user guide.
> If you would like to work on this task, please comment, and we can create & 
> link JIRAs for parts of this work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25327) Update MLlib, GraphX websites for 2.4

2018-09-21 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-25327:
-

Assignee: (was: Nick Pentreath)

> Update MLlib, GraphX websites for 2.4
> -
>
> Key: SPARK-25327
> URL: https://issues.apache.org/jira/browse/SPARK-25327
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Priority: Critical
>
> Update the sub-projects' websites to include new features in this release.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25327) Update MLlib, GraphX websites for 2.4

2018-09-21 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-25327.
---
Resolution: Won't Do

> Update MLlib, GraphX websites for 2.4
> -
>
> Key: SPARK-25327
> URL: https://issues.apache.org/jira/browse/SPARK-25327
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Assignee: Nick Pentreath
>Priority: Critical
>
> Update the sub-projects' websites to include new features in this release.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25322) ML, Graph 2.4 QA: API: Experimental, DeveloperApi, final, sealed audit

2018-09-21 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-25322:
-

Assignee: (was: Nick Pentreath)

> ML, Graph 2.4 QA: API: Experimental, DeveloperApi, final, sealed audit
> --
>
> Key: SPARK-25322
> URL: https://issues.apache.org/jira/browse/SPARK-25322
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Priority: Blocker
>
> We should make a pass through the items marked as Experimental or 
> DeveloperApi and see if any are stable enough to be unmarked.
> We should also check for items marked final or sealed to see if they are 
> stable enough to be opened up as APIs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25323) ML 2.4 QA: API: Python API coverage

2018-09-21 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624123#comment-16624123
 ] 

Xiangrui Meng commented on SPARK-25323:
---

[~WeichenXu123] anyone is working on this? I reduced the priority to critical 
from blocker because it doesn't block the release.

> ML 2.4 QA: API: Python API coverage
> ---
>
> Key: SPARK-25323
> URL: https://issues.apache.org/jira/browse/SPARK-25323
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Priority: Critical
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
> * *GOAL*: Audit and create JIRAs to fix in the next release.
> * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25323) ML 2.4 QA: API: Python API coverage

2018-09-21 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-25323:
--
Priority: Critical  (was: Blocker)

> ML 2.4 QA: API: Python API coverage
> ---
>
> Key: SPARK-25323
> URL: https://issues.apache.org/jira/browse/SPARK-25323
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Assignee: Bryan Cutler
>Priority: Critical
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
> * *GOAL*: Audit and create JIRAs to fix in the next release.
> * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25323) ML 2.4 QA: API: Python API coverage

2018-09-21 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-25323:
-

Assignee: (was: Bryan Cutler)

> ML 2.4 QA: API: Python API coverage
> ---
>
> Key: SPARK-25323
> URL: https://issues.apache.org/jira/browse/SPARK-25323
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Priority: Critical
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
> * *GOAL*: Audit and create JIRAs to fix in the next release.
> * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25324) ML 2.4 QA: API: Java compatibility, docs

2018-09-21 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-25324.
---
Resolution: Done

> ML 2.4 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-25324
> URL: https://issues.apache.org/jira/browse/SPARK-25324
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Blocker
>
> Check Java compatibility for this release:
> * APIs in {{spark.ml}}
> * New APIs in {{spark.mllib}} (There should be few, if any.)
> Checking compatibility means:
> * Checking for differences in how Scala and Java handle types. Some items to 
> look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.  These may not 
> be understood in Java, or they may be accessible only via the weirdly named 
> Java types (with "$" or "#") which are generated by the Scala compiler.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.  (In {{spark.ml}}, we have largely tried to avoid 
> using enumerations, and have instead favored plain strings.)
> * Check for differences in generated Scala vs Java docs.  E.g., one past 
> issue was that Javadocs did not respect Scala's package private modifier.
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here as "requires".
> * Remember that we should not break APIs from previous releases.  If you find 
> a problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (Algorithmic correctness can be checked in Scala.)
> Recommendations for how to complete this task:
> * There are not great tools.  In the past, this task has been done by:
> ** Generating API docs
> ** Building JAR and outputting the Java class signatures for MLlib
> ** Manually inspecting and searching the docs and class signatures for issues
> * If you do have ideas for better tooling, please say so we can make this 
> task easier in the future!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25324) ML 2.4 QA: API: Java compatibility, docs

2018-09-21 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624122#comment-16624122
 ] 

Xiangrui Meng commented on SPARK-25324:
---

See discussion in https://issues.apache.org/jira/browse/SPARK-25321.

> ML 2.4 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-25324
> URL: https://issues.apache.org/jira/browse/SPARK-25324
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Blocker
>
> Check Java compatibility for this release:
> * APIs in {{spark.ml}}
> * New APIs in {{spark.mllib}} (There should be few, if any.)
> Checking compatibility means:
> * Checking for differences in how Scala and Java handle types. Some items to 
> look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.  These may not 
> be understood in Java, or they may be accessible only via the weirdly named 
> Java types (with "$" or "#") which are generated by the Scala compiler.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.  (In {{spark.ml}}, we have largely tried to avoid 
> using enumerations, and have instead favored plain strings.)
> * Check for differences in generated Scala vs Java docs.  E.g., one past 
> issue was that Javadocs did not respect Scala's package private modifier.
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here as "requires".
> * Remember that we should not break APIs from previous releases.  If you find 
> a problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (Algorithmic correctness can be checked in Scala.)
> Recommendations for how to complete this task:
> * There are not great tools.  In the past, this task has been done by:
> ** Generating API docs
> ** Building JAR and outputting the Java class signatures for MLlib
> ** Manually inspecting and searching the docs and class signatures for issues
> * If you do have ideas for better tooling, please say so we can make this 
> task easier in the future!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25321) ML, Graph 2.4 QA: API: New Scala APIs, docs

2018-09-21 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-25321:
-

Assignee: Weichen Xu  (was: Yanbo Liang)

> ML, Graph 2.4 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-25321
> URL: https://issues.apache.org/jira/browse/SPARK-25321
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Blocker
> Fix For: 2.4.0
>
>
> Audit new public Scala APIs added to MLlib & GraphX. Take note of:
>  * Protected/public classes or methods. If access can be more private, then 
> it should be.
>  * Also look for non-sealed traits.
>  * Documentation: Missing? Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue. 
> For *user guide issues* link the new JIRAs to the relevant user guide QA issue



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25320) ML, Graph 2.4 QA: API: Binary incompatible changes

2018-09-21 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-25320:
-

Assignee: Weichen Xu  (was: Bago Amirbekian)

> ML, Graph 2.4 QA: API: Binary incompatible changes
> --
>
> Key: SPARK-25320
> URL: https://issues.apache.org/jira/browse/SPARK-25320
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Blocker
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25320) ML, Graph 2.4 QA: API: Binary incompatible changes

2018-09-21 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624116#comment-16624116
 ] 

Xiangrui Meng commented on SPARK-25320:
---

We will keep "predict()" change because it only breaks source compatibility.

> ML, Graph 2.4 QA: API: Binary incompatible changes
> --
>
> Key: SPARK-25320
> URL: https://issues.apache.org/jira/browse/SPARK-25320
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Assignee: Bago Amirbekian
>Priority: Blocker
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25320) ML, Graph 2.4 QA: API: Binary incompatible changes

2018-09-21 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-25320.
---
Resolution: Fixed

> ML, Graph 2.4 QA: API: Binary incompatible changes
> --
>
> Key: SPARK-25320
> URL: https://issues.apache.org/jira/browse/SPARK-25320
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Blocker
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25320) ML, Graph 2.4 QA: API: Binary incompatible changes

2018-09-21 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624115#comment-16624115
 ] 

Xiangrui Meng commented on SPARK-25320:
---

We revert the tree Node incompatible change and one LDA private[ml] constructor 
change because MLeap already used it and it is not worth the cost to make the 
change.

> ML, Graph 2.4 QA: API: Binary incompatible changes
> --
>
> Key: SPARK-25320
> URL: https://issues.apache.org/jira/browse/SPARK-25320
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Assignee: Bago Amirbekian
>Priority: Blocker
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25321) ML, Graph 2.4 QA: API: New Scala APIs, docs

2018-09-21 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-25321.
---
Resolution: Fixed

Issue resolved by pull request 22510
[https://github.com/apache/spark/pull/22510]

> ML, Graph 2.4 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-25321
> URL: https://issues.apache.org/jira/browse/SPARK-25321
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Assignee: Yanbo Liang
>Priority: Blocker
> Fix For: 2.4.0
>
>
> Audit new public Scala APIs added to MLlib & GraphX. Take note of:
>  * Protected/public classes or methods. If access can be more private, then 
> it should be.
>  * Also look for non-sealed traits.
>  * Documentation: Missing? Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue. 
> For *user guide issues* link the new JIRAs to the relevant user guide QA issue



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25321) ML, Graph 2.4 QA: API: New Scala APIs, docs

2018-09-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25321:


Assignee: Apache Spark  (was: Yanbo Liang)

> ML, Graph 2.4 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-25321
> URL: https://issues.apache.org/jira/browse/SPARK-25321
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>Priority: Blocker
> Fix For: 2.4.0
>
>
> Audit new public Scala APIs added to MLlib & GraphX. Take note of:
>  * Protected/public classes or methods. If access can be more private, then 
> it should be.
>  * Also look for non-sealed traits.
>  * Documentation: Missing? Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue. 
> For *user guide issues* link the new JIRAs to the relevant user guide QA issue



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25321) ML, Graph 2.4 QA: API: New Scala APIs, docs

2018-09-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25321:


Assignee: Yanbo Liang  (was: Apache Spark)

> ML, Graph 2.4 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-25321
> URL: https://issues.apache.org/jira/browse/SPARK-25321
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Assignee: Yanbo Liang
>Priority: Blocker
> Fix For: 2.4.0
>
>
> Audit new public Scala APIs added to MLlib & GraphX. Take note of:
>  * Protected/public classes or methods. If access can be more private, then 
> it should be.
>  * Also look for non-sealed traits.
>  * Documentation: Missing? Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue. 
> For *user guide issues* link the new JIRAs to the relevant user guide QA issue



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-25321) ML, Graph 2.4 QA: API: New Scala APIs, docs

2018-09-21 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reopened SPARK-25321:
---

> ML, Graph 2.4 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-25321
> URL: https://issues.apache.org/jira/browse/SPARK-25321
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Assignee: Yanbo Liang
>Priority: Blocker
> Fix For: 2.4.0
>
>
> Audit new public Scala APIs added to MLlib & GraphX. Take note of:
>  * Protected/public classes or methods. If access can be more private, then 
> it should be.
>  * Also look for non-sealed traits.
>  * Documentation: Missing? Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue. 
> For *user guide issues* link the new JIRAs to the relevant user guide QA issue



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25321) ML, Graph 2.4 QA: API: New Scala APIs, docs

2018-09-21 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-25321.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22492
[https://github.com/apache/spark/pull/22492]

> ML, Graph 2.4 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-25321
> URL: https://issues.apache.org/jira/browse/SPARK-25321
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Assignee: Yanbo Liang
>Priority: Blocker
> Fix For: 2.4.0
>
>
> Audit new public Scala APIs added to MLlib & GraphX. Take note of:
>  * Protected/public classes or methods. If access can be more private, then 
> it should be.
>  * Also look for non-sealed traits.
>  * Documentation: Missing? Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue. 
> For *user guide issues* link the new JIRAs to the relevant user guide QA issue



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24523) InterruptedException when closing SparkContext

2018-09-21 Thread Ankur Gupta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624046#comment-16624046
 ] 

Ankur Gupta commented on SPARK-24523:
-

You are correct [~irashid], it seems "spark-listener-group-appStatus" thread is 
processing the appStatus events while the main thread is waiting for this 
thread to finish when this exception occurs. This may not be related to S3 at 
all.

It will be good to know since how long this thread has been processing events, 
so any logs that provide this information will be useful. Also, it may be 
helpful to try increasing the number of driver cores in your application. 

> InterruptedException when closing SparkContext
> --
>
> Key: SPARK-24523
> URL: https://issues.apache.org/jira/browse/SPARK-24523
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.0, 2.3.1
> Environment: EMR 5.14.0, S3/HDFS inputs and outputs; EMR 5.17
>  
>  
>  
>Reporter: Umayr Hassan
>Priority: Major
> Attachments: spark-stop-jstack.log.1, spark-stop-jstack.log.2, 
> spark-stop-jstack.log.3
>
>
> I'm running a Scala application in EMR with the following properties:
> {{--master yarn --deploy-mode cluster --driver-memory 13g --executor-memory 
> 30g --executor-cores 5 --conf spark.default.parallelism=400 --conf 
> spark.dynamicAllocation.enabled=true --conf 
> spark.dynamicAllocation.maxExecutors=20 --conf 
> spark.eventLog.dir=hdfs:///var/log/spark/apps --conf 
> spark.eventLog.enabled=true --conf 
> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 --conf 
> spark.scheduler.listenerbus.eventqueue.capacity=2 --conf 
> spark.shuffle.service.enabled=true --conf spark.sql.shuffle.partitions=400 
> --conf spark.yarn.maxAppAttempts=1}}
> The application runs fine till SparkContext is (automatically) closed, at 
> which point the SparkContext object throws. 
> {{18/06/10 10:44:43 ERROR Utils: Uncaught exception in thread pool-4-thread-1 
> java.lang.InterruptedException at java.lang.Object.wait(Native Method) at 
> java.lang.Thread.join(Thread.java:1252) at 
> java.lang.Thread.join(Thread.java:1326) at 
> org.apache.spark.scheduler.AsyncEventQueue.stop(AsyncEventQueue.scala:133) at 
> org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219)
>  at 
> org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:893) at 
> scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at 
> scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at 
> scala.collection.AbstractIterable.foreach(Iterable.scala:54) at 
> org.apache.spark.scheduler.LiveListenerBus.stop(LiveListenerBus.scala:219) at 
> org.apache.spark.SparkContext$$anonfun$stop$6.apply$mcV$sp(SparkContext.scala:1915)
>  at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1357) at 
> org.apache.spark.SparkContext.stop(SparkContext.scala:1914) at 
> org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:572) 
> at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216) 
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
>  at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
>  at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
>  at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1988) at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
>  at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
>  at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
>  at scala.util.Try$.apply(Try.scala:192) at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
>  at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> at java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)}}
>  
> I've not seen this behavior in Spark 2.0.2 and Spark 2.2.0 (for the same 
> application), so I'm not sure which change is causing Spark 2.3 to throw. Any 
> ideas?
> best,
> Umayr



--
This message was sent by 

[jira] [Assigned] (SPARK-25510) Create new trait replace BenchmarkWithCodegen

2018-09-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25510:


Assignee: Apache Spark

>  Create new trait replace BenchmarkWithCodegen
> --
>
> Key: SPARK-25510
> URL: https://issues.apache.org/jira/browse/SPARK-25510
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 2.5.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25510) Create new trait replace BenchmarkWithCodegen

2018-09-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25510:


Assignee: (was: Apache Spark)

>  Create new trait replace BenchmarkWithCodegen
> --
>
> Key: SPARK-25510
> URL: https://issues.apache.org/jira/browse/SPARK-25510
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 2.5.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25510) Create new trait replace BenchmarkWithCodegen

2018-09-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16623962#comment-16623962
 ] 

Apache Spark commented on SPARK-25510:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/22522

>  Create new trait replace BenchmarkWithCodegen
> --
>
> Key: SPARK-25510
> URL: https://issues.apache.org/jira/browse/SPARK-25510
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 2.5.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25510) Create new trait replace BenchmarkWithCodegen

2018-09-21 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-25510:

Summary:  Create new trait replace BenchmarkWithCodegen  (was: Create new 
BenchmarkWithCodegen trait doesn't extends SparkFunSuite)

>  Create new trait replace BenchmarkWithCodegen
> --
>
> Key: SPARK-25510
> URL: https://issues.apache.org/jira/browse/SPARK-25510
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 2.5.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25510) Create new BenchmarkWithCodegen trait doesn't extends SparkFunSuite

2018-09-21 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-25510:
---

 Summary: Create new BenchmarkWithCodegen trait doesn't extends 
SparkFunSuite
 Key: SPARK-25510
 URL: https://issues.apache.org/jira/browse/SPARK-25510
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 2.5.0
Reporter: Yuming Wang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24519) MapStatus has 2000 hardcoded

2018-09-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16623924#comment-16623924
 ] 

Apache Spark commented on SPARK-24519:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/22521

> MapStatus has 2000 hardcoded
> 
>
> Key: SPARK-24519
> URL: https://issues.apache.org/jira/browse/SPARK-24519
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Hieu Tri Huynh
>Assignee: Hieu Tri Huynh
>Priority: Minor
> Fix For: 2.4.0
>
>
> MapStatus uses hardcoded value of 2000 partitions to determine if it should 
> use highly compressed map status. We should make it configurable to allow 
> users to more easily tune their jobs with respect to this without having for 
> them to modify their code to change the number of partitions.  Note we can 
> leave this as an internal/undocumented config for now until we have more 
> advise for the users on how to set this config.
> Some of my reasoning:
> The config gives you a way to easily change something without the user having 
> to change code, redeploy jar, and then run again. You can simply change the 
> config and rerun. It also allows for easier experimentation. Changing the # 
> of partitions has other side affects, whether good or bad is situation 
> dependent. It can be worse are you could be increasing # of output files when 
> you don't want to be, affects the # of tasks needs and thus executors to run 
> in parallel, etc.
> There have been various talks about this number at spark summits where people 
> have told customers to increase it to be 2001 partitions. Note if you just do 
> a search for spark 2000 partitions you will fine various things all talking 
> about this number.  This shows that people are modifying their code to take 
> this into account so it seems to me having this configurable would be better.
> Once we have more advice for users we could expose this and document 
> information on it.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25409) Speed up Spark History at start if there are tens of thousands of applications.

2018-09-21 Thread Rong Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rong Tang resolved SPARK-25409.
---
Resolution: Won't Fix

> Speed up Spark History at start if there are tens of thousands of 
> applications.
> ---
>
> Key: SPARK-25409
> URL: https://issues.apache.org/jira/browse/SPARK-25409
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Rong Tang
>Priority: Major
> Attachments: SPARK-25409.0001.patch
>
>
> We have a spark history server, storing 7 days' applications. it usually has 
> 10K to 20K attempts.
> We found that it can take hours at start up,loading/replaying the logs in 
> event-logs folder.  thus, new finished applications have to wait several 
> hours to be seem. So I made 2 improvements for it.
>  # As we run spark on yarn. the on-going applications' information can also 
> be seen via resource manager, so I introduce in a flag 
> spark.history.fs.load.incomplete to say loading logs for incomplete attempts 
> or not.
>  # Incremental loading applications. as I said, we have more then 10K 
> applications stored, it can take hours to load all of them at the first time. 
> so I introduced in a config spark.history.fs.appRefreshNum to say how many 
> application to load each time, then it gets a chance the check the latest 
> updates.
> Here are the benchmark I did.  our system has 1K incomplete application ( it 
> was not cleaned up for some reason, that is another issue that I need 
> investigate), and applications' log size can be gigabytes. 
>  
> Not load incomplete attempts.
> | |Load Count|Load incomplete APPs|All attempts number|Time Cost|Increase 
> with more attempts|
> |1 ( current implementation)|All|Yes|13K|2 hours 14 minutes|Yes|
> |2|All|No|13K|31 minutes| yes|
>  
>  
> Limit each time how much to load.
>  
> | |Load Count|Load incomplete APPs|All attempts number|Worst Cost|Increase 
> with more attempts|
> |1 ( current implementation)|All|Yes|13K|2 hours 14 minutes|Yes|
> |2|3000|Yes|13K|42 minutes except last 1.6K
> (The last 1.6K attempts cost extremely long 2.5 hours)|NO|
>  
>  
> Limit each time how many to load, and not load incomplete jobs.
>  
> | |Load Count|Load incomplete APPs|All attempts number|Worst 
> Cost|Avg|Increase with more attempts|
> |1 ( current implementation)|All|Yes|13K|2 hours 14 minutes| |Yes|
> |2|3000|NO|12K|17minutes
>  |10 minutes
> ( 41 minutes in total)|NO|
>  
>  
> | |Load Count|Load incomplete APPs|All attempts number|Worst 
> Cost|Avg|Increase with more attempts|
> |1 ( current implementation)|All|Yes|20K|1 hour 52 minutes| |Yes|
> |2|3000|NO|18.5K|20minutes|18 minutes
> (2 hours 18 minutes in total)
>  |NO|
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25409) Speed up Spark History at start if there are tens of thousands of applications.

2018-09-21 Thread Rong Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16623919#comment-16623919
 ] 

Rong Tang commented on SPARK-25409:
---

Close this JIRA. As  https://issues.apache.org/jira/browse/SPARK-6951 has done 
a great job speeding up loading.

 

> Speed up Spark History at start if there are tens of thousands of 
> applications.
> ---
>
> Key: SPARK-25409
> URL: https://issues.apache.org/jira/browse/SPARK-25409
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Rong Tang
>Priority: Major
> Attachments: SPARK-25409.0001.patch
>
>
> We have a spark history server, storing 7 days' applications. it usually has 
> 10K to 20K attempts.
> We found that it can take hours at start up,loading/replaying the logs in 
> event-logs folder.  thus, new finished applications have to wait several 
> hours to be seem. So I made 2 improvements for it.
>  # As we run spark on yarn. the on-going applications' information can also 
> be seen via resource manager, so I introduce in a flag 
> spark.history.fs.load.incomplete to say loading logs for incomplete attempts 
> or not.
>  # Incremental loading applications. as I said, we have more then 10K 
> applications stored, it can take hours to load all of them at the first time. 
> so I introduced in a config spark.history.fs.appRefreshNum to say how many 
> application to load each time, then it gets a chance the check the latest 
> updates.
> Here are the benchmark I did.  our system has 1K incomplete application ( it 
> was not cleaned up for some reason, that is another issue that I need 
> investigate), and applications' log size can be gigabytes. 
>  
> Not load incomplete attempts.
> | |Load Count|Load incomplete APPs|All attempts number|Time Cost|Increase 
> with more attempts|
> |1 ( current implementation)|All|Yes|13K|2 hours 14 minutes|Yes|
> |2|All|No|13K|31 minutes| yes|
>  
>  
> Limit each time how much to load.
>  
> | |Load Count|Load incomplete APPs|All attempts number|Worst Cost|Increase 
> with more attempts|
> |1 ( current implementation)|All|Yes|13K|2 hours 14 minutes|Yes|
> |2|3000|Yes|13K|42 minutes except last 1.6K
> (The last 1.6K attempts cost extremely long 2.5 hours)|NO|
>  
>  
> Limit each time how many to load, and not load incomplete jobs.
>  
> | |Load Count|Load incomplete APPs|All attempts number|Worst 
> Cost|Avg|Increase with more attempts|
> |1 ( current implementation)|All|Yes|13K|2 hours 14 minutes| |Yes|
> |2|3000|NO|12K|17minutes
>  |10 minutes
> ( 41 minutes in total)|NO|
>  
>  
> | |Load Count|Load incomplete APPs|All attempts number|Worst 
> Cost|Avg|Increase with more attempts|
> |1 ( current implementation)|All|Yes|20K|1 hour 52 minutes| |Yes|
> |2|3000|NO|18.5K|20minutes|18 minutes
> (2 hours 18 minutes in total)
>  |NO|
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25509) SHS V2 cannot enabled in Windows, because POSIX permissions is not support.

2018-09-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16623898#comment-16623898
 ] 

Apache Spark commented on SPARK-25509:
--

User 'jianjianjiao' has created a pull request for this issue:
https://github.com/apache/spark/pull/22520

> SHS V2 cannot enabled in Windows, because POSIX permissions is not support.
> ---
>
> Key: SPARK-25509
> URL: https://issues.apache.org/jira/browse/SPARK-25509
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Rong Tang
>Priority: Major
>
> SHS V2 cannot enabled in Windoes, because windows doesn't support POSIX 
> permission. 
> with exception: java.lang.UnsupportedOperationException: 'posix:permissions' 
> not supported as initial attribute.
> test case fails in windows without this fix. 
>  org.apache.spark.deploy.history.HistoryServerDiskManagerSuite test("leasing 
> space")
>  
> PR: https://github.com/apache/spark/pull/22520
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25509) SHS V2 cannot enabled in Windows, because POSIX permissions is not support.

2018-09-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25509:


Assignee: Apache Spark

> SHS V2 cannot enabled in Windows, because POSIX permissions is not support.
> ---
>
> Key: SPARK-25509
> URL: https://issues.apache.org/jira/browse/SPARK-25509
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Rong Tang
>Assignee: Apache Spark
>Priority: Major
>
> SHS V2 cannot enabled in Windoes, because windows doesn't support POSIX 
> permission. 
> with exception: java.lang.UnsupportedOperationException: 'posix:permissions' 
> not supported as initial attribute.
> test case fails in windows without this fix. 
>  org.apache.spark.deploy.history.HistoryServerDiskManagerSuite test("leasing 
> space")
>  
> PR: https://github.com/apache/spark/pull/22520
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25509) SHS V2 cannot enabled in Windows, because POSIX permissions is not support.

2018-09-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25509:


Assignee: (was: Apache Spark)

> SHS V2 cannot enabled in Windows, because POSIX permissions is not support.
> ---
>
> Key: SPARK-25509
> URL: https://issues.apache.org/jira/browse/SPARK-25509
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Rong Tang
>Priority: Major
>
> SHS V2 cannot enabled in Windoes, because windows doesn't support POSIX 
> permission. 
> with exception: java.lang.UnsupportedOperationException: 'posix:permissions' 
> not supported as initial attribute.
> test case fails in windows without this fix. 
>  org.apache.spark.deploy.history.HistoryServerDiskManagerSuite test("leasing 
> space")
>  
> PR: https://github.com/apache/spark/pull/22520
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25509) SHS V2 cannot enabled in Windows, because POSIX permissions is not support.

2018-09-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16623896#comment-16623896
 ] 

Apache Spark commented on SPARK-25509:
--

User 'jianjianjiao' has created a pull request for this issue:
https://github.com/apache/spark/pull/22520

> SHS V2 cannot enabled in Windows, because POSIX permissions is not support.
> ---
>
> Key: SPARK-25509
> URL: https://issues.apache.org/jira/browse/SPARK-25509
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Rong Tang
>Priority: Major
>
> SHS V2 cannot enabled in Windoes, because windows doesn't support POSIX 
> permission. 
> with exception: java.lang.UnsupportedOperationException: 'posix:permissions' 
> not supported as initial attribute.
> test case fails in windows without this fix. 
>  org.apache.spark.deploy.history.HistoryServerDiskManagerSuite test("leasing 
> space")
>  
> PR: https://github.com/apache/spark/pull/22520
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25509) SHS V2 cannot enabled in Windows, because POSIX permissions is not support.

2018-09-21 Thread Rong Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rong Tang updated SPARK-25509:
--
Description: 
SHS V2 cannot enabled in Windoes, because windows doesn't support POSIX 
permission. 

with exception: java.lang.UnsupportedOperationException: 'posix:permissions' 
not supported as initial attribute.

test case fails in windows without this fix. 
 org.apache.spark.deploy.history.HistoryServerDiskManagerSuite test("leasing 
space")

 

PR: https://github.com/apache/spark/pull/22520

 

  was:
SHS V2 cannot enabled in Windoes, because windows doesn't support POSIX 
permission. 

with exception: java.lang.UnsupportedOperationException: 'posix:permissions' 
not supported as initial attribute.

test case fails in windows without this fix. 
org.apache.spark.deploy.history.HistoryServerDiskManagerSuite test("leasing 
space")

 


> SHS V2 cannot enabled in Windows, because POSIX permissions is not support.
> ---
>
> Key: SPARK-25509
> URL: https://issues.apache.org/jira/browse/SPARK-25509
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Rong Tang
>Priority: Major
>
> SHS V2 cannot enabled in Windoes, because windows doesn't support POSIX 
> permission. 
> with exception: java.lang.UnsupportedOperationException: 'posix:permissions' 
> not supported as initial attribute.
> test case fails in windows without this fix. 
>  org.apache.spark.deploy.history.HistoryServerDiskManagerSuite test("leasing 
> space")
>  
> PR: https://github.com/apache/spark/pull/22520
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25509) SHS V2 cannot enabled in Windows, because POSIX permissions is not support.

2018-09-21 Thread Rong Tang (JIRA)
Rong Tang created SPARK-25509:
-

 Summary: SHS V2 cannot enabled in Windows, because POSIX 
permissions is not support.
 Key: SPARK-25509
 URL: https://issues.apache.org/jira/browse/SPARK-25509
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.1, 2.4.0
Reporter: Rong Tang


SHS V2 cannot enabled in Windoes, because windows doesn't support POSIX 
permission. 

with exception: java.lang.UnsupportedOperationException: 'posix:permissions' 
not supported as initial attribute.

test case fails in windows without this fix. 
org.apache.spark.deploy.history.HistoryServerDiskManagerSuite test("leasing 
space")

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25477) “INSERT OVERWRITE LOCAL DIRECTORY”, the data files allocated on the non-driver node will not be written to the specified output directory

2018-09-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25477:


Assignee: Apache Spark

> “INSERT OVERWRITE LOCAL DIRECTORY”, the data files allocated on the 
> non-driver node will not be written to the specified output directory
> -
>
> Key: SPARK-25477
> URL: https://issues.apache.org/jira/browse/SPARK-25477
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.3.1
>Reporter: hantiantian
>Assignee: Apache Spark
>Priority: Major
>
> The "INSERT OVERWRITE LOCAL DIRECTORY" features use the local staging 
> directory to load data into the specified output directory. If the executor 
> is executed on a non-driver node, the data files allocated on that node will 
> not be written to the specified output directory.
> The "spark-sql" application has two executors, one on the driver node and one 
> on the non-driver node.
> spark-sql> select * from aa;
> 1 1
> 2 3
> 2 1
> 3 3
> As follows, view the data file of the aa table on HDFS,
> /spark/aa/part-0-5adc7acc-6e2d-407a-be77-abdfce61e55f-c000
> /spark/aa/part-1-5adc7acc-6e2d-407a-be77-abdfce61e55f-c000
> /spark/aa/part-2-5adc7acc-6e2d-407a-be77-abdfce61e55f-c000
> /spark/aa/part-3-5adc7acc-6e2d-407a-be77-abdfce61e55f-c000
>  
> spark-sql> insert overwrite local directory "/test" (select * from aa);
> As follows, view the contents of the output directory "/test", there are only 
> two data files,
> part-0-16312194-288a-4085-9cb8-de0a685f7af0-c000
>  part-2-16312194-288a-4085-9cb8-de0a685f7af0-c000
>  _SUCCESS
> As follows, view the executor log on non-driver node, the other two data 
> files is Is assigned to this node. And the execution results of the tasks is 
> saved to the local staging directory.
> INFO org.apache.spark.rdd.HadoopRDD: Input split: 
> hdfs://nameservice/spark/aa/part-1-5adc7acc-6e2d-407a-be77-abdfce61e55f-c000:0+8
>  INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Saved 
> output of task 'attempt_20180920151705_0002_m_01_0' to 
> file:/spark-tmp/stagedir/.hive-staging_hive_2018-09-20_15-17-01_313_4124694630008666159-1/-ext-1/_temporary/0/task_20180920151705_0002_m_01
>  
> INFO org.apache.spark.rdd.HadoopRDD: Input split: 
> hdfs://nameservice/spark/aa/part-3-5adc7acc-6e2d-407a-be77-abdfce61e55f-c000:0+8
> INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Saved output 
> of task 'attempt_20180920151705_0002_m_03_0' to 
> file:/spark-tmp/stagedir/.hive-staging_hive_2018-09-20_15-17-01_313_4124694630008666159-1/-ext-1/_temporary/0/task_20180920151705_0002_m_03
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25374) SafeProjection supports fallback to an interpreted mode

2018-09-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25374:


Assignee: (was: Apache Spark)

> SafeProjection supports fallback to an interpreted mode
> ---
>
> Key: SPARK-25374
> URL: https://issues.apache.org/jira/browse/SPARK-25374
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> In SPARK-23711, UnsafeProjection supports fallback to an interpreted mode. 
> SafeProjection needs to support, too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25374) SafeProjection supports fallback to an interpreted mode

2018-09-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25374:


Assignee: Apache Spark

> SafeProjection supports fallback to an interpreted mode
> ---
>
> Key: SPARK-25374
> URL: https://issues.apache.org/jira/browse/SPARK-25374
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>Priority: Minor
>
> In SPARK-23711, UnsafeProjection supports fallback to an interpreted mode. 
> SafeProjection needs to support, too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25477) “INSERT OVERWRITE LOCAL DIRECTORY”, the data files allocated on the non-driver node will not be written to the specified output directory

2018-09-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25477:


Assignee: (was: Apache Spark)

> “INSERT OVERWRITE LOCAL DIRECTORY”, the data files allocated on the 
> non-driver node will not be written to the specified output directory
> -
>
> Key: SPARK-25477
> URL: https://issues.apache.org/jira/browse/SPARK-25477
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.3.1
>Reporter: hantiantian
>Priority: Major
>
> The "INSERT OVERWRITE LOCAL DIRECTORY" features use the local staging 
> directory to load data into the specified output directory. If the executor 
> is executed on a non-driver node, the data files allocated on that node will 
> not be written to the specified output directory.
> The "spark-sql" application has two executors, one on the driver node and one 
> on the non-driver node.
> spark-sql> select * from aa;
> 1 1
> 2 3
> 2 1
> 3 3
> As follows, view the data file of the aa table on HDFS,
> /spark/aa/part-0-5adc7acc-6e2d-407a-be77-abdfce61e55f-c000
> /spark/aa/part-1-5adc7acc-6e2d-407a-be77-abdfce61e55f-c000
> /spark/aa/part-2-5adc7acc-6e2d-407a-be77-abdfce61e55f-c000
> /spark/aa/part-3-5adc7acc-6e2d-407a-be77-abdfce61e55f-c000
>  
> spark-sql> insert overwrite local directory "/test" (select * from aa);
> As follows, view the contents of the output directory "/test", there are only 
> two data files,
> part-0-16312194-288a-4085-9cb8-de0a685f7af0-c000
>  part-2-16312194-288a-4085-9cb8-de0a685f7af0-c000
>  _SUCCESS
> As follows, view the executor log on non-driver node, the other two data 
> files is Is assigned to this node. And the execution results of the tasks is 
> saved to the local staging directory.
> INFO org.apache.spark.rdd.HadoopRDD: Input split: 
> hdfs://nameservice/spark/aa/part-1-5adc7acc-6e2d-407a-be77-abdfce61e55f-c000:0+8
>  INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Saved 
> output of task 'attempt_20180920151705_0002_m_01_0' to 
> file:/spark-tmp/stagedir/.hive-staging_hive_2018-09-20_15-17-01_313_4124694630008666159-1/-ext-1/_temporary/0/task_20180920151705_0002_m_01
>  
> INFO org.apache.spark.rdd.HadoopRDD: Input split: 
> hdfs://nameservice/spark/aa/part-3-5adc7acc-6e2d-407a-be77-abdfce61e55f-c000:0+8
> INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Saved output 
> of task 'attempt_20180920151705_0002_m_03_0' to 
> file:/spark-tmp/stagedir/.hive-staging_hive_2018-09-20_15-17-01_313_4124694630008666159-1/-ext-1/_temporary/0/task_20180920151705_0002_m_03
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25449) Don't send zero accumulators in heartbeats

2018-09-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16623876#comment-16623876
 ] 

Apache Spark commented on SPARK-25449:
--

User 'mukulmurthy' has created a pull request for this issue:
https://github.com/apache/spark/pull/22473

> Don't send zero accumulators in heartbeats
> --
>
> Key: SPARK-25449
> URL: https://issues.apache.org/jira/browse/SPARK-25449
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Mukul Murthy
>Priority: Major
>
> Heartbeats sent from executors to the driver every 10 seconds contain metrics 
> and are generally on the order of a few KBs. However, for large jobs with 
> lots of tasks, heartbeats can be on the order of tens of MBs, causing tasks 
> to die with heartbeat failures. We can mitigate this by not sending zero 
> metrics to the driver.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25374) SafeProjection supports fallback to an interpreted mode

2018-09-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16623878#comment-16623878
 ] 

Apache Spark commented on SPARK-25374:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/22468

> SafeProjection supports fallback to an interpreted mode
> ---
>
> Key: SPARK-25374
> URL: https://issues.apache.org/jira/browse/SPARK-25374
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> In SPARK-23711, UnsafeProjection supports fallback to an interpreted mode. 
> SafeProjection needs to support, too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25477) “INSERT OVERWRITE LOCAL DIRECTORY”, the data files allocated on the non-driver node will not be written to the specified output directory

2018-09-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16623877#comment-16623877
 ] 

Apache Spark commented on SPARK-25477:
--

User 'httfighter' has created a pull request for this issue:
https://github.com/apache/spark/pull/22487

> “INSERT OVERWRITE LOCAL DIRECTORY”, the data files allocated on the 
> non-driver node will not be written to the specified output directory
> -
>
> Key: SPARK-25477
> URL: https://issues.apache.org/jira/browse/SPARK-25477
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.3.1
>Reporter: hantiantian
>Priority: Major
>
> The "INSERT OVERWRITE LOCAL DIRECTORY" features use the local staging 
> directory to load data into the specified output directory. If the executor 
> is executed on a non-driver node, the data files allocated on that node will 
> not be written to the specified output directory.
> The "spark-sql" application has two executors, one on the driver node and one 
> on the non-driver node.
> spark-sql> select * from aa;
> 1 1
> 2 3
> 2 1
> 3 3
> As follows, view the data file of the aa table on HDFS,
> /spark/aa/part-0-5adc7acc-6e2d-407a-be77-abdfce61e55f-c000
> /spark/aa/part-1-5adc7acc-6e2d-407a-be77-abdfce61e55f-c000
> /spark/aa/part-2-5adc7acc-6e2d-407a-be77-abdfce61e55f-c000
> /spark/aa/part-3-5adc7acc-6e2d-407a-be77-abdfce61e55f-c000
>  
> spark-sql> insert overwrite local directory "/test" (select * from aa);
> As follows, view the contents of the output directory "/test", there are only 
> two data files,
> part-0-16312194-288a-4085-9cb8-de0a685f7af0-c000
>  part-2-16312194-288a-4085-9cb8-de0a685f7af0-c000
>  _SUCCESS
> As follows, view the executor log on non-driver node, the other two data 
> files is Is assigned to this node. And the execution results of the tasks is 
> saved to the local staging directory.
> INFO org.apache.spark.rdd.HadoopRDD: Input split: 
> hdfs://nameservice/spark/aa/part-1-5adc7acc-6e2d-407a-be77-abdfce61e55f-c000:0+8
>  INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Saved 
> output of task 'attempt_20180920151705_0002_m_01_0' to 
> file:/spark-tmp/stagedir/.hive-staging_hive_2018-09-20_15-17-01_313_4124694630008666159-1/-ext-1/_temporary/0/task_20180920151705_0002_m_01
>  
> INFO org.apache.spark.rdd.HadoopRDD: Input split: 
> hdfs://nameservice/spark/aa/part-3-5adc7acc-6e2d-407a-be77-abdfce61e55f-c000:0+8
> INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Saved output 
> of task 'attempt_20180920151705_0002_m_03_0' to 
> file:/spark-tmp/stagedir/.hive-staging_hive_2018-09-20_15-17-01_313_4124694630008666159-1/-ext-1/_temporary/0/task_20180920151705_0002_m_03
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25449) Don't send zero accumulators in heartbeats

2018-09-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25449:


Assignee: (was: Apache Spark)

> Don't send zero accumulators in heartbeats
> --
>
> Key: SPARK-25449
> URL: https://issues.apache.org/jira/browse/SPARK-25449
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Mukul Murthy
>Priority: Major
>
> Heartbeats sent from executors to the driver every 10 seconds contain metrics 
> and are generally on the order of a few KBs. However, for large jobs with 
> lots of tasks, heartbeats can be on the order of tens of MBs, causing tasks 
> to die with heartbeat failures. We can mitigate this by not sending zero 
> metrics to the driver.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25449) Don't send zero accumulators in heartbeats

2018-09-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25449:


Assignee: Apache Spark

> Don't send zero accumulators in heartbeats
> --
>
> Key: SPARK-25449
> URL: https://issues.apache.org/jira/browse/SPARK-25449
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Mukul Murthy
>Assignee: Apache Spark
>Priority: Major
>
> Heartbeats sent from executors to the driver every 10 seconds contain metrics 
> and are generally on the order of a few KBs. However, for large jobs with 
> lots of tasks, heartbeats can be on the order of tens of MBs, causing tasks 
> to die with heartbeat failures. We can mitigate this by not sending zero 
> metrics to the driver.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25483) Refactor UnsafeArrayDataBenchmark to use main method

2018-09-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25483:


Assignee: (was: Apache Spark)

> Refactor UnsafeArrayDataBenchmark to use main method
> 
>
> Key: SPARK-25483
> URL: https://issues.apache.org/jira/browse/SPARK-25483
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 2.5.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25476) Refactor AggregateBenchmark to use main method

2018-09-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25476:


Assignee: Apache Spark

> Refactor AggregateBenchmark to use main method
> --
>
> Key: SPARK-25476
> URL: https://issues.apache.org/jira/browse/SPARK-25476
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 2.5.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25492) Refactor WideSchemaBenchmark to use main method

2018-09-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25492:


Assignee: (was: Apache Spark)

> Refactor WideSchemaBenchmark to use main method
> ---
>
> Key: SPARK-25492
> URL: https://issues.apache.org/jira/browse/SPARK-25492
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 2.5.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25417) ArrayContains function may return incorrect result when right expression is implicitly down casted

2018-09-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16623875#comment-16623875
 ] 

Apache Spark commented on SPARK-25417:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/22448

> ArrayContains function may return incorrect result when right expression is 
> implicitly down casted
> --
>
> Key: SPARK-25417
> URL: https://issues.apache.org/jira/browse/SPARK-25417
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
>Priority: Major
> Fix For: 2.4.0
>
>
> In ArrayContains, we currently cast the right hand side expression to match 
> the element type of the left hand side Array. This may result in down casting 
> and may return wrong result or questionable result.
> Example :
> {code:java}
> spark-sql> select array_position(array(1), 1.34);
> true
> {code}
>  
> {code:java}
> spark-sql> select array_position(array(1), 'foo');
> null
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25409) Speed up Spark History at start if there are tens of thousands of applications.

2018-09-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25409:


Assignee: (was: Apache Spark)

> Speed up Spark History at start if there are tens of thousands of 
> applications.
> ---
>
> Key: SPARK-25409
> URL: https://issues.apache.org/jira/browse/SPARK-25409
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Rong Tang
>Priority: Major
> Attachments: SPARK-25409.0001.patch
>
>
> We have a spark history server, storing 7 days' applications. it usually has 
> 10K to 20K attempts.
> We found that it can take hours at start up,loading/replaying the logs in 
> event-logs folder.  thus, new finished applications have to wait several 
> hours to be seem. So I made 2 improvements for it.
>  # As we run spark on yarn. the on-going applications' information can also 
> be seen via resource manager, so I introduce in a flag 
> spark.history.fs.load.incomplete to say loading logs for incomplete attempts 
> or not.
>  # Incremental loading applications. as I said, we have more then 10K 
> applications stored, it can take hours to load all of them at the first time. 
> so I introduced in a config spark.history.fs.appRefreshNum to say how many 
> application to load each time, then it gets a chance the check the latest 
> updates.
> Here are the benchmark I did.  our system has 1K incomplete application ( it 
> was not cleaned up for some reason, that is another issue that I need 
> investigate), and applications' log size can be gigabytes. 
>  
> Not load incomplete attempts.
> | |Load Count|Load incomplete APPs|All attempts number|Time Cost|Increase 
> with more attempts|
> |1 ( current implementation)|All|Yes|13K|2 hours 14 minutes|Yes|
> |2|All|No|13K|31 minutes| yes|
>  
>  
> Limit each time how much to load.
>  
> | |Load Count|Load incomplete APPs|All attempts number|Worst Cost|Increase 
> with more attempts|
> |1 ( current implementation)|All|Yes|13K|2 hours 14 minutes|Yes|
> |2|3000|Yes|13K|42 minutes except last 1.6K
> (The last 1.6K attempts cost extremely long 2.5 hours)|NO|
>  
>  
> Limit each time how many to load, and not load incomplete jobs.
>  
> | |Load Count|Load incomplete APPs|All attempts number|Worst 
> Cost|Avg|Increase with more attempts|
> |1 ( current implementation)|All|Yes|13K|2 hours 14 minutes| |Yes|
> |2|3000|NO|12K|17minutes
>  |10 minutes
> ( 41 minutes in total)|NO|
>  
>  
> | |Load Count|Load incomplete APPs|All attempts number|Worst 
> Cost|Avg|Increase with more attempts|
> |1 ( current implementation)|All|Yes|20K|1 hour 52 minutes| |Yes|
> |2|3000|NO|18.5K|20minutes|18 minutes
> (2 hours 18 minutes in total)
>  |NO|
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25478) Refactor CompressionSchemeBenchmark to use main method

2018-09-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25478:


Assignee: (was: Apache Spark)

> Refactor CompressionSchemeBenchmark to use main method
> --
>
> Key: SPARK-25478
> URL: https://issues.apache.org/jira/browse/SPARK-25478
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 2.5.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25409) Speed up Spark History at start if there are tens of thousands of applications.

2018-09-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16623874#comment-16623874
 ] 

Apache Spark commented on SPARK-25409:
--

User 'jianjianjiao' has created a pull request for this issue:
https://github.com/apache/spark/pull/22444

> Speed up Spark History at start if there are tens of thousands of 
> applications.
> ---
>
> Key: SPARK-25409
> URL: https://issues.apache.org/jira/browse/SPARK-25409
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Rong Tang
>Priority: Major
> Attachments: SPARK-25409.0001.patch
>
>
> We have a spark history server, storing 7 days' applications. it usually has 
> 10K to 20K attempts.
> We found that it can take hours at start up,loading/replaying the logs in 
> event-logs folder.  thus, new finished applications have to wait several 
> hours to be seem. So I made 2 improvements for it.
>  # As we run spark on yarn. the on-going applications' information can also 
> be seen via resource manager, so I introduce in a flag 
> spark.history.fs.load.incomplete to say loading logs for incomplete attempts 
> or not.
>  # Incremental loading applications. as I said, we have more then 10K 
> applications stored, it can take hours to load all of them at the first time. 
> so I introduced in a config spark.history.fs.appRefreshNum to say how many 
> application to load each time, then it gets a chance the check the latest 
> updates.
> Here are the benchmark I did.  our system has 1K incomplete application ( it 
> was not cleaned up for some reason, that is another issue that I need 
> investigate), and applications' log size can be gigabytes. 
>  
> Not load incomplete attempts.
> | |Load Count|Load incomplete APPs|All attempts number|Time Cost|Increase 
> with more attempts|
> |1 ( current implementation)|All|Yes|13K|2 hours 14 minutes|Yes|
> |2|All|No|13K|31 minutes| yes|
>  
>  
> Limit each time how much to load.
>  
> | |Load Count|Load incomplete APPs|All attempts number|Worst Cost|Increase 
> with more attempts|
> |1 ( current implementation)|All|Yes|13K|2 hours 14 minutes|Yes|
> |2|3000|Yes|13K|42 minutes except last 1.6K
> (The last 1.6K attempts cost extremely long 2.5 hours)|NO|
>  
>  
> Limit each time how many to load, and not load incomplete jobs.
>  
> | |Load Count|Load incomplete APPs|All attempts number|Worst 
> Cost|Avg|Increase with more attempts|
> |1 ( current implementation)|All|Yes|13K|2 hours 14 minutes| |Yes|
> |2|3000|NO|12K|17minutes
>  |10 minutes
> ( 41 minutes in total)|NO|
>  
>  
> | |Load Count|Load incomplete APPs|All attempts number|Worst 
> Cost|Avg|Increase with more attempts|
> |1 ( current implementation)|All|Yes|20K|1 hour 52 minutes| |Yes|
> |2|3000|NO|18.5K|20minutes|18 minutes
> (2 hours 18 minutes in total)
>  |NO|
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25409) Speed up Spark History at start if there are tens of thousands of applications.

2018-09-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25409:


Assignee: Apache Spark

> Speed up Spark History at start if there are tens of thousands of 
> applications.
> ---
>
> Key: SPARK-25409
> URL: https://issues.apache.org/jira/browse/SPARK-25409
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Rong Tang
>Assignee: Apache Spark
>Priority: Major
> Attachments: SPARK-25409.0001.patch
>
>
> We have a spark history server, storing 7 days' applications. it usually has 
> 10K to 20K attempts.
> We found that it can take hours at start up,loading/replaying the logs in 
> event-logs folder.  thus, new finished applications have to wait several 
> hours to be seem. So I made 2 improvements for it.
>  # As we run spark on yarn. the on-going applications' information can also 
> be seen via resource manager, so I introduce in a flag 
> spark.history.fs.load.incomplete to say loading logs for incomplete attempts 
> or not.
>  # Incremental loading applications. as I said, we have more then 10K 
> applications stored, it can take hours to load all of them at the first time. 
> so I introduced in a config spark.history.fs.appRefreshNum to say how many 
> application to load each time, then it gets a chance the check the latest 
> updates.
> Here are the benchmark I did.  our system has 1K incomplete application ( it 
> was not cleaned up for some reason, that is another issue that I need 
> investigate), and applications' log size can be gigabytes. 
>  
> Not load incomplete attempts.
> | |Load Count|Load incomplete APPs|All attempts number|Time Cost|Increase 
> with more attempts|
> |1 ( current implementation)|All|Yes|13K|2 hours 14 minutes|Yes|
> |2|All|No|13K|31 minutes| yes|
>  
>  
> Limit each time how much to load.
>  
> | |Load Count|Load incomplete APPs|All attempts number|Worst Cost|Increase 
> with more attempts|
> |1 ( current implementation)|All|Yes|13K|2 hours 14 minutes|Yes|
> |2|3000|Yes|13K|42 minutes except last 1.6K
> (The last 1.6K attempts cost extremely long 2.5 hours)|NO|
>  
>  
> Limit each time how many to load, and not load incomplete jobs.
>  
> | |Load Count|Load incomplete APPs|All attempts number|Worst 
> Cost|Avg|Increase with more attempts|
> |1 ( current implementation)|All|Yes|13K|2 hours 14 minutes| |Yes|
> |2|3000|NO|12K|17minutes
>  |10 minutes
> ( 41 minutes in total)|NO|
>  
>  
> | |Load Count|Load incomplete APPs|All attempts number|Worst 
> Cost|Avg|Increase with more attempts|
> |1 ( current implementation)|All|Yes|20K|1 hour 52 minutes| |Yes|
> |2|3000|NO|18.5K|20minutes|18 minutes
> (2 hours 18 minutes in total)
>  |NO|
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25483) Refactor UnsafeArrayDataBenchmark to use main method

2018-09-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25483:


Assignee: Apache Spark

> Refactor UnsafeArrayDataBenchmark to use main method
> 
>
> Key: SPARK-25483
> URL: https://issues.apache.org/jira/browse/SPARK-25483
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 2.5.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25476) Refactor AggregateBenchmark to use main method

2018-09-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25476:


Assignee: (was: Apache Spark)

> Refactor AggregateBenchmark to use main method
> --
>
> Key: SPARK-25476
> URL: https://issues.apache.org/jira/browse/SPARK-25476
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 2.5.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25478) Refactor CompressionSchemeBenchmark to use main method

2018-09-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25478:


Assignee: Apache Spark

> Refactor CompressionSchemeBenchmark to use main method
> --
>
> Key: SPARK-25478
> URL: https://issues.apache.org/jira/browse/SPARK-25478
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 2.5.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25492) Refactor WideSchemaBenchmark to use main method

2018-09-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25492:


Assignee: Apache Spark

> Refactor WideSchemaBenchmark to use main method
> ---
>
> Key: SPARK-25492
> URL: https://issues.apache.org/jira/browse/SPARK-25492
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 2.5.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20937) Describe spark.sql.parquet.writeLegacyFormat property in Spark SQL, DataFrames and Datasets Guide

2018-09-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20937:


Assignee: Apache Spark

> Describe spark.sql.parquet.writeLegacyFormat property in Spark SQL, 
> DataFrames and Datasets Guide
> -
>
> Key: SPARK-20937
> URL: https://issues.apache.org/jira/browse/SPARK-20937
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Assignee: Apache Spark
>Priority: Trivial
>
> As a follow-up to SPARK-20297 (and SPARK-10400) in which 
> {{spark.sql.parquet.writeLegacyFormat}} property was recommended for Impala 
> and Hive, Spark SQL docs for [Parquet 
> Files|https://spark.apache.org/docs/latest/sql-programming-guide.html#configuration]
>  should have it documented.
> p.s. It was asked about in [Why can't Impala read parquet files after Spark 
> SQL's write?|https://stackoverflow.com/q/44279870/1305344] on StackOverflow 
> today.
> p.s. It's also covered in [~holden.ka...@gmail.com]'s "High Performance 
> Spark: Best Practices for Scaling and Optimizing Apache Spark" book (in Table 
> 3-10. Parquet data source options) that gives the option some wider publicity.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20937) Describe spark.sql.parquet.writeLegacyFormat property in Spark SQL, DataFrames and Datasets Guide

2018-09-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20937:


Assignee: (was: Apache Spark)

> Describe spark.sql.parquet.writeLegacyFormat property in Spark SQL, 
> DataFrames and Datasets Guide
> -
>
> Key: SPARK-20937
> URL: https://issues.apache.org/jira/browse/SPARK-20937
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> As a follow-up to SPARK-20297 (and SPARK-10400) in which 
> {{spark.sql.parquet.writeLegacyFormat}} property was recommended for Impala 
> and Hive, Spark SQL docs for [Parquet 
> Files|https://spark.apache.org/docs/latest/sql-programming-guide.html#configuration]
>  should have it documented.
> p.s. It was asked about in [Why can't Impala read parquet files after Spark 
> SQL's write?|https://stackoverflow.com/q/44279870/1305344] on StackOverflow 
> today.
> p.s. It's also covered in [~holden.ka...@gmail.com]'s "High Performance 
> Spark: Best Practices for Scaling and Optimizing Apache Spark" book (in Table 
> 3-10. Parquet data source options) that gives the option some wider publicity.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23779) TaskMemoryManager and UnsafeSorter use MemoryBlock

2018-09-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16623869#comment-16623869
 ] 

Apache Spark commented on SPARK-23779:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/20992

> TaskMemoryManager and UnsafeSorter use MemoryBlock
> --
>
> Key: SPARK-23779
> URL: https://issues.apache.org/jira/browse/SPARK-23779
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> This JIRA entry tries to use {{MemoryBlock}} in {TaskMemoryManager} and 
> classes related to {{UnsafeSorter}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23763) OffHeapColumnVector uses MemoryBlock

2018-09-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16623867#comment-16623867
 ] 

Apache Spark commented on SPARK-23763:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/20874

> OffHeapColumnVector uses MemoryBlock
> 
>
> Key: SPARK-23763
> URL: https://issues.apache.org/jira/browse/SPARK-23763
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> This JIRA entry tries to use {{MemoryBlock}} in {{OffHeapColumnVector}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25425) Extra options must overwrite sessions options

2018-09-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16623834#comment-16623834
 ] 

Apache Spark commented on SPARK-25425:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/22489

> Extra options must overwrite sessions options
> -
>
> Key: SPARK-25425
> URL: https://issues.apache.org/jira/browse/SPARK-25425
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.0
>
>
> In load() and save() methods of DataSource V2, extra options are overwritten 
> by session options:
> * 
> https://github.com/apache/spark/blob/c9cb393dc414ae98093c1541d09fa3c8663ce276/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L244-L245
> * 
> https://github.com/apache/spark/blob/c9cb393dc414ae98093c1541d09fa3c8663ce276/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L205
> but implementation must be opposite - more specific extra options set via 
> *.option(...)* must overwrite more common session options



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25508) Refactor OrcReadBenchmark to use main method

2018-09-21 Thread yucai (JIRA)
yucai created SPARK-25508:
-

 Summary: Refactor OrcReadBenchmark to use main method
 Key: SPARK-25508
 URL: https://issues.apache.org/jira/browse/SPARK-25508
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.5.0
Reporter: yucai






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18082) Locality Sensitive Hashing (LSH) - SignRandomProjection

2018-09-21 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16623832#comment-16623832
 ] 

Sean Owen commented on SPARK-18082:
---

>From the linked JIRA, is this talking about LSH that approximates cosine 
>similarity? That's already what the random projection implementation does. Is 
>this obsolete / a duplicate?

> Locality Sensitive Hashing (LSH) - SignRandomProjection
> ---
>
> Key: SPARK-18082
> URL: https://issues.apache.org/jira/browse/SPARK-18082
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Yun Ni
>Priority: Minor
>
> See linked JIRA for original LSH for details



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24499) Documentation improvement of Spark core and SQL

2018-09-21 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16623799#comment-16623799
 ] 

Xiao Li commented on SPARK-24499:
-

ping [~XuanYuan] Any update?

> Documentation improvement of Spark core and SQL
> ---
>
> Key: SPARK-24499
> URL: https://issues.apache.org/jira/browse/SPARK-24499
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> The current documentation in Apache Spark lacks enough code examples and 
> tips. If needed, we should also split the page of 
> https://spark.apache.org/docs/latest/sql-programming-guide.html to multiple 
> separate pages like what we did for 
> https://spark.apache.org/docs/latest/ml-guide.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25179) Document the features that require Pyarrow 0.10

2018-09-21 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25179:

Issue Type: Sub-task  (was: Documentation)
Parent: SPARK-25507

> Document the features that require Pyarrow 0.10
> ---
>
> Key: SPARK-25179
> URL: https://issues.apache.org/jira/browse/SPARK-25179
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Document the features that require Pyarrow 0.10 . For 
> example, https://github.com/apache/spark/pull/20725
>Reporter: Xiao Li
>Assignee: Bryan Cutler
>Priority: Major
>
> binary type support requires pyarrow 0.10.0. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25507) Update documents for the new features in 2.4 release

2018-09-21 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25507:
---

 Summary: Update documents for the new features in 2.4 release
 Key: SPARK-25507
 URL: https://issues.apache.org/jira/browse/SPARK-25507
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 2.4.0
Reporter: Xiao Li






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2018-09-21 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16623788#comment-16623788
 ] 

Nicholas Chammas commented on SPARK-25150:
--

Given that Spark appears to provide incorrect results when 
spark.sql.crossJoin.enabled is set to true, shall we mark this as a correctness 
issue?

[~petertoth] / [~EeveeB] - Would you agree with that characterization?

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> --
>
> Key: SPARK-25150
> URL: https://issues.apache.org/jira/browse/SPARK-25150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Nicholas Chammas
>Priority: Major
> Attachments: output-with-implicit-cross-join.txt, 
> output-without-implicit-cross-join.txt, persons.csv, states.csv, 
> zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not correct in the sense that it should be 
> left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25485) Refactor UnsafeProjectionBenchmark to use main method

2018-09-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25485:


Assignee: Apache Spark

> Refactor UnsafeProjectionBenchmark to use main method
> -
>
> Key: SPARK-25485
> URL: https://issues.apache.org/jira/browse/SPARK-25485
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.5.0
>Reporter: yucai
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25485) Refactor UnsafeProjectionBenchmark to use main method

2018-09-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25485:


Assignee: (was: Apache Spark)

> Refactor UnsafeProjectionBenchmark to use main method
> -
>
> Key: SPARK-25485
> URL: https://issues.apache.org/jira/browse/SPARK-25485
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.5.0
>Reporter: yucai
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25505) The output order of grouping columns in Pivot is different from the input order

2018-09-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16623785#comment-16623785
 ] 

Apache Spark commented on SPARK-25505:
--

User 'maryannxue' has created a pull request for this issue:
https://github.com/apache/spark/pull/22519

> The output order of grouping columns in Pivot is different from the input 
> order
> ---
>
> Key: SPARK-25505
> URL: https://issues.apache.org/jira/browse/SPARK-25505
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maryann Xue
>Priority: Minor
>
> For example,
> {code}
> SELECT * FROM (
>   SELECT course, earnings, "a" as a, "z" as z, "b" as b, "y" as y, "c" as c, 
> "x" as x, "d" as d, "w" as w
>   FROM courseSales
> )
> PIVOT (
>   sum(earnings)
>   FOR course IN ('dotNET', 'Java')
> )
> {code}
> The output columns should be "a, z, b, y, c, x, d, w, ..." but now it is "a, 
> b, c, d, w, x, y, z, ..."



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25505) The output order of grouping columns in Pivot is different from the input order

2018-09-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25505:


Assignee: (was: Apache Spark)

> The output order of grouping columns in Pivot is different from the input 
> order
> ---
>
> Key: SPARK-25505
> URL: https://issues.apache.org/jira/browse/SPARK-25505
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maryann Xue
>Priority: Minor
>
> For example,
> {code}
> SELECT * FROM (
>   SELECT course, earnings, "a" as a, "z" as z, "b" as b, "y" as y, "c" as c, 
> "x" as x, "d" as d, "w" as w
>   FROM courseSales
> )
> PIVOT (
>   sum(earnings)
>   FOR course IN ('dotNET', 'Java')
> )
> {code}
> The output columns should be "a, z, b, y, c, x, d, w, ..." but now it is "a, 
> b, c, d, w, x, y, z, ..."



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25505) The output order of grouping columns in Pivot is different from the input order

2018-09-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16623784#comment-16623784
 ] 

Apache Spark commented on SPARK-25505:
--

User 'maryannxue' has created a pull request for this issue:
https://github.com/apache/spark/pull/22519

> The output order of grouping columns in Pivot is different from the input 
> order
> ---
>
> Key: SPARK-25505
> URL: https://issues.apache.org/jira/browse/SPARK-25505
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maryann Xue
>Priority: Minor
>
> For example,
> {code}
> SELECT * FROM (
>   SELECT course, earnings, "a" as a, "z" as z, "b" as b, "y" as y, "c" as c, 
> "x" as x, "d" as d, "w" as w
>   FROM courseSales
> )
> PIVOT (
>   sum(earnings)
>   FOR course IN ('dotNET', 'Java')
> )
> {code}
> The output columns should be "a, z, b, y, c, x, d, w, ..." but now it is "a, 
> b, c, d, w, x, y, z, ..."



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25505) The output order of grouping columns in Pivot is different from the input order

2018-09-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25505:


Assignee: Apache Spark

> The output order of grouping columns in Pivot is different from the input 
> order
> ---
>
> Key: SPARK-25505
> URL: https://issues.apache.org/jira/browse/SPARK-25505
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maryann Xue
>Assignee: Apache Spark
>Priority: Minor
>
> For example,
> {code}
> SELECT * FROM (
>   SELECT course, earnings, "a" as a, "z" as z, "b" as b, "y" as y, "c" as c, 
> "x" as x, "d" as d, "w" as w
>   FROM courseSales
> )
> PIVOT (
>   sum(earnings)
>   FOR course IN ('dotNET', 'Java')
> )
> {code}
> The output columns should be "a, z, b, y, c, x, d, w, ..." but now it is "a, 
> b, c, d, w, x, y, z, ..."



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10816) EventTime based sessionization

2018-09-21 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16623749#comment-16623749
 ] 

Jungtaek Lim edited comment on SPARK-10816 at 9/21/18 3:25 PM:
---

[~msukmanowsky]

Thanks for showing your interest on this! If you are interested on my proposal 
you can even pull my patch and build (though it is marked as WIP even now it 
works... handling state is just a bit suboptimal), and play with custom build. 
(If you read my proposal you may be noticed that the proposal addresses the 
missing part you're seeing from map/flatMapGroupsWithState.)

Your scenario looks like fit to simple gap window, with event time & watermark. 
I guess with my patch it would be represented as SQL statement like:
{code:java}
SELECT userId, session.start AS startTimestampMs, session.end AS 
endTimestampMs, session.end - session.start AS durationMs FROM tbl GROUP BY 
session(event_time, 30 minutes){code}
or DSL in below links.

(Here session.end is defined as gap ends. If you would like to have last event 
timestamp in session, max(event_time) would work.)

In append mode you can only see the sessions which are evicted, and in update 
mode you can see all updated sessions for every batch.

I also added UTs which converts structured sessionization example into session 
window function. Please check it out and let me know if something doesn't work 
as you expect.

Append mode:

[https://github.com/HeartSaVioR/spark/blob/fb19879ff2bbafdf7c844d1a8da9d30c07aefd76/sql/core/src/test/scala/org/apache/spark/sql/streaming/EventTimeWatermarkSuite.scala#L475-L573]

Update mode:

[https://github.com/HeartSaVioR/spark/blob/fb19879ff2bbafdf7c844d1a8da9d30c07aefd76/sql/core/src/test/scala/org/apache/spark/sql/streaming/EventTimeWatermarkSuite.scala#L618-L715]

The query will work for both batch and streaming without modification of SQL 
statement or DSL. For batch it doesn't leverage state. 


was (Author: kabhwan):
[~msukmanowsky]

Thanks for showing your interest on this! If you are interested on my proposal 
you can even pull my patch and build (though it is marked as WIP even now it 
works... handling state is just a bit suboptimal), and play with custom build. 
(If you read my proposal you may be noticed that the proposal addresses the 
missing part you're seeing from map/flatMapGroupsWithState.)

Your scenario looks like fit to simple gap window, with event time & watermark. 
I guess with my patch it would be represented as SQL statement like:
{code:java}
SELECT userId, session.start AS startTimestampMs, session.end AS 
endTimestampMs, session.end - session.start AS durationMs FROM tbl GROUP BY 
session(event_time, 30 minutes){code}
or DSL in below links.

(Here session.end is defined as gap ends. If you would like to have last event 
timestamp in session, max(event_time) would work.)

In append mode you can only see the sessions which are evicted, and in update 
mode you can see all updated sessions for every batch.

I also added UTs which converts structured sessionization example into session 
window function. Please check it out and let me know if something doesn't work 
as you expect.

Append mode:

[https://github.com/HeartSaVioR/spark/blob/fb19879ff2bbafdf7c844d1a8da9d30c07aefd76/sql/core/src/test/scala/org/apache/spark/sql/streaming/EventTimeWatermarkSuite.scala#L475-L573]

Update mode:

[https://github.com/HeartSaVioR/spark/blob/fb19879ff2bbafdf7c844d1a8da9d30c07aefd76/sql/core/src/test/scala/org/apache/spark/sql/streaming/EventTimeWatermarkSuite.scala#L618-L715]

 

> EventTime based sessionization
> --
>
> Key: SPARK-10816
> URL: https://issues.apache.org/jira/browse/SPARK-10816
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Reynold Xin
>Priority: Major
> Attachments: SPARK-10816 Support session window natively.pdf
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10816) EventTime based sessionization

2018-09-21 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16623749#comment-16623749
 ] 

Jungtaek Lim edited comment on SPARK-10816 at 9/21/18 3:24 PM:
---

[~msukmanowsky]

Thanks for showing your interest on this! If you are interested on my proposal 
you can even pull my patch and build (though it is marked as WIP even now it 
works... handling state is just a bit suboptimal), and play with custom build. 
(If you read my proposal you may be noticed that the proposal addresses the 
missing part you're seeing from map/flatMapGroupsWithState.)

Your scenario looks like fit to simple gap window, with event time & watermark. 
I guess with my patch it would be represented as SQL statement like:
{code:java}
SELECT userId, session.start AS startTimestampMs, session.end AS 
endTimestampMs, session.end - session.start AS durationMs FROM tbl GROUP BY 
session(event_time, 30 minutes){code}
or DSL in below links.

(Here session.end is defined as gap ends. If you would like to have last event 
timestamp in session, max(event_time) would work.)

In append mode you can only see the sessions which are evicted, and in update 
mode you can see all updated sessions for every batch.

I also added UTs which converts structured sessionization example into session 
window function. Please check it out and let me know if something doesn't work 
as you expect.

Append mode:

[https://github.com/HeartSaVioR/spark/blob/fb19879ff2bbafdf7c844d1a8da9d30c07aefd76/sql/core/src/test/scala/org/apache/spark/sql/streaming/EventTimeWatermarkSuite.scala#L475-L573]

Update mode:

[https://github.com/HeartSaVioR/spark/blob/fb19879ff2bbafdf7c844d1a8da9d30c07aefd76/sql/core/src/test/scala/org/apache/spark/sql/streaming/EventTimeWatermarkSuite.scala#L618-L715]

 


was (Author: kabhwan):
[~msukmanowsky]

Thanks for showing your interest on this! If you are interested on my proposal 
you can even pull my patch and build (though it is marked as WIP even now it 
works... handling state is just a bit suboptimal), and play with custom build. 
(If you read my proposal you may be noticed that the proposal addresses the 
lack you're seeing from map/flatMapGroupsWithState.)

Your scenario looks like fit to simple gap window, with event time & watermark. 
I guess with my patch it would be represented as SQL statement like:
{code:java}
SELECT userId, session.start AS startTimestampMs, session.end AS 
endTimestampMs, session.end - session.start AS durationMs FROM tbl GROUP BY 
session(event_time, 30 minutes){code}
or DSL in below links.

(Here session.end is defined as gap ends. If you would like to have last event 
timestamp in session, max(event_time) would work.)

In append mode you can only see the sessions which are evicted, and in update 
mode you can see all updated sessions for every batch.

I also added UTs which converts structured sessionization example into session 
window function. Please check it out and let me know if something doesn't work 
as you expect.

Append mode:

[https://github.com/HeartSaVioR/spark/blob/fb19879ff2bbafdf7c844d1a8da9d30c07aefd76/sql/core/src/test/scala/org/apache/spark/sql/streaming/EventTimeWatermarkSuite.scala#L475-L573]

Update mode:

[https://github.com/HeartSaVioR/spark/blob/fb19879ff2bbafdf7c844d1a8da9d30c07aefd76/sql/core/src/test/scala/org/apache/spark/sql/streaming/EventTimeWatermarkSuite.scala#L618-L715]

 

> EventTime based sessionization
> --
>
> Key: SPARK-10816
> URL: https://issues.apache.org/jira/browse/SPARK-10816
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Reynold Xin
>Priority: Major
> Attachments: SPARK-10816 Support session window natively.pdf
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10816) EventTime based sessionization

2018-09-21 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16623749#comment-16623749
 ] 

Jungtaek Lim edited comment on SPARK-10816 at 9/21/18 3:13 PM:
---

[~msukmanowsky]

Thanks for showing your interest on this! If you are interested on my proposal 
you can even pull my patch and build (though it is marked as WIP even now it 
works... handling state is just a bit suboptimal), and play with custom build. 
(If you read my proposal you may be noticed that the proposal addresses the 
lack you're seeing from map/flatMapGroupsWithState.)

Your scenario looks like fit to simple gap window, with event time & watermark. 
I guess with my patch it would be represented as SQL statement like:
{code:java}
SELECT userId, session.start AS startTimestampMs, session.end AS 
endTimestampMs, session.end - session.start AS durationMs FROM tbl GROUP BY 
session(event_time, 30 minutes){code}
or DSL in below links.

(Here session.end is defined as gap ends. If you would like to have last event 
timestamp in session, max(event_time) would work.)

In append mode you can only see the sessions which are evicted, and in update 
mode you can see all updated sessions for every batch.

I also added UTs which converts structured sessionization example into session 
window function. Please check it out and let me know if something doesn't work 
as you expect.

Append mode:

[https://github.com/HeartSaVioR/spark/blob/fb19879ff2bbafdf7c844d1a8da9d30c07aefd76/sql/core/src/test/scala/org/apache/spark/sql/streaming/EventTimeWatermarkSuite.scala#L475-L573]

Update mode:

[https://github.com/HeartSaVioR/spark/blob/fb19879ff2bbafdf7c844d1a8da9d30c07aefd76/sql/core/src/test/scala/org/apache/spark/sql/streaming/EventTimeWatermarkSuite.scala#L618-L715]

 


was (Author: kabhwan):
[~msukmanowsky]

Thanks for showing your interest on this! If you are interested on my proposal 
you can even pull my patch and build, and play with custom build. (If you read 
my proposal you may be noticed that the proposal addresses the lack you're 
seeing from map/flatMapGroupsWithState.)

Your scenario looks like fit to simple gap window, with event time & watermark. 
I guess with my patch it would be represented as SQL statement like:
{code:java}
SELECT userId, session.start AS startTimestampMs, session.end AS 
endTimestampMs, session.end - session.start AS durationMs FROM tbl GROUP BY 
session(event_time, 30 minutes){code}
or DSL in below links.

(Here session.end is defined as gap ends. If you would like to have last event 
timestamp in session, max(event_time) would work.)

In append mode you can only see the sessions which are evicted, and in update 
mode you can see all updated sessions for every batch.

I also added UTs which converts structured sessionization example into session 
window function. Please check it out and let me know if something doesn't work 
as you expect.

Append mode:

[https://github.com/HeartSaVioR/spark/blob/fb19879ff2bbafdf7c844d1a8da9d30c07aefd76/sql/core/src/test/scala/org/apache/spark/sql/streaming/EventTimeWatermarkSuite.scala#L475-L573]

Update mode:

[https://github.com/HeartSaVioR/spark/blob/fb19879ff2bbafdf7c844d1a8da9d30c07aefd76/sql/core/src/test/scala/org/apache/spark/sql/streaming/EventTimeWatermarkSuite.scala#L618-L715]

 

> EventTime based sessionization
> --
>
> Key: SPARK-10816
> URL: https://issues.apache.org/jira/browse/SPARK-10816
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Reynold Xin
>Priority: Major
> Attachments: SPARK-10816 Support session window natively.pdf
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10816) EventTime based sessionization

2018-09-21 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16623749#comment-16623749
 ] 

Jungtaek Lim edited comment on SPARK-10816 at 9/21/18 3:10 PM:
---

[~msukmanowsky]

Thanks for showing your interest on this! If you are interested on my proposal 
you can even pull my patch and build, and play with custom build. (If you read 
my proposal you may be noticed that the proposal addresses the lack you're 
seeing from map/flatMapGroupsWithState.)

Your scenario looks like fit to simple gap window, with event time & watermark. 
I guess with my patch it would be represented as SQL statement like:
{code:java}
SELECT userId, session.start AS startTimestampMs, session.end AS 
endTimestampMs, session.end - session.start AS durationMs FROM tbl GROUP BY 
session(event_time, 30 minutes){code}
or DSL in below links.

(Here session.end is defined as gap ends. If you would like to have last event 
timestamp in session, max(event_time) would work.)

In append mode you can only see the sessions which are evicted, and in update 
mode you can see all updated sessions for every batch.

I also added UTs which converts structured sessionization example into session 
window function. Please check it out and let me know if something doesn't work 
as you expect.

Append mode:

[https://github.com/HeartSaVioR/spark/blob/fb19879ff2bbafdf7c844d1a8da9d30c07aefd76/sql/core/src/test/scala/org/apache/spark/sql/streaming/EventTimeWatermarkSuite.scala#L475-L573]

Update mode:

[https://github.com/HeartSaVioR/spark/blob/fb19879ff2bbafdf7c844d1a8da9d30c07aefd76/sql/core/src/test/scala/org/apache/spark/sql/streaming/EventTimeWatermarkSuite.scala#L618-L715]

 


was (Author: kabhwan):
[~msukmanowsky]

Thanks for showing your interest on this! If you are interested on my proposal 
you can even pull my patch and build, and play with custom build.

Your scenario looks like fit to simple gap window, with event time & watermark. 
I guess with my patch it would be represented as SQL statement like:
{code:java}
SELECT userId, session.start AS startTimestampMs, session.end AS 
endTimestampMs, session.end - session.start AS durationMs FROM tbl GROUP BY 
session(event_time, 30 minutes){code}
or DSL in below links.

(Here session.end is defined as gap ends. If you would like to have last event 
timestamp in session, max(event_time) would work.)

In append mode you can only see the sessions which are evicted, and in update 
mode you can see all updated sessions for every batch.

I also added UTs which converts structured sessionization example into session 
window function. Please check it out and let me know if something doesn't work 
as you expect.

Append mode:

[https://github.com/HeartSaVioR/spark/blob/fb19879ff2bbafdf7c844d1a8da9d30c07aefd76/sql/core/src/test/scala/org/apache/spark/sql/streaming/EventTimeWatermarkSuite.scala#L475-L573]

Update mode:

[https://github.com/HeartSaVioR/spark/blob/fb19879ff2bbafdf7c844d1a8da9d30c07aefd76/sql/core/src/test/scala/org/apache/spark/sql/streaming/EventTimeWatermarkSuite.scala#L618-L715]

 

> EventTime based sessionization
> --
>
> Key: SPARK-10816
> URL: https://issues.apache.org/jira/browse/SPARK-10816
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Reynold Xin
>Priority: Major
> Attachments: SPARK-10816 Support session window natively.pdf
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10816) EventTime based sessionization

2018-09-21 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16623749#comment-16623749
 ] 

Jungtaek Lim edited comment on SPARK-10816 at 9/21/18 3:09 PM:
---

[~msukmanowsky]

Thanks for showing your interest on this! If you are interested on my proposal 
you can even pull my patch and build, and play with custom build.

Your scenario looks like fit to simple gap window, with event time & watermark. 
I guess with my patch it would be represented as SQL statement like:
{code:java}
SELECT userId, session.start AS startTimestampMs, session.end AS 
endTimestampMs, session.end - session.start AS durationMs FROM tbl GROUP BY 
session(event_time, 30 minutes){code}
or DSL in below links.

(Here session.end is defined as gap ends. If you would like to have last event 
timestamp in session, max(event_time) would work.)

In append mode you can only see the sessions which are evicted, and in update 
mode you can see all updated sessions for every batch.

I also added UTs which converts structured sessionization example into session 
window function. Please check it out and let me know if something doesn't work 
as you expect.

Append mode:

[https://github.com/HeartSaVioR/spark/blob/fb19879ff2bbafdf7c844d1a8da9d30c07aefd76/sql/core/src/test/scala/org/apache/spark/sql/streaming/EventTimeWatermarkSuite.scala#L475-L573]

Update mode:

[https://github.com/HeartSaVioR/spark/blob/fb19879ff2bbafdf7c844d1a8da9d30c07aefd76/sql/core/src/test/scala/org/apache/spark/sql/streaming/EventTimeWatermarkSuite.scala#L618-L715]

 


was (Author: kabhwan):
[~msukmanowsky]

Thanks for showing your interest on this! If you are interested on my proposal 
you can even pull my patch and build, and play with custom build.

Your scenario looks like fit to simple gap window, with event time & watermark. 
I guess with my patch it would be represented as SQL statement like:
{code:java}
SELECT userId, session.start AS startTimestampMs, session.end AS 
endTimestampMs, session.end - session.start AS durationMs FROM tbl GROUP BY 
session(event_time, 30 minutes){code}
or DSL in below links.

In append mode you can only see the sessions which are evicted, and in update 
mode you can see all updated sessions for every batch.

I also added UTs which converts structured sessionization example into session 
window function. Please check it out and let me know if something doesn't work 
as you expect.

Append mode:

[https://github.com/HeartSaVioR/spark/blob/fb19879ff2bbafdf7c844d1a8da9d30c07aefd76/sql/core/src/test/scala/org/apache/spark/sql/streaming/EventTimeWatermarkSuite.scala#L475-L573]

Update mode:

[https://github.com/HeartSaVioR/spark/blob/fb19879ff2bbafdf7c844d1a8da9d30c07aefd76/sql/core/src/test/scala/org/apache/spark/sql/streaming/EventTimeWatermarkSuite.scala#L618-L715]

 

> EventTime based sessionization
> --
>
> Key: SPARK-10816
> URL: https://issues.apache.org/jira/browse/SPARK-10816
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Reynold Xin
>Priority: Major
> Attachments: SPARK-10816 Support session window natively.pdf
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10816) EventTime based sessionization

2018-09-21 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16623749#comment-16623749
 ] 

Jungtaek Lim commented on SPARK-10816:
--

[~msukmanowsky]

Thanks for showing your interest on this! If you are interested on my proposal 
you can even pull my patch and build, and play with custom build.

Your scenario looks like fit to simple gap window, with event time & watermark. 
I guess with my patch it would be represented as SQL statement like:
{code:java}
SELECT userId, session.start AS startTimestampMs, session.end AS 
endTimestampMs, session.end - session.start AS durationMs FROM tbl GROUP BY 
session(event_time, 30 minutes){code}
or DSL in below links.

In append mode you can only see the sessions which are evicted, and in update 
mode you can see all updated sessions for every batch.

I also added UTs which converts structured sessionization example into session 
window function. Please check it out and let me know if something doesn't work 
as you expect.

Append mode:

[https://github.com/HeartSaVioR/spark/blob/fb19879ff2bbafdf7c844d1a8da9d30c07aefd76/sql/core/src/test/scala/org/apache/spark/sql/streaming/EventTimeWatermarkSuite.scala#L475-L573]

Update mode:

[https://github.com/HeartSaVioR/spark/blob/fb19879ff2bbafdf7c844d1a8da9d30c07aefd76/sql/core/src/test/scala/org/apache/spark/sql/streaming/EventTimeWatermarkSuite.scala#L618-L715]

 

> EventTime based sessionization
> --
>
> Key: SPARK-10816
> URL: https://issues.apache.org/jira/browse/SPARK-10816
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Reynold Xin
>Priority: Major
> Attachments: SPARK-10816 Support session window natively.pdf
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25506) Spark CSV multiline with CRLF

2018-09-21 Thread eugen yushin (JIRA)
eugen yushin created SPARK-25506:


 Summary: Spark CSV multiline with CRLF
 Key: SPARK-25506
 URL: https://issues.apache.org/jira/browse/SPARK-25506
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
Affects Versions: 2.3.1, 2.2.0
 Environment: spark 2.2.0 and 2.3.1
scala 2.11.8
Reporter: eugen yushin


Spark produces empty rows (or ']' when printing via call to `collect`) dealing 
with '\r' character at the end of each line in CSV file. Note, no fields are 
escaped in original input file.
{code:java}
val multilineDf = sparkSession.read
  .format("csv")
  .options(Map("header" -> "true", "inferSchema" -> "false", "escape" -> "\"", 
"multiLine" -> "true"))
  .load("src/test/resources/multiLineHeader.csv")

val df = sparkSession.read
  .format("csv")
  .options(Map("header" -> "true", "inferSchema" -> "false", "escape" -> "\""))
  .load("src/test/resources/multiLineHeader.csv")

multilineDf.show()
multilineDf.collect().foreach(println)

df.show()
df.collect().foreach(println)
{code}
Result:
{code:java}
++-+
|
++-+
|
|
++-+

]
]

+++
|col1|col2|
+++
|   1|   1|
|   2|   2|
+++

[1,1]
[2,2]
{code}
Input file:
{code:java}
cat -vt src/test/resources/multiLineHeader.csv
col1,col2^M
1,1^M
2,2^M
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24523) InterruptedException when closing SparkContext

2018-09-21 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16623712#comment-16623712
 ] 

Imran Rashid commented on SPARK-24523:
--

Wait, are you writing event logs to S3?  I thought the event logs were being 
written to hdfs, and s3 is the regular input & output of the application.  The 
stack traces dont' look like they're from the rename ... I think its thats 
{{SQLAppStatusListener}} is just way behind.  This looks like its during 
{{listenerBus.stop()}}, which is before {{eventLogger.stop()}} (where the 
rename happens): 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1959-L1967

> InterruptedException when closing SparkContext
> --
>
> Key: SPARK-24523
> URL: https://issues.apache.org/jira/browse/SPARK-24523
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.0, 2.3.1
> Environment: EMR 5.14.0, S3/HDFS inputs and outputs; EMR 5.17
>  
>  
>  
>Reporter: Umayr Hassan
>Priority: Major
> Attachments: spark-stop-jstack.log.1, spark-stop-jstack.log.2, 
> spark-stop-jstack.log.3
>
>
> I'm running a Scala application in EMR with the following properties:
> {{--master yarn --deploy-mode cluster --driver-memory 13g --executor-memory 
> 30g --executor-cores 5 --conf spark.default.parallelism=400 --conf 
> spark.dynamicAllocation.enabled=true --conf 
> spark.dynamicAllocation.maxExecutors=20 --conf 
> spark.eventLog.dir=hdfs:///var/log/spark/apps --conf 
> spark.eventLog.enabled=true --conf 
> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 --conf 
> spark.scheduler.listenerbus.eventqueue.capacity=2 --conf 
> spark.shuffle.service.enabled=true --conf spark.sql.shuffle.partitions=400 
> --conf spark.yarn.maxAppAttempts=1}}
> The application runs fine till SparkContext is (automatically) closed, at 
> which point the SparkContext object throws. 
> {{18/06/10 10:44:43 ERROR Utils: Uncaught exception in thread pool-4-thread-1 
> java.lang.InterruptedException at java.lang.Object.wait(Native Method) at 
> java.lang.Thread.join(Thread.java:1252) at 
> java.lang.Thread.join(Thread.java:1326) at 
> org.apache.spark.scheduler.AsyncEventQueue.stop(AsyncEventQueue.scala:133) at 
> org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219)
>  at 
> org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:893) at 
> scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at 
> scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at 
> scala.collection.AbstractIterable.foreach(Iterable.scala:54) at 
> org.apache.spark.scheduler.LiveListenerBus.stop(LiveListenerBus.scala:219) at 
> org.apache.spark.SparkContext$$anonfun$stop$6.apply$mcV$sp(SparkContext.scala:1915)
>  at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1357) at 
> org.apache.spark.SparkContext.stop(SparkContext.scala:1914) at 
> org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:572) 
> at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216) 
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
>  at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
>  at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
>  at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1988) at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
>  at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
>  at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
>  at scala.util.Try$.apply(Try.scala:192) at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
>  at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> at java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)}}
>  
> I've not seen this behavior in Spark 2.0.2 and Spark 2.2.0 (for the same 
> application), so I'm not sure which change is causing Spark 2.3 to throw. Any 
> ideas?
> 

[jira] [Commented] (SPARK-25454) Division between operands with negative scale can cause precision loss

2018-09-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16623710#comment-16623710
 ] 

Apache Spark commented on SPARK-25454:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/22494

> Division between operands with negative scale can cause precision loss
> --
>
> Key: SPARK-25454
> URL: https://issues.apache.org/jira/browse/SPARK-25454
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Marco Gaido
>Priority: Major
>
> The issue was originally reported by [~bersprockets] here: 
> https://issues.apache.org/jira/browse/SPARK-22036?focusedCommentId=16618104=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16618104.
> The problem consist in a precision loss when the second operand of the 
> division is a decimal with a negative scale. It was present also before 2.3 
> but it was harder to reproduce: you had to do something like 
> {{lit(BigDecimal(100e6))}}, while now this can happen more frequently with 
> SQL constants.
> The problem is that our logic is taken from Hive and SQLServer where decimals 
> with negative scales are not allowed. We might also consider enforcing this 
> too in 3.0 eventually. Meanwhile we can fix the logic for computing the 
> result type for a division.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25505) The output order of grouping columns in Pivot is different from the input order

2018-09-21 Thread Maryann Xue (JIRA)
Maryann Xue created SPARK-25505:
---

 Summary: The output order of grouping columns in Pivot is 
different from the input order
 Key: SPARK-25505
 URL: https://issues.apache.org/jira/browse/SPARK-25505
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maryann Xue


For example,
{code}
SELECT * FROM (
  SELECT course, earnings, "a" as a, "z" as z, "b" as b, "y" as y, "c" as c, 
"x" as x, "d" as d, "w" as w
  FROM courseSales
)
PIVOT (
  sum(earnings)
  FOR course IN ('dotNET', 'Java')
)
{code}
The output columns should be "a, z, b, y, c, x, d, w, ..." but now it is "a, b, 
c, d, w, x, y, z, ..."



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25499) Refactor BenchmarkBase and Benchmark

2018-09-21 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-25499:
---

Assignee: Gengliang Wang

> Refactor BenchmarkBase and Benchmark
> 
>
> Key: SPARK-25499
> URL: https://issues.apache.org/jira/browse/SPARK-25499
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.5.0
>
>
> Currently there are two classes with the same naming BenchmarkBase:
> 1. org.apache.spark.util.BenchmarkBase
> 2. org.apache.spark.sql.execution.benchmark.BenchmarkBase
> Here I propose:
> 1. the package org.apache.spark.util.BenchmarkBase should be in test package, 
> move to org.apache.spark.sql.execution.benchmark .
> 2. Rename the org.apache.spark.sql.execution.benchmark.BenchmarkBase as 
> BenchmarkWithCodegen
> 3. Move  org.apache.spark.util.Benchmark to test package of 
> org.apache.spark.sql.execution.benchmark



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25499) Refactor BenchmarkBase and Benchmark

2018-09-21 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-25499.
-
   Resolution: Fixed
Fix Version/s: 2.5.0

Issue resolved by pull request 22513
[https://github.com/apache/spark/pull/22513]

> Refactor BenchmarkBase and Benchmark
> 
>
> Key: SPARK-25499
> URL: https://issues.apache.org/jira/browse/SPARK-25499
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.5.0
>
>
> Currently there are two classes with the same naming BenchmarkBase:
> 1. org.apache.spark.util.BenchmarkBase
> 2. org.apache.spark.sql.execution.benchmark.BenchmarkBase
> Here I propose:
> 1. the package org.apache.spark.util.BenchmarkBase should be in test package, 
> move to org.apache.spark.sql.execution.benchmark .
> 2. Rename the org.apache.spark.sql.execution.benchmark.BenchmarkBase as 
> BenchmarkWithCodegen
> 3. Move  org.apache.spark.util.Benchmark to test package of 
> org.apache.spark.sql.execution.benchmark



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22739) Additional Expression Support for Objects

2018-09-21 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16623672#comment-16623672
 ] 

Wenchen Fan commented on SPARK-22739:
-

Here it is: https://issues.apache.org/jira/browse/SPARK-24768

> Additional Expression Support for Objects
> -
>
> Key: SPARK-22739
> URL: https://issues.apache.org/jira/browse/SPARK-22739
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Aleksander Eskilson
>Priority: Major
>
> Some discussion in Spark-Avro [1] motivates additions and minor changes to 
> the {{Objects}} Expressions API [2]. The proposed changes include
> * a generalized form of {{initializeJavaBean}} taking a sequence of 
> initialization expressions that can be applied to instances of varying objects
> * an object cast that performs a simple Java type cast against a value
> * making {{ExternalMapToCatalyst}} public, for use in outside libraries
> These changes would facilitate the writing of custom encoders for varying 
> objects that cannot already be readily converted to a statically typed 
> dataset by a JavaBean encoder (e.g. Avro).
> [1] -- 
> https://github.com/databricks/spark-avro/pull/217#issuecomment-342599110
> [2] --
>  
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22739) Additional Expression Support for Objects

2018-09-21 Thread Aleksander Eskilson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16623665#comment-16623665
 ] 

Aleksander Eskilson edited comment on SPARK-22739 at 9/21/18 2:10 PM:
--

[~cloud_fan], could you perhaps link here the Spark issue (if any) and PR that 
committed built-in Avro support to Spark? 

I would like to take a look at how Avro support for Datasets that was created 
through the PR on this ticket, and an additional PR in Spark-Avro (see 
[#217|https://github.com/databricks/spark-avro/pull/217]), might be folded in 
to the new code. I would imagine that process would be including the same 
expressions this ticket would have included, and then the new AvroEncoder that 
was to be included in the Spark-Avro project.

Happy to see hear Avro will be included in Spark-proper now! 

cc: [~marmbrus]


was (Author: aeskilson):
[~cloud_fan], could you perhaps link here the Spark issue (if any) and PR that 
committed built-in Avro support to Spark? 

I would like to take a look at how Avro support for Datasets that was created 
through the PR on this ticket, and an additional PR in Spark-Avro (see 
[#217|https://github.com/databricks/spark-avro/pull/217]), might be folded in 
to the new code. I would imagine that process would be including the same 
expressions this ticket would have included, and then the new AvroEncoder that 
was to be included in the Spark-Avro project.

cc: [~marmbrus]

> Additional Expression Support for Objects
> -
>
> Key: SPARK-22739
> URL: https://issues.apache.org/jira/browse/SPARK-22739
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Aleksander Eskilson
>Priority: Major
>
> Some discussion in Spark-Avro [1] motivates additions and minor changes to 
> the {{Objects}} Expressions API [2]. The proposed changes include
> * a generalized form of {{initializeJavaBean}} taking a sequence of 
> initialization expressions that can be applied to instances of varying objects
> * an object cast that performs a simple Java type cast against a value
> * making {{ExternalMapToCatalyst}} public, for use in outside libraries
> These changes would facilitate the writing of custom encoders for varying 
> objects that cannot already be readily converted to a statically typed 
> dataset by a JavaBean encoder (e.g. Avro).
> [1] -- 
> https://github.com/databricks/spark-avro/pull/217#issuecomment-342599110
> [2] --
>  
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22739) Additional Expression Support for Objects

2018-09-21 Thread Aleksander Eskilson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16623665#comment-16623665
 ] 

Aleksander Eskilson commented on SPARK-22739:
-

[~cloud_fan], could you perhaps link here the Spark issue (if any) and PR that 
committed built-in Avro support to Spark? 

I would like to take a look at how Avro support for Datasets that was created 
through the PR on this ticket, and an additional PR in Spark-Avro (see 
[#217|https://github.com/databricks/spark-avro/pull/217]), might be folded in 
to the new code. I would imagine that process would be including the same 
expressions this ticket would have included, and then the new AvroEncoder that 
was to be included in the Spark-Avro project.

cc: [~marmbrus]

> Additional Expression Support for Objects
> -
>
> Key: SPARK-22739
> URL: https://issues.apache.org/jira/browse/SPARK-22739
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Aleksander Eskilson
>Priority: Major
>
> Some discussion in Spark-Avro [1] motivates additions and minor changes to 
> the {{Objects}} Expressions API [2]. The proposed changes include
> * a generalized form of {{initializeJavaBean}} taking a sequence of 
> initialization expressions that can be applied to instances of varying objects
> * an object cast that performs a simple Java type cast against a value
> * making {{ExternalMapToCatalyst}} public, for use in outside libraries
> These changes would facilitate the writing of custom encoders for varying 
> objects that cannot already be readily converted to a statically typed 
> dataset by a JavaBean encoder (e.g. Avro).
> [1] -- 
> https://github.com/databricks/spark-avro/pull/217#issuecomment-342599110
> [2] --
>  
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24355) Improve Spark shuffle server responsiveness to non-ChunkFetch requests

2018-09-21 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-24355:
-

Assignee: Sanket Chintapalli

> Improve Spark shuffle server responsiveness to non-ChunkFetch requests
> --
>
> Key: SPARK-24355
> URL: https://issues.apache.org/jira/browse/SPARK-24355
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.3.0
> Environment: Hadoop-2.7.4
> Spark-2.3.0
>Reporter: Min Shen
>Assignee: Sanket Chintapalli
>Priority: Major
> Fix For: 2.5.0
>
>
> We run Spark on YARN, and deploy Spark external shuffle service as part of 
> YARN NM aux service.
> One issue we saw with Spark external shuffle service is the various timeout 
> experienced by the clients on either registering executor with local shuffle 
> server or establish connection to remote shuffle server.
> Example of a timeout for establishing connection with remote shuffle server:
> {code:java}
> java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout 
> waiting for task.
>   at 
> org.spark_project.guava.base.Throwables.propagate(Throwables.java:160)
>   at 
> org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:288)
>   at 
> org.apache.spark.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:80)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:248)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:106)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleClient.fetchBlocks(ExternalShuffleClient.java:115)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:182)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.org$apache$spark$storage$ShuffleBlockFetcherIterator$$send$1(ShuffleBlockFetcherIterator.scala:396)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala:391)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:345)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:57)
> {code}
> Example of a timeout for registering executor with local shuffle server:
> {code:java}
> ava.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout 
> waiting for task.
>   at 
> org.spark-project.guava.base.Throwables.propagate(Throwables.java:160)
>   at 
> org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:278)
>   at 
> org.apache.spark.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:80)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:181)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:141)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:218)
> {code}
> While patches such as SPARK-20640 and config parameters such as 
> spark.shuffle.registration.timeout and spark.shuffle.sasl.timeout (when 
> spark.authenticate is set to true) could help to alleviate this type of 
> problems, it does not solve the fundamental issue.
> We have observed that, when the shuffle workload gets very busy in peak 
> hours, the client requests could timeout even after configuring these 
> parameters to very high values. Further investigating this issue revealed the 
> following issue:
> Right now, the default server side netty handler threads is 2 * # cores, and 
> can be further configured with parameter spark.shuffle.io.serverThreads.
> In order to process a client request, it would require one available server 
> netty handler thread.
> However, when the server netty handler threads start to process 
> ChunkFetchRequests, they will be blocked on disk I/O, mostly due to disk 
> contentions from the random read operations initiated by all the 
> ChunkFetchRequests received from clients.
> As a result, when the shuffle server is 

[jira] [Commented] (SPARK-24355) Improve Spark shuffle server responsiveness to non-ChunkFetch requests

2018-09-21 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16623660#comment-16623660
 ] 

Thomas Graves commented on SPARK-24355:
---

pr that got merged didn't get linked properly: 
https://github.com/apache/spark/pull/22173

> Improve Spark shuffle server responsiveness to non-ChunkFetch requests
> --
>
> Key: SPARK-24355
> URL: https://issues.apache.org/jira/browse/SPARK-24355
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.3.0
> Environment: Hadoop-2.7.4
> Spark-2.3.0
>Reporter: Min Shen
>Priority: Major
> Fix For: 2.5.0
>
>
> We run Spark on YARN, and deploy Spark external shuffle service as part of 
> YARN NM aux service.
> One issue we saw with Spark external shuffle service is the various timeout 
> experienced by the clients on either registering executor with local shuffle 
> server or establish connection to remote shuffle server.
> Example of a timeout for establishing connection with remote shuffle server:
> {code:java}
> java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout 
> waiting for task.
>   at 
> org.spark_project.guava.base.Throwables.propagate(Throwables.java:160)
>   at 
> org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:288)
>   at 
> org.apache.spark.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:80)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:248)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:106)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleClient.fetchBlocks(ExternalShuffleClient.java:115)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:182)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.org$apache$spark$storage$ShuffleBlockFetcherIterator$$send$1(ShuffleBlockFetcherIterator.scala:396)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala:391)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:345)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:57)
> {code}
> Example of a timeout for registering executor with local shuffle server:
> {code:java}
> ava.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout 
> waiting for task.
>   at 
> org.spark-project.guava.base.Throwables.propagate(Throwables.java:160)
>   at 
> org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:278)
>   at 
> org.apache.spark.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:80)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:181)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:141)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:218)
> {code}
> While patches such as SPARK-20640 and config parameters such as 
> spark.shuffle.registration.timeout and spark.shuffle.sasl.timeout (when 
> spark.authenticate is set to true) could help to alleviate this type of 
> problems, it does not solve the fundamental issue.
> We have observed that, when the shuffle workload gets very busy in peak 
> hours, the client requests could timeout even after configuring these 
> parameters to very high values. Further investigating this issue revealed the 
> following issue:
> Right now, the default server side netty handler threads is 2 * # cores, and 
> can be further configured with parameter spark.shuffle.io.serverThreads.
> In order to process a client request, it would require one available server 
> netty handler thread.
> However, when the server netty handler threads start to process 
> ChunkFetchRequests, they will be blocked on disk I/O, mostly due to disk 
> contentions from the random read operations initiated by all the 
> ChunkFetchRequests received from 

[jira] [Resolved] (SPARK-24355) Improve Spark shuffle server responsiveness to non-ChunkFetch requests

2018-09-21 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-24355.
---
   Resolution: Fixed
Fix Version/s: 2.5.0

> Improve Spark shuffle server responsiveness to non-ChunkFetch requests
> --
>
> Key: SPARK-24355
> URL: https://issues.apache.org/jira/browse/SPARK-24355
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.3.0
> Environment: Hadoop-2.7.4
> Spark-2.3.0
>Reporter: Min Shen
>Priority: Major
> Fix For: 2.5.0
>
>
> We run Spark on YARN, and deploy Spark external shuffle service as part of 
> YARN NM aux service.
> One issue we saw with Spark external shuffle service is the various timeout 
> experienced by the clients on either registering executor with local shuffle 
> server or establish connection to remote shuffle server.
> Example of a timeout for establishing connection with remote shuffle server:
> {code:java}
> java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout 
> waiting for task.
>   at 
> org.spark_project.guava.base.Throwables.propagate(Throwables.java:160)
>   at 
> org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:288)
>   at 
> org.apache.spark.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:80)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:248)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:106)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleClient.fetchBlocks(ExternalShuffleClient.java:115)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:182)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.org$apache$spark$storage$ShuffleBlockFetcherIterator$$send$1(ShuffleBlockFetcherIterator.scala:396)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala:391)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:345)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:57)
> {code}
> Example of a timeout for registering executor with local shuffle server:
> {code:java}
> ava.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout 
> waiting for task.
>   at 
> org.spark-project.guava.base.Throwables.propagate(Throwables.java:160)
>   at 
> org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:278)
>   at 
> org.apache.spark.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:80)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:181)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:141)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:218)
> {code}
> While patches such as SPARK-20640 and config parameters such as 
> spark.shuffle.registration.timeout and spark.shuffle.sasl.timeout (when 
> spark.authenticate is set to true) could help to alleviate this type of 
> problems, it does not solve the fundamental issue.
> We have observed that, when the shuffle workload gets very busy in peak 
> hours, the client requests could timeout even after configuring these 
> parameters to very high values. Further investigating this issue revealed the 
> following issue:
> Right now, the default server side netty handler threads is 2 * # cores, and 
> can be further configured with parameter spark.shuffle.io.serverThreads.
> In order to process a client request, it would require one available server 
> netty handler thread.
> However, when the server netty handler threads start to process 
> ChunkFetchRequests, they will be blocked on disk I/O, mostly due to disk 
> contentions from the random read operations initiated by all the 
> ChunkFetchRequests received from clients.
> As a result, when the shuffle server is serving many concurrent 
> 

  1   2   >