[jira] [Commented] (SPARK-25299) Use remote storage for persisting shuffle data

2018-12-26 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729357#comment-16729357
 ] 

Saisai Shao commented on SPARK-25299:
-

[~jealous] Can we have a doc about this proposed solution for us to review?

> Use remote storage for persisting shuffle data
> --
>
> Key: SPARK-25299
> URL: https://issues.apache.org/jira/browse/SPARK-25299
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Major
>
> In Spark, the shuffle primitive requires Spark executors to persist data to 
> the local disk of the worker nodes. If executors crash, the external shuffle 
> service can continue to serve the shuffle data that was written beyond the 
> lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the 
> external shuffle service is deployed on every worker node. The shuffle 
> service shares local disk with the executors that run on its node.
> There are some shortcomings with the way shuffle is fundamentally implemented 
> right now. Particularly:
>  * If any external shuffle service process or node becomes unavailable, all 
> applications that had an executor that ran on that node must recompute the 
> shuffle blocks that were lost.
>  * Similarly to the above, the external shuffle service must be kept running 
> at all times, which may waste resources when no applications are using that 
> shuffle service node.
>  * Mounting local storage can prevent users from taking advantage of 
> desirable isolation benefits from using containerized environments, like 
> Kubernetes. We had an external shuffle service implementation in an early 
> prototype of the Kubernetes backend, but it was rejected due to its strict 
> requirement to be able to mount hostPath volumes or other persistent volume 
> setups.
> In the following [architecture discussion 
> document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40]
>  (note: _not_ an SPIP), we brainstorm various high level architectures for 
> improving the external shuffle service in a way that addresses the above 
> problems. The purpose of this umbrella JIRA is to promote additional 
> discussion on how we can approach these problems, both at the architecture 
> level and the implementation level. We anticipate filing sub-issues that 
> break down the tasks that must be completed to achieve this goal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26449) Missing Dataframe.transform API in Python API

2018-12-26 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26449:
-
Summary: Missing Dataframe.transform API in Python API  (was: 
Dataframe.transform)

> Missing Dataframe.transform API in Python API
> -
>
> Key: SPARK-26449
> URL: https://issues.apache.org/jira/browse/SPARK-26449
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: Hanan Shteingart
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I would like to chain custom transformations as is suggested in this [blog 
> post|https://medium.com/@mrpowers/chaining-custom-pyspark-transformations-4f38a8c7ae55]
> This will allow to write something like the following:
>  
>  
> {code:java}
>  
> def with_greeting(df):
> return df.withColumn("greeting", lit("hi"))
> def with_something(df, something):
> return df.withColumn("something", lit(something))
> data = [("jose", 1), ("li", 2), ("liz", 3)]
> source_df = spark.createDataFrame(data, ["name", "age"])
> actual_df = (source_df
> .transform(with_greeting)
> .transform(lambda df: with_something(df, "crazy")))
> print(actual_df.show())
> ++---++-+
> |name|age|greeting|something|
> ++---++-+
> |jose|  1|  hi|crazy|
> |  li|  2|  hi|crazy|
> | liz|  3|  hi|crazy|
> ++---++-+
> {code}
> The only thing needed to accomplish this is the following simple method for 
> DataFrame:
> {code:java}
> from pyspark.sql.dataframe import DataFrame 
> def transform(self, f): 
> return f(self) 
> DataFrame.transform = transform
> {code}
> I volunteer to do the pull request if approved (at least the python part)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26452) Suppressing exception in finally: Java heap space java.lang.OutOfMemoryError: Java heap space

2018-12-26 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26452.
--
Resolution: Invalid

Please don't just copy and paste the error message. Include information about 
expected input, output, reproducible codes if possible and your analysis.

> Suppressing exception in finally: Java heap space java.lang.OutOfMemoryError: 
> Java heap space
> -
>
> Key: SPARK-26452
> URL: https://issues.apache.org/jira/browse/SPARK-26452
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.2.0
>Reporter: tommy duan
>Priority: Major
>
> [Stage 1852:===>(896 + 3) / 
> 900] [Stage 1852:===>(897 + 
> 3) / 900] [Stage 
> 1852:===>(899 + 1) / 900] 
> [Stage 1853:> (0 + 0) / 900]18/12/27 06:03:45 WARN util.Utils: Suppressing 
> exception in finally: Java heap space
> java.lang.OutOfMemoryError: Java heap space
>  at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57)
>  at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
>  at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$3.apply(TorrentBroadcast.scala:271)
>  at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$3.apply(TorrentBroadcast.scala:271)
>  at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStream.allocateNewChunkIfNeeded(ChunkedByteBufferOutputStream.scala:87)
>  at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStream.write(ChunkedByteBufferOutputStream.scala:75)
>  at 
> net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:205)
>  at net.jpountz.lz4.LZ4BlockOutputStream.write(LZ4BlockOutputStream.java:158)
>  at 
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
>  at 
> java.io.ObjectOutputStream$BlockDataOutputStream.flush(ObjectOutputStream.java:1822)
>  at java.io.ObjectOutputStream.flush(ObjectOutputStream.java:719)
>  at java.io.ObjectOutputStream.close(ObjectOutputStream.java:740)
>  at 
> org.apache.spark.serializer.JavaSerializationStream.close(JavaSerializer.scala:57)
>  at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$1.apply$mcV$sp(TorrentBroadcast.scala:278)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1346)
>  at 
> org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:277)
>  at 
> org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:126)
>  at 
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:88)
>  at 
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
>  at 
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:56)
>  at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1488)
>  at 
> org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1006)
>  at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:930)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:776)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:775)
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>  at 
> org.apache.spark.scheduler.DAGScheduler.submitWaitingChildStages(DAGScheduler.scala:775)
>  at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1259)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1711)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
> Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: 
> Java heap space
>  at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57)
>  at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
>  at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$3.apply(TorrentBroadcast.scala:271)
>  at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$3.apply(TorrentBroadcast.scala:271)
>  at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStream.allocateNewChunkIfNeeded(ChunkedByteBufferOutputStream.scala:87)
>  at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStream.write(ChunkedByteBufferOutputStream.scala:75)
>  at 
> 

[jira] [Commented] (SPARK-26449) Dataframe.transform

2018-12-26 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729301#comment-16729301
 ] 

Hyukjin Kwon commented on SPARK-26449:
--

Seems like Scala side has it but Python doesn't. Can you open a PR with a 
regression test?

> Dataframe.transform
> ---
>
> Key: SPARK-26449
> URL: https://issues.apache.org/jira/browse/SPARK-26449
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: Hanan Shteingart
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I would like to chain custom transformations as is suggested in this [blog 
> post|https://medium.com/@mrpowers/chaining-custom-pyspark-transformations-4f38a8c7ae55]
> This will allow to write something like the following:
>  
>  
> {code:java}
>  
> def with_greeting(df):
> return df.withColumn("greeting", lit("hi"))
> def with_something(df, something):
> return df.withColumn("something", lit(something))
> data = [("jose", 1), ("li", 2), ("liz", 3)]
> source_df = spark.createDataFrame(data, ["name", "age"])
> actual_df = (source_df
> .transform(with_greeting)
> .transform(lambda df: with_something(df, "crazy")))
> print(actual_df.show())
> ++---++-+
> |name|age|greeting|something|
> ++---++-+
> |jose|  1|  hi|crazy|
> |  li|  2|  hi|crazy|
> | liz|  3|  hi|crazy|
> ++---++-+
> {code}
> The only thing needed to accomplish this is the following simple method for 
> DataFrame:
> {code:java}
> from pyspark.sql.dataframe import DataFrame 
> def transform(self, f): 
> return f(self) 
> DataFrame.transform = transform
> {code}
> I volunteer to do the pull request if approved (at least the python part)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26449) Missing Dataframe.transform API in Python API

2018-12-26 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26449:
-
Issue Type: Improvement  (was: New Feature)

> Missing Dataframe.transform API in Python API
> -
>
> Key: SPARK-26449
> URL: https://issues.apache.org/jira/browse/SPARK-26449
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: Hanan Shteingart
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I would like to chain custom transformations as is suggested in this [blog 
> post|https://medium.com/@mrpowers/chaining-custom-pyspark-transformations-4f38a8c7ae55]
> This will allow to write something like the following:
>  
>  
> {code:java}
>  
> def with_greeting(df):
> return df.withColumn("greeting", lit("hi"))
> def with_something(df, something):
> return df.withColumn("something", lit(something))
> data = [("jose", 1), ("li", 2), ("liz", 3)]
> source_df = spark.createDataFrame(data, ["name", "age"])
> actual_df = (source_df
> .transform(with_greeting)
> .transform(lambda df: with_something(df, "crazy")))
> print(actual_df.show())
> ++---++-+
> |name|age|greeting|something|
> ++---++-+
> |jose|  1|  hi|crazy|
> |  li|  2|  hi|crazy|
> | liz|  3|  hi|crazy|
> ++---++-+
> {code}
> The only thing needed to accomplish this is the following simple method for 
> DataFrame:
> {code:java}
> from pyspark.sql.dataframe import DataFrame 
> def transform(self, f): 
> return f(self) 
> DataFrame.transform = transform
> {code}
> I volunteer to do the pull request if approved (at least the python part)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26449) Missing Dataframe.transform API in Python API

2018-12-26 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26449:
-
Labels:   (was: patch)

> Missing Dataframe.transform API in Python API
> -
>
> Key: SPARK-26449
> URL: https://issues.apache.org/jira/browse/SPARK-26449
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: Hanan Shteingart
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I would like to chain custom transformations as is suggested in this [blog 
> post|https://medium.com/@mrpowers/chaining-custom-pyspark-transformations-4f38a8c7ae55]
> This will allow to write something like the following:
>  
>  
> {code:java}
>  
> def with_greeting(df):
> return df.withColumn("greeting", lit("hi"))
> def with_something(df, something):
> return df.withColumn("something", lit(something))
> data = [("jose", 1), ("li", 2), ("liz", 3)]
> source_df = spark.createDataFrame(data, ["name", "age"])
> actual_df = (source_df
> .transform(with_greeting)
> .transform(lambda df: with_something(df, "crazy")))
> print(actual_df.show())
> ++---++-+
> |name|age|greeting|something|
> ++---++-+
> |jose|  1|  hi|crazy|
> |  li|  2|  hi|crazy|
> | liz|  3|  hi|crazy|
> ++---++-+
> {code}
> The only thing needed to accomplish this is the following simple method for 
> DataFrame:
> {code:java}
> from pyspark.sql.dataframe import DataFrame 
> def transform(self, f): 
> return f(self) 
> DataFrame.transform = transform
> {code}
> I volunteer to do the pull request if approved (at least the python part)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26449) Missing Dataframe.transform API in Python API

2018-12-26 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26449:
-
Component/s: PySpark

> Missing Dataframe.transform API in Python API
> -
>
> Key: SPARK-26449
> URL: https://issues.apache.org/jira/browse/SPARK-26449
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: Hanan Shteingart
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I would like to chain custom transformations as is suggested in this [blog 
> post|https://medium.com/@mrpowers/chaining-custom-pyspark-transformations-4f38a8c7ae55]
> This will allow to write something like the following:
>  
>  
> {code:java}
>  
> def with_greeting(df):
> return df.withColumn("greeting", lit("hi"))
> def with_something(df, something):
> return df.withColumn("something", lit(something))
> data = [("jose", 1), ("li", 2), ("liz", 3)]
> source_df = spark.createDataFrame(data, ["name", "age"])
> actual_df = (source_df
> .transform(with_greeting)
> .transform(lambda df: with_something(df, "crazy")))
> print(actual_df.show())
> ++---++-+
> |name|age|greeting|something|
> ++---++-+
> |jose|  1|  hi|crazy|
> |  li|  2|  hi|crazy|
> | liz|  3|  hi|crazy|
> ++---++-+
> {code}
> The only thing needed to accomplish this is the following simple method for 
> DataFrame:
> {code:java}
> from pyspark.sql.dataframe import DataFrame 
> def transform(self, f): 
> return f(self) 
> DataFrame.transform = transform
> {code}
> I volunteer to do the pull request if approved (at least the python part)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26438) Driver waits to spark.sql.broadcastTimeout before throwing OutOfMemoryError - is this by design?

2018-12-26 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729300#comment-16729300
 ] 

Hyukjin Kwon commented on SPARK-26438:
--

Let's ask a question to Spark mailing list before filing an issue. You could 
have a better answer than this.

> Driver waits to spark.sql.broadcastTimeout before throwing OutOfMemoryError - 
> is this by design?
> 
>
> Key: SPARK-26438
> URL: https://issues.apache.org/jira/browse/SPARK-26438
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Shay Elbaz
>Priority: Trivial
>
> When broadcasting too large DataFrame, the driver does not fail immediately 
> when the broadcast thread throws OutOfMemoryError. Instead it waits for 
> `spark.sql.broadcastTimeout` to meet. Is that by design or a bug?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26438) Driver waits to spark.sql.broadcastTimeout before throwing OutOfMemoryError - is this by design?

2018-12-26 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26438.
--
Resolution: Invalid

> Driver waits to spark.sql.broadcastTimeout before throwing OutOfMemoryError - 
> is this by design?
> 
>
> Key: SPARK-26438
> URL: https://issues.apache.org/jira/browse/SPARK-26438
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Shay Elbaz
>Priority: Trivial
>
> When broadcasting too large DataFrame, the driver does not fail immediately 
> when the broadcast thread throws OutOfMemoryError. Instead it waits for 
> `spark.sql.broadcastTimeout` to meet. Is that by design or a bug?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26268) Decouple shuffle data from Spark deployment

2018-12-26 Thread Peiyu Zhuang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729297#comment-16729297
 ] 

Peiyu Zhuang commented on SPARK-26268:
--

Check [SPARK-25299|https://issues.apache.org/jira/browse/SPARK-25299], we are 
trying to implement a shuffle manager with storage plugin that could support 
different kinds of external/local storage. The work will be open-source soon.

> Decouple shuffle data from Spark deployment
> ---
>
> Key: SPARK-26268
> URL: https://issues.apache.org/jira/browse/SPARK-26268
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.4.0
>Reporter: Ben Sidhom
>Priority: Major
>
> Right now the batch scheduler assumes that shuffle data is tied to executors. 
> As a result, when an executor is lost, any map tasks that ran on that 
> executor are rescheduled unless the "external" shuffle service is being used. 
> Note that this service is only external in the sense that it does not live 
> within executors themselves; its implementation cannot be swapped out and it 
> is assumed to speak the BlockManager language.
> The following changes would facilitate external shuffle (see SPARK-25299 for 
> motivation):
>  * Do not rerun map tasks on lost executors when shuffle data is stored 
> externally. For example, this could be determined by a property or by an 
> additional method that all ShuffleManagers implement.
>  * Do not assume that shuffle data is stored in the standard BlockManager 
> format or that a BlockManager is or must be available to ShuffleManagers.
> Note that only the first change is actually required to realize the benefits 
> of remote shuffle implementations as a phony (or null) BlockManager can be 
> used by shuffle implementations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25299) Use remote storage for persisting shuffle data

2018-12-26 Thread Peiyu Zhuang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729292#comment-16729292
 ] 

Peiyu Zhuang edited comment on SPARK-25299 at 12/27/18 3:31 AM:


We are currently working on a solution that is similar to option 3 mentioned in 
this [architecture discussion 
document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40].
 The idea is to refactor the current shuffle manager and extract a common 
storage interface. User could supply different storage implementations for 
shuffle data and spill data.
 We have got some preliminary test result. Since shuffle manager is critical to 
Spark, we want to make sure it functions just as the original shuffle manager. 
And it will be open-source in the near future.


was (Author: jealous):
We are currently working on a solution that is similar to option 3 mentioned in 
this [architecture discussion 
document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40].
  The idea is to refactor the current shuffle manager and extract a common 
storage interface.  User could supply different storage implementations for 
shuffle data and spill data.
We have got some preliminary test result.  Since shuffle manager is critical to 
Spark, we want to make sure it functions just as the original shuffle manager.  
And it will be open-sourced in the near future.


> Use remote storage for persisting shuffle data
> --
>
> Key: SPARK-25299
> URL: https://issues.apache.org/jira/browse/SPARK-25299
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Major
>
> In Spark, the shuffle primitive requires Spark executors to persist data to 
> the local disk of the worker nodes. If executors crash, the external shuffle 
> service can continue to serve the shuffle data that was written beyond the 
> lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the 
> external shuffle service is deployed on every worker node. The shuffle 
> service shares local disk with the executors that run on its node.
> There are some shortcomings with the way shuffle is fundamentally implemented 
> right now. Particularly:
>  * If any external shuffle service process or node becomes unavailable, all 
> applications that had an executor that ran on that node must recompute the 
> shuffle blocks that were lost.
>  * Similarly to the above, the external shuffle service must be kept running 
> at all times, which may waste resources when no applications are using that 
> shuffle service node.
>  * Mounting local storage can prevent users from taking advantage of 
> desirable isolation benefits from using containerized environments, like 
> Kubernetes. We had an external shuffle service implementation in an early 
> prototype of the Kubernetes backend, but it was rejected due to its strict 
> requirement to be able to mount hostPath volumes or other persistent volume 
> setups.
> In the following [architecture discussion 
> document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40]
>  (note: _not_ an SPIP), we brainstorm various high level architectures for 
> improving the external shuffle service in a way that addresses the above 
> problems. The purpose of this umbrella JIRA is to promote additional 
> discussion on how we can approach these problems, both at the architecture 
> level and the implementation level. We anticipate filing sub-issues that 
> break down the tasks that must be completed to achieve this goal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25299) Use remote storage for persisting shuffle data

2018-12-26 Thread Carson Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729294#comment-16729294
 ] 

Carson Wang commented on SPARK-25299:
-

I am on a vacation and will be back on January 2, 2019. Please expect delayed 
response.


> Use remote storage for persisting shuffle data
> --
>
> Key: SPARK-25299
> URL: https://issues.apache.org/jira/browse/SPARK-25299
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Major
>
> In Spark, the shuffle primitive requires Spark executors to persist data to 
> the local disk of the worker nodes. If executors crash, the external shuffle 
> service can continue to serve the shuffle data that was written beyond the 
> lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the 
> external shuffle service is deployed on every worker node. The shuffle 
> service shares local disk with the executors that run on its node.
> There are some shortcomings with the way shuffle is fundamentally implemented 
> right now. Particularly:
>  * If any external shuffle service process or node becomes unavailable, all 
> applications that had an executor that ran on that node must recompute the 
> shuffle blocks that were lost.
>  * Similarly to the above, the external shuffle service must be kept running 
> at all times, which may waste resources when no applications are using that 
> shuffle service node.
>  * Mounting local storage can prevent users from taking advantage of 
> desirable isolation benefits from using containerized environments, like 
> Kubernetes. We had an external shuffle service implementation in an early 
> prototype of the Kubernetes backend, but it was rejected due to its strict 
> requirement to be able to mount hostPath volumes or other persistent volume 
> setups.
> In the following [architecture discussion 
> document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40]
>  (note: _not_ an SPIP), we brainstorm various high level architectures for 
> improving the external shuffle service in a way that addresses the above 
> problems. The purpose of this umbrella JIRA is to promote additional 
> discussion on how we can approach these problems, both at the architecture 
> level and the implementation level. We anticipate filing sub-issues that 
> break down the tasks that must be completed to achieve this goal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25299) Use remote storage for persisting shuffle data

2018-12-26 Thread Peiyu Zhuang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729292#comment-16729292
 ] 

Peiyu Zhuang commented on SPARK-25299:
--

We are currently working on a solution that is similar to option 3 mentioned in 
this [architecture discussion 
document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40].
  The idea is to refactor the current shuffle manager and extract a common 
storage interface.  User could supply different storage implementations for 
shuffle data and spill data.
We have got some preliminary test result.  Since shuffle manager is critical to 
Spark, we want to make sure it functions just as the original shuffle manager.  
And it will be open-sourced in the near future.


> Use remote storage for persisting shuffle data
> --
>
> Key: SPARK-25299
> URL: https://issues.apache.org/jira/browse/SPARK-25299
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Major
>
> In Spark, the shuffle primitive requires Spark executors to persist data to 
> the local disk of the worker nodes. If executors crash, the external shuffle 
> service can continue to serve the shuffle data that was written beyond the 
> lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the 
> external shuffle service is deployed on every worker node. The shuffle 
> service shares local disk with the executors that run on its node.
> There are some shortcomings with the way shuffle is fundamentally implemented 
> right now. Particularly:
>  * If any external shuffle service process or node becomes unavailable, all 
> applications that had an executor that ran on that node must recompute the 
> shuffle blocks that were lost.
>  * Similarly to the above, the external shuffle service must be kept running 
> at all times, which may waste resources when no applications are using that 
> shuffle service node.
>  * Mounting local storage can prevent users from taking advantage of 
> desirable isolation benefits from using containerized environments, like 
> Kubernetes. We had an external shuffle service implementation in an early 
> prototype of the Kubernetes backend, but it was rejected due to its strict 
> requirement to be able to mount hostPath volumes or other persistent volume 
> setups.
> In the following [architecture discussion 
> document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40]
>  (note: _not_ an SPIP), we brainstorm various high level architectures for 
> improving the external shuffle service in a way that addresses the above 
> problems. The purpose of this umbrella JIRA is to promote additional 
> discussion on how we can approach these problems, both at the architecture 
> level and the implementation level. We anticipate filing sub-issues that 
> break down the tasks that must be completed to achieve this goal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26424) Use java.time API in timestamp/date functions

2018-12-26 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-26424.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23358
[https://github.com/apache/spark/pull/23358]

> Use java.time API in timestamp/date functions 
> --
>
> Key: SPARK-26424
> URL: https://issues.apache.org/jira/browse/SPARK-26424
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> Currently, date/time expressions like DateFormatClass, UnixTimestamp, 
> FromUnixTime use SimpleDateFormat to parse/format dates and timestamps. 
> SimpleDateFormat cannot parse timestamp with microsecond precision. The 
> ticket aims to switch the expression on TimestampFormatter which is able to 
> parse timestamp with microsecond precision.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26424) Use java.time API in timestamp/date functions

2018-12-26 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-26424:
---

Assignee: Maxim Gekk

> Use java.time API in timestamp/date functions 
> --
>
> Key: SPARK-26424
> URL: https://issues.apache.org/jira/browse/SPARK-26424
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> Currently, date/time expressions like DateFormatClass, UnixTimestamp, 
> FromUnixTime use SimpleDateFormat to parse/format dates and timestamps. 
> SimpleDateFormat cannot parse timestamp with microsecond precision. The 
> ticket aims to switch the expression on TimestampFormatter which is able to 
> parse timestamp with microsecond precision.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26452) Suppressing exception in finally: Java heap space java.lang.OutOfMemoryError: Java heap space

2018-12-26 Thread tommy duan (JIRA)
tommy duan created SPARK-26452:
--

 Summary: Suppressing exception in finally: Java heap space 
java.lang.OutOfMemoryError: Java heap space
 Key: SPARK-26452
 URL: https://issues.apache.org/jira/browse/SPARK-26452
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 2.2.0
Reporter: tommy duan


[Stage 1852:===>(896 + 3) / 
900] [Stage 1852:===>(897 + 3) 
/ 900] [Stage 1852:===>(899 + 
1) / 900] [Stage 1853:> (0 + 0) / 900]18/12/27 06:03:45 WARN util.Utils: 
Suppressing exception in finally: Java heap space
java.lang.OutOfMemoryError: Java heap space
 at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57)
 at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
 at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$3.apply(TorrentBroadcast.scala:271)
 at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$3.apply(TorrentBroadcast.scala:271)
 at 
org.apache.spark.util.io.ChunkedByteBufferOutputStream.allocateNewChunkIfNeeded(ChunkedByteBufferOutputStream.scala:87)
 at 
org.apache.spark.util.io.ChunkedByteBufferOutputStream.write(ChunkedByteBufferOutputStream.scala:75)
 at 
net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:205)
 at net.jpountz.lz4.LZ4BlockOutputStream.write(LZ4BlockOutputStream.java:158)
 at 
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
 at 
java.io.ObjectOutputStream$BlockDataOutputStream.flush(ObjectOutputStream.java:1822)
 at java.io.ObjectOutputStream.flush(ObjectOutputStream.java:719)
 at java.io.ObjectOutputStream.close(ObjectOutputStream.java:740)
 at 
org.apache.spark.serializer.JavaSerializationStream.close(JavaSerializer.scala:57)
 at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$1.apply$mcV$sp(TorrentBroadcast.scala:278)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1346)
 at 
org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:277)
 at 
org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:126)
 at 
org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:88)
 at 
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
 at 
org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:56)
 at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1488)
 at 
org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1006)
 at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:930)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:776)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:775)
 at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
 at 
org.apache.spark.scheduler.DAGScheduler.submitWaitingChildStages(DAGScheduler.scala:775)
 at 
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1259)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1711)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: Java 
heap space
 at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57)
 at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
 at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$3.apply(TorrentBroadcast.scala:271)
 at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$3.apply(TorrentBroadcast.scala:271)
 at 
org.apache.spark.util.io.ChunkedByteBufferOutputStream.allocateNewChunkIfNeeded(ChunkedByteBufferOutputStream.scala:87)
 at 
org.apache.spark.util.io.ChunkedByteBufferOutputStream.write(ChunkedByteBufferOutputStream.scala:75)
 at 
net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:205)
 at net.jpountz.lz4.LZ4BlockOutputStream.write(LZ4BlockOutputStream.java:158)
 at 
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
 at 
java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
 at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
 at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
 at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
 at 

[jira] [Updated] (SPARK-26439) Introduce WorkerOffer reservation mechanism for Barrier TaskSet

2018-12-26 Thread wuyi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-26439:
-
Description: 
Currently, Barrier TaskSet has a hard requirement that tasks can only be 
launched
 in a single resourceOffers round with enough slots(or sufficient resources), 
but
 can not be guaranteed even if with enough slots due to task locality delay 
scheduling.
 So, it is very likely that Barrier TaskSet gets a chunk of sufficient 
resources after
 all the trouble, but let it go easily just beacuae one of pending tasks can 
not be
 scheduled. Futhermore, it causes severe resource competition between TaskSets 
and jobs
 and introduce unclear semantic for DynamicAllocation.

This JIRA trys to introduce WorkerOffer reservation mechanism for Barrier 
TaskSet, which
 allows Barrier TaskSet to reserve WorkerOffer in each resourceOffers round, 
and launch
 tasks at the same time once it accumulate the sufficient resource. In this 
way, we 
 relax the requirement of resources for the Barrier TaskSet. To avoid the 
deadlock which
 may be introuduced by serveral Barrier TaskSets holding the reserved 
WorkerOffers for a
 long time, we'll ask Barrier TaskSets to force releasing part of reserved 
WorkerOffers 
 on demand. So, it is highly possible that each Barrier TaskSet would be 
launched in the
 end.

To integrate with DynamicAllocation

The possible effective way I can imagine is that adding new event, e.g. 
 ExecutorReservedEvent, ExecutorReleasedEvent, which behaved like busy executor 
with
 running tasks or idle executor without running tasks. Thus, 
ExecutionAllocationManager 
 would not let the executor go if it reminds of there're some reserved resource 
on that
 executor.

  was:
Currently, Barrier TaskSet has a hard requirement that tasks can only be 
launched
 in a single resourceOffers round with enough slots(or sufficient resources), 
but
 can not be guaranteed even if with enough slots due to task locality delay 
scheduling.
 So, it is very likely that Barrier TaskSet gets a chunk of sufficient 
resources after
 all the trouble, but let it go easily just beacuae one of pending tasks can 
not be
 scheduled. Futhermore, it causes severe resource competition between TaskSets 
and jobs
 and introduce unclear semantic for DynamicAllocation.

This JIRA trys to introduce WorkOffer reservation mechanism for Barrier 
TaskSet, which
 allows Barrier TaskSet to reserve WorkOffer in each resourceOffers round, and 
launch
 tasks at the same time once it accumulate the sufficient resource. In this 
way, we 
 relax the requirement of resources for the Barrier TaskSet. To avoid the 
deadlock which
 may be introuduced by serveral Barrier TaskSets holding the reserved WorkOffer 
for a
 long time, we'll ask Barrier TaskSets to force releasing part of reserved 
WorkOffers
 on demand. So, it is highly possible that each Barrier TaskSet would be 
launched in the
 end.

To integrate with DynamicAllocation

The possible effective way I can imagine is that adding new event, e.g. 
 ExecutorReservedEvent, ExecutorReleasedEvent, which behaved like busy executor 
with
 running tasks or idle executor without running tasks. Thus, 
ExecutionAllocationManager 
 would not let the executor go if it reminds of there're some reserved resource 
on that
 executor.


> Introduce WorkerOffer reservation mechanism for Barrier TaskSet
> ---
>
> Key: SPARK-26439
> URL: https://issues.apache.org/jira/browse/SPARK-26439
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: wuyi
>Priority: Major
>  Labels: performance
> Fix For: 2.4.0
>
>
> Currently, Barrier TaskSet has a hard requirement that tasks can only be 
> launched
>  in a single resourceOffers round with enough slots(or sufficient resources), 
> but
>  can not be guaranteed even if with enough slots due to task locality delay 
> scheduling.
>  So, it is very likely that Barrier TaskSet gets a chunk of sufficient 
> resources after
>  all the trouble, but let it go easily just beacuae one of pending tasks can 
> not be
>  scheduled. Futhermore, it causes severe resource competition between 
> TaskSets and jobs
>  and introduce unclear semantic for DynamicAllocation.
> This JIRA trys to introduce WorkerOffer reservation mechanism for Barrier 
> TaskSet, which
>  allows Barrier TaskSet to reserve WorkerOffer in each resourceOffers round, 
> and launch
>  tasks at the same time once it accumulate the sufficient resource. In this 
> way, we 
>  relax the requirement of resources for the Barrier TaskSet. To avoid the 
> deadlock which
>  may be introuduced by serveral Barrier TaskSets holding the reserved 
> WorkerOffers for a
>  long time, we'll ask Barrier TaskSets to force releasing 

[jira] [Commented] (SPARK-24630) SPIP: Support SQLStreaming in Spark

2018-12-26 Thread Jacky Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729261#comment-16729261
 ] 

Jacky Li commented on SPARK-24630:
--

Actually I encountered this scenario earlier, so we have implemented some 
commands for using SparkStreaming(or StructureStreaming) on Apache CarbonData, 
you can refer to the StreamSQL section in this 
[doc|http://carbondata.apache.org/streaming-guide.html] for more detail.

It is good if Spark community can use similar or same syntax, if possible, then 
in future version of CarbonData can migrate to Spark's syntax.


> SPIP: Support SQLStreaming in Spark
> ---
>
> Key: SPARK-24630
> URL: https://issues.apache.org/jira/browse/SPARK-24630
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Jackey Lee
>Priority: Minor
>  Labels: SQLStreaming
> Attachments: SQLStreaming SPIP V2.pdf
>
>
> At present, KafkaSQL, Flink SQL(which is actually based on Calcite), 
> SQLStream, StormSQL all provide a stream type SQL interface, with which users 
> with little knowledge about streaming,  can easily develop a flow system 
> processing model. In Spark, we can also support SQL API based on 
> StructStreamig.
> To support for SQL Streaming, there are two key points: 
> 1, Analysis should be able to parse streaming type SQL. 
> 2, Analyzer should be able to map metadata information to the corresponding 
> Relation. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26439) Introduce WorkerOffer reservation mechanism for Barrier TaskSet

2018-12-26 Thread wuyi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-26439:
-
Summary: Introduce WorkerOffer reservation mechanism for Barrier TaskSet  
(was: Introduce WorkOffer reservation mechanism for Barrier TaskSet)

> Introduce WorkerOffer reservation mechanism for Barrier TaskSet
> ---
>
> Key: SPARK-26439
> URL: https://issues.apache.org/jira/browse/SPARK-26439
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: wuyi
>Priority: Major
>  Labels: performance
> Fix For: 2.4.0
>
>
> Currently, Barrier TaskSet has a hard requirement that tasks can only be 
> launched
>  in a single resourceOffers round with enough slots(or sufficient resources), 
> but
>  can not be guaranteed even if with enough slots due to task locality delay 
> scheduling.
>  So, it is very likely that Barrier TaskSet gets a chunk of sufficient 
> resources after
>  all the trouble, but let it go easily just beacuae one of pending tasks can 
> not be
>  scheduled. Futhermore, it causes severe resource competition between 
> TaskSets and jobs
>  and introduce unclear semantic for DynamicAllocation.
> This JIRA trys to introduce WorkOffer reservation mechanism for Barrier 
> TaskSet, which
>  allows Barrier TaskSet to reserve WorkOffer in each resourceOffers round, 
> and launch
>  tasks at the same time once it accumulate the sufficient resource. In this 
> way, we 
>  relax the requirement of resources for the Barrier TaskSet. To avoid the 
> deadlock which
>  may be introuduced by serveral Barrier TaskSets holding the reserved 
> WorkOffer for a
>  long time, we'll ask Barrier TaskSets to force releasing part of reserved 
> WorkOffers
>  on demand. So, it is highly possible that each Barrier TaskSet would be 
> launched in the
>  end.
> To integrate with DynamicAllocation
> The possible effective way I can imagine is that adding new event, e.g. 
>  ExecutorReservedEvent, ExecutorReleasedEvent, which behaved like busy 
> executor with
>  running tasks or idle executor without running tasks. Thus, 
> ExecutionAllocationManager 
>  would not let the executor go if it reminds of there're some reserved 
> resource on that
>  executor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26451) Change lead/lag argument name from count to offset

2018-12-26 Thread Deepyaman Datta (JIRA)
Deepyaman Datta created SPARK-26451:
---

 Summary: Change lead/lag argument name from count to offset
 Key: SPARK-26451
 URL: https://issues.apache.org/jira/browse/SPARK-26451
 Project: Spark
  Issue Type: Documentation
  Components: PySpark, SQL
Affects Versions: 2.4.0
Reporter: Deepyaman Datta






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26378) Queries of wide CSV/JSON data slowed after SPARK-26151

2018-12-26 Thread Bruce Robbins (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-26378:
--
Description: 
A recent change significantly slowed the queries of wide CSV tables. For 
example, queries against a 6000 column table slowed by 45-48% when queried with 
a single executor.
  
 The [PR for 
SPARK-26151|https://github.com/apache/spark/commit/11e5f1bcd49eec8ab4225d6e68a051b5c6a21cb2]
 changed FailureSafeParser#toResultRow such that the returned function 
recreates every row, even when the associated input record has no parsing 
issues and the user specified no corrupt record field in his/her schema. This 
extra processing is responsible for the slowdown.

The change to FailureSafeParser also impacted queries of wide JSON tables as 
well.

 I propose that a row should be recreated only if there is a parsing error or 
columns need to be shifted due to the existence of a corrupt column field in 
the user-supplied schema. Otherwise, the row should be used as-is.

  was:
A recent change significantly slowed the queries of wide CSV tables. For 
example, queries against a 6000 column table slowed by 45-48% when queried with 
a single executor.
  
 The [PR for 
SPARK-26151|https://github.com/apache/spark/commit/11e5f1bcd49eec8ab4225d6e68a051b5c6a21cb2]
 changed FailureSafeParser#toResultRow such that the returned function 
recreates every row, even when the associated input record has no parsing 
issues and the user specified no corrupt record field in his/her schema. This 
extra processing is responsible for the slowdown.
  
 I propose that a row should be recreated only if there is a parsing error or 
columns need to be shifted due to the existence of a corrupt column field in 
the user-supplied schema. Otherwise, the row should be used as-is.


> Queries of wide CSV/JSON data slowed after SPARK-26151
> --
>
> Key: SPARK-26378
> URL: https://issues.apache.org/jira/browse/SPARK-26378
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Bruce Robbins
>Priority: Major
>
> A recent change significantly slowed the queries of wide CSV tables. For 
> example, queries against a 6000 column table slowed by 45-48% when queried 
> with a single executor.
>   
>  The [PR for 
> SPARK-26151|https://github.com/apache/spark/commit/11e5f1bcd49eec8ab4225d6e68a051b5c6a21cb2]
>  changed FailureSafeParser#toResultRow such that the returned function 
> recreates every row, even when the associated input record has no parsing 
> issues and the user specified no corrupt record field in his/her schema. This 
> extra processing is responsible for the slowdown.
> The change to FailureSafeParser also impacted queries of wide JSON tables as 
> well.
>  I propose that a row should be recreated only if there is a parsing error or 
> columns need to be shifted due to the existence of a corrupt column field in 
> the user-supplied schema. Otherwise, the row should be used as-is.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26378) Queries of wide CSV/JSON data slowed after SPARK-26151

2018-12-26 Thread Bruce Robbins (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-26378:
--
Summary: Queries of wide CSV/JSON data slowed after SPARK-26151  (was: 
Queries of wide CSV data slowed after SPARK-26151)

> Queries of wide CSV/JSON data slowed after SPARK-26151
> --
>
> Key: SPARK-26378
> URL: https://issues.apache.org/jira/browse/SPARK-26378
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Bruce Robbins
>Priority: Major
>
> A recent change significantly slowed the queries of wide CSV tables. For 
> example, queries against a 6000 column table slowed by 45-48% when queried 
> with a single executor.
>   
>  The [PR for 
> SPARK-26151|https://github.com/apache/spark/commit/11e5f1bcd49eec8ab4225d6e68a051b5c6a21cb2]
>  changed FailureSafeParser#toResultRow such that the returned function 
> recreates every row, even when the associated input record has no parsing 
> issues and the user specified no corrupt record field in his/her schema. This 
> extra processing is responsible for the slowdown.
>   
>  I propose that a row should be recreated only if there is a parsing error or 
> columns need to be shifted due to the existence of a corrupt column field in 
> the user-supplied schema. Otherwise, the row should be used as-is.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26451) Change lead/lag argument name from count to offset

2018-12-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26451:


Assignee: (was: Apache Spark)

> Change lead/lag argument name from count to offset
> --
>
> Key: SPARK-26451
> URL: https://issues.apache.org/jira/browse/SPARK-26451
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: Deepyaman Datta
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26451) Change lead/lag argument name from count to offset

2018-12-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26451:


Assignee: Apache Spark

> Change lead/lag argument name from count to offset
> --
>
> Key: SPARK-26451
> URL: https://issues.apache.org/jira/browse/SPARK-26451
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: Deepyaman Datta
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26450) Map of schema is built too frequently in some wide queries

2018-12-26 Thread Bruce Robbins (JIRA)
Bruce Robbins created SPARK-26450:
-

 Summary: Map of schema is built too frequently in some wide queries
 Key: SPARK-26450
 URL: https://issues.apache.org/jira/browse/SPARK-26450
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Bruce Robbins


When executing queries with wide projections and wide schemas, Spark rebuilds 
an attribute map for the same schema many times.

For example:
{noformat}
select * from orctbl where id1 = 1
{noformat}
Assume {{orctbl}} has 6000 columns and 34 files. In that case, the above query 
creates an AttributeSeq object 270,000 times[1]. Each AttributeSeq 
instantiation builds a map of the entire list of 6000 attributes (but not until 
lazy val exprIdToOrdinal is referenced).

Whenever OrcFileFormat reads a new file, it generates a new unsafe projection. 
That results in this 
[function|https://github.com/apache/spark/blob/827383a97c11a61661440ff86ce0c3382a2a23b2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala#L319]
 getting called:
{code:java}
protected def bind(in: Seq[Expression], inputSchema: Seq[Attribute]): 
Seq[Expression] =
in.map(BindReferences.bindReference(_, inputSchema))
{code}
For each column in the projection, this line calls bindReference. Each call 
passes inputSchema, a Sequence of Attributes, to a parameter position expecting 
an AttributeSeq. The compiler implicitly calls the constructor for 
AttributeSeq, which (lazily) builds a map for every attribute in the schema. 
Therefore, this function builds a map of the entire schema once for each column 
in the projection, and it does this for each input file. For the above example 
query, this accounts for 204K instantiations of AttributeSeq.

Readers for CSV and JSON tables do something similar.

In addition, ProjectExec also creates an unsafe projection for each task. As a 
result, this 
[line|https://github.com/apache/spark/blob/827383a97c11a61661440ff86ce0c3382a2a23b2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala#L91]
 gets called, which has the same issue:
{code:java}
  def toBoundExprs(exprs: Seq[Expression], inputSchema: Seq[Attribute]): 
Seq[Expression] = {
exprs.map(BindReferences.bindReference(_, inputSchema))
  }
{code}
The above affects all wide queries that have a projection node, regardless of 
the file reader. For the example query, ProjectExec accounts for the additional 
66K instantiations of the AttributeSeq.

Spark can save time by pre-building the AttributeSeq right before the map 
operations in {{bind}} and {{toBoundExprs}}. The time saved depends on size of 
schema, size of projection, number of input files (for Orc), number of file 
splits (for CSV, and JSON tables), and number of tasks.

For a 6000 column CSV table with 500K records and 34 input files, the time 
savings is only 6%[1] because Spark doesn't create as many unsafe projections 
as compared to Orc tables.

On the other hand, for a 6000 column Orc table with 500K records and 34 input 
files, the time savings is about 16%[1].

[1] based on queries run in local mode with 8 executor threads on my laptop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26449) Dataframe.transform

2018-12-26 Thread Hanan Shteingart (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hanan Shteingart updated SPARK-26449:
-
Shepherd: Maciej Szymkiewicz  (was: Lazy Developer)

> Dataframe.transform
> ---
>
> Key: SPARK-26449
> URL: https://issues.apache.org/jira/browse/SPARK-26449
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Hanan Shteingart
>Priority: Minor
>  Labels: patch
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I would like to chain custom transformations as is suggested in this [blog 
> post|https://medium.com/@mrpowers/chaining-custom-pyspark-transformations-4f38a8c7ae55]
> This will allow to write something like the following:
>  
>  
> {code:java}
>  
> def with_greeting(df):
> return df.withColumn("greeting", lit("hi"))
> def with_something(df, something):
> return df.withColumn("something", lit(something))
> data = [("jose", 1), ("li", 2), ("liz", 3)]
> source_df = spark.createDataFrame(data, ["name", "age"])
> actual_df = (source_df
> .transform(with_greeting)
> .transform(lambda df: with_something(df, "crazy")))
> print(actual_df.show())
> ++---++-+
> |name|age|greeting|something|
> ++---++-+
> |jose|  1|  hi|crazy|
> |  li|  2|  hi|crazy|
> | liz|  3|  hi|crazy|
> ++---++-+
> {code}
> The only thing needed to accomplish this is the following simple method for 
> DataFrame:
> {code:java}
> from pyspark.sql.dataframe import DataFrame 
> def transform(self, f): 
> return f(self) 
> DataFrame.transform = transform
> {code}
> I volunteer to do the pull request if approved (at least the python part)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26449) Dataframe.transform

2018-12-26 Thread Hanan Shteingart (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hanan Shteingart updated SPARK-26449:
-
Shepherd: Lazy Developer  (was: yc)

> Dataframe.transform
> ---
>
> Key: SPARK-26449
> URL: https://issues.apache.org/jira/browse/SPARK-26449
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Hanan Shteingart
>Priority: Minor
>  Labels: patch
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I would like to chain custom transformations as is suggested in this [blog 
> post|https://medium.com/@mrpowers/chaining-custom-pyspark-transformations-4f38a8c7ae55]
> This will allow to write something like the following:
>  
>  
> {code:java}
>  
> def with_greeting(df):
> return df.withColumn("greeting", lit("hi"))
> def with_something(df, something):
> return df.withColumn("something", lit(something))
> data = [("jose", 1), ("li", 2), ("liz", 3)]
> source_df = spark.createDataFrame(data, ["name", "age"])
> actual_df = (source_df
> .transform(with_greeting)
> .transform(lambda df: with_something(df, "crazy")))
> print(actual_df.show())
> ++---++-+
> |name|age|greeting|something|
> ++---++-+
> |jose|  1|  hi|crazy|
> |  li|  2|  hi|crazy|
> | liz|  3|  hi|crazy|
> ++---++-+
> {code}
> The only thing needed to accomplish this is the following simple method for 
> DataFrame:
> {code:java}
> from pyspark.sql.dataframe import DataFrame 
> def transform(self, f): 
> return f(self) 
> DataFrame.transform = transform
> {code}
> I volunteer to do the pull request if approved (at least the python part)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26449) Dataframe.transform

2018-12-26 Thread Hanan Shteingart (JIRA)
Hanan Shteingart created SPARK-26449:


 Summary: Dataframe.transform
 Key: SPARK-26449
 URL: https://issues.apache.org/jira/browse/SPARK-26449
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.4.0
Reporter: Hanan Shteingart


I would like to chain custom transformations as is suggested in this [blog 
post|https://medium.com/@mrpowers/chaining-custom-pyspark-transformations-4f38a8c7ae55]

This will allow to write something like the following:

 

 
{code:java}
 
def with_greeting(df):
return df.withColumn("greeting", lit("hi"))

def with_something(df, something):
return df.withColumn("something", lit(something))

data = [("jose", 1), ("li", 2), ("liz", 3)]
source_df = spark.createDataFrame(data, ["name", "age"])

actual_df = (source_df
.transform(with_greeting)
.transform(lambda df: with_something(df, "crazy")))
print(actual_df.show())
++---++-+
|name|age|greeting|something|
++---++-+
|jose|  1|  hi|crazy|
|  li|  2|  hi|crazy|
| liz|  3|  hi|crazy|
++---++-+

{code}
The only thing needed to accomplish this is the following simple method for 
DataFrame:
{code:java}
from pyspark.sql.dataframe import DataFrame 
def transform(self, f): 
return f(self) 
DataFrame.transform = transform
{code}
I volunteer to do the pull request if approved (at least the python part)

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23959) UnresolvedException with DataSet created from Seq.empty since Spark 2.3.0

2018-12-26 Thread Sam hendley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729148#comment-16729148
 ] 

Sam hendley commented on SPARK-23959:
-

I am working on upgrading a medium sized project from spark 2.0.2 to spark 
2.3.0 and ran into this bug in a few of my unit tests. The reproduction case 
included in this ticket fails in my environment. Adding a `.cache()` to `zs` 
seems to fix the issue as expected. It looks like the code in question is 
working in my production environment when all of the input datasets are 
populated and loaded from parquet files. 

In my tests I was using createDataset() calls to store intermediate results. If 
I check for empty input data and call .cache() on the resulting frame it works 
in my unit tests.

Do you have any guesses on what might be different in my environment that would 
make this fail? I tried changing hadoop versions (2.6.5 and 2.8.3) and spark 
versions (2.3.0 and 2.3.2) but was still able to reproduce this issue. Is there 
anything I can do to help you debug this issue?

> UnresolvedException with DataSet created from Seq.empty since Spark 2.3.0
> -
>
> Key: SPARK-23959
> URL: https://issues.apache.org/jira/browse/SPARK-23959
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Sam De Backer
>Priority: Major
>
> The following snippet works fine in Spark 2.2.1 but gives a rather cryptic 
> runtime exception in Spark 2.3.0:
> {code:java}
> import sparkSession.implicits._
> import org.apache.spark.sql.functions._
> case class X(xid: Long, yid: Int)
> case class Y(yid: Int, zid: Long)
> case class Z(zid: Long, b: Boolean)
> val xs = Seq(X(1L, 10)).toDS()
> val ys = Seq(Y(10, 100L)).toDS()
> val zs = Seq.empty[Z].toDS()
> val j = xs
>   .join(ys, "yid")
>   .join(zs, Seq("zid"), "left")
>   .withColumn("BAM", when('b, "B").otherwise("NB"))
> j.show(){code}
> In Spark 2.2.1 it prints to the console
> {noformat}
> +---+---+---++---+
> |zid|yid|xid|   b|BAM|
> +---+---+---++---+
> |100| 10|  1|null| NB|
> +---+---+---++---+{noformat}
> In Spark 2.3.0 it results in:
> {noformat}
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> dataType on unresolved object, tree: 'BAM
> at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:105)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:435)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:435)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.immutable.List.map(List.scala:296)
> at org.apache.spark.sql.types.StructType$.fromAttributes(StructType.scala:435)
> at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.schema$lzycompute(QueryPlan.scala:157)
> ...{noformat}
> The culprit really seems to be DataSet being created from an empty Seq[Z]. 
> When you change that to something that will also result in an empty 
> DataSet[Z] it works as in Spark 2.2.1, e.g.
> {code:java}
> val zs = Seq(Z(10L, true)).toDS().filter('zid < Long.MinValue){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26448) retain the difference between 0.0 and -0.0

2018-12-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26448:


Assignee: Wenchen Fan  (was: Apache Spark)

> retain the difference between 0.0 and -0.0
> --
>
> Key: SPARK-26448
> URL: https://issues.apache.org/jira/browse/SPARK-26448
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26448) retain the difference between 0.0 and -0.0

2018-12-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26448:


Assignee: Apache Spark  (was: Wenchen Fan)

> retain the difference between 0.0 and -0.0
> --
>
> Key: SPARK-26448
> URL: https://issues.apache.org/jira/browse/SPARK-26448
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26448) retain the difference between 0.0 and -0.0

2018-12-26 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-26448:
---

 Summary: retain the difference between 0.0 and -0.0
 Key: SPARK-26448
 URL: https://issues.apache.org/jira/browse/SPARK-26448
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26447) Allow OrcColumnarBatchReader to return less partition columns

2018-12-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26447:


Assignee: (was: Apache Spark)

> Allow OrcColumnarBatchReader to return less partition columns
> -
>
> Key: SPARK-26447
> URL: https://issues.apache.org/jira/browse/SPARK-26447
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Currently OrcColumnarBatchReader returns all the partition column values in 
> the batch read.
> In data source V2, we can improve it by returning the required partition 
> column values only.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26447) Allow OrcColumnarBatchReader to return less partition columns

2018-12-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26447:


Assignee: Apache Spark

> Allow OrcColumnarBatchReader to return less partition columns
> -
>
> Key: SPARK-26447
> URL: https://issues.apache.org/jira/browse/SPARK-26447
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> Currently OrcColumnarBatchReader returns all the partition column values in 
> the batch read.
> In data source V2, we can improve it by returning the required partition 
> column values only.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26447) Allow OrcColumnarBatchReader to return less partition columns

2018-12-26 Thread Gengliang Wang (JIRA)
Gengliang Wang created SPARK-26447:
--

 Summary: Allow OrcColumnarBatchReader to return less partition 
columns
 Key: SPARK-26447
 URL: https://issues.apache.org/jira/browse/SPARK-26447
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Gengliang Wang


Currently OrcColumnarBatchReader returns all the partition column values in the 
batch read.
In data source V2, we can improve it by returning the required partition column 
values only.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26446) improve doc on ExecutorAllocationManager

2018-12-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26446:


Assignee: Apache Spark

> improve doc on ExecutorAllocationManager
> 
>
> Key: SPARK-26446
> URL: https://issues.apache.org/jira/browse/SPARK-26446
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Qingxin Wu
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26446) improve doc on ExecutorAllocationManager

2018-12-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26446:


Assignee: (was: Apache Spark)

> improve doc on ExecutorAllocationManager
> 
>
> Key: SPARK-26446
> URL: https://issues.apache.org/jira/browse/SPARK-26446
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Qingxin Wu
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26446) improve doc on ExecutorAllocationManager

2018-12-26 Thread Qingxin Wu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qingxin Wu updated SPARK-26446:
---
Component/s: (was: Scheduler)
 Spark Core

> improve doc on ExecutorAllocationManager
> 
>
> Key: SPARK-26446
> URL: https://issues.apache.org/jira/browse/SPARK-26446
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Qingxin Wu
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26446) improve doc on ExecutorAllocationManager

2018-12-26 Thread Qingxin Wu (JIRA)
Qingxin Wu created SPARK-26446:
--

 Summary: improve doc on ExecutorAllocationManager
 Key: SPARK-26446
 URL: https://issues.apache.org/jira/browse/SPARK-26446
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Affects Versions: 2.4.0
Reporter: Qingxin Wu






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26445) Use ConfigEntry for hardcoded configs for driver/executor categories.

2018-12-26 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-26445:
-

 Summary: Use ConfigEntry for hardcoded configs for driver/executor 
categories.
 Key: SPARK-26445
 URL: https://issues.apache.org/jira/browse/SPARK-26445
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Takuya Ueshin


Make hardcoded "spark.driver" and "spark.executor" configs to use 
{{ConfigEntry}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26445) Use ConfigEntry for hardcoded configs for driver/executor categories.

2018-12-26 Thread Takuya Ueshin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16728974#comment-16728974
 ] 

Takuya Ueshin commented on SPARK-26445:
---

I'm working on this.

> Use ConfigEntry for hardcoded configs for driver/executor categories.
> -
>
> Key: SPARK-26445
> URL: https://issues.apache.org/jira/browse/SPARK-26445
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> Make hardcoded "spark.driver" and "spark.executor" configs to use 
> {{ConfigEntry}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26444) Stage color doesn't change with it's status

2018-12-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26444:


Assignee: Apache Spark

> Stage color doesn't change with it's status
> ---
>
> Key: SPARK-26444
> URL: https://issues.apache.org/jira/browse/SPARK-26444
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: Chenxiao Mao
>Assignee: Apache Spark
>Priority: Major
> Attachments: active.png, complete.png, failed.png
>
>
> On job page, in event timeline section, stage color doesn't change according 
> to its status. See attachments for some screen shots. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26444) Stage color doesn't change with it's status

2018-12-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26444:


Assignee: (was: Apache Spark)

> Stage color doesn't change with it's status
> ---
>
> Key: SPARK-26444
> URL: https://issues.apache.org/jira/browse/SPARK-26444
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: Chenxiao Mao
>Priority: Major
> Attachments: active.png, complete.png, failed.png
>
>
> On job page, in event timeline section, stage color doesn't change according 
> to its status. See attachments for some screen shots. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26444) Stage color doesn't change with it's status

2018-12-26 Thread Chenxiao Mao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenxiao Mao updated SPARK-26444:
-
Attachment: failed.png
complete.png
active.png

> Stage color doesn't change with it's status
> ---
>
> Key: SPARK-26444
> URL: https://issues.apache.org/jira/browse/SPARK-26444
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: Chenxiao Mao
>Priority: Major
> Attachments: active.png, complete.png, failed.png
>
>
> On job page, in event timeline section, stage color doesn't change according 
> to its status. See attachments for some screen shots. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26444) Stage color doesn't change with it's status

2018-12-26 Thread Chenxiao Mao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenxiao Mao updated SPARK-26444:
-
Description: 
On job page, in event timeline section, stage color doesn't change according to 
its status. See attachments for some screen shots. 

 

  was:
On job page, in event timeline section, stage color doesn't change according to 
its status. Below are some screen shots.

active:

!image-2018-12-26-16-14-38-958.png!

complete:

!image-2018-12-26-16-15-55-957.png!

failed:

!image-2018-12-26-16-16-11-697.png!

 

 


> Stage color doesn't change with it's status
> ---
>
> Key: SPARK-26444
> URL: https://issues.apache.org/jira/browse/SPARK-26444
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: Chenxiao Mao
>Priority: Major
>
> On job page, in event timeline section, stage color doesn't change according 
> to its status. See attachments for some screen shots. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26444) Stage color doesn't change with it's status

2018-12-26 Thread Chenxiao Mao (JIRA)
Chenxiao Mao created SPARK-26444:


 Summary: Stage color doesn't change with it's status
 Key: SPARK-26444
 URL: https://issues.apache.org/jira/browse/SPARK-26444
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.4.0
Reporter: Chenxiao Mao


On job page, in event timeline section, stage color doesn't change according to 
its status. Below are some screen shots.

active:

!image-2018-12-26-16-14-38-958.png!

complete:

!image-2018-12-26-16-15-55-957.png!

failed:

!image-2018-12-26-16-16-11-697.png!

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org