[jira] [Created] (SPARK-43597) Assign a name to the error class _LEGACY_ERROR_TEMP_0017

2023-05-19 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-43597:
---

 Summary: Assign a name to the error class _LEGACY_ERROR_TEMP_0017
 Key: SPARK-43597
 URL: https://issues.apache.org/jira/browse/SPARK-43597
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.5.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43596) Subquery decorrelation rewriteDomainJoins failure from ConstantFolding to isnull

2023-05-19 Thread Jack Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Chen updated SPARK-43596:
--
Description: 
We can get a decorrelation error because of rewrites that run in between 
DecorrelateInnerQuery and rewriteDomainJoins, that modify the correlation join 
conditions. In particular, ConstantFolding can transform `innercol <=> null` to 
`isnull(innercol)` and then rewriteDomainJoins does not recognize this and 
throws error Unable to rewrite domain join with conditions: 
ArrayBuffer(isnull(innercol#280)) because the isnull is not an equality, so it 
isn't usable for rewriting the domain join.

Can fix by recognizing `isnull(innercol)` as `innercol <=> null` in 
rewriteDomainJoins.

This area is also fragile in general and other rewrites that run between the 
two steps of decorrelation could potentially break their assumptions, so we may 
want to investigate longer-term follow ups for that.

  was:
We can get a decorrelation error because of rewrites that run in between 
DecorrelateInnerQuery and rewriteDomainJoins, that modify the correlation join 
conditions. In particular, ConstantFolding can transform `innercol <=> null` to 
`isnull(innercol)` and then rewriteDomainJoins does not recognize this and 
throws error Unable to rewrite domain join with conditions: 
ArrayBuffer(isnull(innercol#280)) because the isnull is not an equality, so it 
isn't usable for rewriting the domain join.

Can fix by recognizing `isnull(x)` as `x <=> null` in rewriteDomainJoins.

This area is also fragile in general and other rewrites that run between the 
two steps of decorrelation could potentially break their assumptions, so we may 
want to investigate longer-term follow ups for that.


> Subquery decorrelation rewriteDomainJoins failure from ConstantFolding to 
> isnull
> 
>
> Key: SPARK-43596
> URL: https://issues.apache.org/jira/browse/SPARK-43596
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jack Chen
>Priority: Major
>
> We can get a decorrelation error because of rewrites that run in between 
> DecorrelateInnerQuery and rewriteDomainJoins, that modify the correlation 
> join conditions. In particular, ConstantFolding can transform `innercol <=> 
> null` to `isnull(innercol)` and then rewriteDomainJoins does not recognize 
> this and throws error Unable to rewrite domain join with conditions: 
> ArrayBuffer(isnull(innercol#280)) because the isnull is not an equality, so 
> it isn't usable for rewriting the domain join.
> Can fix by recognizing `isnull(innercol)` as `innercol <=> null` in 
> rewriteDomainJoins.
> This area is also fragile in general and other rewrites that run between the 
> two steps of decorrelation could potentially break their assumptions, so we 
> may want to investigate longer-term follow ups for that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43596) Subquery decorrelation rewriteDomainJoins failure from ConstantFolding to isnull

2023-05-19 Thread Jack Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Chen updated SPARK-43596:
--
Description: 
We can get a decorrelation error because of rewrites that run in between 
DecorrelateInnerQuery and rewriteDomainJoins, that modify the correlation join 
conditions. In particular, ConstantFolding can transform `innercol <=> null` to 
`isnull(innercol)` and then rewriteDomainJoins does not recognize this and 
throws error Unable to rewrite domain join with conditions: 
ArrayBuffer(isnull(innercol#280)) because the isnull is not an equality, so it 
isn't usable for rewriting the domain join.

Can fix by recognizing `isnull(x)` as `x <=> null` in rewriteDomainJoins.

This area is also fragile in general and other rewrites that run between the 
two steps of decorrelation could potentially break their assumptions, so we may 
want to investigate longer-term follow ups for that.

  was:
We can get a decorrelation error because of rewrites that run in between 
DecorrelateInnerQuery and rewriteDomainJoins, that modify the correlation join 
conditions. In particular, ConstantFolding can transform `innercol <=> null` to 
`isnull(innercol)` and then rewriteDomainJoins does not recognize this and 
throws error Unable to rewrite domain join with conditions: 
ArrayBuffer(isnull(innercol#280)) because the isnull is not an equality, so it 
isn't usable for rewriting the domain join.

Can fix by recognizing `isnull\(x)` as `x <=> null` in rewriteDomainJoins.

This area is also fragile in general and other rewrites that run between the 
two steps of decorrelation could potentially break their assumptions, so we may 
want to investigate longer-term follow ups for that.}}{}}}


> Subquery decorrelation rewriteDomainJoins failure from ConstantFolding to 
> isnull
> 
>
> Key: SPARK-43596
> URL: https://issues.apache.org/jira/browse/SPARK-43596
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jack Chen
>Priority: Major
>
> We can get a decorrelation error because of rewrites that run in between 
> DecorrelateInnerQuery and rewriteDomainJoins, that modify the correlation 
> join conditions. In particular, ConstantFolding can transform `innercol <=> 
> null` to `isnull(innercol)` and then rewriteDomainJoins does not recognize 
> this and throws error Unable to rewrite domain join with conditions: 
> ArrayBuffer(isnull(innercol#280)) because the isnull is not an equality, so 
> it isn't usable for rewriting the domain join.
> Can fix by recognizing `isnull(x)` as `x <=> null` in rewriteDomainJoins.
> This area is also fragile in general and other rewrites that run between the 
> two steps of decorrelation could potentially break their assumptions, so we 
> may want to investigate longer-term follow ups for that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43596) Subquery decorrelation rewriteDomainJoins failure from ConstantFolding to isnull

2023-05-19 Thread Jack Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Chen updated SPARK-43596:
--
Description: 
We can get a decorrelation error because of rewrites that run in between 
DecorrelateInnerQuery and rewriteDomainJoins, that modify the correlation join 
conditions. In particular, ConstantFolding can transform `innercol <=> null` to 
`isnull(innercol)` and then rewriteDomainJoins does not recognize this and 
throws error Unable to rewrite domain join with conditions: 
ArrayBuffer(isnull(innercol#280)) because the isnull is not an equality, so it 
isn't usable for rewriting the domain join.

Can fix by recognizing `isnull(x)` as `x <=> null` in rewriteDomainJoins.

This area is also fragile in general and other rewrites that run between the 
two steps of decorrelation could potentially break their assumptions, so we may 
want to investigate longer-term follow ups for that.}}{}}}

  was:
We can get a decorrelation error because of rewrites that run in between 
DecorrelateInnerQuery and rewriteDomainJoins, that modify the correlation join 
conditions. In particular, ConstantFolding can transform `innercol <=> null` to 
`isnull(innercol)` and then rewriteDomainJoins does not recognize this and 
throws error Unable to rewrite domain join with conditions: 
ArrayBuffer(isnull(innercol#280)) because the isnull is not an equality, so it 
isn't usable for rewriting the domain join.

Can fix by recognizing isnull(x) as x <=> null in rewriteDomainJoins.

This area is also fragile in general and other rewrites that run between the 
two steps of decorrelation could potentially break their assumptions, so we may 
want to investigate longer-term follow ups for that.{{{}{}}}


> Subquery decorrelation rewriteDomainJoins failure from ConstantFolding to 
> isnull
> 
>
> Key: SPARK-43596
> URL: https://issues.apache.org/jira/browse/SPARK-43596
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jack Chen
>Priority: Major
>
> We can get a decorrelation error because of rewrites that run in between 
> DecorrelateInnerQuery and rewriteDomainJoins, that modify the correlation 
> join conditions. In particular, ConstantFolding can transform `innercol <=> 
> null` to `isnull(innercol)` and then rewriteDomainJoins does not recognize 
> this and throws error Unable to rewrite domain join with conditions: 
> ArrayBuffer(isnull(innercol#280)) because the isnull is not an equality, so 
> it isn't usable for rewriting the domain join.
> Can fix by recognizing `isnull(x)` as `x <=> null` in rewriteDomainJoins.
> This area is also fragile in general and other rewrites that run between the 
> two steps of decorrelation could potentially break their assumptions, so we 
> may want to investigate longer-term follow ups for that.}}{}}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43596) Subquery decorrelation rewriteDomainJoins failure from ConstantFolding to isnull

2023-05-19 Thread Jack Chen (Jira)
Jack Chen created SPARK-43596:
-

 Summary: Subquery decorrelation rewriteDomainJoins failure from 
ConstantFolding to isnull
 Key: SPARK-43596
 URL: https://issues.apache.org/jira/browse/SPARK-43596
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0
Reporter: Jack Chen


We can get a decorrelation error because of rewrites that run in between 
DecorrelateInnerQuery and rewriteDomainJoins, that modify the correlation join 
conditions. In particular, ConstantFolding can transform `innercol <=> null` to 
`isnull(innercol)` and then rewriteDomainJoins does not recognize this and 
throws error Unable to rewrite domain join with conditions: 
ArrayBuffer(isnull(innercol#280)) because the isnull is not an equality, so it 
isn't usable for rewriting the domain join.

Can fix by recognizing isnull(x) as x <=> null in rewriteDomainJoins.

This area is also fragile in general and other rewrites that run between the 
two steps of decorrelation could potentially break their assumptions, so we may 
want to investigate longer-term follow ups for that.{{{}{}}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43596) Subquery decorrelation rewriteDomainJoins failure from ConstantFolding to isnull

2023-05-19 Thread Jack Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Chen updated SPARK-43596:
--
Description: 
We can get a decorrelation error because of rewrites that run in between 
DecorrelateInnerQuery and rewriteDomainJoins, that modify the correlation join 
conditions. In particular, ConstantFolding can transform `innercol <=> null` to 
`isnull(innercol)` and then rewriteDomainJoins does not recognize this and 
throws error Unable to rewrite domain join with conditions: 
ArrayBuffer(isnull(innercol#280)) because the isnull is not an equality, so it 
isn't usable for rewriting the domain join.

Can fix by recognizing `isnull\(x)` as `x <=> null` in rewriteDomainJoins.

This area is also fragile in general and other rewrites that run between the 
two steps of decorrelation could potentially break their assumptions, so we may 
want to investigate longer-term follow ups for that.}}{}}}

  was:
We can get a decorrelation error because of rewrites that run in between 
DecorrelateInnerQuery and rewriteDomainJoins, that modify the correlation join 
conditions. In particular, ConstantFolding can transform `innercol <=> null` to 
`isnull(innercol)` and then rewriteDomainJoins does not recognize this and 
throws error Unable to rewrite domain join with conditions: 
ArrayBuffer(isnull(innercol#280)) because the isnull is not an equality, so it 
isn't usable for rewriting the domain join.

Can fix by recognizing `isnull(x)` as `x <=> null` in rewriteDomainJoins.

This area is also fragile in general and other rewrites that run between the 
two steps of decorrelation could potentially break their assumptions, so we may 
want to investigate longer-term follow ups for that.}}{}}}


> Subquery decorrelation rewriteDomainJoins failure from ConstantFolding to 
> isnull
> 
>
> Key: SPARK-43596
> URL: https://issues.apache.org/jira/browse/SPARK-43596
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jack Chen
>Priority: Major
>
> We can get a decorrelation error because of rewrites that run in between 
> DecorrelateInnerQuery and rewriteDomainJoins, that modify the correlation 
> join conditions. In particular, ConstantFolding can transform `innercol <=> 
> null` to `isnull(innercol)` and then rewriteDomainJoins does not recognize 
> this and throws error Unable to rewrite domain join with conditions: 
> ArrayBuffer(isnull(innercol#280)) because the isnull is not an equality, so 
> it isn't usable for rewriting the domain join.
> Can fix by recognizing `isnull\(x)` as `x <=> null` in rewriteDomainJoins.
> This area is also fragile in general and other rewrites that run between the 
> two steps of decorrelation could potentially break their assumptions, so we 
> may want to investigate longer-term follow ups for that.}}{}}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43595) Update some maven plugins to newest version

2023-05-19 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-43595:
---

 Summary: Update some maven plugins to newest version
 Key: SPARK-43595
 URL: https://issues.apache.org/jira/browse/SPARK-43595
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.5.0
Reporter: BingKun Pan


Update some maven plugins to newest version. include:
 * exec-maven-plugin from 1.6.0 to 3.1.0
 * scala-maven-plugin from 4.8.0 to 4.8.1
 * maven-antrun-plugin from 1.8 to 3.1.0
 * maven-enforcer-plugin from 3.2.1 to 3.3.0
 * build-helper-maven-plugin from 3.3.0 to 3.4.0
 * maven-surefire-plugin from 3.0.0 to 3.1.0
 * maven-assembly-plugin from 3.1.0 to 3.6.0
 * maven-install-plugin from 3.1.0 to 3.1.1
 * maven-deploy-plugin from 3.1.0 to 3.1.1
 * maven-checkstyle-plugin from 3.2.1 to 3.2.2



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-43523) Memory leak in Spark UI

2023-05-19 Thread Amine Bagdouri (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724436#comment-17724436
 ] 

Amine Bagdouri edited comment on SPARK-43523 at 5/19/23 10:37 PM:
--

Theses units are only considered running from the point of view of Spark UI, 
while in reality they were already finished a long time ago and the application 
is totally idle.

Spark UI is not aware that these leaked units are already finished because the 
queue (AsyncEventQueue) it is using to listen to events (onJobEnd, onTaskEnd, 
onStageCompleted, ...) in order to update its state is full, and new events are 
dropped.

As shown below, we try to add the event to the bounded queue using 
LinkedBlockingQueue::offer. If the queue is full, the event is not inserted and 
the droppedEvents counter is incremented.

[AsyncEventQueue.scala post (line 
152)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/AsyncEventQueue.scala#LL152C1-L164C1]
 :
{code:java}
  def post(event: SparkListenerEvent): Unit = {
    if (stopped.get()) {
      return
    }    
eventCount.incrementAndGet()
    if (eventQueue.offer(event)) {
      return
    }
eventCount.decrementAndGet()
    droppedEvents.inc()
    droppedEventsCounter.incrementAndGet() {code}
 I will try to reproduce the leak on a more recent version of Spark and provide 
you with the results.


was (Author: JIRAUSER300423):
Theses units are only considered running from the point of view of Spark UI, 
while in reality they were already finished a long time ago and the application 
is totally idle.

Spark UI is not aware of the update of the state of these leaked units because 
the queue (AsyncEventQueue) it is using to listen to events (onJobEnd, 
onTaskEnd, onStageCompleted, ...) in order to update its state is full and new 
events are dropped.

As shown below, we try to add the event to the bounded queue using 
LinkedBlockingQueue::offer. If the queue is full, the event is not inserted and 
the droppedEvents counter is incremented.

[AsyncEventQueue.scala post (line 
152)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/AsyncEventQueue.scala#LL152C1-L164C1]
 :
{code:java}
  def post(event: SparkListenerEvent): Unit = {
    if (stopped.get()) {
      return
    }    
eventCount.incrementAndGet()
    if (eventQueue.offer(event)) {
      return
    }
eventCount.decrementAndGet()
    droppedEvents.inc()
    droppedEventsCounter.incrementAndGet() {code}
 I will try to reproduce the leak on a more recent version of Spark and provide 
you with the results.

> Memory leak in Spark UI
> ---
>
> Key: SPARK-43523
> URL: https://issues.apache.org/jira/browse/SPARK-43523
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.4
>Reporter: Amine Bagdouri
>Priority: Major
>
> We have a distributed Spark application running on Azure HDInsight using 
> Spark version 2.4.4.
> After a few days of active processing on our application, we have noticed 
> that the GC CPU time ratio of the driver is close to 100%. We suspected a 
> memory leak. Thus, we have produced a heap dump and analyzed it using Eclipse 
> Memory Analyzer.
> Here is some interesting data from the driver's heap dump (heap size is 8 GB):
>  * The estimated retained heap size of String objects (~5M instances) is 3.3 
> GB. It seems that most of these instances correspond to spark events.
>  * Spark UI's AppStatusListener instance estimated retained size is 1.1 GB.
>  * The number of LiveJob objects with status "RUNNING" is 18K, knowing that 
> there shouldn't be more than 16 live running jobs since we use a fixed size 
> thread pool of 16 threads to run spark queries.
>  * The number of LiveTask objects is 485K.
>  * The AsyncEventQueue instance associated to the AppStatusListener has a 
> value of 854 for dropped events count and a value of 10001 for total events 
> count, knowing that the dropped events counter is reset every minute and that 
> the queue's default capacity is 1.
> We think that there is a memory leak in Spark UI. Here is our analysis of the 
> root cause of this leak:
>  * AppStatusListener is notified of Spark events using a bounded queue in 
> AsyncEventQueue.
>  * AppStatusListener updates its state (kvstore, liveTasks, liveStages, 
> liveJobs, ...) based on the received events. For example, onTaskStart adds a 
> task to liveTasks map and onTaskEnd removes the task from liveTasks map.
>  * When the rate of events is very high, the bounded queue in AsyncEventQueue 
> is full, some events are dropped and don't make it to AppStatusListener.
>  * Dropped events that signal the end of a processing unit prevent the state 
> of AppStatusListener from being cleaned. For example, a dropped 

[jira] [Comment Edited] (SPARK-43523) Memory leak in Spark UI

2023-05-19 Thread Amine Bagdouri (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724436#comment-17724436
 ] 

Amine Bagdouri edited comment on SPARK-43523 at 5/19/23 10:35 PM:
--

Theses units are only considered running from the point of view of Spark UI, 
while in reality they were already finished a long time ago and the application 
is totally idle.

Spark UI is not aware of the update of the state of these leaked units because 
the queue (AsyncEventQueue) it is using to listen to events (onJobEnd, 
onTaskEnd, onStageCompleted, ...) in order to update its state is full and new 
events are dropped.

As shown below, we try to add the event to the bounded queue using 
LinkedBlockingQueue::offer. If the queue is full, the event is not inserted and 
the droppedEvents counter is incremented.

[AsyncEventQueue.scala post (line 
152)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/AsyncEventQueue.scala#LL152C1-L164C1]
 :
{code:java}
  def post(event: SparkListenerEvent): Unit = {
    if (stopped.get()) {
      return
    }    
eventCount.incrementAndGet()
    if (eventQueue.offer(event)) {
      return
    }
eventCount.decrementAndGet()
    droppedEvents.inc()
    droppedEventsCounter.incrementAndGet() {code}
 I will try to reproduce the leak on a more recent version of Spark and provide 
you with the results.


was (Author: JIRAUSER300423):
Theses units are only considered running from the point of view of Spark UI, 
while in reality they were already finished a long time ago and the application 
is totally idle.

Spark UI is not aware of the update of the state of these leaked units because 
the queue (AsyncEventQueue) it is using to listen to events (onJobEnd, 
onTaskEnd, onStageCompleted, ...) in order to update its state is full and new 
events are dropped.

As shown below, we try to add the event to the bounded queue using 
LinkedBlockingQueue::offer. If the queue is full, the event is not inserted and 
the droppedEvents counter is incremented.

[AsyncEventQueue.scala post (line 
152)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/AsyncEventQueue.scala#LL152C1-L164C1]
 :
{code:java}
  def post(event: SparkListenerEvent): Unit = {
    if (stopped.get()) {
      return
    }    
eventCount.incrementAndGet()
    if (eventQueue.offer(event)) { //
      return
    }
eventCount.decrementAndGet()
    droppedEvents.inc()
    droppedEventsCounter.incrementAndGet() {code}
 I will try to reproduce the leak on a more recent version of Spark and provide 
you with the results.

 

> Memory leak in Spark UI
> ---
>
> Key: SPARK-43523
> URL: https://issues.apache.org/jira/browse/SPARK-43523
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.4
>Reporter: Amine Bagdouri
>Priority: Major
>
> We have a distributed Spark application running on Azure HDInsight using 
> Spark version 2.4.4.
> After a few days of active processing on our application, we have noticed 
> that the GC CPU time ratio of the driver is close to 100%. We suspected a 
> memory leak. Thus, we have produced a heap dump and analyzed it using Eclipse 
> Memory Analyzer.
> Here is some interesting data from the driver's heap dump (heap size is 8 GB):
>  * The estimated retained heap size of String objects (~5M instances) is 3.3 
> GB. It seems that most of these instances correspond to spark events.
>  * Spark UI's AppStatusListener instance estimated retained size is 1.1 GB.
>  * The number of LiveJob objects with status "RUNNING" is 18K, knowing that 
> there shouldn't be more than 16 live running jobs since we use a fixed size 
> thread pool of 16 threads to run spark queries.
>  * The number of LiveTask objects is 485K.
>  * The AsyncEventQueue instance associated to the AppStatusListener has a 
> value of 854 for dropped events count and a value of 10001 for total events 
> count, knowing that the dropped events counter is reset every minute and that 
> the queue's default capacity is 1.
> We think that there is a memory leak in Spark UI. Here is our analysis of the 
> root cause of this leak:
>  * AppStatusListener is notified of Spark events using a bounded queue in 
> AsyncEventQueue.
>  * AppStatusListener updates its state (kvstore, liveTasks, liveStages, 
> liveJobs, ...) based on the received events. For example, onTaskStart adds a 
> task to liveTasks map and onTaskEnd removes the task from liveTasks map.
>  * When the rate of events is very high, the bounded queue in AsyncEventQueue 
> is full, some events are dropped and don't make it to AppStatusListener.
>  * Dropped events that signal the end of a processing unit prevent the state 
> of AppStatusListener from being cleaned. For example, 

[jira] [Comment Edited] (SPARK-43523) Memory leak in Spark UI

2023-05-19 Thread Amine Bagdouri (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724436#comment-17724436
 ] 

Amine Bagdouri edited comment on SPARK-43523 at 5/19/23 10:35 PM:
--

Theses units are only considered running from the point of view of Spark UI, 
while in reality they were already finished a long time ago and the application 
is totally idle.

Spark UI is not aware of the update of the state of these leaked units because 
the queue (AsyncEventQueue) it is using to listen to events (onJobEnd, 
onTaskEnd, onStageCompleted, ...) in order to update its state is full and new 
events are dropped.

As shown below, we try to add the event to the bounded queue using 
LinkedBlockingQueue::offer. If the queue is full, the event is not inserted and 
the droppedEvents counter is incremented.

[AsyncEventQueue.scala post (line 
152)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/AsyncEventQueue.scala#LL152C1-L164C1]
 :
{code:java}
  def post(event: SparkListenerEvent): Unit = {
    if (stopped.get()) {
      return
    }    
eventCount.incrementAndGet()
    if (eventQueue.offer(event)) { //
      return
    }
eventCount.decrementAndGet()
    droppedEvents.inc()
    droppedEventsCounter.incrementAndGet() {code}
 I will try to reproduce the leak on a more recent version of Spark and provide 
you with the results.

 


was (Author: JIRAUSER300423):
Theses units are only considered running from the point of view of Spark UI, 
while in reality they were already finished a long time ago and the application 
is totally idle.

Spark UI is not aware of the update of the state of these leaked units because 
the queue (AsyncEventQueue) it is using to listen to events (onJobEnd, 
onTaskEnd, onStageCompleted, ...) in order to update its state is full and new 
events are dropped.

As shown below, we try to add the event to the bounded queue using 
LinkedBlockingQueue::offer. If the queue is full, the event is not inserted and 
the droppedEvents counter is incremented.

[AsyncEventQueue.scala post (line 
152)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/AsyncEventQueue.scala#LL152C1-L164C1]
 :

 
{code:java}
  def post(event: SparkListenerEvent): Unit = {
    if (stopped.get()) {
      return
    }    
eventCount.incrementAndGet()
    if (eventQueue.offer(event)) { //
      return
    }
eventCount.decrementAndGet()
    droppedEvents.inc()
    droppedEventsCounter.incrementAndGet() {code}
 I will try to reproduce the leak on a more recent version of Spark and provide 
you with the results.

 

> Memory leak in Spark UI
> ---
>
> Key: SPARK-43523
> URL: https://issues.apache.org/jira/browse/SPARK-43523
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.4
>Reporter: Amine Bagdouri
>Priority: Major
>
> We have a distributed Spark application running on Azure HDInsight using 
> Spark version 2.4.4.
> After a few days of active processing on our application, we have noticed 
> that the GC CPU time ratio of the driver is close to 100%. We suspected a 
> memory leak. Thus, we have produced a heap dump and analyzed it using Eclipse 
> Memory Analyzer.
> Here is some interesting data from the driver's heap dump (heap size is 8 GB):
>  * The estimated retained heap size of String objects (~5M instances) is 3.3 
> GB. It seems that most of these instances correspond to spark events.
>  * Spark UI's AppStatusListener instance estimated retained size is 1.1 GB.
>  * The number of LiveJob objects with status "RUNNING" is 18K, knowing that 
> there shouldn't be more than 16 live running jobs since we use a fixed size 
> thread pool of 16 threads to run spark queries.
>  * The number of LiveTask objects is 485K.
>  * The AsyncEventQueue instance associated to the AppStatusListener has a 
> value of 854 for dropped events count and a value of 10001 for total events 
> count, knowing that the dropped events counter is reset every minute and that 
> the queue's default capacity is 1.
> We think that there is a memory leak in Spark UI. Here is our analysis of the 
> root cause of this leak:
>  * AppStatusListener is notified of Spark events using a bounded queue in 
> AsyncEventQueue.
>  * AppStatusListener updates its state (kvstore, liveTasks, liveStages, 
> liveJobs, ...) based on the received events. For example, onTaskStart adds a 
> task to liveTasks map and onTaskEnd removes the task from liveTasks map.
>  * When the rate of events is very high, the bounded queue in AsyncEventQueue 
> is full, some events are dropped and don't make it to AppStatusListener.
>  * Dropped events that signal the end of a processing unit prevent the state 
> of AppStatusListener from being cleaned. For 

[jira] [Commented] (SPARK-43523) Memory leak in Spark UI

2023-05-19 Thread Amine Bagdouri (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724436#comment-17724436
 ] 

Amine Bagdouri commented on SPARK-43523:


Theses units are only considered running from the point of view of Spark UI, 
while in reality they were already finished a long time ago and the application 
is totally idle.

Spark UI is not aware of the update of the state of these leaked units because 
the queue (AsyncEventQueue) it is using to listen to events (onJobEnd, 
onTaskEnd, onStageCompleted, ...) in order to update its state is full and new 
events are dropped.

As shown below, we try to add the event to the bounded queue using 
LinkedBlockingQueue::offer. If the queue is full, the event is not inserted and 
the droppedEvents counter is incremented.

[AsyncEventQueue.scala post (line 
152)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/AsyncEventQueue.scala#LL152C1-L164C1]
 :

 
{code:java}
  def post(event: SparkListenerEvent): Unit = {
    if (stopped.get()) {
      return
    }    
eventCount.incrementAndGet()
    if (eventQueue.offer(event)) { //
      return
    }
eventCount.decrementAndGet()
    droppedEvents.inc()
    droppedEventsCounter.incrementAndGet() {code}
 I will try to reproduce the leak on a more recent version of Spark and provide 
you with the results.

 

> Memory leak in Spark UI
> ---
>
> Key: SPARK-43523
> URL: https://issues.apache.org/jira/browse/SPARK-43523
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.4
>Reporter: Amine Bagdouri
>Priority: Major
>
> We have a distributed Spark application running on Azure HDInsight using 
> Spark version 2.4.4.
> After a few days of active processing on our application, we have noticed 
> that the GC CPU time ratio of the driver is close to 100%. We suspected a 
> memory leak. Thus, we have produced a heap dump and analyzed it using Eclipse 
> Memory Analyzer.
> Here is some interesting data from the driver's heap dump (heap size is 8 GB):
>  * The estimated retained heap size of String objects (~5M instances) is 3.3 
> GB. It seems that most of these instances correspond to spark events.
>  * Spark UI's AppStatusListener instance estimated retained size is 1.1 GB.
>  * The number of LiveJob objects with status "RUNNING" is 18K, knowing that 
> there shouldn't be more than 16 live running jobs since we use a fixed size 
> thread pool of 16 threads to run spark queries.
>  * The number of LiveTask objects is 485K.
>  * The AsyncEventQueue instance associated to the AppStatusListener has a 
> value of 854 for dropped events count and a value of 10001 for total events 
> count, knowing that the dropped events counter is reset every minute and that 
> the queue's default capacity is 1.
> We think that there is a memory leak in Spark UI. Here is our analysis of the 
> root cause of this leak:
>  * AppStatusListener is notified of Spark events using a bounded queue in 
> AsyncEventQueue.
>  * AppStatusListener updates its state (kvstore, liveTasks, liveStages, 
> liveJobs, ...) based on the received events. For example, onTaskStart adds a 
> task to liveTasks map and onTaskEnd removes the task from liveTasks map.
>  * When the rate of events is very high, the bounded queue in AsyncEventQueue 
> is full, some events are dropped and don't make it to AppStatusListener.
>  * Dropped events that signal the end of a processing unit prevent the state 
> of AppStatusListener from being cleaned. For example, a dropped onTaskEnd 
> event, will prevent the task from being removed from liveTasks map, and the 
> task will remain in the heap until the driver's JVM is stopped.
> We were able to confirm our analysis by reducing the capacity of the 
> AsyncEventQueue (spark.scheduler.listenerbus.eventqueue.capacity=10). After 
> having launched many spark queries using this config, we observed that the 
> number of active jobs in Spark UI increased rapidly and remained high even 
> though all submitted queries have completed. We have also noticed that some 
> executor task counters in Spark UI were negative, which confirms that 
> AppStatusListener state does not accurately reflect the reality and that it 
> can be a victim of event drops.
> Suggested fix:
> There are some limits today on the number of "dead" objects in 
> AppStatusListener's maps (for example: spark.ui.retainedJobs). We suggest 
> enforcing another configurable limit on the number of total objects in 
> AppStatusListener's maps and kvstore. This should limit the leak in the case 
> of high events rate, but AppStatusListener stats will remain inaccurate.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, 

[jira] [Commented] (SPARK-43523) Memory leak in Spark UI

2023-05-19 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724430#comment-17724430
 ] 

Sean R. Owen commented on SPARK-43523:
--

You would need to report this versus a supported Spark version to be 
considered, because any change from this would have to be applied to master. 
It's entirely possible it is fixed and sounds like you haven't ruled that out. 
What causes a job to miss events? if that's what's happening then that is the 
issue not a memory leak. You're talking about memory consumed by things 
considered 'running'.

> Memory leak in Spark UI
> ---
>
> Key: SPARK-43523
> URL: https://issues.apache.org/jira/browse/SPARK-43523
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.4
>Reporter: Amine Bagdouri
>Priority: Major
>
> We have a distributed Spark application running on Azure HDInsight using 
> Spark version 2.4.4.
> After a few days of active processing on our application, we have noticed 
> that the GC CPU time ratio of the driver is close to 100%. We suspected a 
> memory leak. Thus, we have produced a heap dump and analyzed it using Eclipse 
> Memory Analyzer.
> Here is some interesting data from the driver's heap dump (heap size is 8 GB):
>  * The estimated retained heap size of String objects (~5M instances) is 3.3 
> GB. It seems that most of these instances correspond to spark events.
>  * Spark UI's AppStatusListener instance estimated retained size is 1.1 GB.
>  * The number of LiveJob objects with status "RUNNING" is 18K, knowing that 
> there shouldn't be more than 16 live running jobs since we use a fixed size 
> thread pool of 16 threads to run spark queries.
>  * The number of LiveTask objects is 485K.
>  * The AsyncEventQueue instance associated to the AppStatusListener has a 
> value of 854 for dropped events count and a value of 10001 for total events 
> count, knowing that the dropped events counter is reset every minute and that 
> the queue's default capacity is 1.
> We think that there is a memory leak in Spark UI. Here is our analysis of the 
> root cause of this leak:
>  * AppStatusListener is notified of Spark events using a bounded queue in 
> AsyncEventQueue.
>  * AppStatusListener updates its state (kvstore, liveTasks, liveStages, 
> liveJobs, ...) based on the received events. For example, onTaskStart adds a 
> task to liveTasks map and onTaskEnd removes the task from liveTasks map.
>  * When the rate of events is very high, the bounded queue in AsyncEventQueue 
> is full, some events are dropped and don't make it to AppStatusListener.
>  * Dropped events that signal the end of a processing unit prevent the state 
> of AppStatusListener from being cleaned. For example, a dropped onTaskEnd 
> event, will prevent the task from being removed from liveTasks map, and the 
> task will remain in the heap until the driver's JVM is stopped.
> We were able to confirm our analysis by reducing the capacity of the 
> AsyncEventQueue (spark.scheduler.listenerbus.eventqueue.capacity=10). After 
> having launched many spark queries using this config, we observed that the 
> number of active jobs in Spark UI increased rapidly and remained high even 
> though all submitted queries have completed. We have also noticed that some 
> executor task counters in Spark UI were negative, which confirms that 
> AppStatusListener state does not accurately reflect the reality and that it 
> can be a victim of event drops.
> Suggested fix:
> There are some limits today on the number of "dead" objects in 
> AppStatusListener's maps (for example: spark.ui.retainedJobs). We suggest 
> enforcing another configurable limit on the number of total objects in 
> AppStatusListener's maps and kvstore. This should limit the leak in the case 
> of high events rate, but AppStatusListener stats will remain inaccurate.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-43523) Memory leak in Spark UI

2023-05-19 Thread Amine Bagdouri (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724424#comment-17724424
 ] 

Amine Bagdouri edited comment on SPARK-43523 at 5/19/23 10:09 PM:
--

Hi [~srowen], thanks for your reply.

 

However, I still think that it is a memory leak, since many jobs, stages and 
tasks get cumulated in memory and are never released.

For example, all jobs that have a "RUNNING" state and that missed the 
"onJobEnd" event (due to event drop from queue) will remain in the heap 
forever, and keep adding up indefinitely.

The same goes for tasks and stages and that missed the "onTaskEnd" and the 
"onStageCompleted" events.

 

As for the limits on the number of retained jobs, tasks and stages, they only 
apply to units that are considered finished (as shown below). Thus, these 
limits do not prevent the memory leak. That is why I suggest adding other 
limits that apply even to units that are still running or pending.
 * [AppStatusListener.scala cleanupJobs (line 
1263)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#L1263-L1265]
 :

{code:java}
    val toDelete = KVUtils.viewToSeq(view, countToDelete.toInt) { j =>
      j.info.status != JobExecutionStatus.RUNNING && j.info.status != 
JobExecutionStatus.UNKNOWN
    }{code}
 * [AppStatusListener.scala cleanupStagesInKVStore (line 
1314)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#LL1314C1-L1316C1]
 :

{code:java}
    val stages = KVUtils.viewToSeq(view, countToDelete.toInt) { s =>
      s.info.status != v1.StageStatus.ACTIVE && s.info.status != 
v1.StageStatus.PENDING
    } {code}
 * [AppStatusListener.scala cleanupTasks (line 
1377)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#LL1377C1-L1380C1]
 :

{code:java}
      // Try to delete finished tasks only.
      val toDelete = KVUtils.viewToSeq(view, countToDelete) { t =>
        !live || t.status != TaskState.RUNNING.toString()
      } {code}
 

As for the version, I have only used the Spark version 2.4.4. But I don't think 
that upgrading to a more recent version will fix the memory leak since the code 
causing the leak is still there.

 

Regards,


was (Author: JIRAUSER300423):
Hi [~srowen], thanks for your reply.

 

However, I still think that it is a memory leak, since many jobs, stages and 
tasks get cumulated in memory and are never released.

For example, all jobs that have a "RUNNING" state and that missed the 
"onJobEnd" event (due to event drop from queue) will remain in the heap 
forever, and keep adding up indefinitely.

The same goes for tasks and stages and that missed the "onTaskEnd" and the 
"onStageCompleted" events.

 

As for the limits on the number of retained jobs, tasks and stages, they only 
apply to units that are considered finished (as shown below). Thus, these 
limits do not prevent the memory leak. That is why I suggest adding other 
limits that apply even to units that are still running or pending.
 * [AppStatusListener.scala cleanupJobs (line 
1263)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#L1263-L1265]
 :

{code:java}
    val toDelete = KVUtils.viewToSeq(view, countToDelete.toInt) { j =>
      j.info.status != JobExecutionStatus.RUNNING && j.info.status != 
JobExecutionStatus.UNKNOWN
    }{code}
 * [AppStatusListener.scala cleanupStagesInKVStore (line 
1309)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#LL1314C1-L1316C1]
 :

{code:java}
    val stages = KVUtils.viewToSeq(view, countToDelete.toInt) { s =>
      s.info.status != v1.StageStatus.ACTIVE && s.info.status != 
v1.StageStatus.PENDING
    } {code}
 * [AppStatusListener.scala cleanupTasks (line 
1314)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#LL1377C1-L1380C1]
 :

{code:java}
      // Try to delete finished tasks only.
      val toDelete = KVUtils.viewToSeq(view, countToDelete) { t =>
        !live || t.status != TaskState.RUNNING.toString()
      } {code}
 

As for the version, I have only used the Spark version 2.4.4. But I don't think 
that upgrading to a more recent version will fix the memory leak since the code 
causing the leak is still there.

 

Regards,

> Memory leak in Spark UI
> ---
>
> Key: SPARK-43523
> URL: https://issues.apache.org/jira/browse/SPARK-43523
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.4
>Reporter: Amine Bagdouri
>Priority: Major
>
> We have a distributed Spark application running on Azure HDInsight using 
> Spark version 

[jira] [Comment Edited] (SPARK-43523) Memory leak in Spark UI

2023-05-19 Thread Amine Bagdouri (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724424#comment-17724424
 ] 

Amine Bagdouri edited comment on SPARK-43523 at 5/19/23 9:59 PM:
-

Hi [~srowen], thanks for your reply.

 

However, I still think that it is a memory leak, since many jobs, stages and 
tasks get cumulated in memory and are never released.

For example, all jobs that have a "RUNNING" state and that missed the 
"onJobEnd" event (due to event drop from queue) will remain in the heap 
forever, and keep adding up indefinitely.

The same goes for tasks and stages and that missed the "onTaskEnd" and the 
"onStageCompleted" events.

 

As for the limits on the number of retained jobs, tasks and stages, they only 
apply to units that are considered finished (as shown below). Thus, these 
limits do not prevent the memory leak. That is why I suggest adding other 
limits that apply even to units that are still running or pending.
 * [AppStatusListener.scala cleanupJobs (line 
1263)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#L1263-L1265]
 :

{code:java}
    val toDelete = KVUtils.viewToSeq(view, countToDelete.toInt) { j =>
      j.info.status != JobExecutionStatus.RUNNING && j.info.status != 
JobExecutionStatus.UNKNOWN
    }{code}
 * [AppStatusListener.scala cleanupStagesInKVStore (line 
1309)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#LL1314C1-L1316C1]
 :

{code:java}
    val stages = KVUtils.viewToSeq(view, countToDelete.toInt) { s =>
      s.info.status != v1.StageStatus.ACTIVE && s.info.status != 
v1.StageStatus.PENDING
    } {code}
 * [AppStatusListener.scala cleanupTasks (line 
1314)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#LL1377C1-L1380C1]
 :

{code:java}
      // Try to delete finished tasks only.
      val toDelete = KVUtils.viewToSeq(view, countToDelete) { t =>
        !live || t.status != TaskState.RUNNING.toString()
      } {code}
 

As for the version, I have only used the Spark version 2.4.4. But I don't think 
that upgrading to a more recent version will fix the memory leak since the code 
causing the leak is still there.

 

Regards,


was (Author: JIRAUSER300423):
Hi [~srowen], thanks for your reply.

 

However, I still think that it is a memory leak, since many jobs, stages and 
tasks get cumulated in memory and are never released.

For example, all jobs that have a "RUNNING" state and that missed the 
"onJobEnd" event (due to event drop from queue) will remain in the heap 
forever, and keep adding up indefinitely.

The same goes for tasks and stages and that missed the "onTaskEnd" and the 
"onStageCompleted" events.

 

As for the limits on the number of retained jobs, tasks and stages, it only 
applies to units that are considered finished (as shown below). Thus, these 
limits do not prevent the memory leak. That is why I suggest adding other 
limits that apply even to units that are still running or pending.
 * [AppStatusListener.scala cleanupJobs (line 
1263)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#L1263-L1265]
 :

{code:java}
    val toDelete = KVUtils.viewToSeq(view, countToDelete.toInt) { j =>
      j.info.status != JobExecutionStatus.RUNNING && j.info.status != 
JobExecutionStatus.UNKNOWN
    }{code}
 * [AppStatusListener.scala cleanupStagesInKVStore (line 
1309)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#LL1314C1-L1316C1]
 :

{code:java}
    val stages = KVUtils.viewToSeq(view, countToDelete.toInt) { s =>
      s.info.status != v1.StageStatus.ACTIVE && s.info.status != 
v1.StageStatus.PENDING
    } {code}
 * [AppStatusListener.scala cleanupTasks (line 
1314)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#LL1377C1-L1380C1]
 :

{code:java}
      // Try to delete finished tasks only.
      val toDelete = KVUtils.viewToSeq(view, countToDelete) { t =>
        !live || t.status != TaskState.RUNNING.toString()
      } {code}
 

As for the version, I have only used the Spark version 2.4.4. But I don't think 
that upgrading to a more recent version will fix the memory leak since the code 
causing the leak is still there.

 

Regards,

> Memory leak in Spark UI
> ---
>
> Key: SPARK-43523
> URL: https://issues.apache.org/jira/browse/SPARK-43523
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.4
>Reporter: Amine Bagdouri
>Priority: Major
>
> We have a distributed Spark application running on Azure HDInsight using 
> Spark version 

[jira] [Comment Edited] (SPARK-43523) Memory leak in Spark UI

2023-05-19 Thread Amine Bagdouri (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724424#comment-17724424
 ] 

Amine Bagdouri edited comment on SPARK-43523 at 5/19/23 9:58 PM:
-

Hi [~srowen], thanks for your reply.

 

However, I still think that it is a memory leak, since many jobs, stages and 
tasks get cumulated in memory and are never released.

For example, all jobs that have a "RUNNING" state and that missed the 
"onJobEnd" event (due to event drop from queue) will remain in the heap 
forever, and keep adding up indefinitely.

The same goes for tasks and stages and that missed the "onTaskEnd" and the 
"onStageCompleted" events.

 

As for the limits on the number of retained jobs, tasks and stages, it only 
applies to units that are considered finished (as shown below). Thus, these 
limits do not prevent the memory leak. That is why I suggest adding other 
limits that apply even to units that are still running or pending.
 * [AppStatusListener.scala cleanupJobs (line 
1263)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#L1263-L1265]
 :

{code:java}
    val toDelete = KVUtils.viewToSeq(view, countToDelete.toInt) { j =>
      j.info.status != JobExecutionStatus.RUNNING && j.info.status != 
JobExecutionStatus.UNKNOWN
    }{code}
 * [AppStatusListener.scala cleanupStagesInKVStore (line 
1309)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#LL1314C1-L1316C1]
 :

{code:java}
    val stages = KVUtils.viewToSeq(view, countToDelete.toInt) { s =>
      s.info.status != v1.StageStatus.ACTIVE && s.info.status != 
v1.StageStatus.PENDING
    } {code}
 * [AppStatusListener.scala cleanupTasks (line 
1314)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#LL1377C1-L1380C1]
 :

{code:java}
      // Try to delete finished tasks only.
      val toDelete = KVUtils.viewToSeq(view, countToDelete) { t =>
        !live || t.status != TaskState.RUNNING.toString()
      } {code}
 

As for the version, I have only used the Spark version 2.4.4. But I don't think 
that upgrading to a more recent version will fix the memory leak since the code 
causing the leak is still there.

 

Regards,


was (Author: JIRAUSER300423):
Hi [~srowen], thanks for your reply.

 

However, I still think that it is a memory leak, since many jobs, stages and 
tasks get cumulated in memory and are never released.

For example, all jobs that have a "RUNNING" state and that missed the 
"onJobEnd" event (due to event drop from queue) will remain in the heap forever.

The same goes for tasks and stages and that missed the "onTaskEnd" and the 
"onStageCompleted" events.

 

As for the limits on the number of retained jobs, tasks and stages, it only 
applies to units that are considered finished (as shown below). Thus, these 
limits do not prevent the memory leak. That is why I suggest adding other 
limits that apply even to units that are still running or pending.
 * [AppStatusListener.scala cleanupJobs (line 
1263)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#L1263-L1265]
 :

{code:java}
    val toDelete = KVUtils.viewToSeq(view, countToDelete.toInt) { j =>
      j.info.status != JobExecutionStatus.RUNNING && j.info.status != 
JobExecutionStatus.UNKNOWN
    }{code}
 * [AppStatusListener.scala cleanupStagesInKVStore (line 
1309)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#LL1314C1-L1316C1]
 :

{code:java}
    val stages = KVUtils.viewToSeq(view, countToDelete.toInt) { s =>
      s.info.status != v1.StageStatus.ACTIVE && s.info.status != 
v1.StageStatus.PENDING
    } {code}
 * [AppStatusListener.scala cleanupTasks (line 
1314)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#LL1377C1-L1380C1]
 :

{code:java}
      // Try to delete finished tasks only.
      val toDelete = KVUtils.viewToSeq(view, countToDelete) { t =>
        !live || t.status != TaskState.RUNNING.toString()
      } {code}
 

As for the version, I have only used the Spark version 2.4.4. But I don't think 
that upgrading to a more recent version will fix the memory leak since the code 
causing the leak is still there.

 

Regards,

> Memory leak in Spark UI
> ---
>
> Key: SPARK-43523
> URL: https://issues.apache.org/jira/browse/SPARK-43523
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.4
>Reporter: Amine Bagdouri
>Priority: Major
>
> We have a distributed Spark application running on Azure HDInsight using 
> Spark version 2.4.4.
> After a few days of active 

[jira] [Resolved] (SPARK-43543) Standardize Nested Complex DataTypes Support

2023-05-19 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng resolved SPARK-43543.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41147
[https://github.com/apache/spark/pull/41147]

> Standardize Nested Complex DataTypes Support
> 
>
> Key: SPARK-43543
> URL: https://issues.apache.org/jira/browse/SPARK-43543
> Project: Spark
>  Issue Type: Umbrella
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43543) Standardize Nested Complex DataTypes Support

2023-05-19 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng reassigned SPARK-43543:


Assignee: Xinrong Meng

> Standardize Nested Complex DataTypes Support
> 
>
> Key: SPARK-43543
> URL: https://issues.apache.org/jira/browse/SPARK-43543
> Project: Spark
>  Issue Type: Umbrella
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-43523) Memory leak in Spark UI

2023-05-19 Thread Amine Bagdouri (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724424#comment-17724424
 ] 

Amine Bagdouri edited comment on SPARK-43523 at 5/19/23 9:55 PM:
-

Hi [~srowen], thanks for your reply.

 

However, I still think that it is a memory leak, since many jobs, stages and 
tasks get cumulated in memory and are never released.

For example, all jobs that have a "RUNNING" state and that missed the 
"onJobEnd" event (due to event drop from queue) will remain in the heap forever.

The same goes for tasks and stages and that missed the "onTaskEnd" and the 
"onStageCompleted" events.

 

As for the limits on the number of retained jobs, tasks and stages, it only 
applies to units that are considered finished (as shown below). Thus, these 
limits do not prevent the memory leak. That is why I suggest adding other 
limits that apply even to units that are still running or pending.
 * [AppStatusListener.scala cleanupJobs (line 
1263)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#L1263-L1265]
 :

{code:java}
    val toDelete = KVUtils.viewToSeq(view, countToDelete.toInt) { j =>
      j.info.status != JobExecutionStatus.RUNNING && j.info.status != 
JobExecutionStatus.UNKNOWN
    }{code}
 * [AppStatusListener.scala cleanupStagesInKVStore (line 
1309)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#LL1314C1-L1316C1]
 :

{code:java}
    val stages = KVUtils.viewToSeq(view, countToDelete.toInt) { s =>
      s.info.status != v1.StageStatus.ACTIVE && s.info.status != 
v1.StageStatus.PENDING
    } {code}
 * [AppStatusListener.scala cleanupTasks (line 
1314)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#LL1377C1-L1380C1]
 :

{code:java}
      // Try to delete finished tasks only.
      val toDelete = KVUtils.viewToSeq(view, countToDelete) { t =>
        !live || t.status != TaskState.RUNNING.toString()
      } {code}
 

As for the version, I have only used the Spark version 2.4.4. But I don't think 
that upgrading to a more recent version will fix the memory leak since the code 
causing the leak is still there.

 

Regards,


was (Author: JIRAUSER300423):
Hi [~srowen], thanks for your reply.

 

However, I still think that it is a memory leak, since many jobs, stages and 
tasks get cumulated in memory and are never released.

For example, all jobs that have a "RUNNING" state and that missed the 
"onJobEnd" event (due to event drop from queue) will remain in the heap forever.

The same goes for tasks and stages and that missed the "onTaskEnd" and tje 
"onStageCompleted" events.

 

As for the limits on the number of retained jobs, tasks and stages, it only 
applies to units that are considered finished (as shown below). Thus, these 
limits do not prevent the memory leak. That is why I suggest adding other 
limits that apply even to units that are still running or pending.
 * [AppStatusListener.scala cleanupJobs (line 
1263)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#L1263-L1265]
 :

{code:java}
    val toDelete = KVUtils.viewToSeq(view, countToDelete.toInt) { j =>
      j.info.status != JobExecutionStatus.RUNNING && j.info.status != 
JobExecutionStatus.UNKNOWN
    }{code}
 * [AppStatusListener.scala cleanupStagesInKVStore (line 
1309)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#LL1314C1-L1316C1]
 :

{code:java}
    val stages = KVUtils.viewToSeq(view, countToDelete.toInt) { s =>
      s.info.status != v1.StageStatus.ACTIVE && s.info.status != 
v1.StageStatus.PENDING
    } {code}
 * [AppStatusListener.scala cleanupTasks (line 
1314)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#LL1377C1-L1380C1]
 :

{code:java}
      // Try to delete finished tasks only.
      val toDelete = KVUtils.viewToSeq(view, countToDelete) { t =>
        !live || t.status != TaskState.RUNNING.toString()
      } {code}
 

As for the version, I have only used the Spark version 2.4.4. But I don't think 
that upgrading to a more recent version will fix the memory leak since the code 
causing the leak is still there.

 

Regards,

> Memory leak in Spark UI
> ---
>
> Key: SPARK-43523
> URL: https://issues.apache.org/jira/browse/SPARK-43523
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.4
>Reporter: Amine Bagdouri
>Priority: Major
>
> We have a distributed Spark application running on Azure HDInsight using 
> Spark version 2.4.4.
> After a few days of active processing on our application, we 

[jira] [Commented] (SPARK-43523) Memory leak in Spark UI

2023-05-19 Thread Amine Bagdouri (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724424#comment-17724424
 ] 

Amine Bagdouri commented on SPARK-43523:


Hi [~srowen], thanks for your reply.

 

However, I still think that it is a memory leak, since many jobs, stages and 
tasks get cumulated in memory and are never released.

For example, all jobs that have a "RUNNING" state and that missed the 
"onJobEnd" event (due to event drop from queue) will remain in the heap forever.

The same goes for tasks and stages and that missed the "onTaskEnd" and tje 
"onStageCompleted" events.

 

As for the limits on the number of retained jobs, tasks and stages, it only 
applies to units that are considered finished (as shown below). Thus, these 
limits do not prevent the memory leak. That is why I suggest adding other 
limits that apply even to units that are still running or pending.
 * [AppStatusListener.scala cleanupJobs (line 
1263)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#L1263-L1265]
 :

{code:java}
    val toDelete = KVUtils.viewToSeq(view, countToDelete.toInt) { j =>
      j.info.status != JobExecutionStatus.RUNNING && j.info.status != 
JobExecutionStatus.UNKNOWN
    }{code}
 * [AppStatusListener.scala cleanupStagesInKVStore (line 
1309)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#LL1314C1-L1316C1]
 :

{code:java}
    val stages = KVUtils.viewToSeq(view, countToDelete.toInt) { s =>
      s.info.status != v1.StageStatus.ACTIVE && s.info.status != 
v1.StageStatus.PENDING
    } {code}
 * [AppStatusListener.scala cleanupTasks (line 
1314)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#LL1377C1-L1380C1]
 :

{code:java}
      // Try to delete finished tasks only.
      val toDelete = KVUtils.viewToSeq(view, countToDelete) { t =>
        !live || t.status != TaskState.RUNNING.toString()
      } {code}
 

As for the version, I have only used the Spark version 2.4.4. But I don't think 
that upgrading to a more recent version will fix the memory leak since the code 
causing the leak is still there.

 

Regards,

> Memory leak in Spark UI
> ---
>
> Key: SPARK-43523
> URL: https://issues.apache.org/jira/browse/SPARK-43523
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.4
>Reporter: Amine Bagdouri
>Priority: Major
>
> We have a distributed Spark application running on Azure HDInsight using 
> Spark version 2.4.4.
> After a few days of active processing on our application, we have noticed 
> that the GC CPU time ratio of the driver is close to 100%. We suspected a 
> memory leak. Thus, we have produced a heap dump and analyzed it using Eclipse 
> Memory Analyzer.
> Here is some interesting data from the driver's heap dump (heap size is 8 GB):
>  * The estimated retained heap size of String objects (~5M instances) is 3.3 
> GB. It seems that most of these instances correspond to spark events.
>  * Spark UI's AppStatusListener instance estimated retained size is 1.1 GB.
>  * The number of LiveJob objects with status "RUNNING" is 18K, knowing that 
> there shouldn't be more than 16 live running jobs since we use a fixed size 
> thread pool of 16 threads to run spark queries.
>  * The number of LiveTask objects is 485K.
>  * The AsyncEventQueue instance associated to the AppStatusListener has a 
> value of 854 for dropped events count and a value of 10001 for total events 
> count, knowing that the dropped events counter is reset every minute and that 
> the queue's default capacity is 1.
> We think that there is a memory leak in Spark UI. Here is our analysis of the 
> root cause of this leak:
>  * AppStatusListener is notified of Spark events using a bounded queue in 
> AsyncEventQueue.
>  * AppStatusListener updates its state (kvstore, liveTasks, liveStages, 
> liveJobs, ...) based on the received events. For example, onTaskStart adds a 
> task to liveTasks map and onTaskEnd removes the task from liveTasks map.
>  * When the rate of events is very high, the bounded queue in AsyncEventQueue 
> is full, some events are dropped and don't make it to AppStatusListener.
>  * Dropped events that signal the end of a processing unit prevent the state 
> of AppStatusListener from being cleaned. For example, a dropped onTaskEnd 
> event, will prevent the task from being removed from liveTasks map, and the 
> task will remain in the heap until the driver's JVM is stopped.
> We were able to confirm our analysis by reducing the capacity of the 
> AsyncEventQueue (spark.scheduler.listenerbus.eventqueue.capacity=10). After 
> having launched many spark queries using this config, we observed that the 
> number of 

[jira] [Assigned] (SPARK-43539) Assign a name to the error class _LEGACY_ERROR_TEMP_0003

2023-05-19 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-43539:


Assignee: BingKun Pan

> Assign a name to the error class _LEGACY_ERROR_TEMP_0003
> 
>
> Key: SPARK-43539
> URL: https://issues.apache.org/jira/browse/SPARK-43539
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43539) Assign a name to the error class _LEGACY_ERROR_TEMP_0003

2023-05-19 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-43539.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41200
[https://github.com/apache/spark/pull/41200]

> Assign a name to the error class _LEGACY_ERROR_TEMP_0003
> 
>
> Key: SPARK-43539
> URL: https://issues.apache.org/jira/browse/SPARK-43539
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40500) Use `pd.items` instead of `pd.iteritems`

2023-05-19 Thread Jim Huang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724387#comment-17724387
 ] 

Jim Huang commented on SPARK-40500:
---

Just to provide a better context of the situation between Pandas v2.+ with 
Spark Python APIs.  
  
Basically starting with Pandas 2.0.0 release, there is [a long list of removal 
of (Pandas) deprecations/changes 
published|https://pandas.pydata.org/docs/dev/whatsnew/v2.0.0.html#removal-of-prior-version-deprecations-changes].
  
This current SPARK-40500, [Pandas 
DataFrame.iteritems()](https://github.com/pandas-dev/pandas/issues/45321) 
addresses only one of many in that list.  

Is there a more holistic or comprehensive effort that has already being done 
that identifies all the incompatibilities between Pandas v2+ & Spark Python 
Pandas APIs?   If the answer is "yes", how can I find this information?  Is it 
in the form of "tag:value" that I can search for or there is an umbrella JIRA 
that is handling this?  I am simply guessing that the backporting (idea) effort 
might benefit from getting synced / caught up with the current Pandas v2+ API 
first.

Thank you.

> Use `pd.items` instead of `pd.iteritems`
> 
>
> Key: SPARK-40500
> URL: https://issues.apache.org/jira/browse/SPARK-40500
> Project: Spark
>  Issue Type: Improvement
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43534) Add log4j-1.2-api and log4j-slf4j2-impl to classpath if active hadoop-provided

2023-05-19 Thread Ignite TC Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724373#comment-17724373
 ] 

Ignite TC Bot commented on SPARK-43534:
---

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/41195

> Add log4j-1.2-api and log4j-slf4j2-impl to classpath if active hadoop-provided
> --
>
> Key: SPARK-43534
> URL: https://issues.apache.org/jira/browse/SPARK-43534
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: hadoop log jars.png, log4j-1.2-api-2.20.0.jar, 
> log4j-slf4j2-impl-2.20.0.jar
>
>
> Build Spark:
> {code:sh}
> ./dev/make-distribution.sh --name default --tgz -Phive -Phive-thriftserver 
> -Pyarn -Phadoop-provided
> tar -zxf spark-3.5.0-SNAPSHOT-bin-default.tgz {code}
> Remove the following jars from spark-3.5.0-SNAPSHOT-bin-default:
> {noformat}
> jars/log4j-1.2-api-2.20.0.jar
> jars/log4j-slf4j2-impl-2.20.0.jar
> {noformat}
> Add a new log4j2.properties to spark-3.5.0-SNAPSHOT-bin-default/conf:
> {code:none}
> rootLogger.level = info
> rootLogger.appenderRef.file.ref = File
> rootLogger.appenderRef.stderr.ref = console
> appender.console.type = Console
> appender.console.name = console
> appender.console.target = SYSTEM_ERR
> appender.console.layout.type = PatternLayout
> appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L 
> : %m%n
> appender.file.type = RollingFile
> appender.file.name = File
> appender.file.fileName = /tmp/spark/logs/spark.log
> appender.file.filePattern = /tmp/spark/logs/spark.%d{MMdd-HH}.log
> appender.file.append = true
> appender.file.layout.type = PatternLayout
> appender.file.layout.pattern = %d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{2}:%L : 
> %m%n
> appender.file.policies.type = Policies
> appender.file.policies.time.type = TimeBasedTriggeringPolicy
> appender.file.policies.time.interval = 1
> appender.file.policies.time.modulate = true
> appender.file.policies.size.type = SizeBasedTriggeringPolicy
> appender.file.policies.size.size = 256M
> appender.file.strategy.type = DefaultRolloverStrategy
> appender.file.strategy.max = 100
> {code}
> Start Spark thriftserver:
> {code:java}
> sbin/start-thriftserver.sh
> {code}
> Check the log:
> {code:sh}
> cat /tmp/spark/logs/spark.log
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43584) Update sbt-assembly, sbt-revolver, sbt-mima-plugin plugins

2023-05-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-43584:
--
Summary: Update sbt-assembly, sbt-revolver, sbt-mima-plugin plugins  (was: 
Update some sbt plugins)

> Update sbt-assembly, sbt-revolver, sbt-mima-plugin plugins
> --
>
> Key: SPARK-43584
> URL: https://issues.apache.org/jira/browse/SPARK-43584
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.5.0
>
>
> Update some sbt plugins to newest version. include:
>  * sbt-assembly from 2.1.0 to 2.1.1
>  * sbt-mima-plugin from 1.1.0 to 1.1.2
>  * sbt-revolver from 0.9.1 to 0.10.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43584) Update some sbt plugins

2023-05-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-43584.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41226
[https://github.com/apache/spark/pull/41226]

> Update some sbt plugins
> ---
>
> Key: SPARK-43584
> URL: https://issues.apache.org/jira/browse/SPARK-43584
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.5.0
>
>
> Update some sbt plugins to newest version. include:
>  * sbt-assembly from 2.1.0 to 2.1.1
>  * sbt-mima-plugin from 1.1.0 to 1.1.2
>  * sbt-revolver from 0.9.1 to 0.10.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43584) Update some sbt plugins

2023-05-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-43584:
-

Assignee: BingKun Pan

> Update some sbt plugins
> ---
>
> Key: SPARK-43584
> URL: https://issues.apache.org/jira/browse/SPARK-43584
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>
> Update some sbt plugins to newest version. include:
>  * sbt-assembly from 2.1.0 to 2.1.1
>  * sbt-mima-plugin from 1.1.0 to 1.1.2
>  * sbt-revolver from 0.9.1 to 0.10.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43594) Add LocalDateTime to anyToMicros

2023-05-19 Thread Fokko Driesprong (Jira)
Fokko Driesprong created SPARK-43594:


 Summary: Add LocalDateTime to anyToMicros
 Key: SPARK-43594
 URL: https://issues.apache.org/jira/browse/SPARK-43594
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Fokko Driesprong






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43589) Fix `cannotBroadcastTableOverMaxTableBytesError` to use `bytesToString`

2023-05-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-43589.
---
Fix Version/s: 3.3.3
   3.4.1
 Assignee: Dongjoon Hyun
   Resolution: Fixed

> Fix `cannotBroadcastTableOverMaxTableBytesError` to use `bytesToString`
> ---
>
> Key: SPARK-43589
> URL: https://issues.apache.org/jira/browse/SPARK-43589
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2, 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.3.3, 3.4.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42760) The partition of result data frame of join is always 1

2023-05-19 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-42760.
--
Resolution: Not A Problem

> The partition of result data frame of join is always 1
> --
>
> Key: SPARK-42760
> URL: https://issues.apache.org/jira/browse/SPARK-42760
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.3.2
> Environment: standard spark 3.0.3/3.3.2, using in jupyter notebook, 
> local mode
>Reporter: binyang
>Priority: Major
>
> I am using pyspark. The partition of result data frame of join is always 1.
> Here is my code from 
> https://stackoverflow.com/questions/51876281/is-partitioning-retained-after-a-spark-sql-join
>  
> print(spark.version)
> def example_shuffle_partitions(data_partitions=10, shuffle_partitions=4):
>     spark.conf.set("spark.sql.shuffle.partitions", shuffle_partitions)
>     spark.sql("SET spark.sql.autoBroadcastJoinThreshold=-1")
>     df1 = spark.range(1, 1000).repartition(data_partitions)
>     df2 = spark.range(1, 2000).repartition(data_partitions)
>     df3 = spark.range(1, 3000).repartition(data_partitions)
>     print("Data partitions is: {}. Shuffle partitions is 
> {}".format(data_partitions, shuffle_partitions))
>     print("Data partitions before join: 
> {}".format(df1.rdd.getNumPartitions()))
>     df = (df1.join(df2, df1.id == df2.id)
>           .join(df3, df1.id == df3.id))
>     print("Data partitions after join : {}".format(df.rdd.getNumPartitions()))
> example_shuffle_partitions()
>  
> In Spark 3.0.3, it prints out:
> 3.0.3
> Data partitions is: 10. Shuffle partitions is 4
> Data partitions before join: 10
> Data partitions after join : 4
> However, it prints out the following in the latest 3.3.2
> 3.3.2
> Data partitions is: 10. Shuffle partitions is 4
> Data partitions before join: 10
> Data partitions after join : 1



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43188) Cannot write to Azure Datalake Gen2 (abfs/abfss) after Spark 3.1.2

2023-05-19 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724306#comment-17724306
 ] 

Sean R. Owen commented on SPARK-43188:
--

Looks like a local disk problem of some kind, not really a Spark issue

> Cannot write to Azure Datalake Gen2 (abfs/abfss) after Spark 3.1.2
> --
>
> Key: SPARK-43188
> URL: https://issues.apache.org/jira/browse/SPARK-43188
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.3.2, 3.4.0
>Reporter: Nicolas PHUNG
>Priority: Major
>
> Hello,
> I have an issue with Spark 3.3.2 & Spark 3.4.0 to write into Azure Data Lake 
> Storage Gen2 (abfs/abfss scheme). I've got the following errors:
> {code:java}
> warn 13:12:47.554: StdErr from Kernel Process 23/04/19 13:12:47 ERROR 
> FileFormatWriter: Aborting job 
> 6a75949c-1473-4445-b8ab-d125be3f0f21.org.apache.spark.SparkException: Job 
> aborted due to stage failure: Task 1 in stage 0.0 failed 1 times, most recent 
> failure: Lost task 1.0 in stage 0.0 (TID 1) (myhost executor driver): 
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any 
> valid local directory for datablock-0001-    at 
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:462)
>     at 
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:165)
>     at 
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146)
>     at 
> org.apache.hadoop.fs.store.DataBlocks$DiskBlockFactory.createTmpFileForWrite(DataBlocks.java:980)
>     at 
> org.apache.hadoop.fs.store.DataBlocks$DiskBlockFactory.create(DataBlocks.java:960)
>     at 
> org.apache.hadoop.fs.azurebfs.services.AbfsOutputStream.createBlockIfNeeded(AbfsOutputStream.java:262)
>     at 
> org.apache.hadoop.fs.azurebfs.services.AbfsOutputStream.(AbfsOutputStream.java:173)
>     at 
> org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.createFile(AzureBlobFileSystemStore.java:580)
>     at 
> org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.create(AzureBlobFileSystem.java:301)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1195)    at 
> org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1175)    at 
> org.apache.parquet.hadoop.util.HadoopOutputFile.create(HadoopOutputFile.java:74)
>     at 
> org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:347)
>     at 
> org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:314)
>     at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:480)
>     at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:420)
>     at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:409)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:36)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetUtils$$anon$1.newInstance(ParquetUtils.scala:490)
>     at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:161)
>     at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:146)
>     at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:389)
>     at 
> org.apache.spark.sql.execution.datasources.WriteFilesExec.$anonfun$doExecuteWrite$1(WriteFiles.scala:100)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:888)    
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:888)
>     at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)    
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)    at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:328)    at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)    at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)    
> at org.apache.spark.scheduler.Task.run(Task.scala:139)    at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
>     at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)    
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)    
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)
> Driver stacktrace:    at 
> org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2785)
>     at 
> 

[jira] [Commented] (SPARK-43322) Spark SQL docs for explode_outer and posexplode_outer omit behavior for null/empty

2023-05-19 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724304#comment-17724304
 ] 

Sean R. Owen commented on SPARK-43322:
--

Can you add examples?

> Spark SQL docs for explode_outer and posexplode_outer omit behavior for 
> null/empty
> --
>
> Key: SPARK-43322
> URL: https://issues.apache.org/jira/browse/SPARK-43322
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Robert Juchnicki
>Priority: Minor
>
> The Spark SQL documentation for 
> [explode_outer|https://spark.apache.org/docs/latest/api/sql/index.html#explode_outer]
>  and 
> [posexplode_outer|[https://spark.apache.org/docs/latest/api/sql/index.html#posexplode_outer|https://spark.apache.org/docs/latest/api/sql/index.html#posexplode_outer)]]
>  omits mentioning that null or empty arrays produce nulls. The descriptions 
> do not appear to be written down in a doc file and are likely pulled from the 
> `ExpressionDescription` tags for the `Explode` and `PosExplode` generators 
> when the `GeneratorOuter` wrapper is used.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43297) Make improvement to LocalKMeans

2023-05-19 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724305#comment-17724305
 ] 

Sean R. Owen commented on SPARK-43297:
--

Can you make a pull request to show what you mean?

> Make improvement to LocalKMeans
> ---
>
> Key: SPARK-43297
> URL: https://issues.apache.org/jira/browse/SPARK-43297
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 3.3.0
>Reporter: wenweijian
>Priority: Minor
>
> There are two initializationMode in Kmeans, random mode and parallel mode.
> The ParallelMode is using kmeansPlusPlus to generate the centers point, but 
> the kMeansPlusPlus is a local method which runs in the driver.
> If the scale of points is huge, the kMeansPlusPlus will run for a long time, 
> because it is a single thread method running in the driiver.
> We can make this method run in parallel to make it faster, such as using 
> Parallel collections. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43369) Address comments about /etc/pam.d/su

2023-05-19 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-43369:
-
Priority: Minor  (was: Major)

> Address comments about /etc/pam.d/su
> 
>
> Key: SPARK-43369
> URL: https://issues.apache.org/jira/browse/SPARK-43369
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Docker
>Affects Versions: 3.5.0
>Reporter: Yikun Jiang
>Priority: Minor
>
> echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su &&
> I am unsure what this is for?  As far as I can tell, this means that only 
> members of the administrative group wheel (or 0 if there is no wheel) can 
> switch to another user using the su command. That might make sense on a 
> regular multi-user system, but I am unsure why it would matter for a 
> container.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43368) Address DOI comments about /etc/passwd

2023-05-19 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-43368:
-
Priority: Minor  (was: Major)

> Address DOI comments about /etc/passwd
> --
>
> Key: SPARK-43368
> URL: https://issues.apache.org/jira/browse/SPARK-43368
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Docker
>Affects Versions: 3.5.0
>Reporter: Yikun Jiang
>Priority: Minor
>
> chgrp root /etc/passwd && chmod ug+rw /etc/passwd
> Wider permissions on /etc/passwd is concerning. What use case is broken if 
> the running user id doesn't exist?
> echo ... >> /etc/passwd
> Having the entrypoint itself modify /etc/passwd is fragile. Are there 
> features that are broken if the user doesn't exist in /etc/passwd (like 
> PostgreSQL's initdb that refuses to run)? Minimally, this should probably use 
> useradd and usermod rather than hand editing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43366) Spark Driver Bind Address is off-by-one

2023-05-19 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724303#comment-17724303
 ] 

Sean R. Owen commented on SPARK-43366:
--

Was the original port in use? it'll try the next one then

> Spark Driver Bind Address is off-by-one
> ---
>
> Key: SPARK-43366
> URL: https://issues.apache.org/jira/browse/SPARK-43366
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 3.3.3
>Reporter: Derek Brown
>Priority: Major
>
> I have the following environment variable set in my driver pod configuration:
> {code:java}
> SPARK_DRIVER_BIND_ADDRESS=10.244.0.53{code}
> However, I see an off-by-one IP address being referred to in the Spark logs:
> {code:java}
> 23/05/04 02:37:03 INFO 
> KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Registered 
> executor NettyRpcEndpointRef(spark-client://Executor) (10.244.0.54:53140) 
> with ID 1,  ResourceProfileId 0
> 23/05/04 02:37:03 INFO BlockManagerMasterEndpoint: Registering block manager 
> 10.244.0.54:32805 with 413.9 MiB RAM, BlockManagerId(1, 10.244.0.54, 32805, 
> None){code}
> I am not sure why this might be the case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43382) Read and write csv and json files. Archive files such as zip or gz are supported

2023-05-19 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-43382:
-
Issue Type: Improvement  (was: Bug)
  Priority: Minor  (was: Major)

> Read and write csv and json files. Archive files such as zip or gz are 
> supported
> 
>
> Key: SPARK-43382
> URL: https://issues.apache.org/jira/browse/SPARK-43382
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: melin
>Priority: Minor
>
> snowflake data import and export, support fixed files. For example:
>  
> {code:java}
> COPY INTO @mystage/data.csv.gz 
>  
> COPY INTO mytable 
> FROM @my_ext_stage/tutorials/dataloading/sales.json.gz; 
> FILE_FORMAT = (TYPE = 'JSON') 
> MATCH_BY_COLUMN_NAME='CASE_INSENSITIVE'; 
>  
> {code}
> Can spark directly read archive files?
> {code:java}
> spark.read.csv("/tutorials/dataloading/sales.json.gz")
> {code}
> @[~kaifeiYi] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43388) Latest docker Spark image has critical CVE

2023-05-19 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724301#comment-17724301
 ] 

Sean R. Owen commented on SPARK-43388:
--

Generally speaking - please also make an argument that these affect Spark when 
reporting. (But this one is already updated, yes)

> Latest docker Spark image has critical CVE
> --
>
> Key: SPARK-43388
> URL: https://issues.apache.org/jira/browse/SPARK-43388
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Docker
>Affects Versions: 3.4.0
>Reporter: mahiki jones
>Priority: Major
> Attachments: spark-docker.CVE-everywhere.png
>
>
> I pulled the latest spark 3.4.0 image from dockerhub, on 2023-04-28 and found 
> after scanning on docker desktop that there are several critical CVE found 
> (see screenshot).
> It seems like some changes to github actions are needed to rebuild with 
> updated dependencies on a regular cadence.
>  
> Notes:
> Original project issue: https://issues.apache.org/jira/browse/SPARK-40513
> [https://hub.docker.com/r/apache/spark/tags]
> https://github.com/apache/spark-docker/actions
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34124) Upgrade jackson version to fix CVE-2020-36179 in Spark 2.4

2023-05-19 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-34124.
--
Resolution: Won't Fix

> Upgrade jackson version to fix CVE-2020-36179 in Spark 2.4
> --
>
> Key: SPARK-34124
> URL: https://issues.apache.org/jira/browse/SPARK-34124
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.7
>Reporter: Yang Jie
>Priority: Minor
>
>  
> {code:java}
> FasterXML jackson-databind 2.x before 2.9.10.8 mishandles the interaction 
> between serialization gadgets and typing, related to 
> oadd.org.apache.commons.dbcp.cpdsadapter.DriverAdapterCPDS.{code}
>  
> [CVE-2020-36179|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-36179]
> Spark 2.4.7 still using Jackson 2.6.7



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43389) spark.read.csv throws NullPointerException when lineSep is set to None

2023-05-19 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724299#comment-17724299
 ] 

Sean R. Owen commented on SPARK-43389:
--

Looks like we should handle the case where it's explicitly set to null in 
CSVOptions:256. Just use a match statement to handle null and None the same way 
- return None. Are you up for making a PR?

> spark.read.csv throws NullPointerException when lineSep is set to None
> --
>
> Key: SPARK-43389
> URL: https://issues.apache.org/jira/browse/SPARK-43389
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.3.1
>Reporter: Zach Liu
>Priority: Trivial
>
> lineSep was defined as Optional[str] yet i'm unable to explicitly set it as 
> None:
> reader = spark.read.format("csv")
> read_options={'inferSchema': False, 'header': True, 'mode': 'DROPMALFORMED', 
> 'sep': '\t', 'escape': '\\', 'multiLine': False, 'lineSep': None}
> for option, option_value in read_options.items():
> reader = reader.option(option, option_value)
> df = reader.load("s3://")
> raises exception:
> py4j.protocol.Py4JJavaError: An error occurred while calling o126.load.
> : java.lang.NullPointerException
>   at 
> scala.collection.immutable.StringOps$.length$extension(StringOps.scala:51)
>   at scala.collection.immutable.StringOps.length(StringOps.scala:51)
>   at 
> scala.collection.IndexedSeqOptimized.isEmpty(IndexedSeqOptimized.scala:30)
>   at 
> scala.collection.IndexedSeqOptimized.isEmpty$(IndexedSeqOptimized.scala:30)
>   at scala.collection.immutable.StringOps.isEmpty(StringOps.scala:33)
>   at scala.collection.TraversableOnce.nonEmpty(TraversableOnce.scala:143)
>   at scala.collection.TraversableOnce.nonEmpty$(TraversableOnce.scala:143)
>   at scala.collection.immutable.StringOps.nonEmpty(StringOps.scala:33)
>   at 
> org.apache.spark.sql.catalyst.csv.CSVOptions.$anonfun$lineSeparator$1(CSVOptions.scala:216)
>   at scala.Option.map(Option.scala:230)
>   at 
> org.apache.spark.sql.catalyst.csv.CSVOptions.(CSVOptions.scala:215)
>   at 
> org.apache.spark.sql.catalyst.csv.CSVOptions.(CSVOptions.scala:47)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:60)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$11(DataSource.scala:210)
>   at scala.Option.orElse(Option.scala:447)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:207)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:411)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228)
>   at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:185)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:282)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at 
> py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
>   at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
>   at java.lang.Thread.run(Thread.java:750)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43576) Remove unused declarations from Core module

2023-05-19 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-43576:

Description: There are some unused declarations in the `core` module, and 
we need to clean it to make code clean.  (was: There are some unused 
declarations in the `core` module, and we need to clean it and keep the code 
clean.)

> Remove unused declarations from Core module
> ---
>
> Key: SPARK-43576
> URL: https://issues.apache.org/jira/browse/SPARK-43576
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>
> There are some unused declarations in the `core` module, and we need to clean 
> it to make code clean.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43576) Remove unused declarations from Core module

2023-05-19 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-43576:

Description: There are some unused declarations in the `core` module, and 
we need to clean it and keep the code clean.  (was: There are some xxx code in 
the xxx module, and we need to clean it and keep the code clean)

> Remove unused declarations from Core module
> ---
>
> Key: SPARK-43576
> URL: https://issues.apache.org/jira/browse/SPARK-43576
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>
> There are some unused declarations in the `core` module, and we need to clean 
> it and keep the code clean.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43576) Remove unused declarations from Core module

2023-05-19 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-43576:

Description: There are some xxx code in the xxx module, and we need to 
clean it and keep the code clean

> Remove unused declarations from Core module
> ---
>
> Key: SPARK-43576
> URL: https://issues.apache.org/jira/browse/SPARK-43576
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>
> There are some xxx code in the xxx module, and we need to clean it and keep 
> the code clean



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43576) Remove unused declarations from Core module

2023-05-19 Thread BingKun Pan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724297#comment-17724297
 ] 

BingKun Pan commented on SPARK-43576:
-

Related pr: https://github.com/apache/spark/pull/41218

> Remove unused declarations from Core module
> ---
>
> Key: SPARK-43576
> URL: https://issues.apache.org/jira/browse/SPARK-43576
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43584) Update some sbt plugins

2023-05-19 Thread BingKun Pan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724296#comment-17724296
 ] 

BingKun Pan commented on SPARK-43584:
-

Let me update more detailed information later.

Related pr: https://github.com/apache/spark/pull/41226

> Update some sbt plugins
> ---
>
> Key: SPARK-43584
> URL: https://issues.apache.org/jira/browse/SPARK-43584
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>
> Update some sbt plugins to newest version. include:
>  * sbt-assembly from 2.1.0 to 2.1.1
>  * sbt-mima-plugin from 1.1.0 to 1.1.2
>  * sbt-revolver from 0.9.1 to 0.10.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43584) Update some sbt plugins

2023-05-19 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-43584:

Description: 
Update some sbt plugins to newest version. include:
 * sbt-assembly from 2.1.0 to 2.1.1
 * sbt-mima-plugin from 1.1.0 to 1.1.2
 * sbt-revolver from 0.9.1 to 0.10.0

> Update some sbt plugins
> ---
>
> Key: SPARK-43584
> URL: https://issues.apache.org/jira/browse/SPARK-43584
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>
> Update some sbt plugins to newest version. include:
>  * sbt-assembly from 2.1.0 to 2.1.1
>  * sbt-mima-plugin from 1.1.0 to 1.1.2
>  * sbt-revolver from 0.9.1 to 0.10.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43593) Support the minimum number of range shuffle partitions

2023-05-19 Thread Wan Kun (Jira)
Wan Kun created SPARK-43593:
---

 Summary: Support the minimum number of range shuffle partitions
 Key: SPARK-43593
 URL: https://issues.apache.org/jira/browse/SPARK-43593
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.1
Reporter: Wan Kun


If there are few distinct values in the RangePartitioner, there will be very 
few partitions that could be very large. We can append a random expression to 
increase the number of partitions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43592) NoSuchMethodError in Spark 3.4 with JDK8u362 & JDK8u372

2023-05-19 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724291#comment-17724291
 ] 

Yuming Wang commented on SPARK-43592:
-

Maybe same issue: SPARK-32475?

> NoSuchMethodError in Spark 3.4 with JDK8u362 & JDK8u372
> ---
>
> Key: SPARK-43592
> URL: https://issues.apache.org/jira/browse/SPARK-43592
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
> Environment: JDK: JDK8u362, JDK8u372
> Kubernetes
> Spark 3.4
>Reporter: Shivam Kasat
>Priority: Critical
>  Labels: JDK1.8, java, jdk11
>
> My project was on spark 3.3 with JDK8u362 and I tried updating it to spark 
> 3.4, Official documentation of spark 3.4 says it works with JDK8u362 and 
> above but when I tried upgrading docker base image of spark to JDK8u362 and 
> JDK8u372 it is failing at runtime with below error, For JDK8u362 it throws 
> error for Java.nio.CharBuffer.position method and for JDK8u372 it throws 
> error for java.nio.ByteBuffer.flip method. But when I run with JDK11 image in 
> spark Docker file it works fine. Am I missing anything or how to fix this 
> issue as I want to run it with JDK8.
> {code:java}
> ava.lang.NoSuchMethodError: 
> java.nio.CharBuffer.position(I)Ljava/nio/CharBuffer;
> at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.unescapeSQLString(ParserUtils.scala:220)
>  ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
> at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.string(ParserUtils.scala:95)
>  ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
> at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$createString$2(AstBuilder.scala:2632)
>  ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
> at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) 
> ~[scala-library-2.12.17.jar:?]
> at scala.collection.Iterator.foreach(Iterator.scala:943) 
> ~[scala-library-2.12.17.jar:?]
> at scala.collection.Iterator.foreach$(Iterator.scala:943) 
> ~[scala-library-2.12.17.jar:?]
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) 
> ~[scala-library-2.12.17.jar:?]
> at scala.collection.IterableLike.foreach(IterableLike.scala:74) 
> ~[scala-library-2.12.17.jar:?]
> at scala.collection.IterableLike.foreach$(IterableLike.scala:73) 
> ~[scala-library-2.12.17.jar:?]
> at scala.collection.AbstractIterable.foreach(Iterable.scala:56) 
> ~[scala-library-2.12.17.jar:?]
> at scala.collection.TraversableLike.map(TraversableLike.scala:286) 
> ~[scala-library-2.12.17.jar:?]
> at scala.collection.TraversableLike.map$(TraversableLike.scala:279) 
> ~[scala-library-2.12.17.jar:?]
> at scala.collection.AbstractTraversable.map(Traversable.scala:108) 
> ~[scala-library-2.12.17.jar:?]
> at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.createString(AstBuilder.scala:2632)
>  ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
> at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitStringLiteral$1(AstBuilder.scala:2618)
>  ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
> at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:160)
>  ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
> at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitStringLiteral(AstBuilder.scala:2618)
>  ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
> at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitStringLiteral(AstBuilder.scala:58)
>  ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
> at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$StringLiteralContext.accept(SqlBaseParser.java:19511)
>  ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
> at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:73)
>  ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
> at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParserBaseVisitor.visitConstantDefault(SqlBaseParserBaseVisitor.java:1735)
>  ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
> at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$ConstantDefaultContext.accept(SqlBaseParser.java:18373)
>  ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
> at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:73)
>  ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
> at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParserBaseVisitor.visitValueExpressionDefault(SqlBaseParserBaseVisitor.java:1567)
>  ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
> at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$ValueExpressionDefaultContext.accept(SqlBaseParser.java:17491)
>  ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
> at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.typedVisit(AstBuilder.scala:63)
>  

[jira] [Updated] (SPARK-43592) NoSuchMethodError in Spark 3.4 with JDK8u362 & JDK8u372

2023-05-19 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-43592:

Target Version/s:   (was: 3.4.0)

> NoSuchMethodError in Spark 3.4 with JDK8u362 & JDK8u372
> ---
>
> Key: SPARK-43592
> URL: https://issues.apache.org/jira/browse/SPARK-43592
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
> Environment: JDK: JDK8u362, JDK8u372
> Kubernetes
> Spark 3.4
>Reporter: Shivam Kasat
>Priority: Critical
>  Labels: JDK1.8, java, jdk11
>
> My project was on spark 3.3 with JDK8u362 and I tried updating it to spark 
> 3.4, Official documentation of spark 3.4 says it works with JDK8u362 and 
> above but when I tried upgrading docker base image of spark to JDK8u362 and 
> JDK8u372 it is failing at runtime with below error, For JDK8u362 it throws 
> error for Java.nio.CharBuffer.position method and for JDK8u372 it throws 
> error for java.nio.ByteBuffer.flip method. But when I run with JDK11 image in 
> spark Docker file it works fine. Am I missing anything or how to fix this 
> issue as I want to run it with JDK8.
> {code:java}
> ava.lang.NoSuchMethodError: 
> java.nio.CharBuffer.position(I)Ljava/nio/CharBuffer;
> at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.unescapeSQLString(ParserUtils.scala:220)
>  ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
> at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.string(ParserUtils.scala:95)
>  ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
> at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$createString$2(AstBuilder.scala:2632)
>  ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
> at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) 
> ~[scala-library-2.12.17.jar:?]
> at scala.collection.Iterator.foreach(Iterator.scala:943) 
> ~[scala-library-2.12.17.jar:?]
> at scala.collection.Iterator.foreach$(Iterator.scala:943) 
> ~[scala-library-2.12.17.jar:?]
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) 
> ~[scala-library-2.12.17.jar:?]
> at scala.collection.IterableLike.foreach(IterableLike.scala:74) 
> ~[scala-library-2.12.17.jar:?]
> at scala.collection.IterableLike.foreach$(IterableLike.scala:73) 
> ~[scala-library-2.12.17.jar:?]
> at scala.collection.AbstractIterable.foreach(Iterable.scala:56) 
> ~[scala-library-2.12.17.jar:?]
> at scala.collection.TraversableLike.map(TraversableLike.scala:286) 
> ~[scala-library-2.12.17.jar:?]
> at scala.collection.TraversableLike.map$(TraversableLike.scala:279) 
> ~[scala-library-2.12.17.jar:?]
> at scala.collection.AbstractTraversable.map(Traversable.scala:108) 
> ~[scala-library-2.12.17.jar:?]
> at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.createString(AstBuilder.scala:2632)
>  ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
> at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitStringLiteral$1(AstBuilder.scala:2618)
>  ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
> at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:160)
>  ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
> at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitStringLiteral(AstBuilder.scala:2618)
>  ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
> at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitStringLiteral(AstBuilder.scala:58)
>  ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
> at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$StringLiteralContext.accept(SqlBaseParser.java:19511)
>  ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
> at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:73)
>  ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
> at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParserBaseVisitor.visitConstantDefault(SqlBaseParserBaseVisitor.java:1735)
>  ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
> at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$ConstantDefaultContext.accept(SqlBaseParser.java:18373)
>  ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
> at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:73)
>  ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
> at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParserBaseVisitor.visitValueExpressionDefault(SqlBaseParserBaseVisitor.java:1567)
>  ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
> at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$ValueExpressionDefaultContext.accept(SqlBaseParser.java:17491)
>  ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
> at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.typedVisit(AstBuilder.scala:63)
>  

[jira] [Commented] (SPARK-43024) Upgrade pandas to 2.0.0

2023-05-19 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724276#comment-17724276
 ] 

GridGain Integration commented on SPARK-43024:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/41211

> Upgrade pandas to 2.0.0
> ---
>
> Key: SPARK-43024
> URL: https://issues.apache.org/jira/browse/SPARK-43024
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Since pandas 2.0.0 is released in Apr 03, 2023.
>  
> We should update our infra and docs to support it.
> h4.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43448) Remove dummy hadoop-openstack

2023-05-19 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-43448.
--
Fix Version/s: 3.5.0
 Assignee: Cheng Pan
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/41133

> Remove dummy hadoop-openstack
> -
>
> Key: SPARK-43448
> URL: https://issues.apache.org/jira/browse/SPARK-43448
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Trivial
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43448) Remove dummy hadoop-openstack

2023-05-19 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-43448:
-
Priority: Trivial  (was: Major)

> Remove dummy hadoop-openstack
> -
>
> Key: SPARK-43448
> URL: https://issues.apache.org/jira/browse/SPARK-43448
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Cheng Pan
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43592) NoSuchMethodError in Spark 3.4 with JDK8u362 & JDK8u372

2023-05-19 Thread Shivam Kasat (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724256#comment-17724256
 ] 

Shivam Kasat commented on SPARK-43592:
--

Posted the same on stackoverflow: 
https://stackoverflow.com/questions/76286857/nosuchmethoderror-in-spark-3-4-with-jdk8u362-jdk8u372

> NoSuchMethodError in Spark 3.4 with JDK8u362 & JDK8u372
> ---
>
> Key: SPARK-43592
> URL: https://issues.apache.org/jira/browse/SPARK-43592
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
> Environment: JDK: JDK8u362, JDK8u372
> Kubernetes
> Spark 3.4
>Reporter: Shivam Kasat
>Priority: Critical
>  Labels: JDK1.8, java, jdk11
>
> My project was on spark 3.3 with JDK8u362 and I tried updating it to spark 
> 3.4, Official documentation of spark 3.4 says it works with JDK8u362 and 
> above but when I tried upgrading docker base image of spark to JDK8u362 and 
> JDK8u372 it is failing at runtime with below error, For JDK8u362 it throws 
> error for Java.nio.CharBuffer.position method and for JDK8u372 it throws 
> error for java.nio.ByteBuffer.flip method. But when I run with JDK11 image in 
> spark Docker file it works fine. Am I missing anything or how to fix this 
> issue as I want to run it with JDK8.
> {code:java}
> ava.lang.NoSuchMethodError: 
> java.nio.CharBuffer.position(I)Ljava/nio/CharBuffer;
> at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.unescapeSQLString(ParserUtils.scala:220)
>  ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
> at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.string(ParserUtils.scala:95)
>  ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
> at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$createString$2(AstBuilder.scala:2632)
>  ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
> at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) 
> ~[scala-library-2.12.17.jar:?]
> at scala.collection.Iterator.foreach(Iterator.scala:943) 
> ~[scala-library-2.12.17.jar:?]
> at scala.collection.Iterator.foreach$(Iterator.scala:943) 
> ~[scala-library-2.12.17.jar:?]
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) 
> ~[scala-library-2.12.17.jar:?]
> at scala.collection.IterableLike.foreach(IterableLike.scala:74) 
> ~[scala-library-2.12.17.jar:?]
> at scala.collection.IterableLike.foreach$(IterableLike.scala:73) 
> ~[scala-library-2.12.17.jar:?]
> at scala.collection.AbstractIterable.foreach(Iterable.scala:56) 
> ~[scala-library-2.12.17.jar:?]
> at scala.collection.TraversableLike.map(TraversableLike.scala:286) 
> ~[scala-library-2.12.17.jar:?]
> at scala.collection.TraversableLike.map$(TraversableLike.scala:279) 
> ~[scala-library-2.12.17.jar:?]
> at scala.collection.AbstractTraversable.map(Traversable.scala:108) 
> ~[scala-library-2.12.17.jar:?]
> at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.createString(AstBuilder.scala:2632)
>  ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
> at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitStringLiteral$1(AstBuilder.scala:2618)
>  ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
> at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:160)
>  ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
> at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitStringLiteral(AstBuilder.scala:2618)
>  ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
> at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitStringLiteral(AstBuilder.scala:58)
>  ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
> at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$StringLiteralContext.accept(SqlBaseParser.java:19511)
>  ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
> at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:73)
>  ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
> at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParserBaseVisitor.visitConstantDefault(SqlBaseParserBaseVisitor.java:1735)
>  ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
> at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$ConstantDefaultContext.accept(SqlBaseParser.java:18373)
>  ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
> at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:73)
>  ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
> at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParserBaseVisitor.visitValueExpressionDefault(SqlBaseParserBaseVisitor.java:1567)
>  ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
> at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$ValueExpressionDefaultContext.accept(SqlBaseParser.java:17491)
>  

[jira] [Created] (SPARK-43592) NoSuchMethodError in Spark 3.4 with JDK8u362 & JDK8u372

2023-05-19 Thread Shivam Kasat (Jira)
Shivam Kasat created SPARK-43592:


 Summary: NoSuchMethodError in Spark 3.4 with JDK8u362 & JDK8u372
 Key: SPARK-43592
 URL: https://issues.apache.org/jira/browse/SPARK-43592
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.4.0
 Environment: JDK: JDK8u362, JDK8u372
Kubernetes
Spark 3.4
Reporter: Shivam Kasat


My project was on spark 3.3 with JDK8u362 and I tried updating it to spark 3.4, 
Official documentation of spark 3.4 says it works with JDK8u362 and above but 
when I tried upgrading docker base image of spark to JDK8u362 and JDK8u372 it 
is failing at runtime with below error, For JDK8u362 it throws error for 
Java.nio.CharBuffer.position method and for JDK8u372 it throws error for 
java.nio.ByteBuffer.flip method. But when I run with JDK11 image in spark 
Docker file it works fine. Am I missing anything or how to fix this issue as I 
want to run it with JDK8.
{code:java}
ava.lang.NoSuchMethodError: java.nio.CharBuffer.position(I)Ljava/nio/CharBuffer;
at 
org.apache.spark.sql.catalyst.parser.ParserUtils$.unescapeSQLString(ParserUtils.scala:220)
 ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
at 
org.apache.spark.sql.catalyst.parser.ParserUtils$.string(ParserUtils.scala:95) 
~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
at 
org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$createString$2(AstBuilder.scala:2632)
 ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
at 
scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) 
~[scala-library-2.12.17.jar:?]
at scala.collection.Iterator.foreach(Iterator.scala:943) 
~[scala-library-2.12.17.jar:?]
at scala.collection.Iterator.foreach$(Iterator.scala:943) 
~[scala-library-2.12.17.jar:?]
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) 
~[scala-library-2.12.17.jar:?]
at scala.collection.IterableLike.foreach(IterableLike.scala:74) 
~[scala-library-2.12.17.jar:?]
at scala.collection.IterableLike.foreach$(IterableLike.scala:73) 
~[scala-library-2.12.17.jar:?]
at scala.collection.AbstractIterable.foreach(Iterable.scala:56) 
~[scala-library-2.12.17.jar:?]
at scala.collection.TraversableLike.map(TraversableLike.scala:286) 
~[scala-library-2.12.17.jar:?]
at scala.collection.TraversableLike.map$(TraversableLike.scala:279) 
~[scala-library-2.12.17.jar:?]
at scala.collection.AbstractTraversable.map(Traversable.scala:108) 
~[scala-library-2.12.17.jar:?]
at 
org.apache.spark.sql.catalyst.parser.AstBuilder.createString(AstBuilder.scala:2632)
 ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
at 
org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitStringLiteral$1(AstBuilder.scala:2618)
 ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
at 
org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:160)
 ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
at 
org.apache.spark.sql.catalyst.parser.AstBuilder.visitStringLiteral(AstBuilder.scala:2618)
 ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
at 
org.apache.spark.sql.catalyst.parser.AstBuilder.visitStringLiteral(AstBuilder.scala:58)
 ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
at 
org.apache.spark.sql.catalyst.parser.SqlBaseParser$StringLiteralContext.accept(SqlBaseParser.java:19511)
 ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
at 
org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:73)
 ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
at 
org.apache.spark.sql.catalyst.parser.SqlBaseParserBaseVisitor.visitConstantDefault(SqlBaseParserBaseVisitor.java:1735)
 ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
at 
org.apache.spark.sql.catalyst.parser.SqlBaseParser$ConstantDefaultContext.accept(SqlBaseParser.java:18373)
 ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
at 
org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:73)
 ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
at 
org.apache.spark.sql.catalyst.parser.SqlBaseParserBaseVisitor.visitValueExpressionDefault(SqlBaseParserBaseVisitor.java:1567)
 ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
at 
org.apache.spark.sql.catalyst.parser.SqlBaseParser$ValueExpressionDefaultContext.accept(SqlBaseParser.java:17491)
 ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
at 
org.apache.spark.sql.catalyst.parser.AstBuilder.typedVisit(AstBuilder.scala:63) 
~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
at 
org.apache.spark.sql.catalyst.parser.AstBuilder.expression(AstBuilder.scala:1630)
 ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
at 
org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$withPredicate$1(AstBuilder.scala:1870)
 ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
at 
org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:160)
 ~[spark-catalyst_2.12-3.4.0.jar:3.4.0]
at 

[jira] [Commented] (SPARK-43523) Memory leak in Spark UI

2023-05-19 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724254#comment-17724254
 ] 

Sean R. Owen commented on SPARK-43523:
--

I don't think that's a memory leak; you're just observing that it saves a lot 
of state when you ask it to. You can also reduce things like the number of 
tasks/jobs that it retains info about. You're also on a long-since EOL version 
of Spark which would not be supported.

> Memory leak in Spark UI
> ---
>
> Key: SPARK-43523
> URL: https://issues.apache.org/jira/browse/SPARK-43523
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.4
>Reporter: Amine Bagdouri
>Priority: Major
>
> We have a distributed Spark application running on Azure HDInsight using 
> Spark version 2.4.4.
> After a few days of active processing on our application, we have noticed 
> that the GC CPU time ratio of the driver is close to 100%. We suspected a 
> memory leak. Thus, we have produced a heap dump and analyzed it using Eclipse 
> Memory Analyzer.
> Here is some interesting data from the driver's heap dump (heap size is 8 GB):
>  * The estimated retained heap size of String objects (~5M instances) is 3.3 
> GB. It seems that most of these instances correspond to spark events.
>  * Spark UI's AppStatusListener instance estimated retained size is 1.1 GB.
>  * The number of LiveJob objects with status "RUNNING" is 18K, knowing that 
> there shouldn't be more than 16 live running jobs since we use a fixed size 
> thread pool of 16 threads to run spark queries.
>  * The number of LiveTask objects is 485K.
>  * The AsyncEventQueue instance associated to the AppStatusListener has a 
> value of 854 for dropped events count and a value of 10001 for total events 
> count, knowing that the dropped events counter is reset every minute and that 
> the queue's default capacity is 1.
> We think that there is a memory leak in Spark UI. Here is our analysis of the 
> root cause of this leak:
>  * AppStatusListener is notified of Spark events using a bounded queue in 
> AsyncEventQueue.
>  * AppStatusListener updates its state (kvstore, liveTasks, liveStages, 
> liveJobs, ...) based on the received events. For example, onTaskStart adds a 
> task to liveTasks map and onTaskEnd removes the task from liveTasks map.
>  * When the rate of events is very high, the bounded queue in AsyncEventQueue 
> is full, some events are dropped and don't make it to AppStatusListener.
>  * Dropped events that signal the end of a processing unit prevent the state 
> of AppStatusListener from being cleaned. For example, a dropped onTaskEnd 
> event, will prevent the task from being removed from liveTasks map, and the 
> task will remain in the heap until the driver's JVM is stopped.
> We were able to confirm our analysis by reducing the capacity of the 
> AsyncEventQueue (spark.scheduler.listenerbus.eventqueue.capacity=10). After 
> having launched many spark queries using this config, we observed that the 
> number of active jobs in Spark UI increased rapidly and remained high even 
> though all submitted queries have completed. We have also noticed that some 
> executor task counters in Spark UI were negative, which confirms that 
> AppStatusListener state does not accurately reflect the reality and that it 
> can be a victim of event drops.
> Suggested fix:
> There are some limits today on the number of "dead" objects in 
> AppStatusListener's maps (for example: spark.ui.retainedJobs). We suggest 
> enforcing another configurable limit on the number of total objects in 
> AppStatusListener's maps and kvstore. This should limit the leak in the case 
> of high events rate, but AppStatusListener stats will remain inaccurate.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43584) Update some sbt plugins

2023-05-19 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724252#comment-17724252
 ] 

Sean R. Owen commented on SPARK-43584:
--

What is this about? there is no detail

> Update some sbt plugins
> ---
>
> Key: SPARK-43584
> URL: https://issues.apache.org/jira/browse/SPARK-43584
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43576) Remove unused declarations from Core module

2023-05-19 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724253#comment-17724253
 ] 

Sean R. Owen commented on SPARK-43576:
--

Same here, there is no information in this JIRA

> Remove unused declarations from Core module
> ---
>
> Key: SPARK-43576
> URL: https://issues.apache.org/jira/browse/SPARK-43576
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43591) Assign a name to the error class _LEGACY_ERROR_TEMP_0013

2023-05-19 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-43591:
---

 Summary: Assign a name to the error class _LEGACY_ERROR_TEMP_0013
 Key: SPARK-43591
 URL: https://issues.apache.org/jira/browse/SPARK-43591
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.5.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43590) Make `CheckConnectJvmClientCompatibility` to compare client and protobuf

2023-05-19 Thread Yang Jie (Jira)
Yang Jie created SPARK-43590:


 Summary: Make `CheckConnectJvmClientCompatibility` to compare 
client and  protobuf
 Key: SPARK-43590
 URL: https://issues.apache.org/jira/browse/SPARK-43590
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 3.5.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43589) Fix `cannotBroadcastTableOverMaxTableBytesError` to use `bytesToString`

2023-05-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724208#comment-17724208
 ] 

ASF GitHub Bot commented on SPARK-43589:


User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/41232

> Fix `cannotBroadcastTableOverMaxTableBytesError` to use `bytesToString`
> ---
>
> Key: SPARK-43589
> URL: https://issues.apache.org/jira/browse/SPARK-43589
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2, 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43589) Fix `cannotBroadcastTableOverMaxTableBytesError` to use `bytesToString`

2023-05-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724206#comment-17724206
 ] 

ASF GitHub Bot commented on SPARK-43589:


User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/41232

> Fix `cannotBroadcastTableOverMaxTableBytesError` to use `bytesToString`
> ---
>
> Key: SPARK-43589
> URL: https://issues.apache.org/jira/browse/SPARK-43589
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2, 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43588) Upgrade ASM to 9.5

2023-05-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724203#comment-17724203
 ] 

ASF GitHub Bot commented on SPARK-43588:


User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/41231

> Upgrade ASM to 9.5
> --
>
> Key: SPARK-43588
> URL: https://issues.apache.org/jira/browse/SPARK-43588
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.5.0
>
>
> ASM 9.5 is for Java 21
>  
> https://asm.ow2.io/versions.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43588) Upgrade ASM to 9.5

2023-05-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724201#comment-17724201
 ] 

ASF GitHub Bot commented on SPARK-43588:


User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/41231

> Upgrade ASM to 9.5
> --
>
> Key: SPARK-43588
> URL: https://issues.apache.org/jira/browse/SPARK-43588
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.5.0
>
>
> ASM 9.5 is for Java 21
>  
> https://asm.ow2.io/versions.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43588) Upgrade ASM to 9.5

2023-05-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-43588.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41231
[https://github.com/apache/spark/pull/41231]

> Upgrade ASM to 9.5
> --
>
> Key: SPARK-43588
> URL: https://issues.apache.org/jira/browse/SPARK-43588
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.5.0
>
>
> ASM 9.5 is for Java 21
>  
> https://asm.ow2.io/versions.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43588) Upgrade ASM to 9.5

2023-05-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-43588:
-

Assignee: Yang Jie

> Upgrade ASM to 9.5
> --
>
> Key: SPARK-43588
> URL: https://issues.apache.org/jira/browse/SPARK-43588
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>
> ASM 9.5 is for Java 21
>  
> https://asm.ow2.io/versions.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43346) Assign a name to the error class _LEGACY_ERROR_TEMP_1206

2023-05-19 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-43346.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41020
[https://github.com/apache/spark/pull/41020]

> Assign a name to the error class _LEGACY_ERROR_TEMP_1206
> 
>
> Key: SPARK-43346
> URL: https://issues.apache.org/jira/browse/SPARK-43346
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Minor
>  Labels: starter
> Fix For: 3.5.0
>
>
> Choose a proper name for the error class *_LEGACY_ERROR_TEMP_1206* defined in 
> {*}core/src/main/resources/error/error-classes.json{*}. The name should be 
> short but complete (look at the example in error-classes.json).
> Add a test which triggers the error from user code if such test still doesn't 
> exist. Check exception fields by using {*}checkError(){*}. The last function 
> checks valuable error fields only, and avoids dependencies from error text 
> message. In this way, tech editors can modify error format in 
> error-classes.json, and don't worry of Spark's internal tests. Migrate other 
> tests that might trigger the error onto checkError().
> If you cannot reproduce the error from user space (using SQL query), replace 
> the error by an internal error, see {*}SparkException.internalError(){*}.
> Improve the error message format in error-classes.json if the current is not 
> clear. Propose a solution to users how to avoid and fix such kind of errors.
> Please, look at the PR below as examples:
>  * [https://github.com/apache/spark/pull/38685]
>  * [https://github.com/apache/spark/pull/38656]
>  * [https://github.com/apache/spark/pull/38490]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43346) Assign a name to the error class _LEGACY_ERROR_TEMP_1206

2023-05-19 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-43346:


Assignee: Terry Kim

> Assign a name to the error class _LEGACY_ERROR_TEMP_1206
> 
>
> Key: SPARK-43346
> URL: https://issues.apache.org/jira/browse/SPARK-43346
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Minor
>  Labels: starter
>
> Choose a proper name for the error class *_LEGACY_ERROR_TEMP_1206* defined in 
> {*}core/src/main/resources/error/error-classes.json{*}. The name should be 
> short but complete (look at the example in error-classes.json).
> Add a test which triggers the error from user code if such test still doesn't 
> exist. Check exception fields by using {*}checkError(){*}. The last function 
> checks valuable error fields only, and avoids dependencies from error text 
> message. In this way, tech editors can modify error format in 
> error-classes.json, and don't worry of Spark's internal tests. Migrate other 
> tests that might trigger the error onto checkError().
> If you cannot reproduce the error from user space (using SQL query), replace 
> the error by an internal error, see {*}SparkException.internalError(){*}.
> Improve the error message format in error-classes.json if the current is not 
> clear. Propose a solution to users how to avoid and fix such kind of errors.
> Please, look at the PR below as examples:
>  * [https://github.com/apache/spark/pull/38685]
>  * [https://github.com/apache/spark/pull/38656]
>  * [https://github.com/apache/spark/pull/38490]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43345) Assign a name to the error class _LEGACY_ERROR_TEMP_0041

2023-05-19 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-43345:


Assignee: Terry Kim

> Assign a name to the error class _LEGACY_ERROR_TEMP_0041
> 
>
> Key: SPARK-43345
> URL: https://issues.apache.org/jira/browse/SPARK-43345
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Minor
>  Labels: starter
>
> Choose a proper name for the error class *_LEGACY_ERROR_TEMP_0041* defined in 
> {*}core/src/main/resources/error/error-classes.json{*}. The name should be 
> short but complete (look at the example in error-classes.json).
> Add a test which triggers the error from user code if such test still doesn't 
> exist. Check exception fields by using {*}checkError(){*}. The last function 
> checks valuable error fields only, and avoids dependencies from error text 
> message. In this way, tech editors can modify error format in 
> error-classes.json, and don't worry of Spark's internal tests. Migrate other 
> tests that might trigger the error onto checkError().
> If you cannot reproduce the error from user space (using SQL query), replace 
> the error by an internal error, see {*}SparkException.internalError(){*}.
> Improve the error message format in error-classes.json if the current is not 
> clear. Propose a solution to users how to avoid and fix such kind of errors.
> Please, look at the PR below as examples:
>  * [https://github.com/apache/spark/pull/38685]
>  * [https://github.com/apache/spark/pull/38656]
>  * [https://github.com/apache/spark/pull/38490]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43345) Assign a name to the error class _LEGACY_ERROR_TEMP_0041

2023-05-19 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-43345.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41020
[https://github.com/apache/spark/pull/41020]

> Assign a name to the error class _LEGACY_ERROR_TEMP_0041
> 
>
> Key: SPARK-43345
> URL: https://issues.apache.org/jira/browse/SPARK-43345
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Minor
>  Labels: starter
> Fix For: 3.5.0
>
>
> Choose a proper name for the error class *_LEGACY_ERROR_TEMP_0041* defined in 
> {*}core/src/main/resources/error/error-classes.json{*}. The name should be 
> short but complete (look at the example in error-classes.json).
> Add a test which triggers the error from user code if such test still doesn't 
> exist. Check exception fields by using {*}checkError(){*}. The last function 
> checks valuable error fields only, and avoids dependencies from error text 
> message. In this way, tech editors can modify error format in 
> error-classes.json, and don't worry of Spark's internal tests. Migrate other 
> tests that might trigger the error onto checkError().
> If you cannot reproduce the error from user space (using SQL query), replace 
> the error by an internal error, see {*}SparkException.internalError(){*}.
> Improve the error message format in error-classes.json if the current is not 
> clear. Propose a solution to users how to avoid and fix such kind of errors.
> Please, look at the PR below as examples:
>  * [https://github.com/apache/spark/pull/38685]
>  * [https://github.com/apache/spark/pull/38656]
>  * [https://github.com/apache/spark/pull/38490]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43589) Fix `cannotBroadcastTableOverMaxTableBytesError` to use `bytesToString`

2023-05-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-43589:
--
Affects Version/s: 3.3.2

> Fix `cannotBroadcastTableOverMaxTableBytesError` to use `bytesToString`
> ---
>
> Key: SPARK-43589
> URL: https://issues.apache.org/jira/browse/SPARK-43589
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2, 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43589) Fix `cannotBroadcastTableOverMaxTableBytesError` to use `bytesToString`

2023-05-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-43589:
--
Priority: Minor  (was: Trivial)

> Fix `cannotBroadcastTableOverMaxTableBytesError` to use `bytesToString`
> ---
>
> Key: SPARK-43589
> URL: https://issues.apache.org/jira/browse/SPARK-43589
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43589) Fix `cannotBroadcastTableOverMaxTableBytesError` to use `bytesToString`

2023-05-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-43589:
--
Issue Type: Bug  (was: Improvement)

> Fix `cannotBroadcastTableOverMaxTableBytesError` to use `bytesToString`
> ---
>
> Key: SPARK-43589
> URL: https://issues.apache.org/jira/browse/SPARK-43589
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43589) Fix `cannotBroadcastTableOverMaxTableBytesError` to use `bytesToString`

2023-05-19 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-43589:
-

 Summary: Fix `cannotBroadcastTableOverMaxTableBytesError` to use 
`bytesToString`
 Key: SPARK-43589
 URL: https://issues.apache.org/jira/browse/SPARK-43589
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43587) Run HealthTrackerIntegrationSuite in a dedicate JVM

2023-05-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-43587:
-

Assignee: Dongjoon Hyun

> Run HealthTrackerIntegrationSuite in a dedicate JVM
> ---
>
> Key: SPARK-43587
> URL: https://issues.apache.org/jira/browse/SPARK-43587
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core, Tests
>Affects Versions: 3.3.2, 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43587) Run HealthTrackerIntegrationSuite in a dedicate JVM

2023-05-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-43587.
---
Fix Version/s: 3.3.3
   3.5.0
   3.4.1
   Resolution: Fixed

Issue resolved by pull request 41229
[https://github.com/apache/spark/pull/41229]

> Run HealthTrackerIntegrationSuite in a dedicate JVM
> ---
>
> Key: SPARK-43587
> URL: https://issues.apache.org/jira/browse/SPARK-43587
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core, Tests
>Affects Versions: 3.3.2, 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.3.3, 3.5.0, 3.4.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org