[jira] [Updated] (SPARK-44772) Reading blocks from remote executors causes timeout issue

2023-08-10 Thread nebi mert aydin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nebi mert aydin updated SPARK-44772:

Component/s: Shuffle
 Spark Core

> Reading blocks from remote executors  causes timeout issue
> --
>
> Key: SPARK-44772
> URL: https://issues.apache.org/jira/browse/SPARK-44772
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, PySpark, Shuffle, Spark Core
>Affects Versions: 3.1.2
>Reporter: nebi mert aydin
>Priority: Major
>
> I'm using EMR 6.5 with Spark 3.1.2
> I'm shuffling 1.5 TiB of data with 3000 executors with 4 cores 23 gig memory 
> for executors
> Also speculative mode is on.
> {code:java}
> // df.repartition(6000) {code}
> I see lots of failures with 
> {code:java}
> 2023-08-11 01:01:09,846 ERROR 
> org.apache.spark.network.server.ChunkFetchRequestHandler 
> (shuffle-server-4-95): Error sending result 
> ChunkFetchSuccess[streamChunkId=StreamChunkId[streamId=779084003612,chunkIndex=323],buffer=FileSegmentManagedBuffer[file=/mnt3/yarn/usercache/zeppelin/appcache/application_1691438567823_0012/blockmgr-0d82ca05-9429-4ff2-9f61-e779e8e60648/07/shuffle_5_114492_0.data,offset=1836997,length=618]]
>  to /172.31.20.110:36654; closing connection
> java.nio.channels.ClosedChannelException
>   at 
> org.sparkproject.io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957)
>   at 
> org.sparkproject.io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:865)
>   at 
> org.sparkproject.io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
>   at 
> org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
>   at 
> org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:709)
>   at 
> org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:792)
>   at 
> org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:702)
>   at 
> org.sparkproject.io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:110)
>   at 
> org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
>   at 
> org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:709)
>   at 
> org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:792)
>   at 
> org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:702)
>   at 
> org.sparkproject.io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:302)
>   at 
> org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
>   at 
> org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
>   at 
> org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:790)
>   at 
> org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:758)
>   at 
> org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:808)
>   at 
> org.sparkproject.io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:1025)
>   at 
> org.sparkproject.io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:294)
>   at 
> org.apache.spark.network.server.ChunkFetchRequestHandler.respond(ChunkFetchRequestHandler.java:142)
>   at 
> org.apache.spark.network.server.ChunkFetchRequestHandler.processFetchRequest(ChunkFetchRequestHandler.java:116)
>   at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:107)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
>   at 
> org.sparkproject.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
>   at 
> org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   at 
> org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   at 
> 

[jira] [Updated] (SPARK-44772) Reading blocks from remote executors causes timeout issue

2023-08-10 Thread nebi mert aydin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nebi mert aydin updated SPARK-44772:

Description: 
I'm using EMR 6.5 with Spark 3.1.2

I'm shuffling 1.5 TiB of data with 3000 executors with 4 cores 23 gig memory 
for executors

Also speculative mode is on.
{code:java}
// df.repartition(6000) {code}
I see lots of failures with 
{code:java}
2023-08-11 01:01:09,846 ERROR 
org.apache.spark.network.server.ChunkFetchRequestHandler (shuffle-server-4-95): 
Error sending result 
ChunkFetchSuccess[streamChunkId=StreamChunkId[streamId=779084003612,chunkIndex=323],buffer=FileSegmentManagedBuffer[file=/mnt3/yarn/usercache/zeppelin/appcache/application_1691438567823_0012/blockmgr-0d82ca05-9429-4ff2-9f61-e779e8e60648/07/shuffle_5_114492_0.data,offset=1836997,length=618]]
 to /172.31.20.110:36654; closing connection
java.nio.channels.ClosedChannelException
at 
org.sparkproject.io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957)
at 
org.sparkproject.io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:865)
at 
org.sparkproject.io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
at 
org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
at 
org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:709)
at 
org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:792)
at 
org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:702)
at 
org.sparkproject.io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:110)
at 
org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
at 
org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:709)
at 
org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:792)
at 
org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:702)
at 
org.sparkproject.io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:302)
at 
org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
at 
org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
at 
org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:790)
at 
org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:758)
at 
org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:808)
at 
org.sparkproject.io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:1025)
at 
org.sparkproject.io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:294)
at 
org.apache.spark.network.server.ChunkFetchRequestHandler.respond(ChunkFetchRequestHandler.java:142)
at 
org.apache.spark.network.server.ChunkFetchRequestHandler.processFetchRequest(ChunkFetchRequestHandler.java:116)
at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:107)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
at 
org.sparkproject.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
at 
org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at 
org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at 
org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at 
org.sparkproject.io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
at 
org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at 
org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at 

[jira] [Updated] (SPARK-44772) Reading blocks from remote executors causes timeout issue

2023-08-10 Thread nebi mert aydin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nebi mert aydin updated SPARK-44772:

Description: 
I'm using EMR 6.5 with Spark 3.1.2

I'm shuffling 1.5 TiB of data with 3000 executors with 4 cores 23 gig memory 
for executors
{code:java}
// df.repartition(6000) {code}
I see lots of failures with 
{code:java}
2023-08-11 01:01:09,846 ERROR 
org.apache.spark.network.server.ChunkFetchRequestHandler (shuffle-server-4-95): 
Error sending result 
ChunkFetchSuccess[streamChunkId=StreamChunkId[streamId=779084003612,chunkIndex=323],buffer=FileSegmentManagedBuffer[file=/mnt3/yarn/usercache/zeppelin/appcache/application_1691438567823_0012/blockmgr-0d82ca05-9429-4ff2-9f61-e779e8e60648/07/shuffle_5_114492_0.data,offset=1836997,length=618]]
 to /172.31.20.110:36654; closing connection
java.nio.channels.ClosedChannelException
at 
org.sparkproject.io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957)
at 
org.sparkproject.io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:865)
at 
org.sparkproject.io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
at 
org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
at 
org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:709)
at 
org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:792)
at 
org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:702)
at 
org.sparkproject.io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:110)
at 
org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
at 
org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:709)
at 
org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:792)
at 
org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:702)
at 
org.sparkproject.io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:302)
at 
org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
at 
org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
at 
org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:790)
at 
org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:758)
at 
org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:808)
at 
org.sparkproject.io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:1025)
at 
org.sparkproject.io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:294)
at 
org.apache.spark.network.server.ChunkFetchRequestHandler.respond(ChunkFetchRequestHandler.java:142)
at 
org.apache.spark.network.server.ChunkFetchRequestHandler.processFetchRequest(ChunkFetchRequestHandler.java:116)
at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:107)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
at 
org.sparkproject.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
at 
org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at 
org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at 
org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at 
org.sparkproject.io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
at 
org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at 
org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at 

[jira] [Created] (SPARK-44772) Reading blocks from remote executors causes timeout issue

2023-08-10 Thread nebi mert aydin (Jira)
nebi mert aydin created SPARK-44772:
---

 Summary: Reading blocks from remote executors  causes timeout issue
 Key: SPARK-44772
 URL: https://issues.apache.org/jira/browse/SPARK-44772
 Project: Spark
  Issue Type: Bug
  Components: EC2, PySpark
Affects Versions: 3.1.2
Reporter: nebi mert aydin


I'm using EMR 6.5 with Spark 3.1.2

I'm shuffling 1.5 TiB of data with 3000 executors with 4 cores 23 gig memory 
for executor
`df.repartition(6000)`
I see lots of failures with 

```

2023-08-11 01:01:09,847 ERROR 
org.apache.spark.network.server.ChunkFetchRequestHandler (shuffle-server-4-95): 
Error sending result 
ChunkFetchSuccess[streamChunkId=StreamChunkId[streamId=779084003612,chunkIndex=324],buffer=FileSegmentManagedBuffer[file=/mnt1/yarn/usercache/zeppelin/appcache/application_1691438567823_0012/blockmgr-b2f9bea5-068c-45c8-b324-1f132c87de54/24/shuffle_5_115515_0.data,offset=680394,length=255]]
 to /172.31.20.110:36654; closing connection

```

I tried to set this for kernel

```

sudo ethtool -K eth0 tso off
sudo ethtool -K eth0 sg off

```

Didn't work. I guess external shuffle service is not able to send to data to 
other executors due to some reason.

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24578) Reading remote cache block behavior changes and causes timeout issue

2023-08-10 Thread nebi mert aydin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17753038#comment-17753038
 ] 

nebi mert aydin commented on SPARK-24578:
-

I still have this problem in Amazon EMR spark 3.1.2, any idea?

> Reading remote cache block behavior changes and causes timeout issue
> 
>
> Key: SPARK-24578
> URL: https://issues.apache.org/jira/browse/SPARK-24578
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Wenbo Zhao
>Assignee: Wenbo Zhao
>Priority: Blocker
> Fix For: 2.3.2, 2.4.0
>
>
> After Spark 2.3, we observed lots of errors like the following in some of our 
> production job
> {code:java}
> 18/06/15 20:59:42 ERROR TransportRequestHandler: Error sending result 
> ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=91672904003, 
> chunkIndex=0}, 
> buffer=org.apache.spark.storage.BlockManagerManagedBuffer@783a9324} to 
> /172.22.18.7:60865; closing connection
> java.io.IOException: Broken pipe
> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
> at sun.nio.ch.IOUtil.write(IOUtil.java:65)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
> at 
> org.apache.spark.network.protocol.MessageWithHeader.writeNioBuffer(MessageWithHeader.java:156)
> at 
> org.apache.spark.network.protocol.MessageWithHeader.copyByteBuf(MessageWithHeader.java:142)
> at 
> org.apache.spark.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:123)
> at 
> io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:355)
> at 
> io.netty.channel.nio.AbstractNioByteChannel.doWrite(AbstractNioByteChannel.java:224)
> at 
> io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:382)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:934)
> at 
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:362)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:901)
> at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1321)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
> at 
> io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
> at 
> io.netty.channel.ChannelOutboundHandlerAdapter.flush(ChannelOutboundHandlerAdapter.java:115)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
> at 
> io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
> at io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:117)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
> at 
> io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
> at 
> io.netty.channel.DefaultChannelPipeline.flush(DefaultChannelPipeline.java:983)
> at io.netty.channel.AbstractChannel.flush(AbstractChannel.java:248)
> at 
> io.netty.channel.nio.AbstractNioByteChannel$1.run(AbstractNioByteChannel.java:284)
> at 
> io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> {code}
>  
> Here is a small reproducible for a small cluster of 2 executors (say host-1 
> and host-2) each with 8 cores. Here, the memory of driver and executors are 
> not an import factor here as long as it is big enough, say 20G. 
> {code:java}
> val n = 1
> val df0 = sc.parallelize(1 to n).toDF
> val df = df0.withColumn("x0", rand()).withColumn("x0", rand()
> ).withColumn("x1", rand()
> ).withColumn("x2", rand()
> ).withColumn("x3", rand()
> ).withColumn("x4", rand()
> ).withColumn("x5", rand()
> ).withColumn("x6", rand()
> ).withColumn("x7", rand()
> ).withColumn("x8", rand()
> ).withColumn("x9", rand())

[jira] [Commented] (SPARK-44719) NoClassDefFoundError when using Hive UDF

2023-08-10 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17753037#comment-17753037
 ] 

Snoot.io commented on SPARK-44719:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/42447

> NoClassDefFoundError when using Hive UDF
> 
>
> Key: SPARK-44719
> URL: https://issues.apache.org/jira/browse/SPARK-44719
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: HiveUDFs-1.0-SNAPSHOT.jar
>
>
> How to reproduce:
> {noformat}
> spark-sql (default)> add jar 
> /Users/yumwang/Downloads/HiveUDFs-1.0-SNAPSHOT.jar;
> Time taken: 0.413 seconds
> spark-sql (default)> CREATE TEMPORARY FUNCTION long_to_ip as 
> 'net.petrabarus.hiveudfs.LongToIP';
> Time taken: 0.038 seconds
> spark-sql (default)> SELECT long_to_ip(2130706433L) FROM range(10);
> 23/08/08 20:17:58 ERROR SparkSQLDriver: Failed in [SELECT 
> long_to_ip(2130706433L) FROM range(10)]
> java.lang.NoClassDefFoundError: org/codehaus/jackson/map/type/TypeFactory
>   at org.apache.hadoop.hive.ql.udf.UDFJson.(UDFJson.java:64)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44461) Enable Process Isolation for streaming python worker

2023-08-10 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17753036#comment-17753036
 ] 

Snoot.io commented on SPARK-44461:
--

User 'WweiL' has created a pull request for this issue:
https://github.com/apache/spark/pull/42443

> Enable Process Isolation for streaming python worker
> 
>
> Key: SPARK-44461
> URL: https://issues.apache.org/jira/browse/SPARK-44461
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.4.1
>Reporter: Raghu Angadi
>Priority: Major
>
> Enable PI for Python worker used for foreachBatch() & streaming listener in 
> Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44768) Improve WSCG handling of row buffer by accounting for executor memory . Exploding nested arrays can easily lead to out of memory errors.

2023-08-10 Thread Franck Tago (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-44768:

Description: 
The code sample below is to showcase the wholestagecodegen generated when 
exploding nested arrays.  The data sample in the dataframe is quite small so it 
will not trigger the Out of Memory error . 

However if the array is larger and the row size is large , you will definitely 
end up with an OOM error .  

 

consider a scenario where you flatten  a nested array 

// e.g you can use the following steps to create the dataframe 

//create a partClass case class
case class partClass (PARTNAME: String , PartNumber: String , PartPrice : 
Double )

//create a nested array array class
case  class array_array_class (
 col_int: Int,
 arr_arr_string : Seq[Seq[String]],
 arr_arr_bigint : Seq[Seq[Long]],
 col_string     : String,
 parts_s        : Seq[Seq[partClass]]
 
)

//create a dataframe
var df_array_array = sc.parallelize(
 Seq(
 (1,Seq(Seq("a","b" ,"c" ,"d") ,Seq("aa","bb" ,"cc","dd")) , 
Seq(Seq(1000,2), Seq(3,-1)),"ItemPart1",
  Seq(Seq(partClass("PNAME1","P1",20.75),partClass("PNAME1_1","P1_1",30.75)))
 ) ,
 
 (2,Seq(Seq("ab","bc" ,"cd" ,"de") ,Seq("aab","bbc" 
,"ccd","dde"),Seq("aab")) , Seq(Seq(-1000,-2,-1,-2), 
Seq(0,3,-1)),"ItemPart2",
  
Seq(Seq(partClass("PNAME2","P2",170.75),partClass("PNAME2_1","P2_1",33.75),partClass("PNAME2_2","P2_2",73.75)))
 )
  
 )

).toDF("c1" ,"c2" ,"c3" ,"c4" ,"c5")

//explode a nested array 

var  result   =  df_array_array.select( col("c1"), 
explode(col("c2"))).select('c1 , explode('col))

result.explain

 

The physical for this operator is seen below.

-
Physical plan 

== Physical Plan ==
*(1) Generate explode(col#27), [c1#17|#17], false, [col#30|#30]
+- *(1) Filter ((size(col#27, true) > 0) AND isnotnull(col#27))
   +- *(1) Generate explode(c2#18), [c1#17|#17], false, [col#27|#27]
      +- *(1) Project [_1#6 AS c1#17, _2#7 AS c2#18|#6 AS c1#17, _2#7 AS c2#18]
         +- *(1) Filter ((size(_2#7, true) > 0) AND isnotnull(_2#7))
            +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, 
scala.Tuple5, true]))._1 AS _1#6, mapobjects(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -1), 
mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, 
-2), staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -2), StringType, ObjectType(class java.lang.String)), 
true, false, true), validateexternaltype(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -1), ArrayType(StringType,true), 
ObjectType(interface scala.collection.Seq)), None), 
knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._2, None) AS _2#7, 
mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, 
-3), mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), 
true, -4), assertnotnull(validateexternaltype(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -4), IntegerType, IntegerType)), 
validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -3), ArrayType(IntegerType,false), 
ObjectType(interface scala.collection.Seq)), None), 
knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._3, None) AS _3#8, 
staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._4, true, 
false, true) AS _4#9, mapobjects(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -5), mapobjects(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -6), if 
(isnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -6), StructField(PARTNAME,StringType,true), 
StructField(PartNumber,StringType,true), 
StructField(PartPrice,DoubleType,false), ObjectType(class 
$line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$partClass null else 
named_struct(PARTNAME, staticinvoke(class 
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
knownnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -6), StructField(PARTNAME,StringType,true), 
StructField(PartNumber,StringType,true), 
StructField(PartPrice,DoubleType,false), ObjectType(class 
$line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$partClass))).PARTNAME, true, 
false, true), PartNumber, staticinvoke(class 
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
knownnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -6), StructField(PARTNAME,StringType,true), 
StructField(PartNumber,StringType,true), 
StructField(PartPrice,DoubleType,false), ObjectType(class 

[jira] [Commented] (SPARK-44768) Improve WSCG handling of row buffer by accounting for executor memory . Exploding nested arrays can easily lead to out of memory errors.

2023-08-10 Thread Franck Tago (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17753035#comment-17753035
 ] 

Franck Tago commented on SPARK-44768:
-

!image-2023-08-10-20-32-55-684.png!

> Improve WSCG handling of row buffer by accounting for executor memory  .  
> Exploding nested arrays can easily lead to out of memory errors. 
> ---
>
> Key: SPARK-44768
> URL: https://issues.apache.org/jira/browse/SPARK-44768
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.2, 3.4.0, 3.4.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: image-2023-08-10-20-32-55-684.png, 
> spark-jira_wscg_code.txt
>
>
> consider a scenario where you flatten  a nested array 
> // e.g you can use the following steps to create the dataframe 
> //create a partClass case class
> case class partClass (PARTNAME: String , PartNumber: String , PartPrice : 
> Double )
> //create a nested array array class
> case  class array_array_class (
>  col_int: Int,
>  arr_arr_string : Seq[Seq[String]],
>  arr_arr_bigint : Seq[Seq[Long]],
>  col_string     : String,
>  parts_s        : Seq[Seq[partClass]]
>  
> )
> //create a dataframe
> var df_array_array = sc.parallelize(
>  Seq(
>  (1,Seq(Seq("a","b" ,"c" ,"d") ,Seq("aa","bb" ,"cc","dd")) , 
> Seq(Seq(1000,2), Seq(3,-1)),"ItemPart1",
>   Seq(Seq(partClass("PNAME1","P1",20.75),partClass("PNAME1_1","P1_1",30.75)))
>  ) ,
>  
>  (2,Seq(Seq("ab","bc" ,"cd" ,"de") ,Seq("aab","bbc" 
> ,"ccd","dde"),Seq("aab")) , Seq(Seq(-1000,-2,-1,-2), 
> Seq(0,3,-1)),"ItemPart2",
>   
> Seq(Seq(partClass("PNAME2","P2",170.75),partClass("PNAME2_1","P2_1",33.75),partClass("PNAME2_2","P2_2",73.75)))
>  )
>   
>  )
> ).toDF("c1" ,"c2" ,"c3" ,"c4" ,"c5")
> //explode a nested array 
> var  result   =  df_array_array.select( col("c1"), 
> explode(col("c2"))).select('c1 , explode('col))
> result.explain
>  
> The physical for this operator is seen below.
> -
> Physical plan 
> == Physical Plan ==
> *(1) Generate explode(col#27), [c1#17|#17], false, [col#30|#30]
> +- *(1) Filter ((size(col#27, true) > 0) AND isnotnull(col#27))
>    +- *(1) Generate explode(c2#18), [c1#17|#17], false, [col#27|#27]
>       +- *(1) Project [_1#6 AS c1#17, _2#7 AS c2#18|#6 AS c1#17, _2#7 AS 
> c2#18]
>          +- *(1) Filter ((size(_2#7, true) > 0) AND isnotnull(_2#7))
>             +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, 
> scala.Tuple5, true]))._1 AS _1#6, mapobjects(lambdavariable(MapObject, 
> ObjectType(class java.lang.Object), true, -1), 
> mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), 
> true, -2), staticinvoke(class org.apache.spark.unsafe.types.UTF8String, 
> StringType, fromString, validateexternaltype(lambdavariable(MapObject, 
> ObjectType(class java.lang.Object), true, -2), StringType, ObjectType(class 
> java.lang.String)), true, false, true), 
> validateexternaltype(lambdavariable(MapObject, ObjectType(class 
> java.lang.Object), true, -1), ArrayType(StringType,true), 
> ObjectType(interface scala.collection.Seq)), None), 
> knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._2, None) AS _2#7, 
> mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), 
> true, -3), mapobjects(lambdavariable(MapObject, ObjectType(class 
> java.lang.Object), true, -4), 
> assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
> java.lang.Object), true, -4), IntegerType, IntegerType)), 
> validateexternaltype(lambdavariable(MapObject, ObjectType(class 
> java.lang.Object), true, -3), ArrayType(IntegerType,false), 
> ObjectType(interface scala.collection.Seq)), None), 
> knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._3, None) AS _3#8, 
> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
> fromString, knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._4, 
> true, false, true) AS _4#9, mapobjects(lambdavariable(MapObject, 
> ObjectType(class java.lang.Object), true, -5), 
> mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), 
> true, -6), if (isnull(validateexternaltype(lambdavariable(MapObject, 
> ObjectType(class java.lang.Object), true, -6), 
> StructField(PARTNAME,StringType,true), 
> StructField(PartNumber,StringType,true), 
> StructField(PartPrice,DoubleType,false), ObjectType(class 
> $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$partClass null else 
> named_struct(PARTNAME, staticinvoke(class 
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
> knownnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
> java.lang.Object), 

[jira] [Updated] (SPARK-44768) Improve WSCG handling of row buffer by accounting for executor memory . Exploding nested arrays can easily lead to out of memory errors.

2023-08-10 Thread Franck Tago (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-44768:

Attachment: image-2023-08-10-20-32-55-684.png

> Improve WSCG handling of row buffer by accounting for executor memory  .  
> Exploding nested arrays can easily lead to out of memory errors. 
> ---
>
> Key: SPARK-44768
> URL: https://issues.apache.org/jira/browse/SPARK-44768
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.2, 3.4.0, 3.4.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: image-2023-08-10-20-32-55-684.png, 
> spark-jira_wscg_code.txt
>
>
> consider a scenario where you flatten  a nested array 
> // e.g you can use the following steps to create the dataframe 
> //create a partClass case class
> case class partClass (PARTNAME: String , PartNumber: String , PartPrice : 
> Double )
> //create a nested array array class
> case  class array_array_class (
>  col_int: Int,
>  arr_arr_string : Seq[Seq[String]],
>  arr_arr_bigint : Seq[Seq[Long]],
>  col_string     : String,
>  parts_s        : Seq[Seq[partClass]]
>  
> )
> //create a dataframe
> var df_array_array = sc.parallelize(
>  Seq(
>  (1,Seq(Seq("a","b" ,"c" ,"d") ,Seq("aa","bb" ,"cc","dd")) , 
> Seq(Seq(1000,2), Seq(3,-1)),"ItemPart1",
>   Seq(Seq(partClass("PNAME1","P1",20.75),partClass("PNAME1_1","P1_1",30.75)))
>  ) ,
>  
>  (2,Seq(Seq("ab","bc" ,"cd" ,"de") ,Seq("aab","bbc" 
> ,"ccd","dde"),Seq("aab")) , Seq(Seq(-1000,-2,-1,-2), 
> Seq(0,3,-1)),"ItemPart2",
>   
> Seq(Seq(partClass("PNAME2","P2",170.75),partClass("PNAME2_1","P2_1",33.75),partClass("PNAME2_2","P2_2",73.75)))
>  )
>   
>  )
> ).toDF("c1" ,"c2" ,"c3" ,"c4" ,"c5")
> //explode a nested array 
> var  result   =  df_array_array.select( col("c1"), 
> explode(col("c2"))).select('c1 , explode('col))
> result.explain
>  
> The physical for this operator is seen below.
> -
> Physical plan 
> == Physical Plan ==
> *(1) Generate explode(col#27), [c1#17|#17], false, [col#30|#30]
> +- *(1) Filter ((size(col#27, true) > 0) AND isnotnull(col#27))
>    +- *(1) Generate explode(c2#18), [c1#17|#17], false, [col#27|#27]
>       +- *(1) Project [_1#6 AS c1#17, _2#7 AS c2#18|#6 AS c1#17, _2#7 AS 
> c2#18]
>          +- *(1) Filter ((size(_2#7, true) > 0) AND isnotnull(_2#7))
>             +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, 
> scala.Tuple5, true]))._1 AS _1#6, mapobjects(lambdavariable(MapObject, 
> ObjectType(class java.lang.Object), true, -1), 
> mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), 
> true, -2), staticinvoke(class org.apache.spark.unsafe.types.UTF8String, 
> StringType, fromString, validateexternaltype(lambdavariable(MapObject, 
> ObjectType(class java.lang.Object), true, -2), StringType, ObjectType(class 
> java.lang.String)), true, false, true), 
> validateexternaltype(lambdavariable(MapObject, ObjectType(class 
> java.lang.Object), true, -1), ArrayType(StringType,true), 
> ObjectType(interface scala.collection.Seq)), None), 
> knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._2, None) AS _2#7, 
> mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), 
> true, -3), mapobjects(lambdavariable(MapObject, ObjectType(class 
> java.lang.Object), true, -4), 
> assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
> java.lang.Object), true, -4), IntegerType, IntegerType)), 
> validateexternaltype(lambdavariable(MapObject, ObjectType(class 
> java.lang.Object), true, -3), ArrayType(IntegerType,false), 
> ObjectType(interface scala.collection.Seq)), None), 
> knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._3, None) AS _3#8, 
> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
> fromString, knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._4, 
> true, false, true) AS _4#9, mapobjects(lambdavariable(MapObject, 
> ObjectType(class java.lang.Object), true, -5), 
> mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), 
> true, -6), if (isnull(validateexternaltype(lambdavariable(MapObject, 
> ObjectType(class java.lang.Object), true, -6), 
> StructField(PARTNAME,StringType,true), 
> StructField(PartNumber,StringType,true), 
> StructField(PartPrice,DoubleType,false), ObjectType(class 
> $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$partClass null else 
> named_struct(PARTNAME, staticinvoke(class 
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
> knownnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
> java.lang.Object), true, -6), 

[jira] [Commented] (SPARK-43194) PySpark 3.4.0 cannot convert timestamp-typed objects to pandas with pandas 2.0

2023-08-10 Thread Berg Lloyd-Haig (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17753034#comment-17753034
 ] 

Berg Lloyd-Haig commented on SPARK-43194:
-

This is affecting us also with parquet read from AWS Glue Catalog 
(Hive-compatible metastore) with TimestampType fields.

> PySpark 3.4.0 cannot convert timestamp-typed objects to pandas with pandas 2.0
> --
>
> Key: SPARK-43194
> URL: https://issues.apache.org/jira/browse/SPARK-43194
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
> Environment: {code}
> In [4]: import pandas as pd
> In [5]: pd.__version__
> Out[5]: '2.0.0'
> In [6]: import pyspark as ps
> In [7]: ps.__version__
> Out[7]: '3.4.0'
> {code}
>Reporter: Phillip Cloud
>Priority: Major
>
> {code}
> In [1]: from pyspark.sql import SparkSession
> In [2]: session = SparkSession.builder.appName("test").getOrCreate()
> 23/04/19 09:21:42 WARN Utils: Your hostname, albatross resolves to a loopback 
> address: 127.0.0.2; using 192.168.1.170 instead (on interface enp5s0)
> 23/04/19 09:21:42 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 23/04/19 09:21:42 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> In [3]: session.sql("select now()").toPandas()
> {code}
> Results in:
> {code}
> ...
> TypeError: Casting to unit-less dtype 'datetime64' is not supported. Pass 
> e.g. 'datetime64[ns]' instead.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44771) Remove 'sudo' in 'pip install' suggestions in the dev scripts

2023-08-10 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-44771.

Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42444
[https://github.com/apache/spark/pull/42444]

> Remove 'sudo' in 'pip install' suggestions in the dev scripts
> -
>
> Key: SPARK-44771
> URL: https://issues.apache.org/jira/browse/SPARK-44771
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Trivial
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42132) DeduplicateRelations rule breaks plan when co-grouping the same DataFrame

2023-08-10 Thread Jia Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jia Fan resolved SPARK-42132.
-
Fix Version/s: 3.5.0
   Resolution: Fixed

> DeduplicateRelations rule breaks plan when co-grouping the same DataFrame
> -
>
> Key: SPARK-42132
> URL: https://issues.apache.org/jira/browse/SPARK-42132
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.3, 3.3.0, 3.3.1, 3.2.3, 3.4.0, 3.5.0
>Reporter: Enrico Minack
>Priority: Major
>  Labels: correctness
> Fix For: 3.5.0
>
>
> Co-grouping two DataFrames that share references breaks on the 
> DeduplicateRelations rule:
> {code:java}
> val df = spark.range(3)
> val left_grouped_df = df.groupBy("id").as[Long, Long]
> val right_grouped_df = df.groupBy("id").as[Long, Long]
> val cogroup_df = left_grouped_df.cogroup(right_grouped_df) {
>   case (key, left, right) => left
> }
> cogroup_df.explain()
> {code}
> {code:java}
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- SerializeFromObject [input[0, bigint, false] AS value#12L]
>+- CoGroup, id#0: bigint, id#0: bigint, id#0: bigint, [id#13L], [id#13L], 
> [id#13L], [id#13L], obj#11: bigint
>   :- !Sort [id#13L ASC NULLS FIRST], false, 0
>   :  +- !Exchange hashpartitioning(id#13L, 200), ENSURE_REQUIREMENTS, 
> [plan_id=16]
>   : +- Range (0, 3, step=1, splits=16)
>   +- Sort [id#13L ASC NULLS FIRST], false, 0
>  +- Exchange hashpartitioning(id#13L, 200), ENSURE_REQUIREMENTS, 
> [plan_id=17]
> +- Range (0, 3, step=1, splits=16)
> {code}
> The DataFrame cannot be computed:
> {code:java}
> cogroup_df.show()
> {code}
> {code:java}
> java.lang.IllegalStateException: Couldn't find id#13L in [id#0L]
> {code}
> The rule replaces `id#0L` on the right side with `id#13L` while replacing all 
> occurrences in `CoGroup`. Some occurrences of `id#0L` in `CoGroup`refer to 
> the left side and should not be replaced. Further, `id#0L` of the right 
> deserializer is not replaced.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44771) Remove 'sudo' in 'pip install' suggestions in the dev scripts

2023-08-10 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-44771:
--

 Summary: Remove 'sudo' in 'pip install' suggestions in the dev 
scripts
 Key: SPARK-44771
 URL: https://issues.apache.org/jira/browse/SPARK-44771
 Project: Spark
  Issue Type: Task
  Components: Build
Affects Versions: 4.0.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44763) Fix a bug of promoting string as double in binary arithmetic with interval

2023-08-10 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-44763.

Target Version/s: 4.0.0
  Resolution: Fixed

> Fix a bug of promoting string as double in binary arithmetic with interval  
> 
>
> Key: SPARK-44763
> URL: https://issues.apache.org/jira/browse/SPARK-44763
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> The following query works on branch-3.5 or below, but fails on the latest 
> master:
> ```
> select concat(DATE'2020-12-31', ' ', date_format('09:03:08', 'HH:mm:ss')) + 
> (INTERVAL '03' HOUR)
> ```
>  
> The direct reason is now we mark `cast(date as string)` as resolved during 
> type coercion after changes [https://github.com/apache/spark/pull/42089.] As 
> a result, there are two transforms from CombinedTypeCoercionRule
> ```
> Rule ConcatCoercion Transformed concat(2020-12-31,  , 
> date_format(cast(09:03:08 as timestamp), HH:mm:ss, 
> Some(America/Los_Angeles))) to concat(cast(2020-12-31 as string),  , 
> date_format(cast(09:03:08 as timestamp), HH:mm:ss, Some(America/Los_Angeles)))
> Rule PromoteStrings Transformed (concat(cast(2020-12-31 as string),  , 
> date_format(cast(09:03:08 as timestamp), HH:mm:ss, 
> Some(America/Los_Angeles))) + INTERVAL '03' HOUR) to 
> (cast(concat(cast(2020-12-31 as string),  , date_format(cast(09:03:08 as 
> timestamp), HH:mm:ss, Some(America/Los_Angeles))) as double) + INTERVAL '03' 
> HOUR)
> ```  
> The second transform doesn't happen in previous releases since 
> cast(2020-12-31 as string)  used to be unresolved after the first transform.
>  
> The fix is simple, the analyzer should not promote string as double in binary 
> arithmetic with ANSI interval. The changes in 
> [https://github.com/apache/spark/pull/42089|https://github.com/apache/spark/pull/42089.]
>  are valid and we should keep it.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44763) Fix a bug of promoting string as double in binary arithmetic with interval

2023-08-10 Thread Gengliang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17753020#comment-17753020
 ] 

Gengliang Wang commented on SPARK-44763:


Resolved in https://github.com/apache/spark/pull/42436

> Fix a bug of promoting string as double in binary arithmetic with interval  
> 
>
> Key: SPARK-44763
> URL: https://issues.apache.org/jira/browse/SPARK-44763
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> The following query works on branch-3.5 or below, but fails on the latest 
> master:
> ```
> select concat(DATE'2020-12-31', ' ', date_format('09:03:08', 'HH:mm:ss')) + 
> (INTERVAL '03' HOUR)
> ```
>  
> The direct reason is now we mark `cast(date as string)` as resolved during 
> type coercion after changes [https://github.com/apache/spark/pull/42089.] As 
> a result, there are two transforms from CombinedTypeCoercionRule
> ```
> Rule ConcatCoercion Transformed concat(2020-12-31,  , 
> date_format(cast(09:03:08 as timestamp), HH:mm:ss, 
> Some(America/Los_Angeles))) to concat(cast(2020-12-31 as string),  , 
> date_format(cast(09:03:08 as timestamp), HH:mm:ss, Some(America/Los_Angeles)))
> Rule PromoteStrings Transformed (concat(cast(2020-12-31 as string),  , 
> date_format(cast(09:03:08 as timestamp), HH:mm:ss, 
> Some(America/Los_Angeles))) + INTERVAL '03' HOUR) to 
> (cast(concat(cast(2020-12-31 as string),  , date_format(cast(09:03:08 as 
> timestamp), HH:mm:ss, Some(America/Los_Angeles))) as double) + INTERVAL '03' 
> HOUR)
> ```  
> The second transform doesn't happen in previous releases since 
> cast(2020-12-31 as string)  used to be unresolved after the first transform.
>  
> The fix is simple, the analyzer should not promote string as double in binary 
> arithmetic with ANSI interval. The changes in 
> [https://github.com/apache/spark/pull/42089|https://github.com/apache/spark/pull/42089.]
>  are valid and we should keep it.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43781) IllegalStateException when cogrouping two datasets derived from the same source

2023-08-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-43781.
-
Fix Version/s: 3.5.0
 Assignee: Jia Fan
   Resolution: Fixed

> IllegalStateException when cogrouping two datasets derived from the same 
> source
> ---
>
> Key: SPARK-43781
> URL: https://issues.apache.org/jira/browse/SPARK-43781
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.1, 3.4.0
> Environment: Reproduces in a unit test, using Spark 3.3.1, the Java 
> API, and a {{local[2]}} SparkSession.
>Reporter: Derek Murray
>Assignee: Jia Fan
>Priority: Major
> Fix For: 3.5.0
>
>
> Attempting to {{cogroup}} two datasets derived from the same source dataset 
> yields an {{IllegalStateException}} when the query is executed.
> Minimal reproducer:
> {code:java}
> StructType inputType = DataTypes.createStructType(
> new StructField[]{
> DataTypes.createStructField("id", DataTypes.LongType, false),
> DataTypes.createStructField("type", DataTypes.StringType, false)
> }
> );
> StructType keyType = DataTypes.createStructType(
> new StructField[]{
> DataTypes.createStructField("id", DataTypes.LongType, false)
> }
> );
> List inputRows = new ArrayList<>();
> inputRows.add(RowFactory.create(1L, "foo"));
> inputRows.add(RowFactory.create(1L, "bar"));
> inputRows.add(RowFactory.create(2L, "foo"));
> Dataset input = sparkSession.createDataFrame(inputRows, inputType);
> KeyValueGroupedDataset fooGroups = input
> .filter("type = 'foo'")
> .groupBy("id")
> .as(RowEncoder.apply(keyType), RowEncoder.apply(inputType));
> KeyValueGroupedDataset barGroups = input
> .filter("type = 'bar'")
> .groupBy("id")
> .as(RowEncoder.apply(keyType), RowEncoder.apply(inputType));
> Dataset result = fooGroups.cogroup(
> barGroups,
> (CoGroupFunction) (row, iterator, iterator1) -> new 
> ArrayList().iterator(),
> RowEncoder.apply(inputType));
> result.explain();
> result.show();{code}
> Explain output (note mismatch in column IDs between Sort/Exchagne and 
> LocalTableScan on the first input to the CoGroup):
> {code:java}
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- SerializeFromObject 
> [validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
> org.apache.spark.sql.Row, true]), 0, id), LongType, false) AS id#37L, 
> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
> fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
> org.apache.spark.sql.Row, true]), 1, type), StringType, false), true, false, 
> true) AS type#38]
>    +- CoGroup 
> org.apache.spark.sql.KeyValueGroupedDataset$$Lambda$1478/1869116781@77856cc5, 
> createexternalrow(id#16L, StructField(id,LongType,false)), 
> createexternalrow(id#16L, type#17.toString, StructField(id,LongType,false), 
> StructField(type,StringType,false)), createexternalrow(id#16L, 
> type#17.toString, StructField(id,LongType,false), 
> StructField(type,StringType,false)), [id#39L], [id#39L], [id#39L, type#40], 
> [id#39L, type#40], obj#36: org.apache.spark.sql.Row
>       :- !Sort [id#39L ASC NULLS FIRST], false, 0
>       :  +- !Exchange hashpartitioning(id#39L, 2), ENSURE_REQUIREMENTS, 
> [plan_id=19]
>       :     +- LocalTableScan [id#16L, type#17]
>       +- Sort [id#39L ASC NULLS FIRST], false, 0
>          +- Exchange hashpartitioning(id#39L, 2), ENSURE_REQUIREMENTS, 
> [plan_id=20]
>             +- LocalTableScan [id#39L, type#40]{code}
> Exception:
> {code:java}
> java.lang.IllegalStateException: Couldn't find id#39L in [id#16L,type#17]
>         at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)
>         at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
>         at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589)
>         at scala.collection.immutable.ArraySeq.map(ArraySeq.scala:75)
>         at scala.collection.immutable.ArraySeq.map(ArraySeq.scala:35)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:698)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589)
>         at 
> 

[jira] [Reopened] (SPARK-44760) Index Out Of Bound for JIRA resolution in merge_spark_pr

2023-08-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-44760:
--
  Assignee: (was: Kent Yao)

Reverted at 
https://github.com/apache/spark/commit/3164ff51dd249670b8505a5ea8d361cf53c9db94

> Index Out Of Bound for JIRA resolution in merge_spark_pr
> 
>
> Key: SPARK-44760
> URL: https://issues.apache.org/jira/browse/SPARK-44760
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>
> I



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44760) Index Out Of Bound for JIRA resolution in merge_spark_pr

2023-08-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-44760:
-
Fix Version/s: (was: 4.0.0)

> Index Out Of Bound for JIRA resolution in merge_spark_pr
> 
>
> Key: SPARK-44760
> URL: https://issues.apache.org/jira/browse/SPARK-44760
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>
> I



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43872) Enable DataFramePlotMatplotlibTests for pandas 2.0.0.

2023-08-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-43872:


Assignee: Haejoon Lee

> Enable DataFramePlotMatplotlibTests for pandas 2.0.0.
> -
>
> Key: SPARK-43872
> URL: https://issues.apache.org/jira/browse/SPARK-43872
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark, PySpark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> test list:
>  * test_area_plot
>  * test_area_plot_stacked_false
>  * test_area_plot_y
>  * test_bar_plot
>  * test_bar_with_x_y
>  * test_barh_plot_with_x_y
>  * test_barh_plot
>  * test_line_plot
>  * test_pie_plot
>  * test_scatter_plot
>  * test_hist_plot
>  * test_kde_plot



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44705) Make PythonRunner single-threaded

2023-08-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-44705.
--
  Assignee: Utkarsh Agarwal
Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/42385

> Make PythonRunner single-threaded
> -
>
> Key: SPARK-44705
> URL: https://issues.apache.org/jira/browse/SPARK-44705
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Utkarsh Agarwal
>Assignee: Utkarsh Agarwal
>Priority: Major
>
> PythonRunner, a utility that executes Python UDFs in Spark, uses two threads 
> in a producer-consumer model today. This multi-threading model is problematic 
> and confusing as Spark's execution model within a task is commonly understood 
> to be single-threaded. 
> More importantly, this departure of a double-threaded execution resulted in a 
> series of customer issues involving [race 
> conditions|https://issues.apache.org/jira/browse/SPARK-33277] and 
> [deadlocks|https://issues.apache.org/jira/browse/SPARK-38677] between threads 
> as the code was hard to reason about. There have been multiple attempts to 
> reign in these issues, viz., [fix 
> 1|https://issues.apache.org/jira/browse/SPARK-22535], [fix 
> 2|https://github.com/apache/spark/pull/30177], [fix 
> 3|https://github.com/apache/spark/commit/243c321db2f02f6b4d926114bd37a6e74c2be185].
>  Moreover, the fixes have made the code base somewhat abstruse by introducing 
> multiple daemon [monitor 
> threads|https://github.com/apache/spark/blob/a3a32912be04d3760cb34eb4b79d6d481bbec502/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala#L579]
>  to detect deadlocks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44765) Make ReleaseExecute retry in ExecutePlanResponseReattachableIterator reuse common mechanism

2023-08-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-44765:


Assignee: Juliusz Sompolski

> Make ReleaseExecute retry in ExecutePlanResponseReattachableIterator reuse 
> common mechanism
> ---
>
> Key: SPARK-44765
> URL: https://issues.apache.org/jira/browse/SPARK-44765
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Juliusz Sompolski
>Assignee: Juliusz Sompolski
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43032) Add StreamingQueryManager API

2023-08-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-43032:
-
Fix Version/s: 3.5.0
   4.0.0

> Add StreamingQueryManager API
> -
>
> Key: SPARK-43032
> URL: https://issues.apache.org/jira/browse/SPARK-43032
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Raghu Angadi
>Assignee: Wei Liu
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>
> Add StreamingQueryManager API. It would include API that can be directly 
> support. API like registering streaming listener will be handled separately. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44766) Cache the pandas converter for Python UDTFs

2023-08-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-44766.
--
Fix Version/s: 3.5.0
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 42439
[https://github.com/apache/spark/pull/42439]

> Cache the pandas converter for Python UDTFs
> ---
>
> Key: SPARK-44766
> URL: https://issues.apache.org/jira/browse/SPARK-44766
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>
> We should cache the pandas converter when serializing pandas dataframe to 
> arrow batch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44766) Cache the pandas converter for Python UDTFs

2023-08-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-44766:


Assignee: Allison Wang

> Cache the pandas converter for Python UDTFs
> ---
>
> Key: SPARK-44766
> URL: https://issues.apache.org/jira/browse/SPARK-44766
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
>
> We should cache the pandas converter when serializing pandas dataframe to 
> arrow batch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44768) Improve WSCG handling of row buffer by accounting for executor memory . Exploding nested arrays can easily lead to out of memory errors.

2023-08-10 Thread Franck Tago (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-44768:

Description: 
consider a scenario where you flatten  a nested array 

// e.g you can use the following steps to create the dataframe 

//create a partClass case class
case class partClass (PARTNAME: String , PartNumber: String , PartPrice : 
Double )

//create a nested array array class
case  class array_array_class (
 col_int: Int,
 arr_arr_string : Seq[Seq[String]],
 arr_arr_bigint : Seq[Seq[Long]],
 col_string     : String,
 parts_s        : Seq[Seq[partClass]]
 
)

//create a dataframe
var df_array_array = sc.parallelize(
 Seq(
 (1,Seq(Seq("a","b" ,"c" ,"d") ,Seq("aa","bb" ,"cc","dd")) , 
Seq(Seq(1000,2), Seq(3,-1)),"ItemPart1",
  Seq(Seq(partClass("PNAME1","P1",20.75),partClass("PNAME1_1","P1_1",30.75)))
 ) ,
 
 (2,Seq(Seq("ab","bc" ,"cd" ,"de") ,Seq("aab","bbc" 
,"ccd","dde"),Seq("aab")) , Seq(Seq(-1000,-2,-1,-2), 
Seq(0,3,-1)),"ItemPart2",
  
Seq(Seq(partClass("PNAME2","P2",170.75),partClass("PNAME2_1","P2_1",33.75),partClass("PNAME2_2","P2_2",73.75)))
 )
  
 )

).toDF("c1" ,"c2" ,"c3" ,"c4" ,"c5")

//explode a nested array 

var  result   =  df_array_array.select( col("c1"), 
explode(col("c2"))).select('c1 , explode('col))

result.explain

 

The physical for this operator is seen below.

-
Physical plan 

== Physical Plan ==
*(1) Generate explode(col#27), [c1#17|#17], false, [col#30|#30]
+- *(1) Filter ((size(col#27, true) > 0) AND isnotnull(col#27))
   +- *(1) Generate explode(c2#18), [c1#17|#17], false, [col#27|#27]
      +- *(1) Project [_1#6 AS c1#17, _2#7 AS c2#18|#6 AS c1#17, _2#7 AS c2#18]
         +- *(1) Filter ((size(_2#7, true) > 0) AND isnotnull(_2#7))
            +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, 
scala.Tuple5, true]))._1 AS _1#6, mapobjects(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -1), 
mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, 
-2), staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -2), StringType, ObjectType(class java.lang.String)), 
true, false, true), validateexternaltype(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -1), ArrayType(StringType,true), 
ObjectType(interface scala.collection.Seq)), None), 
knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._2, None) AS _2#7, 
mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, 
-3), mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), 
true, -4), assertnotnull(validateexternaltype(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -4), IntegerType, IntegerType)), 
validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -3), ArrayType(IntegerType,false), 
ObjectType(interface scala.collection.Seq)), None), 
knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._3, None) AS _3#8, 
staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._4, true, 
false, true) AS _4#9, mapobjects(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -5), mapobjects(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -6), if 
(isnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -6), StructField(PARTNAME,StringType,true), 
StructField(PartNumber,StringType,true), 
StructField(PartPrice,DoubleType,false), ObjectType(class 
$line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$partClass null else 
named_struct(PARTNAME, staticinvoke(class 
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
knownnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -6), StructField(PARTNAME,StringType,true), 
StructField(PartNumber,StringType,true), 
StructField(PartPrice,DoubleType,false), ObjectType(class 
$line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$partClass))).PARTNAME, true, 
false, true), PartNumber, staticinvoke(class 
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
knownnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -6), StructField(PARTNAME,StringType,true), 
StructField(PartNumber,StringType,true), 
StructField(PartPrice,DoubleType,false), ObjectType(class 
$line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$partClass))).PartNumber, true, 
false, true), PartPrice, 
knownnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -6), StructField(PARTNAME,StringType,true), 
StructField(PartNumber,StringType,true), 

[jira] [Updated] (SPARK-44768) Improve WSCG handling of row buffer by accounting for executor memory . Exploding nested arrays can easily lead to out of memory errors.

2023-08-10 Thread Franck Tago (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-44768:

Attachment: spark-jira_wscg_code.txt

> Improve WSCG handling of row buffer by accounting for executor memory  .  
> Exploding nested arrays can easily lead to out of memory errors. 
> ---
>
> Key: SPARK-44768
> URL: https://issues.apache.org/jira/browse/SPARK-44768
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.2, 3.4.0, 3.4.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: spark-jira_wscg_code.txt
>
>
> consider a scenario where you flatten  a nested array 
> // e.g you can use the following steps to create the dataframe 
> /create a partClass
> case class partClass (PARTNAME: String , PartNumber: String , PartPrice : 
> Double )
> //create a nested array array class
> case  class array_array_class (
>  col_int: Int,
>  arr_arr_string : Seq[Seq[String]],
>  arr_arr_bigint : Seq[Seq[Long]],
>  col_string     : String,
>  parts_s        : Seq[Seq[partClass]]
>  
> )
> //create a dataframe
> var df_array_array = sc.parallelize(
>  Seq(
>  (1,Seq(Seq("a","b" ,"c" ,"d") ,Seq("aa","bb" ,"cc","dd")) , 
> Seq(Seq(1000,2), Seq(3,-1)),"ItemPart1",
>   Seq(Seq(partClass("PNAME1","P1",20.75),partClass("PNAME1_1","P1_1",30.75)))
>  ) ,
>  
>  (2,Seq(Seq("ab","bc" ,"cd" ,"de") ,Seq("aab","bbc" 
> ,"ccd","dde"),Seq("aab")) , Seq(Seq(-1000,-2,-1,-2), 
> Seq(0,3,-1)),"ItemPart2",
>   
> Seq(Seq(partClass("PNAME2","P2",170.75),partClass("PNAME2_1","P2_1",33.75),partClass("PNAME2_2","P2_2",73.75)))
>  )
>   
>  )
> ).toDF("c1" ,"c2" ,"c3" ,"c4" ,"c5")
> //explode a nested array 
> var  result   =  df_array_array.select( col("c1"), 
> explode(col("c2"))).select('c1 , explode('col))
> result.explain
>  
> The physical for this operator is seen below.
> -
> Physical plan 
> == Physical Plan ==
> *(1) Generate explode(col#27), [c1#17], false, [col#30]
> +- *(1) Filter ((size(col#27, true) > 0) AND isnotnull(col#27))
>    +- *(1) Generate explode(c2#18), [c1#17], false, [col#27]
>       +- *(1) Project [_1#6 AS c1#17, _2#7 AS c2#18]
>          +- *(1) Filter ((size(_2#7, true) > 0) AND isnotnull(_2#7))
>             +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, 
> scala.Tuple5, true]))._1 AS _1#6, mapobjects(lambdavariable(MapObject, 
> ObjectType(class java.lang.Object), true, -1), 
> mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), 
> true, -2), staticinvoke(class org.apache.spark.unsafe.types.UTF8String, 
> StringType, fromString, validateexternaltype(lambdavariable(MapObject, 
> ObjectType(class java.lang.Object), true, -2), StringType, ObjectType(class 
> java.lang.String)), true, false, true), 
> validateexternaltype(lambdavariable(MapObject, ObjectType(class 
> java.lang.Object), true, -1), ArrayType(StringType,true), 
> ObjectType(interface scala.collection.Seq)), None), 
> knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._2, None) AS _2#7, 
> mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), 
> true, -3), mapobjects(lambdavariable(MapObject, ObjectType(class 
> java.lang.Object), true, -4), 
> assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
> java.lang.Object), true, -4), IntegerType, IntegerType)), 
> validateexternaltype(lambdavariable(MapObject, ObjectType(class 
> java.lang.Object), true, -3), ArrayType(IntegerType,false), 
> ObjectType(interface scala.collection.Seq)), None), 
> knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._3, None) AS _3#8, 
> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
> fromString, knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._4, 
> true, false, true) AS _4#9, mapobjects(lambdavariable(MapObject, 
> ObjectType(class java.lang.Object), true, -5), 
> mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), 
> true, -6), if (isnull(validateexternaltype(lambdavariable(MapObject, 
> ObjectType(class java.lang.Object), true, -6), 
> StructField(PARTNAME,StringType,true), 
> StructField(PartNumber,StringType,true), 
> StructField(PartPrice,DoubleType,false), ObjectType(class 
> $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$partClass null else 
> named_struct(PARTNAME, staticinvoke(class 
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
> knownnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
> java.lang.Object), true, -6), StructField(PARTNAME,StringType,true), 
> StructField(PartNumber,StringType,true), 
> 

[jira] [Created] (SPARK-44770) Add a displayOrder variable to WebUITab to specify the order in which tabs appear

2023-08-10 Thread Jason Li (Jira)
Jason Li created SPARK-44770:


 Summary: Add a displayOrder variable to WebUITab to specify the 
order in which tabs appear
 Key: SPARK-44770
 URL: https://issues.apache.org/jira/browse/SPARK-44770
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 3.5.1
Reporter: Jason Li


Add a displayOrder variable to WebUITab to specify the order in which tabs 
appear. Currently, the tabs are ordered by when they get attached, which isn't 
always desired. The default is MIN_VALUE, meaning if it's not specified, it 
will appear in the order added before any tabs with a non-default displayOrder. 
 For example, we would like to have the SQL Tab appear before the Connect tab; 
however, based on the code flow, the Connect tab will be attached first and 
with the current logic, that tab would also appear first.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44769) Add SQL statement to create an empty array with a type

2023-08-10 Thread Holden Karau (Jira)
Holden Karau created SPARK-44769:


 Summary: Add SQL statement to create an empty array with a type
 Key: SPARK-44769
 URL: https://issues.apache.org/jira/browse/SPARK-44769
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Holden Karau






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44768) Improve WSCG handling of row buffer by accounting for executor memory . Exploding nested arrays can easily lead to out of memory errors.

2023-08-10 Thread Franck Tago (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-44768:

Summary: Improve WSCG handling of row buffer by accounting for executor 
memory  .  Exploding nested arrays can easily lead to out of memory errors.   
(was: Improve WSCG handling of row buffer by accounting for executor memory)

> Improve WSCG handling of row buffer by accounting for executor memory  .  
> Exploding nested arrays can easily lead to out of memory errors. 
> ---
>
> Key: SPARK-44768
> URL: https://issues.apache.org/jira/browse/SPARK-44768
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.2, 3.4.0, 3.4.1
>Reporter: Franck Tago
>Priority: Major
>
> consider a scenario where you flatten  a nested array 
> // e.g you can use the following steps to create the dataframe 
> /create a partClass
> case class partClass (PARTNAME: String , PartNumber: String , PartPrice : 
> Double )
> //create a nested array array class
> case  class array_array_class (
>  col_int: Int,
>  arr_arr_string : Seq[Seq[String]],
>  arr_arr_bigint : Seq[Seq[Long]],
>  col_string     : String,
>  parts_s        : Seq[Seq[partClass]]
>  
> )
> //create a dataframe
> var df_array_array = sc.parallelize(
>  Seq(
>  (1,Seq(Seq("a","b" ,"c" ,"d") ,Seq("aa","bb" ,"cc","dd")) , 
> Seq(Seq(1000,2), Seq(3,-1)),"ItemPart1",
>   Seq(Seq(partClass("PNAME1","P1",20.75),partClass("PNAME1_1","P1_1",30.75)))
>  ) ,
>  
>  (2,Seq(Seq("ab","bc" ,"cd" ,"de") ,Seq("aab","bbc" 
> ,"ccd","dde"),Seq("aab")) , Seq(Seq(-1000,-2,-1,-2), 
> Seq(0,3,-1)),"ItemPart2",
>   
> Seq(Seq(partClass("PNAME2","P2",170.75),partClass("PNAME2_1","P2_1",33.75),partClass("PNAME2_2","P2_2",73.75)))
>  )
>   
>  )
> ).toDF("c1" ,"c2" ,"c3" ,"c4" ,"c5")
> //explode a nested array 
> var  result   =  df_array_array.select( col("c1"), 
> explode(col("c2"))).select('c1 , explode('col))
> result.explain
>  
> The physical for this operator is seen below.
> -
> Physical plan 
> == Physical Plan ==
> *(1) Generate explode(col#27), [c1#17], false, [col#30]
> +- *(1) Filter ((size(col#27, true) > 0) AND isnotnull(col#27))
>    +- *(1) Generate explode(c2#18), [c1#17], false, [col#27]
>       +- *(1) Project [_1#6 AS c1#17, _2#7 AS c2#18]
>          +- *(1) Filter ((size(_2#7, true) > 0) AND isnotnull(_2#7))
>             +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, 
> scala.Tuple5, true]))._1 AS _1#6, mapobjects(lambdavariable(MapObject, 
> ObjectType(class java.lang.Object), true, -1), 
> mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), 
> true, -2), staticinvoke(class org.apache.spark.unsafe.types.UTF8String, 
> StringType, fromString, validateexternaltype(lambdavariable(MapObject, 
> ObjectType(class java.lang.Object), true, -2), StringType, ObjectType(class 
> java.lang.String)), true, false, true), 
> validateexternaltype(lambdavariable(MapObject, ObjectType(class 
> java.lang.Object), true, -1), ArrayType(StringType,true), 
> ObjectType(interface scala.collection.Seq)), None), 
> knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._2, None) AS _2#7, 
> mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), 
> true, -3), mapobjects(lambdavariable(MapObject, ObjectType(class 
> java.lang.Object), true, -4), 
> assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
> java.lang.Object), true, -4), IntegerType, IntegerType)), 
> validateexternaltype(lambdavariable(MapObject, ObjectType(class 
> java.lang.Object), true, -3), ArrayType(IntegerType,false), 
> ObjectType(interface scala.collection.Seq)), None), 
> knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._3, None) AS _3#8, 
> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
> fromString, knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._4, 
> true, false, true) AS _4#9, mapobjects(lambdavariable(MapObject, 
> ObjectType(class java.lang.Object), true, -5), 
> mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), 
> true, -6), if (isnull(validateexternaltype(lambdavariable(MapObject, 
> ObjectType(class java.lang.Object), true, -6), 
> StructField(PARTNAME,StringType,true), 
> StructField(PartNumber,StringType,true), 
> StructField(PartPrice,DoubleType,false), ObjectType(class 
> $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$partClass null else 
> named_struct(PARTNAME, staticinvoke(class 
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
> knownnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
> 

[jira] [Created] (SPARK-44768) Improve WSCG handling of row buffer by accounting for executor memory

2023-08-10 Thread Franck Tago (Jira)
Franck Tago created SPARK-44768:
---

 Summary: Improve WSCG handling of row buffer by accounting for 
executor memory
 Key: SPARK-44768
 URL: https://issues.apache.org/jira/browse/SPARK-44768
 Project: Spark
  Issue Type: Bug
  Components: Optimizer
Affects Versions: 3.4.1, 3.4.0, 3.3.2
Reporter: Franck Tago


consider a scenario where you flatten  a nested array 

// e.g you can use the following steps to create the dataframe 

/create a partClass
case class partClass (PARTNAME: String , PartNumber: String , PartPrice : 
Double )

//create a nested array array class
case  class array_array_class (
 col_int: Int,
 arr_arr_string : Seq[Seq[String]],
 arr_arr_bigint : Seq[Seq[Long]],
 col_string     : String,
 parts_s        : Seq[Seq[partClass]]
 
)

//create a dataframe
var df_array_array = sc.parallelize(
 Seq(
 (1,Seq(Seq("a","b" ,"c" ,"d") ,Seq("aa","bb" ,"cc","dd")) , 
Seq(Seq(1000,2), Seq(3,-1)),"ItemPart1",
  Seq(Seq(partClass("PNAME1","P1",20.75),partClass("PNAME1_1","P1_1",30.75)))
 ) ,
 
 (2,Seq(Seq("ab","bc" ,"cd" ,"de") ,Seq("aab","bbc" 
,"ccd","dde"),Seq("aab")) , Seq(Seq(-1000,-2,-1,-2), 
Seq(0,3,-1)),"ItemPart2",
  
Seq(Seq(partClass("PNAME2","P2",170.75),partClass("PNAME2_1","P2_1",33.75),partClass("PNAME2_2","P2_2",73.75)))
 )
  
 )

).toDF("c1" ,"c2" ,"c3" ,"c4" ,"c5")

//explode a nested array 

var  result   =  df_array_array.select( col("c1"), 
explode(col("c2"))).select('c1 , explode('col))

result.explain

 

The physical for this operator is seen below.

-
Physical plan 

== Physical Plan ==
*(1) Generate explode(col#27), [c1#17], false, [col#30]
+- *(1) Filter ((size(col#27, true) > 0) AND isnotnull(col#27))
   +- *(1) Generate explode(c2#18), [c1#17], false, [col#27]
      +- *(1) Project [_1#6 AS c1#17, _2#7 AS c2#18]
         +- *(1) Filter ((size(_2#7, true) > 0) AND isnotnull(_2#7))
            +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, 
scala.Tuple5, true]))._1 AS _1#6, mapobjects(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -1), 
mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, 
-2), staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -2), StringType, ObjectType(class java.lang.String)), 
true, false, true), validateexternaltype(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -1), ArrayType(StringType,true), 
ObjectType(interface scala.collection.Seq)), None), 
knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._2, None) AS _2#7, 
mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, 
-3), mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), 
true, -4), assertnotnull(validateexternaltype(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -4), IntegerType, IntegerType)), 
validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -3), ArrayType(IntegerType,false), 
ObjectType(interface scala.collection.Seq)), None), 
knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._3, None) AS _3#8, 
staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._4, true, 
false, true) AS _4#9, mapobjects(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -5), mapobjects(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -6), if 
(isnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -6), StructField(PARTNAME,StringType,true), 
StructField(PartNumber,StringType,true), 
StructField(PartPrice,DoubleType,false), ObjectType(class 
$line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$partClass null else 
named_struct(PARTNAME, staticinvoke(class 
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
knownnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -6), StructField(PARTNAME,StringType,true), 
StructField(PartNumber,StringType,true), 
StructField(PartPrice,DoubleType,false), ObjectType(class 
$line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$partClass))).PARTNAME, true, 
false, true), PartNumber, staticinvoke(class 
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
knownnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -6), StructField(PARTNAME,StringType,true), 
StructField(PartNumber,StringType,true), 
StructField(PartPrice,DoubleType,false), ObjectType(class 
$line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$partClass))).PartNumber, true, 
false, true), PartPrice, 

[jira] [Commented] (SPARK-44767) Plugin API for PySpark and SparkR workers

2023-08-10 Thread Willi Raschkowski (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752926#comment-17752926
 ] 

Willi Raschkowski commented on SPARK-44767:
---

I put up a proposal implementation here: 
https://github.com/apache/spark/pull/42440

> Plugin API for PySpark and SparkR workers
> -
>
> Key: SPARK-44767
> URL: https://issues.apache.org/jira/browse/SPARK-44767
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: Willi Raschkowski
>Priority: Major
>
> An API to customize Python and R workers allows for extensibility beyond what 
> can be expressed via static configs and environment variables like, e.g., 
> {{spark.pyspark.python}}.
> A use case for this is overriding {{PATH}} when using {{spark.archives}} 
> with, say, conda-pack (as documented 
> [here|https://spark.apache.org/docs/3.1.1/api/python/user_guide/python_packaging.html#using-conda]).
>  Some packages rely on binaries. And if we want to use those packages in 
> Spark, we need to include their binaries in the {{PATH}}.
> But we can't set the {{PATH}} via some config because 1) the environment with 
> its binaries may be at a dynamic location (archives are unpacked on the 
> driver [into a directory with random 
> name|https://github.com/apache/spark/blob/5db87787d5cc1cefb51ec77e49bac7afaa46d300/core/src/main/scala/org/apache/spark/SparkFiles.scala#L33-L37]),
>  and 2) we may not want to override the {{PATH}} that's pre-configured on the 
> hosts.
> Other use cases unlocked by this include overriding the executable 
> dynamically (e.g., to select a version) or forking/redirecting the worker's 
> output stream.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44767) Plugin API for PySpark and SparkR workers

2023-08-10 Thread Willi Raschkowski (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Willi Raschkowski updated SPARK-44767:
--
Summary: Plugin API for PySpark and SparkR workers  (was: Plugin API for 
PySpark and SparkR subprocesses)

> Plugin API for PySpark and SparkR workers
> -
>
> Key: SPARK-44767
> URL: https://issues.apache.org/jira/browse/SPARK-44767
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: Willi Raschkowski
>Priority: Major
>
> An API to customize Python and R workers allows for extensibility beyond what 
> can be expressed via static configs and environment variables like, e.g., 
> {{spark.pyspark.python}}.
> A use case for this is overriding {{PATH}} when using {{spark.archives}} 
> with, say, conda-pack (as documented 
> [here|https://spark.apache.org/docs/3.1.1/api/python/user_guide/python_packaging.html#using-conda]).
>  Some packages rely on binaries. And if we want to use those packages in 
> Spark, we need to include their binaries in the {{PATH}}.
> But we can't set the {{PATH}} via some config because 1) the environment with 
> its binaries may be at a dynamic location (archives are unpacked on the 
> driver [into a directory with random 
> name|https://github.com/apache/spark/blob/5db87787d5cc1cefb51ec77e49bac7afaa46d300/core/src/main/scala/org/apache/spark/SparkFiles.scala#L33-L37]),
>  and 2) we may not want to override the {{PATH}} that's pre-configured on the 
> hosts.
> Other use cases unlocked by this include overriding the executable 
> dynamically (e.g., to select a version) or forking/redirecting the worker's 
> output stream.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44767) Plugin API for PySpark and SparkR subprocesses

2023-08-10 Thread Willi Raschkowski (Jira)
Willi Raschkowski created SPARK-44767:
-

 Summary: Plugin API for PySpark and SparkR subprocesses
 Key: SPARK-44767
 URL: https://issues.apache.org/jira/browse/SPARK-44767
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 3.4.1
Reporter: Willi Raschkowski


An API to customize Python and R workers allows for extensibility beyond what 
can be expressed via static configs and environment variables like, e.g., 
{{spark.pyspark.python}}.

A use case we had for this is overriding {{PATH}} when using {{spark.archives}} 
with, say, conda-pack (as documented 
[here|https://spark.apache.org/docs/3.1.1/api/python/user_guide/python_packaging.html#using-conda]).
 Some packages rely on binaries. And if we want to use those packages in Spark, 
we need to include their binaries in the {{PATH}}.

But we can't set the {{PATH}} via some config because 1) the environment with 
its binaries may be at a dynamic location (archives are unpacked on the driver 
[into a directory with random 
name|https://github.com/apache/spark/blob/5db87787d5cc1cefb51ec77e49bac7afaa46d300/core/src/main/scala/org/apache/spark/SparkFiles.scala#L33-L37]),
 and 2) we may not want to override the {{PATH}} that's pre-configured on the 
hosts.

Other use cases unlocked by this include overriding the executable dynamically 
(e.g., to select a version) or forking/redirecting the worker's output stream.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44767) Plugin API for PySpark and SparkR subprocesses

2023-08-10 Thread Willi Raschkowski (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Willi Raschkowski updated SPARK-44767:
--
Description: 
An API to customize Python and R workers allows for extensibility beyond what 
can be expressed via static configs and environment variables like, e.g., 
{{spark.pyspark.python}}.

A use case for this is overriding {{PATH}} when using {{spark.archives}} with, 
say, conda-pack (as documented 
[here|https://spark.apache.org/docs/3.1.1/api/python/user_guide/python_packaging.html#using-conda]).
 Some packages rely on binaries. And if we want to use those packages in Spark, 
we need to include their binaries in the {{PATH}}.

But we can't set the {{PATH}} via some config because 1) the environment with 
its binaries may be at a dynamic location (archives are unpacked on the driver 
[into a directory with random 
name|https://github.com/apache/spark/blob/5db87787d5cc1cefb51ec77e49bac7afaa46d300/core/src/main/scala/org/apache/spark/SparkFiles.scala#L33-L37]),
 and 2) we may not want to override the {{PATH}} that's pre-configured on the 
hosts.

Other use cases unlocked by this include overriding the executable dynamically 
(e.g., to select a version) or forking/redirecting the worker's output stream.

  was:
An API to customize Python and R workers allows for extensibility beyond what 
can be expressed via static configs and environment variables like, e.g., 
{{spark.pyspark.python}}.

A use case we had for this is overriding {{PATH}} when using {{spark.archives}} 
with, say, conda-pack (as documented 
[here|https://spark.apache.org/docs/3.1.1/api/python/user_guide/python_packaging.html#using-conda]).
 Some packages rely on binaries. And if we want to use those packages in Spark, 
we need to include their binaries in the {{PATH}}.

But we can't set the {{PATH}} via some config because 1) the environment with 
its binaries may be at a dynamic location (archives are unpacked on the driver 
[into a directory with random 
name|https://github.com/apache/spark/blob/5db87787d5cc1cefb51ec77e49bac7afaa46d300/core/src/main/scala/org/apache/spark/SparkFiles.scala#L33-L37]),
 and 2) we may not want to override the {{PATH}} that's pre-configured on the 
hosts.

Other use cases unlocked by this include overriding the executable dynamically 
(e.g., to select a version) or forking/redirecting the worker's output stream.


> Plugin API for PySpark and SparkR subprocesses
> --
>
> Key: SPARK-44767
> URL: https://issues.apache.org/jira/browse/SPARK-44767
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: Willi Raschkowski
>Priority: Major
>
> An API to customize Python and R workers allows for extensibility beyond what 
> can be expressed via static configs and environment variables like, e.g., 
> {{spark.pyspark.python}}.
> A use case for this is overriding {{PATH}} when using {{spark.archives}} 
> with, say, conda-pack (as documented 
> [here|https://spark.apache.org/docs/3.1.1/api/python/user_guide/python_packaging.html#using-conda]).
>  Some packages rely on binaries. And if we want to use those packages in 
> Spark, we need to include their binaries in the {{PATH}}.
> But we can't set the {{PATH}} via some config because 1) the environment with 
> its binaries may be at a dynamic location (archives are unpacked on the 
> driver [into a directory with random 
> name|https://github.com/apache/spark/blob/5db87787d5cc1cefb51ec77e49bac7afaa46d300/core/src/main/scala/org/apache/spark/SparkFiles.scala#L33-L37]),
>  and 2) we may not want to override the {{PATH}} that's pre-configured on the 
> hosts.
> Other use cases unlocked by this include overriding the executable 
> dynamically (e.g., to select a version) or forking/redirecting the worker's 
> output stream.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44766) Cache the pandas converter for Python UDTFs

2023-08-10 Thread Allison Wang (Jira)
Allison Wang created SPARK-44766:


 Summary: Cache the pandas converter for Python UDTFs
 Key: SPARK-44766
 URL: https://issues.apache.org/jira/browse/SPARK-44766
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Allison Wang


We should cache the pandas converter when serializing pandas dataframe to arrow 
batch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44765) Make ReleaseExecute retry in ExecutePlanResponseReattachableIterator reuse common mechanism

2023-08-10 Thread Juliusz Sompolski (Jira)
Juliusz Sompolski created SPARK-44765:
-

 Summary: Make ReleaseExecute retry in 
ExecutePlanResponseReattachableIterator reuse common mechanism
 Key: SPARK-44765
 URL: https://issues.apache.org/jira/browse/SPARK-44765
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 3.5.0
Reporter: Juliusz Sompolski






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44764) Streaming process improvement

2023-08-10 Thread Wei Liu (Jira)
Wei Liu created SPARK-44764:
---

 Summary: Streaming process improvement
 Key: SPARK-44764
 URL: https://issues.apache.org/jira/browse/SPARK-44764
 Project: Spark
  Issue Type: New Feature
  Components: Connect, Structured Streaming
Affects Versions: 4.0.0
Reporter: Wei Liu


# Deduplicate or remove StreamingPythonRunner if possible, it is very similar 
to existing PythonRunner
 # Use the DAEMON mode for starting a new process



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44759) Do not combine multiple Generate operators in the same WholeStageCodeGen node because it can easily cause OOM failures if arrays are relatively large

2023-08-10 Thread Franck Tago (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-44759:

Description: 
This is an issue since the WSCG  implementation of the generate node. 

Because WSCG compute rows in batches , the combination of WSCG and the explode 
operation consume a lot of the dedicated executor memory. This is even more 
true when the WSCG node contains multiple explode nodes. This is the case when 
flattening a nested array.

The generate node used to flatten array generally  produces an amount of output 
rows that is significantly higher than the input rows.

the number of output rows generated is even drastically higher when flattening 
a nested array .

When we combine more that 1 generate node in the same WholeStageCodeGen  node, 
we run  a high risk of running out of memory for multiple reasons. 

1- As you can see from snapshots added in the comments ,  the rows created in 
the nested loop are saved in a writer buffer.  In this case because the rows 
were big , the job failed with an Out Of Memory Exception error .

2_ The generated WholeStageCodeGen result in a nested loop that for each row  , 
will explode the parent array and then explode the inner array.  The rows are 
accumulated in the writer buffer without accounting for the row size.

Please view the attached Spark Gui and Spark Dag 

In my case the wholestagecodegen includes 2 explode nodes. 

Because the array elements are large , we end up with an Out Of Memory error. 

 

I recommend that we do not merge  multiple explode nodes in the same whole 
stage code gen node . Doing so leads to potential memory issues.

In our case , the job execution failed with an  OOM error because the the WSCG 
executed  into a nested for loop . 

 

 

 

  was:
This is an issue since the WSCG  implementation of the generate node. 

The generate node used to flatten array generally  produces an amount of output 
rows that is significantly higher than the input rows to the node. 

the number of output rows generated is even drastically higher when flattening 
a nested array .

When we combine more that 1 generate node in the same WholeStageCodeGen  node, 
we run  a high risk of running out of memory for multiple reasons. 

1- As you can see from the attachment ,  the rows created in the nested loop 
are saved in writer buffer.  In my case because the rows were big , I hit an 
Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
nested loop that for each row  , will explode the parent array and then explode 
the inner array.  This is prone to OutOfmerry errors

 

Please view the attached Spark Gui and Spark Dag 

In my case the wholestagecodegen includes 2 explode nodes. 

Because the array elements are large , we end up with an Out Of Memory error. 

 

I recommend that we do not merge  multiple explode nodes in the same whole 
stage code gen node . Doing so leads to potential memory issues.

In our case , the job execution failed with an  OOM error because the the WSCG 
executed  into a nested for loop . 

 

 

 


> Do not combine multiple Generate operators in the same WholeStageCodeGen node 
> because it can  easily cause OOM failures if arrays are relatively large
> --
>
> Key: SPARK-44759
> URL: https://issues.apache.org/jira/browse/SPARK-44759
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 
> 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.3.1, 3.2.3, 3.2.4, 3.3.2, 3.4.0, 3.4.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: image-2023-08-10-09-27-24-124.png, 
> image-2023-08-10-09-29-24-804.png, image-2023-08-10-09-32-46-163.png, 
> image-2023-08-10-09-33-47-788.png, 
> wholestagecodegen_wc1_debug_wholecodegen_passed
>
>
> This is an issue since the WSCG  implementation of the generate node. 
> Because WSCG compute rows in batches , the combination of WSCG and the 
> explode operation consume a lot of the dedicated executor memory. This is 
> even more true when the WSCG node contains multiple explode nodes. This is 
> the case when flattening a nested array.
> The generate node used to flatten array generally  produces an amount of 
> output rows that is significantly higher than the input rows.
> the number of output rows generated is even drastically higher when 
> flattening a nested array .
> When we combine more that 1 generate node in the same WholeStageCodeGen  
> node, we run  a high risk of running out of memory for multiple reasons. 
> 1- As you can see from snapshots added in the comments ,  the rows created in 
> the nested loop are saved in a writer buffer.  In this case 

[jira] [Created] (SPARK-44763) Fix a bug of promoting string as double in binary arithmetic with interval

2023-08-10 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-44763:
--

 Summary: Fix a bug of promoting string as double in binary 
arithmetic with interval  
 Key: SPARK-44763
 URL: https://issues.apache.org/jira/browse/SPARK-44763
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang


The following query works on branch-3.5 or below, but fails on the latest 
master:

```
select concat(DATE'2020-12-31', ' ', date_format('09:03:08', 'HH:mm:ss')) + 
(INTERVAL '03' HOUR)
```
 
The direct reason is now we mark `cast(date as string)` as resolved during type 
coercion after changes [https://github.com/apache/spark/pull/42089.] As a 
result, there are two transforms from CombinedTypeCoercionRule
```
Rule ConcatCoercion Transformed concat(2020-12-31,  , date_format(cast(09:03:08 
as timestamp), HH:mm:ss, Some(America/Los_Angeles))) to concat(cast(2020-12-31 
as string),  , date_format(cast(09:03:08 as timestamp), HH:mm:ss, 
Some(America/Los_Angeles)))

Rule PromoteStrings Transformed (concat(cast(2020-12-31 as string),  , 
date_format(cast(09:03:08 as timestamp), HH:mm:ss, Some(America/Los_Angeles))) 
+ INTERVAL '03' HOUR) to (cast(concat(cast(2020-12-31 as string),  , 
date_format(cast(09:03:08 as timestamp), HH:mm:ss, Some(America/Los_Angeles))) 
as double) + INTERVAL '03' HOUR)
```  
The second transform doesn't happen in previous releases since cast(2020-12-31 
as string)  used to be unresolved after the first transform.
 
The fix is simple, the analyzer should not promote string as double in binary 
arithmetic with ANSI interval. The changes in 
[https://github.com/apache/spark/pull/42089|https://github.com/apache/spark/pull/42089.]
 are valid and we should keep it.
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44759) Do not combine multiple Generate operators in the same WholeStageCodeGen node because it can easily cause OOM failures if arrays are relatively large

2023-08-10 Thread Franck Tago (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-44759:

Summary: Do not combine multiple Generate operators in the same 
WholeStageCodeGen node because it can  easily cause OOM failures if arrays are 
relatively large  (was: Do not combine multiple Generate nodes in the same 
WholeStageCodeGen node because it can  easily cause OOM failures if arrays are 
relatively large)

> Do not combine multiple Generate operators in the same WholeStageCodeGen node 
> because it can  easily cause OOM failures if arrays are relatively large
> --
>
> Key: SPARK-44759
> URL: https://issues.apache.org/jira/browse/SPARK-44759
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 
> 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.3.1, 3.2.3, 3.2.4, 3.3.2, 3.4.0, 3.4.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: image-2023-08-10-09-27-24-124.png, 
> image-2023-08-10-09-29-24-804.png, image-2023-08-10-09-32-46-163.png, 
> image-2023-08-10-09-33-47-788.png, 
> wholestagecodegen_wc1_debug_wholecodegen_passed
>
>
> This is an issue since the WSCG  implementation of the generate node. 
> The generate node used to flatten array generally  produces an amount of 
> output rows that is significantly higher than the input rows to the node. 
> the number of output rows generated is even drastically higher when 
> flattening a nested array .
> When we combine more that 1 generate node in the same WholeStageCodeGen  
> node, we run  a high risk of running out of memory for multiple reasons. 
> 1- As you can see from the attachment ,  the rows created in the nested loop 
> are saved in writer buffer.  In my case because the rows were big , I hit an 
> Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
> nested loop that for each row  , will explode the parent array and then 
> explode the inner array.  This is prone to OutOfmerry errors
>  
> Please view the attached Spark Gui and Spark Dag 
> In my case the wholestagecodegen includes 2 explode nodes. 
> Because the array elements are large , we end up with an Out Of Memory error. 
>  
> I recommend that we do not merge  multiple explode nodes in the same whole 
> stage code gen node . Doing so leads to potential memory issues.
> In our case , the job execution failed with an  OOM error because the the 
> WSCG executed  into a nested for loop . 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen node because it can easily cause OOM failures if arrays are relatively large

2023-08-10 Thread Franck Tago (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-44759:

Summary: Do not combine multiple Generate nodes in the same 
WholeStageCodeGen node because it can  easily cause OOM failures if arrays are 
relatively large  (was: Do not combine multiple Generate nodes in the same 
WholeStageCodeGen nodebecause it can  easily cause OOM failures if arrays are 
relatively large)

> Do not combine multiple Generate nodes in the same WholeStageCodeGen node 
> because it can  easily cause OOM failures if arrays are relatively large
> --
>
> Key: SPARK-44759
> URL: https://issues.apache.org/jira/browse/SPARK-44759
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 
> 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.3.1, 3.2.3, 3.2.4, 3.3.2, 3.4.0, 3.4.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: image-2023-08-10-09-27-24-124.png, 
> image-2023-08-10-09-29-24-804.png, image-2023-08-10-09-32-46-163.png, 
> image-2023-08-10-09-33-47-788.png, 
> wholestagecodegen_wc1_debug_wholecodegen_passed
>
>
> This is an issue since the WSCG  implementation of the generate node. 
> The generate node used to flatten array generally  produces an amount of 
> output rows that is significantly higher than the input rows to the node. 
> the number of output rows generated is even drastically higher when 
> flattening a nested array .
> When we combine more that 1 generate node in the same WholeStageCodeGen  
> node, we run  a high risk of running out of memory for multiple reasons. 
> 1- As you can see from the attachment ,  the rows created in the nested loop 
> are saved in writer buffer.  In my case because the rows were big , I hit an 
> Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
> nested loop that for each row  , will explode the parent array and then 
> explode the inner array.  This is prone to OutOfmerry errors
>  
> Please view the attached Spark Gui and Spark Dag 
> In my case the wholestagecodegen includes 2 explode nodes. 
> Because the array elements are large , we end up with an Out Of Memory error. 
>  
> I recommend that we do not merge  multiple explode nodes in the same whole 
> stage code gen node . Doing so leads to potential memory issues.
> In our case , the job execution failed with an  OOM error because the the 
> WSCG executed  into a nested for loop . 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen nodebecause it can easily cause OOM failures if arrays are relatively large

2023-08-10 Thread Franck Tago (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-44759:

Description: 
This is an issue since the WSCG  implementation of the generate node. 

The generate node used to flatten array generally  produces an amount of output 
rows that is significantly higher than the input rows to the node. 

the number of output rows generated is even drastically higher when flattening 
a nested array .

When we combine more that 1 generate node in the same WholeStageCodeGen  node, 
we run  a high risk of running out of memory for multiple reasons. 

1- As you can see from the attachment ,  the rows created in the nested loop 
are saved in writer buffer.  In my case because the rows were big , I hit an 
Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
nested loop that for each row  , will explode the parent array and then explode 
the inner array.  This is prone to OutOfmerry errors

 

Please view the attached Spark Gui and Spark Dag 

In my case the wholestagecodegen includes 2 explode nodes. 

Because the array elements are large , we end up with an Out Of Memory error. 

 

I recommend that we do not merge  multiple explode nodes in the same whole 
stage code gen node . Doing so leads to potential memory issues.

In our case , the job execution failed with an  OOM error because the the WSCG 
executed  into a nested for loop . 

 

 

 

  was:
The generate node used to flatten array generally  produces an amount of output 
rows that is significantly higher than the input rows to the node. 

the number of output rows generated is even drastically higher when flattening 
a nested array .

When we combine more that 1 generate node in the same WholeStageCodeGen  node, 
we run  a high risk of running out of memory for multiple reasons. 

1- As you can see from the attachment ,  the rows created in the nested loop 
are saved in writer buffer.  In my case because the rows were big , I hit an 
Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
nested loop that for each row  , will explode the parent array and then explode 
the inner array.  This is prone to OutOfmerry errors

 

Please view the attached Spark Gui and Spark Dag 

In my case the wholestagecodegen includes 2 explode nodes. 

Because the array elements are large , we end up with an Out Of Memory error. 

 

I recommend that we do not merge  multiple explode nodes in the same whole 
stage code gen node . Doing so leads to potential memory issues.

In our case , the job execution failed with an  OOM error because the the WSCG 
executed  into a nested for loop . 

 

 

 


> Do not combine multiple Generate nodes in the same WholeStageCodeGen 
> nodebecause it can  easily cause OOM failures if arrays are relatively large
> -
>
> Key: SPARK-44759
> URL: https://issues.apache.org/jira/browse/SPARK-44759
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 
> 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.3.1, 3.2.3, 3.2.4, 3.3.2, 3.4.0, 3.4.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: image-2023-08-10-09-27-24-124.png, 
> image-2023-08-10-09-29-24-804.png, image-2023-08-10-09-32-46-163.png, 
> image-2023-08-10-09-33-47-788.png, 
> wholestagecodegen_wc1_debug_wholecodegen_passed
>
>
> This is an issue since the WSCG  implementation of the generate node. 
> The generate node used to flatten array generally  produces an amount of 
> output rows that is significantly higher than the input rows to the node. 
> the number of output rows generated is even drastically higher when 
> flattening a nested array .
> When we combine more that 1 generate node in the same WholeStageCodeGen  
> node, we run  a high risk of running out of memory for multiple reasons. 
> 1- As you can see from the attachment ,  the rows created in the nested loop 
> are saved in writer buffer.  In my case because the rows were big , I hit an 
> Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
> nested loop that for each row  , will explode the parent array and then 
> explode the inner array.  This is prone to OutOfmerry errors
>  
> Please view the attached Spark Gui and Spark Dag 
> In my case the wholestagecodegen includes 2 explode nodes. 
> Because the array elements are large , we end up with an Out Of Memory error. 
>  
> I recommend that we do not merge  multiple explode nodes in the same whole 
> stage code gen node . Doing so leads to potential memory issues.
> In our case , the job execution failed with an  OOM error because the the 
> WSCG executed  

[jira] [Updated] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen nodebecause it can easily cause OOM failures if arrays are relatively large

2023-08-10 Thread Franck Tago (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-44759:

Affects Version/s: 3.4.1
   3.4.0
   3.3.2
   3.2.4
   3.2.3
   3.2.2
   3.2.1
   3.1.3
   3.2.0
   3.1.2
   3.1.1
   3.1.0
   3.0.3
   3.0.2
   3.0.1
   3.0.0

> Do not combine multiple Generate nodes in the same WholeStageCodeGen 
> nodebecause it can  easily cause OOM failures if arrays are relatively large
> -
>
> Key: SPARK-44759
> URL: https://issues.apache.org/jira/browse/SPARK-44759
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 
> 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.3.1, 3.2.3, 3.2.4, 3.3.2, 3.4.0, 3.4.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: image-2023-08-10-09-27-24-124.png, 
> image-2023-08-10-09-29-24-804.png, image-2023-08-10-09-32-46-163.png, 
> image-2023-08-10-09-33-47-788.png, 
> wholestagecodegen_wc1_debug_wholecodegen_passed
>
>
> The generate node used to flatten array generally  produces an amount of 
> output rows that is significantly higher than the input rows to the node. 
> the number of output rows generated is even drastically higher when 
> flattening a nested array .
> When we combine more that 1 generate node in the same WholeStageCodeGen  
> node, we run  a high risk of running out of memory for multiple reasons. 
> 1- As you can see from the attachment ,  the rows created in the nested loop 
> are saved in writer buffer.  In my case because the rows were big , I hit an 
> Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
> nested loop that for each row  , will explode the parent array and then 
> explode the inner array.  This is prone to OutOfmerry errors
>  
> Please view the attached Spark Gui and Spark Dag 
> In my case the wholestagecodegen includes 2 explode nodes. 
> Because the array elements are large , we end up with an Out Of Memory error. 
>  
> I recommend that we do not merge  multiple explode nodes in the same whole 
> stage code gen node . Doing so leads to potential memory issues.
> In our case , the job execution failed with an  OOM error because the the 
> WSCG executed  into a nested for loop . 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen nodebecause it can easily cause OOM failures if arrays are relatively large

2023-08-10 Thread Franck Tago (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-44759:

Description: 
The generate node used to flatten array generally  produces an amount of output 
rows that is significantly higher than the input rows to the node. 

the number of output rows generated is even drastically higher when flattening 
a nested array .

When we combine more that 1 generate node in the same WholeStageCodeGen  node, 
we run  a high risk of running out of memory for multiple reasons. 

1- As you can see from the attachment ,  the rows created in the nested loop 
are saved in writer buffer.  In my case because the rows were big , I hit an 
Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
nested loop that for each row  , will explode the parent array and then explode 
the inner array.  This is prone to OutOfmerry errors

 

Please view the attached Spark Gui and Spark Dag 

In my case the wholestagecodegen includes 2 explode nodes. 

Because the array elements are large , we end up with an Out Of Memory error. 

 

I recommend that we do not merge  multiple explode nodes in the same whole 
stage code gen node . Doing so leads to potential memory issues.

In our case , the job execution failed with an  OOM error because the the WSCG 
executed  into a nested for loop . 

 

 

 

  was:
The generate node used to flatten array generally  produces an amount of output 
rows that is significantly higher than the input rows to the node. 

the number of output rows generated is even drastically higher when flattening 
a nested array .

When we combine more that 1 generate node in the same WholeStageCodeGen  node, 
we run  a high risk of running out of memory for multiple reasons. 

1- As you can see from the attachment ,  the rows created in the nested loop 
are saved in writer buffer.  In my case because the rows were big , I hit an 
Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
nested loop that for each row  , will explode the parent array and then explode 
the inner array.  This is prone to OutOfmerry errors

 

Please view the attached Spark Gui and Spark Dag 

In my case the wholestagecodegen includes 2 explode nodes. 

Because the array elements are large , we end up with an Out Of Memory error. 

 

I recommend that we do not merge  multiple explode nodes in the same whoe stage 
code gen node . Doing so leads to potential memory issues.

 

 

 

 

 

 

 

WSCG method for first Generate node

!image-2023-08-10-09-22-29-949.png!

 

WSCG for second Generate node 

As you can see the execution of generate_doConsume_1 and generate_doConsume_0  
triggers a nested loop.

!image-2023-08-10-09-24-02-755.png!

 


> Do not combine multiple Generate nodes in the same WholeStageCodeGen 
> nodebecause it can  easily cause OOM failures if arrays are relatively large
> -
>
> Key: SPARK-44759
> URL: https://issues.apache.org/jira/browse/SPARK-44759
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.0, 3.3.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: image-2023-08-10-09-27-24-124.png, 
> image-2023-08-10-09-29-24-804.png, image-2023-08-10-09-32-46-163.png, 
> image-2023-08-10-09-33-47-788.png, 
> wholestagecodegen_wc1_debug_wholecodegen_passed
>
>
> The generate node used to flatten array generally  produces an amount of 
> output rows that is significantly higher than the input rows to the node. 
> the number of output rows generated is even drastically higher when 
> flattening a nested array .
> When we combine more that 1 generate node in the same WholeStageCodeGen  
> node, we run  a high risk of running out of memory for multiple reasons. 
> 1- As you can see from the attachment ,  the rows created in the nested loop 
> are saved in writer buffer.  In my case because the rows were big , I hit an 
> Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
> nested loop that for each row  , will explode the parent array and then 
> explode the inner array.  This is prone to OutOfmerry errors
>  
> Please view the attached Spark Gui and Spark Dag 
> In my case the wholestagecodegen includes 2 explode nodes. 
> Because the array elements are large , we end up with an Out Of Memory error. 
>  
> I recommend that we do not merge  multiple explode nodes in the same whole 
> stage code gen node . Doing so leads to potential memory issues.
> In our case , the job execution failed with an  OOM error because the the 
> WSCG executed  into a nested for loop . 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen nodebecause it can easily cause OOM failures if arrays are relatively large

2023-08-10 Thread Franck Tago (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-44759:

Description: 
The generate node used to flatten array generally  produces an amount of output 
rows that is significantly higher than the input rows to the node. 

the number of output rows generated is even drastically higher when flattening 
a nested array .

When we combine more that 1 generate node in the same WholeStageCodeGen  node, 
we run  a high risk of running out of memory for multiple reasons. 

1- As you can see from the attachment ,  the rows created in the nested loop 
are saved in writer buffer.  In my case because the rows were big , I hit an 
Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
nested loop that for each row  , will explode the parent array and then explode 
the inner array.  This is prone to OutOfmerry errors

 

Please view the attached Spark Gui and Spark Dag 

In my case the wholestagecodegen includes 2 explode nodes. 

Because the array elements are large , we end up with an Out Of Memory error. 

 

I recommend that we do not merge  multiple explode nodes in the same whoe stage 
code gen node . Doing so leads to potential memory issues.

 

 

 

 

 

 

 

WSCG method for first Generate node

!image-2023-08-10-09-22-29-949.png!

 

WSCG for second Generate node 

As you can see the execution of generate_doConsume_1 and generate_doConsume_0  
triggers a nested loop.

!image-2023-08-10-09-24-02-755.png!

 

  was:
The generate node used to flatten array generally  produces an amount of output 
rows that is significantly higher than the input rows to the node. 

the number of output rows generated is even drastically higher when flattening 
a nested array .

When we combine more that 1 generate node in the same WholeStageCodeGen  node, 
we run  a high risk of running out of memory for multiple reasons. 

1- As you can see from the attachment ,  the rows created in the nested loop 
are saved in writer buffer.  In my case because the rows were big , I hit an 
Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
nested loop that for each row  , will explode the parent array and then explode 
the inner array.  This is prone to OutOfmerry errors

 

Please view the attached Spark Gui and Spark Dag 

In my case the wholestagecodegen includes 2 explode nodes. 

Because the array elements are large , we end up with an Out Of Memory error. 

 

I recommend that we do not merge  multiple explode nodes in the same whoe stage 
code gen node . Doing so leads to potential memory issues.

 

!image-2023-08-10-02-18-24-758.png!

 

!image-2023-08-10-09-19-23-973.png!

 

!image-2023-08-10-09-21-32-371.png!

 

WSCG method for first Generate node

!image-2023-08-10-09-22-29-949.png!

 

WSCG for second Generate node 

As you can see the execution of generate_doConsume_1 and generate_doConsume_0  
triggers a nested loop.

!image-2023-08-10-09-24-02-755.png!

 


> Do not combine multiple Generate nodes in the same WholeStageCodeGen 
> nodebecause it can  easily cause OOM failures if arrays are relatively large
> -
>
> Key: SPARK-44759
> URL: https://issues.apache.org/jira/browse/SPARK-44759
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.0, 3.3.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: image-2023-08-10-09-27-24-124.png, 
> image-2023-08-10-09-29-24-804.png, image-2023-08-10-09-32-46-163.png, 
> image-2023-08-10-09-33-47-788.png, 
> wholestagecodegen_wc1_debug_wholecodegen_passed
>
>
> The generate node used to flatten array generally  produces an amount of 
> output rows that is significantly higher than the input rows to the node. 
> the number of output rows generated is even drastically higher when 
> flattening a nested array .
> When we combine more that 1 generate node in the same WholeStageCodeGen  
> node, we run  a high risk of running out of memory for multiple reasons. 
> 1- As you can see from the attachment ,  the rows created in the nested loop 
> are saved in writer buffer.  In my case because the rows were big , I hit an 
> Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
> nested loop that for each row  , will explode the parent array and then 
> explode the inner array.  This is prone to OutOfmerry errors
>  
> Please view the attached Spark Gui and Spark Dag 
> In my case the wholestagecodegen includes 2 explode nodes. 
> Because the array elements are large , we end up with an Out Of Memory error. 
>  
> I recommend that we do not merge  multiple explode nodes in the same whoe 
> stage code gen node . 

[jira] [Commented] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen nodebecause it can easily cause OOM failures if arrays are relatively large

2023-08-10 Thread Franck Tago (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752870#comment-17752870
 ] 

Franck Tago commented on SPARK-44759:
-

Spark Dag for the use case . The failure is from the execution of 
WholeStageCodeGen(2)

!image-2023-08-10-09-33-47-788.png!

> Do not combine multiple Generate nodes in the same WholeStageCodeGen 
> nodebecause it can  easily cause OOM failures if arrays are relatively large
> -
>
> Key: SPARK-44759
> URL: https://issues.apache.org/jira/browse/SPARK-44759
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.0, 3.3.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: image-2023-08-10-09-27-24-124.png, 
> image-2023-08-10-09-29-24-804.png, image-2023-08-10-09-32-46-163.png, 
> image-2023-08-10-09-33-47-788.png, 
> wholestagecodegen_wc1_debug_wholecodegen_passed
>
>
> The generate node used to flatten array generally  produces an amount of 
> output rows that is significantly higher than the input rows to the node. 
> the number of output rows generated is even drastically higher when 
> flattening a nested array .
> When we combine more that 1 generate node in the same WholeStageCodeGen  
> node, we run  a high risk of running out of memory for multiple reasons. 
> 1- As you can see from the attachment ,  the rows created in the nested loop 
> are saved in writer buffer.  In my case because the rows were big , I hit an 
> Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
> nested loop that for each row  , will explode the parent array and then 
> explode the inner array.  This is prone to OutOfmerry errors
>  
> Please view the attached Spark Gui and Spark Dag 
> In my case the wholestagecodegen includes 2 explode nodes. 
> Because the array elements are large , we end up with an Out Of Memory error. 
>  
> I recommend that we do not merge  multiple explode nodes in the same whoe 
> stage code gen node . Doing so leads to potential memory issues.
>  
> !image-2023-08-10-02-18-24-758.png!
>  
> !image-2023-08-10-09-19-23-973.png!
>  
> !image-2023-08-10-09-21-32-371.png!
>  
> WSCG method for first Generate node
> !image-2023-08-10-09-22-29-949.png!
>  
> WSCG for second Generate node 
> As you can see the execution of generate_doConsume_1 and generate_doConsume_0 
>  triggers a nested loop.
> !image-2023-08-10-09-24-02-755.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen nodebecause it can easily cause OOM failures if arrays are relatively large

2023-08-10 Thread Franck Tago (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-44759:

Attachment: image-2023-08-10-09-33-47-788.png

> Do not combine multiple Generate nodes in the same WholeStageCodeGen 
> nodebecause it can  easily cause OOM failures if arrays are relatively large
> -
>
> Key: SPARK-44759
> URL: https://issues.apache.org/jira/browse/SPARK-44759
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.0, 3.3.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: image-2023-08-10-09-27-24-124.png, 
> image-2023-08-10-09-29-24-804.png, image-2023-08-10-09-32-46-163.png, 
> image-2023-08-10-09-33-47-788.png, 
> wholestagecodegen_wc1_debug_wholecodegen_passed
>
>
> The generate node used to flatten array generally  produces an amount of 
> output rows that is significantly higher than the input rows to the node. 
> the number of output rows generated is even drastically higher when 
> flattening a nested array .
> When we combine more that 1 generate node in the same WholeStageCodeGen  
> node, we run  a high risk of running out of memory for multiple reasons. 
> 1- As you can see from the attachment ,  the rows created in the nested loop 
> are saved in writer buffer.  In my case because the rows were big , I hit an 
> Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
> nested loop that for each row  , will explode the parent array and then 
> explode the inner array.  This is prone to OutOfmerry errors
>  
> Please view the attached Spark Gui and Spark Dag 
> In my case the wholestagecodegen includes 2 explode nodes. 
> Because the array elements are large , we end up with an Out Of Memory error. 
>  
> I recommend that we do not merge  multiple explode nodes in the same whoe 
> stage code gen node . Doing so leads to potential memory issues.
>  
> !image-2023-08-10-02-18-24-758.png!
>  
> !image-2023-08-10-09-19-23-973.png!
>  
> !image-2023-08-10-09-21-32-371.png!
>  
> WSCG method for first Generate node
> !image-2023-08-10-09-22-29-949.png!
>  
> WSCG for second Generate node 
> As you can see the execution of generate_doConsume_1 and generate_doConsume_0 
>  triggers a nested loop.
> !image-2023-08-10-09-24-02-755.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen nodebecause it can easily cause OOM failures if arrays are relatively large

2023-08-10 Thread Franck Tago (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752868#comment-17752868
 ] 

Franck Tago commented on SPARK-44759:
-

WSCG  generated code for second Generate node 

!image-2023-08-10-09-32-46-163.png!

> Do not combine multiple Generate nodes in the same WholeStageCodeGen 
> nodebecause it can  easily cause OOM failures if arrays are relatively large
> -
>
> Key: SPARK-44759
> URL: https://issues.apache.org/jira/browse/SPARK-44759
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.0, 3.3.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: image-2023-08-10-09-27-24-124.png, 
> image-2023-08-10-09-29-24-804.png, image-2023-08-10-09-32-46-163.png, 
> wholestagecodegen_wc1_debug_wholecodegen_passed
>
>
> The generate node used to flatten array generally  produces an amount of 
> output rows that is significantly higher than the input rows to the node. 
> the number of output rows generated is even drastically higher when 
> flattening a nested array .
> When we combine more that 1 generate node in the same WholeStageCodeGen  
> node, we run  a high risk of running out of memory for multiple reasons. 
> 1- As you can see from the attachment ,  the rows created in the nested loop 
> are saved in writer buffer.  In my case because the rows were big , I hit an 
> Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
> nested loop that for each row  , will explode the parent array and then 
> explode the inner array.  This is prone to OutOfmerry errors
>  
> Please view the attached Spark Gui and Spark Dag 
> In my case the wholestagecodegen includes 2 explode nodes. 
> Because the array elements are large , we end up with an Out Of Memory error. 
>  
> I recommend that we do not merge  multiple explode nodes in the same whoe 
> stage code gen node . Doing so leads to potential memory issues.
>  
> !image-2023-08-10-02-18-24-758.png!
>  
> !image-2023-08-10-09-19-23-973.png!
>  
> !image-2023-08-10-09-21-32-371.png!
>  
> WSCG method for first Generate node
> !image-2023-08-10-09-22-29-949.png!
>  
> WSCG for second Generate node 
> As you can see the execution of generate_doConsume_1 and generate_doConsume_0 
>  triggers a nested loop.
> !image-2023-08-10-09-24-02-755.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen nodebecause it can easily cause OOM failures if arrays are relatively large

2023-08-10 Thread Franck Tago (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-44759:

Attachment: image-2023-08-10-09-32-46-163.png

> Do not combine multiple Generate nodes in the same WholeStageCodeGen 
> nodebecause it can  easily cause OOM failures if arrays are relatively large
> -
>
> Key: SPARK-44759
> URL: https://issues.apache.org/jira/browse/SPARK-44759
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.0, 3.3.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: image-2023-08-10-09-27-24-124.png, 
> image-2023-08-10-09-29-24-804.png, image-2023-08-10-09-32-46-163.png, 
> wholestagecodegen_wc1_debug_wholecodegen_passed
>
>
> The generate node used to flatten array generally  produces an amount of 
> output rows that is significantly higher than the input rows to the node. 
> the number of output rows generated is even drastically higher when 
> flattening a nested array .
> When we combine more that 1 generate node in the same WholeStageCodeGen  
> node, we run  a high risk of running out of memory for multiple reasons. 
> 1- As you can see from the attachment ,  the rows created in the nested loop 
> are saved in writer buffer.  In my case because the rows were big , I hit an 
> Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
> nested loop that for each row  , will explode the parent array and then 
> explode the inner array.  This is prone to OutOfmerry errors
>  
> Please view the attached Spark Gui and Spark Dag 
> In my case the wholestagecodegen includes 2 explode nodes. 
> Because the array elements are large , we end up with an Out Of Memory error. 
>  
> I recommend that we do not merge  multiple explode nodes in the same whoe 
> stage code gen node . Doing so leads to potential memory issues.
>  
> !image-2023-08-10-02-18-24-758.png!
>  
> !image-2023-08-10-09-19-23-973.png!
>  
> !image-2023-08-10-09-21-32-371.png!
>  
> WSCG method for first Generate node
> !image-2023-08-10-09-22-29-949.png!
>  
> WSCG for second Generate node 
> As you can see the execution of generate_doConsume_1 and generate_doConsume_0 
>  triggers a nested loop.
> !image-2023-08-10-09-24-02-755.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen nodebecause it can easily cause OOM failures if arrays are relatively large

2023-08-10 Thread Franck Tago (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-44759:

Attachment: image-2023-08-10-09-29-24-804.png

> Do not combine multiple Generate nodes in the same WholeStageCodeGen 
> nodebecause it can  easily cause OOM failures if arrays are relatively large
> -
>
> Key: SPARK-44759
> URL: https://issues.apache.org/jira/browse/SPARK-44759
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.0, 3.3.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: image-2023-08-10-09-27-24-124.png, 
> image-2023-08-10-09-29-24-804.png, 
> wholestagecodegen_wc1_debug_wholecodegen_passed
>
>
> The generate node used to flatten array generally  produces an amount of 
> output rows that is significantly higher than the input rows to the node. 
> the number of output rows generated is even drastically higher when 
> flattening a nested array .
> When we combine more that 1 generate node in the same WholeStageCodeGen  
> node, we run  a high risk of running out of memory for multiple reasons. 
> 1- As you can see from the attachment ,  the rows created in the nested loop 
> are saved in writer buffer.  In my case because the rows were big , I hit an 
> Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
> nested loop that for each row  , will explode the parent array and then 
> explode the inner array.  This is prone to OutOfmerry errors
>  
> Please view the attached Spark Gui and Spark Dag 
> In my case the wholestagecodegen includes 2 explode nodes. 
> Because the array elements are large , we end up with an Out Of Memory error. 
>  
> I recommend that we do not merge  multiple explode nodes in the same whoe 
> stage code gen node . Doing so leads to potential memory issues.
>  
> !image-2023-08-10-02-18-24-758.png!
>  
> !image-2023-08-10-09-19-23-973.png!
>  
> !image-2023-08-10-09-21-32-371.png!
>  
> WSCG method for first Generate node
> !image-2023-08-10-09-22-29-949.png!
>  
> WSCG for second Generate node 
> As you can see the execution of generate_doConsume_1 and generate_doConsume_0 
>  triggers a nested loop.
> !image-2023-08-10-09-24-02-755.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen nodebecause it can easily cause OOM failures if arrays are relatively large

2023-08-10 Thread Franck Tago (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752866#comment-17752866
 ] 

Franck Tago commented on SPARK-44759:
-

WSCG  generated code for first Generate node 

!image-2023-08-10-09-29-24-804.png!

> Do not combine multiple Generate nodes in the same WholeStageCodeGen 
> nodebecause it can  easily cause OOM failures if arrays are relatively large
> -
>
> Key: SPARK-44759
> URL: https://issues.apache.org/jira/browse/SPARK-44759
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.0, 3.3.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: image-2023-08-10-09-27-24-124.png, 
> image-2023-08-10-09-29-24-804.png, 
> wholestagecodegen_wc1_debug_wholecodegen_passed
>
>
> The generate node used to flatten array generally  produces an amount of 
> output rows that is significantly higher than the input rows to the node. 
> the number of output rows generated is even drastically higher when 
> flattening a nested array .
> When we combine more that 1 generate node in the same WholeStageCodeGen  
> node, we run  a high risk of running out of memory for multiple reasons. 
> 1- As you can see from the attachment ,  the rows created in the nested loop 
> are saved in writer buffer.  In my case because the rows were big , I hit an 
> Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
> nested loop that for each row  , will explode the parent array and then 
> explode the inner array.  This is prone to OutOfmerry errors
>  
> Please view the attached Spark Gui and Spark Dag 
> In my case the wholestagecodegen includes 2 explode nodes. 
> Because the array elements are large , we end up with an Out Of Memory error. 
>  
> I recommend that we do not merge  multiple explode nodes in the same whoe 
> stage code gen node . Doing so leads to potential memory issues.
>  
> !image-2023-08-10-02-18-24-758.png!
>  
> !image-2023-08-10-09-19-23-973.png!
>  
> !image-2023-08-10-09-21-32-371.png!
>  
> WSCG method for first Generate node
> !image-2023-08-10-09-22-29-949.png!
>  
> WSCG for second Generate node 
> As you can see the execution of generate_doConsume_1 and generate_doConsume_0 
>  triggers a nested loop.
> !image-2023-08-10-09-24-02-755.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen nodebecause it can easily cause OOM failures if arrays are relatively large

2023-08-10 Thread Franck Tago (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752864#comment-17752864
 ] 

Franck Tago edited comment on SPARK-44759 at 8/10/23 4:28 PM:
--

WSCG  generated code that calls  generate_doConsume_0

!image-2023-08-10-09-27-24-124.png!


was (Author: tafra...@gmail.com):
!image-2023-08-10-09-27-24-124.png!

> Do not combine multiple Generate nodes in the same WholeStageCodeGen 
> nodebecause it can  easily cause OOM failures if arrays are relatively large
> -
>
> Key: SPARK-44759
> URL: https://issues.apache.org/jira/browse/SPARK-44759
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.0, 3.3.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: image-2023-08-10-09-27-24-124.png, 
> wholestagecodegen_wc1_debug_wholecodegen_passed
>
>
> The generate node used to flatten array generally  produces an amount of 
> output rows that is significantly higher than the input rows to the node. 
> the number of output rows generated is even drastically higher when 
> flattening a nested array .
> When we combine more that 1 generate node in the same WholeStageCodeGen  
> node, we run  a high risk of running out of memory for multiple reasons. 
> 1- As you can see from the attachment ,  the rows created in the nested loop 
> are saved in writer buffer.  In my case because the rows were big , I hit an 
> Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
> nested loop that for each row  , will explode the parent array and then 
> explode the inner array.  This is prone to OutOfmerry errors
>  
> Please view the attached Spark Gui and Spark Dag 
> In my case the wholestagecodegen includes 2 explode nodes. 
> Because the array elements are large , we end up with an Out Of Memory error. 
>  
> I recommend that we do not merge  multiple explode nodes in the same whoe 
> stage code gen node . Doing so leads to potential memory issues.
>  
> !image-2023-08-10-02-18-24-758.png!
>  
> !image-2023-08-10-09-19-23-973.png!
>  
> !image-2023-08-10-09-21-32-371.png!
>  
> WSCG method for first Generate node
> !image-2023-08-10-09-22-29-949.png!
>  
> WSCG for second Generate node 
> As you can see the execution of generate_doConsume_1 and generate_doConsume_0 
>  triggers a nested loop.
> !image-2023-08-10-09-24-02-755.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen nodebecause it can easily cause OOM failures if arrays are relatively large

2023-08-10 Thread Franck Tago (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-44759:

Attachment: image-2023-08-10-09-27-24-124.png

> Do not combine multiple Generate nodes in the same WholeStageCodeGen 
> nodebecause it can  easily cause OOM failures if arrays are relatively large
> -
>
> Key: SPARK-44759
> URL: https://issues.apache.org/jira/browse/SPARK-44759
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.0, 3.3.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: image-2023-08-10-09-27-24-124.png, 
> wholestagecodegen_wc1_debug_wholecodegen_passed
>
>
> The generate node used to flatten array generally  produces an amount of 
> output rows that is significantly higher than the input rows to the node. 
> the number of output rows generated is even drastically higher when 
> flattening a nested array .
> When we combine more that 1 generate node in the same WholeStageCodeGen  
> node, we run  a high risk of running out of memory for multiple reasons. 
> 1- As you can see from the attachment ,  the rows created in the nested loop 
> are saved in writer buffer.  In my case because the rows were big , I hit an 
> Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
> nested loop that for each row  , will explode the parent array and then 
> explode the inner array.  This is prone to OutOfmerry errors
>  
> Please view the attached Spark Gui and Spark Dag 
> In my case the wholestagecodegen includes 2 explode nodes. 
> Because the array elements are large , we end up with an Out Of Memory error. 
>  
> I recommend that we do not merge  multiple explode nodes in the same whoe 
> stage code gen node . Doing so leads to potential memory issues.
>  
> !image-2023-08-10-02-18-24-758.png!
>  
> !image-2023-08-10-09-19-23-973.png!
>  
> !image-2023-08-10-09-21-32-371.png!
>  
> WSCG method for first Generate node
> !image-2023-08-10-09-22-29-949.png!
>  
> WSCG for second Generate node 
> As you can see the execution of generate_doConsume_1 and generate_doConsume_0 
>  triggers a nested loop.
> !image-2023-08-10-09-24-02-755.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen nodebecause it can easily cause OOM failures if arrays are relatively large

2023-08-10 Thread Franck Tago (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752864#comment-17752864
 ] 

Franck Tago commented on SPARK-44759:
-

!image-2023-08-10-09-27-24-124.png!

> Do not combine multiple Generate nodes in the same WholeStageCodeGen 
> nodebecause it can  easily cause OOM failures if arrays are relatively large
> -
>
> Key: SPARK-44759
> URL: https://issues.apache.org/jira/browse/SPARK-44759
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.0, 3.3.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: image-2023-08-10-09-27-24-124.png, 
> wholestagecodegen_wc1_debug_wholecodegen_passed
>
>
> The generate node used to flatten array generally  produces an amount of 
> output rows that is significantly higher than the input rows to the node. 
> the number of output rows generated is even drastically higher when 
> flattening a nested array .
> When we combine more that 1 generate node in the same WholeStageCodeGen  
> node, we run  a high risk of running out of memory for multiple reasons. 
> 1- As you can see from the attachment ,  the rows created in the nested loop 
> are saved in writer buffer.  In my case because the rows were big , I hit an 
> Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
> nested loop that for each row  , will explode the parent array and then 
> explode the inner array.  This is prone to OutOfmerry errors
>  
> Please view the attached Spark Gui and Spark Dag 
> In my case the wholestagecodegen includes 2 explode nodes. 
> Because the array elements are large , we end up with an Out Of Memory error. 
>  
> I recommend that we do not merge  multiple explode nodes in the same whoe 
> stage code gen node . Doing so leads to potential memory issues.
>  
> !image-2023-08-10-02-18-24-758.png!
>  
> !image-2023-08-10-09-19-23-973.png!
>  
> !image-2023-08-10-09-21-32-371.png!
>  
> WSCG method for first Generate node
> !image-2023-08-10-09-22-29-949.png!
>  
> WSCG for second Generate node 
> As you can see the execution of generate_doConsume_1 and generate_doConsume_0 
>  triggers a nested loop.
> !image-2023-08-10-09-24-02-755.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen nodebecause it can easily cause OOM failures if arrays are relatively large

2023-08-10 Thread Franck Tago (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-44759:

Summary: Do not combine multiple Generate nodes in the same 
WholeStageCodeGen nodebecause it can  easily cause OOM failures if arrays are 
relatively large  (was: Do not combine multiple Generate nodes in the same 
WholeStageCodeGen because it can  easily cause OOM failures)

> Do not combine multiple Generate nodes in the same WholeStageCodeGen 
> nodebecause it can  easily cause OOM failures if arrays are relatively large
> -
>
> Key: SPARK-44759
> URL: https://issues.apache.org/jira/browse/SPARK-44759
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.0, 3.3.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: wholestagecodegen_wc1_debug_wholecodegen_passed
>
>
> The generate node used to flatten array generally  produces an amount of 
> output rows that is significantly higher than the input rows to the node. 
> the number of output rows generated is even drastically higher when 
> flattening a nested array .
> When we combine more that 1 generate node in the same WholeStageCodeGen  
> node, we run  a high risk of running out of memory for multiple reasons. 
> 1- As you can see from the attachment ,  the rows created in the nested loop 
> are saved in writer buffer.  In my case because the rows were big , I hit an 
> Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
> nested loop that for each row  , will explode the parent array and then 
> explode the inner array.  This is prone to OutOfmerry errors
>  
> Please view the attached Spark Gui and Spark Dag 
> In my case the wholestagecodegen includes 2 explode nodes. 
> Because the array elements are large , we end up with an Out Of Memory error. 
>  
> I recommend that we do not merge  multiple explode nodes in the same whoe 
> stage code gen node . Doing so leads to potential memory issues.
>  
> !image-2023-08-10-02-18-24-758.png!
>  
> !image-2023-08-10-09-19-23-973.png!
>  
> !image-2023-08-10-09-21-32-371.png!
>  
> WSCG method for first Generate node
> !image-2023-08-10-09-22-29-949.png!
>  
> WSCG for second Generate node 
> As you can see the execution of generate_doConsume_1 and generate_doConsume_0 
>  triggers a nested loop.
> !image-2023-08-10-09-24-02-755.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen because it can easily cause OOM failures

2023-08-10 Thread Franck Tago (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-44759:

Description: 
The generate node used to flatten array generally  produces an amount of output 
rows that is significantly higher than the input rows to the node. 

the number of output rows generated is even drastically higher when flattening 
a nested array .

When we combine more that 1 generate node in the same WholeStageCodeGen  node, 
we run  a high risk of running out of memory for multiple reasons. 

1- As you can see from the attachment ,  the rows created in the nested loop 
are saved in writer buffer.  In my case because the rows were big , I hit an 
Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
nested loop that for each row  , will explode the parent array and then explode 
the inner array.  This is prone to OutOfmerry errors

 

Please view the attached Spark Gui and Spark Dag 

In my case the wholestagecodegen includes 2 explode nodes. 

Because the array elements are large , we end up with an Out Of Memory error. 

 

I recommend that we do not merge  multiple explode nodes in the same whoe stage 
code gen node . Doing so leads to potential memory issues.

 

!image-2023-08-10-02-18-24-758.png!

 

!image-2023-08-10-09-19-23-973.png!

 

!image-2023-08-10-09-21-32-371.png!

 

WSCG method for first Generate node

!image-2023-08-10-09-22-29-949.png!

 

WSCG for second Generate node 

As you can see the execution of generate_doConsume_1 and generate_doConsume_0  
triggers a nested loop.

!image-2023-08-10-09-24-02-755.png!

 

  was:
The generate node used to flatten array generally  produces an amount of output 
rows that is significantly higher than the input rows to the node. 

the number of output rows generated is even drastically higher when flattening 
a nested array .

When we combine more that 1 generate node in the same WholeStageCodeGen  node, 
we run  a high risk of running out of memory for multiple reasons. 

1- As you can see from the attachment ,  the rows created in the nested loop 
are saved in writer buffer.  In my case because the rows were big , I hit an 
Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
nested loop that for each row  , will explode the parent array and then explode 
the inner array.  This is prone to OutOfmerry errors

 

Please view the attached Spark Gui and Spark Dag 

In my case the wholestagecodegen includes 2 explode nodes. 

Because the array elements are large , we end up with an Out Of Memory error. 

 

I recommend that we do not merge  multiple explode nodes in the same whoe stage 
code gen node . Doing so leads to potential memory issues.

 

!image-2023-08-10-02-18-24-758.png!

 


> Do not combine multiple Generate nodes in the same WholeStageCodeGen because 
> it can  easily cause OOM failures
> --
>
> Key: SPARK-44759
> URL: https://issues.apache.org/jira/browse/SPARK-44759
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.0, 3.3.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: wholestagecodegen_wc1_debug_wholecodegen_passed
>
>
> The generate node used to flatten array generally  produces an amount of 
> output rows that is significantly higher than the input rows to the node. 
> the number of output rows generated is even drastically higher when 
> flattening a nested array .
> When we combine more that 1 generate node in the same WholeStageCodeGen  
> node, we run  a high risk of running out of memory for multiple reasons. 
> 1- As you can see from the attachment ,  the rows created in the nested loop 
> are saved in writer buffer.  In my case because the rows were big , I hit an 
> Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
> nested loop that for each row  , will explode the parent array and then 
> explode the inner array.  This is prone to OutOfmerry errors
>  
> Please view the attached Spark Gui and Spark Dag 
> In my case the wholestagecodegen includes 2 explode nodes. 
> Because the array elements are large , we end up with an Out Of Memory error. 
>  
> I recommend that we do not merge  multiple explode nodes in the same whoe 
> stage code gen node . Doing so leads to potential memory issues.
>  
> !image-2023-08-10-02-18-24-758.png!
>  
> !image-2023-08-10-09-19-23-973.png!
>  
> !image-2023-08-10-09-21-32-371.png!
>  
> WSCG method for first Generate node
> !image-2023-08-10-09-22-29-949.png!
>  
> WSCG for second Generate node 
> As you can see the execution of generate_doConsume_1 and generate_doConsume_0 
>  triggers a nested loop.
> !image-2023-08-10-09-24-02-755.png!
>  



--
This 

[jira] [Resolved] (SPARK-44760) Index Out Of Bound for JIRA resolution in merge_spark_pr

2023-08-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-44760.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42429
[https://github.com/apache/spark/pull/42429]

> Index Out Of Bound for JIRA resolution in merge_spark_pr
> 
>
> Key: SPARK-44760
> URL: https://issues.apache.org/jira/browse/SPARK-44760
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
> Fix For: 4.0.0
>
>
> I



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44762) Add more documentation and examples for using job tags for interrupt

2023-08-10 Thread Juliusz Sompolski (Jira)
Juliusz Sompolski created SPARK-44762:
-

 Summary: Add more documentation and examples for using job tags 
for interrupt
 Key: SPARK-44762
 URL: https://issues.apache.org/jira/browse/SPARK-44762
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 3.5.0
Reporter: Juliusz Sompolski


Add documentation to spark.addJob tag with similar examples and explanation 
like SparkContext.setJobGroup



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44741) Support regex-based MetricFilter in StatsdSink

2023-08-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-44741:
--
Affects Version/s: 4.0.0
   (was: 3.4.1)

> Support regex-based MetricFilter in StatsdSink
> --
>
> Key: SPARK-44741
> URL: https://issues.apache.org/jira/browse/SPARK-44741
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: rameshkrishnan muthusamy
>Assignee: rameshkrishnan muthusamy
>Priority: Minor
>  Labels: Metrics, Sink, statsd
> Fix For: 4.0.0
>
>
> Spark statsd metrics sink in the current state does not support metrics 
> filtering option. 
> Though this option is available in the reporters it is not exposed in the 
> statsD sink. An example of this can be looked at 
> [https://github.com/apache/spark/blob/be9ffb37585fe421705ceaa52fe49b89c50703a3/core/src/main/scala/org/apache/spark/metrics/sink/GraphiteSink.scala#L76]
>  
> This is a critical option to have when teams does not want all the metrics 
> that are exposed by spark in the metrics monitoring platforms and switch to 
> detailed metrics as and when needed. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44741) Support regex-based MetricFilter in StatsdSink

2023-08-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-44741:
--
Summary: Support regex-based MetricFilter in StatsdSink  (was: Spark StatsD 
metrics reported to support metrics filter option )

> Support regex-based MetricFilter in StatsdSink
> --
>
> Key: SPARK-44741
> URL: https://issues.apache.org/jira/browse/SPARK-44741
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: rameshkrishnan muthusamy
>Assignee: rameshkrishnan muthusamy
>Priority: Minor
>  Labels: Metrics, Sink, statsd
> Fix For: 4.0.0
>
>
> Spark statsd metrics sink in the current state does not support metrics 
> filtering option. 
> Though this option is available in the reporters it is not exposed in the 
> statsD sink. An example of this can be looked at 
> [https://github.com/apache/spark/blob/be9ffb37585fe421705ceaa52fe49b89c50703a3/core/src/main/scala/org/apache/spark/metrics/sink/GraphiteSink.scala#L76]
>  
> This is a critical option to have when teams does not want all the metrics 
> that are exposed by spark in the metrics monitoring platforms and switch to 
> detailed metrics as and when needed. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44741) Spark StatsD metrics reported to support metrics filter option

2023-08-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-44741:
-

Assignee: rameshkrishnan muthusamy

> Spark StatsD metrics reported to support metrics filter option 
> ---
>
> Key: SPARK-44741
> URL: https://issues.apache.org/jira/browse/SPARK-44741
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: rameshkrishnan muthusamy
>Assignee: rameshkrishnan muthusamy
>Priority: Minor
>  Labels: Metrics, Sink, statsd
>
> Spark statsd metrics sink in the current state does not support metrics 
> filtering option. 
> Though this option is available in the reporters it is not exposed in the 
> statsD sink. An example of this can be looked at 
> [https://github.com/apache/spark/blob/be9ffb37585fe421705ceaa52fe49b89c50703a3/core/src/main/scala/org/apache/spark/metrics/sink/GraphiteSink.scala#L76]
>  
> This is a critical option to have when teams does not want all the metrics 
> that are exposed by spark in the metrics monitoring platforms and switch to 
> detailed metrics as and when needed. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44741) Spark StatsD metrics reported to support metrics filter option

2023-08-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-44741.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42416
[https://github.com/apache/spark/pull/42416]

> Spark StatsD metrics reported to support metrics filter option 
> ---
>
> Key: SPARK-44741
> URL: https://issues.apache.org/jira/browse/SPARK-44741
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: rameshkrishnan muthusamy
>Assignee: rameshkrishnan muthusamy
>Priority: Minor
>  Labels: Metrics, Sink, statsd
> Fix For: 4.0.0
>
>
> Spark statsd metrics sink in the current state does not support metrics 
> filtering option. 
> Though this option is available in the reporters it is not exposed in the 
> statsD sink. An example of this can be looked at 
> [https://github.com/apache/spark/blob/be9ffb37585fe421705ceaa52fe49b89c50703a3/core/src/main/scala/org/apache/spark/metrics/sink/GraphiteSink.scala#L76]
>  
> This is a critical option to have when teams does not want all the metrics 
> that are exposed by spark in the metrics monitoring platforms and switch to 
> detailed metrics as and when needed. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44700) Rule OptimizeCsvJsonExprs should not be applied to expression like from_json(regexp_replace)

2023-08-10 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-44700:

Fix Version/s: 3.3.0

> Rule OptimizeCsvJsonExprs should not be applied to expression like 
> from_json(regexp_replace)
> 
>
> Key: SPARK-44700
> URL: https://issues.apache.org/jira/browse/SPARK-44700
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: jiahong.li
>Priority: Minor
> Fix For: 3.3.0
>
>
> _SQL_ like below:
> select tmp.* 
>  from
>  (select
>         device_id, ads_id, 
>         from_json(regexp_replace(device_personas, '(?<=(\\{|,))"device_', 
> '"user_device_'), ${device_schema}) as tmp
>         from input )
> ${device_schema} includes more than 100 fields.
> if Rule: OptimizeCsvJsonExprs  been applied, the expression, regexp_replace, 
> will be invoked many times, that costs so much time.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44700) Rule OptimizeCsvJsonExprs should not be applied to expression like from_json(regexp_replace)

2023-08-10 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-44700.
-
Resolution: Fixed

Please upgrade Spark to the latest version to fix this issue.

> Rule OptimizeCsvJsonExprs should not be applied to expression like 
> from_json(regexp_replace)
> 
>
> Key: SPARK-44700
> URL: https://issues.apache.org/jira/browse/SPARK-44700
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: jiahong.li
>Priority: Minor
>
> _SQL_ like below:
> select tmp.* 
>  from
>  (select
>         device_id, ads_id, 
>         from_json(regexp_replace(device_personas, '(?<=(\\{|,))"device_', 
> '"user_device_'), ${device_schema}) as tmp
>         from input )
> ${device_schema} includes more than 100 fields.
> if Rule: OptimizeCsvJsonExprs  been applied, the expression, regexp_replace, 
> will be invoked many times, that costs so much time.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44700) Rule OptimizeCsvJsonExprs should not be applied to expression like from_json(regexp_replace)

2023-08-10 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-44700:

Affects Version/s: 3.1.1
   (was: 3.4.0)
   (was: 3.4.1)

> Rule OptimizeCsvJsonExprs should not be applied to expression like 
> from_json(regexp_replace)
> 
>
> Key: SPARK-44700
> URL: https://issues.apache.org/jira/browse/SPARK-44700
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: jiahong.li
>Priority: Minor
>
> _SQL_ like below:
> select tmp.* 
>  from
>  (select
>         device_id, ads_id, 
>         from_json(regexp_replace(device_personas, '(?<=(\\{|,))"device_', 
> '"user_device_'), ${device_schema}) as tmp
>         from input )
> ${device_schema} includes more than 100 fields.
> if Rule: OptimizeCsvJsonExprs  been applied, the expression, regexp_replace, 
> will be invoked many times, that costs so much time.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44761) Add DataStreamWriter.foreachBatch(org.apache.spark.api.java.function.VoidFunction2) signature

2023-08-10 Thread Jira
Herman van Hövell created SPARK-44761:
-

 Summary: Add 
DataStreamWriter.foreachBatch(org.apache.spark.api.java.function.VoidFunction2) 
signature 
 Key: SPARK-44761
 URL: https://issues.apache.org/jira/browse/SPARK-44761
 Project: Spark
  Issue Type: New Feature
  Components: Connect
Affects Versions: 3.5.0
Reporter: Herman van Hövell
Assignee: Herman van Hövell






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44756) Executor hangs when RetryingBlockTransferor fails to initiate retry

2023-08-10 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752785#comment-17752785
 ] 

GridGain Integration commented on SPARK-44756:
--

User 'hdaikoku' has created a pull request for this issue:
https://github.com/apache/spark/pull/42426

> Executor hangs when RetryingBlockTransferor fails to initiate retry
> ---
>
> Key: SPARK-44756
> URL: https://issues.apache.org/jira/browse/SPARK-44756
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 3.3.1
>Reporter: Harunobu Daikoku
>Priority: Minor
>
> We have been observing this issue several times in our production where some 
> executors are being stuck at BlockTransferService#fetchBlockSync().
> After some investigation, the issue seems to be caused by an unhandled edge 
> case in RetryingBlockTransferor.
> 1. Shuffle transfer fails for whatever reason
> {noformat}
> java.io.IOException: Cannot allocate memory
>   at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>   at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60)
>   at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
>   at sun.nio.ch.IOUtil.write(IOUtil.java:51)
>   at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:211)
>   at 
> org.apache.spark.network.shuffle.SimpleDownloadFile$SimpleDownloadWritableChannel.write(SimpleDownloadFile.java:78)
>   at 
> org.apache.spark.network.shuffle.OneForOneBlockFetcher$DownloadCallback.onData(OneForOneBlockFetcher.java:340)
>   at 
> org.apache.spark.network.client.StreamInterceptor.handle(StreamInterceptor.java:79)
>   at 
> org.apache.spark.network.util.TransportFrameDecoder.feedInterceptor(TransportFrameDecoder.java:263)
>   at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:87)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
> {noformat}
> 2. The above exception caught by 
> [AbstractChannelHandlerContext#invokeChannelRead()|https://github.com/netty/netty/blob/netty-4.1.74.Final/transport/src/main/java/io/netty/channel/AbstractChannelHandlerContext.java#L381],
>  and propagated to the exception handler
> 3. Exception reaches 
> [RetryingBlockTransferor#initiateRetry()|https://github.com/apache/spark/blob/v3.3.1/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RetryingBlockTransferor.java#L178-L180],
>  and it tries to initiate retry
> {noformat}
> 23/08/09 16:58:37 shuffle-client-4-2 INFO RetryingBlockTransferor: Retrying 
> fetch (1/3) for 1 outstanding blocks after 5000 ms
> {noformat}
> 4. Retry initiation fails (in our case, it fails to create a new thread)
> 5. Exception caught by 
> [AbstractChannelHandlerContext#invokeExceptionCaught()|https://github.com/netty/netty/blob/netty-4.1.74.Final/transport/src/main/java/io/netty/channel/AbstractChannelHandlerContext.java#L305-L309],
>  and not further processed
> {noformat}
> 23/08/09 16:58:53 shuffle-client-4-2 DEBUG AbstractChannelHandlerContext: An 
> exception java.lang.OutOfMemoryError: unable to create new native thread
>   at java.lang.Thread.start0(Native Method)
>   at java.lang.Thread.start(Thread.java:719)
>   at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1378)
>   at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockTransferor.initiateRetry(RetryingBlockTransferor.java:182)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockTransferor.access$500(RetryingBlockTransferor.java:43)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockTransferor$RetryingBlockTransferListener.handleBlockTransferFailure(RetryingBlockTransferor.java:230)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockTransferor$RetryingBlockTransferListener.onBlockFetchFailure(RetryingBlockTransferor.java:260)
>   at 
> org.apache.spark.network.shuffle.OneForOneBlockFetcher.failRemainingBlocks(OneForOneBlockFetcher.java:318)
>   at 
> org.apache.spark.network.shuffle.OneForOneBlockFetcher.access$300(OneForOneBlockFetcher.java:55)
>   at 
> org.apache.spark.network.shuffle.OneForOneBlockFetcher$DownloadCallback.onFailure(OneForOneBlockFetcher.java:357)
>   at 
> org.apache.spark.network.client.StreamInterceptor.exceptionCaught(StreamInterceptor.java:56)
>   at 
> org.apache.spark.network.util.TransportFrameDecoder.exceptionCaught(TransportFrameDecoder.java:231)
>   at 
> 

[jira] [Commented] (SPARK-44756) Executor hangs when RetryingBlockTransferor fails to initiate retry

2023-08-10 Thread Harunobu Daikoku (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752781#comment-17752781
 ] 

Harunobu Daikoku commented on SPARK-44756:
--

PR raised: https://github.com/apache/spark/pull/42426

> Executor hangs when RetryingBlockTransferor fails to initiate retry
> ---
>
> Key: SPARK-44756
> URL: https://issues.apache.org/jira/browse/SPARK-44756
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 3.3.1
>Reporter: Harunobu Daikoku
>Priority: Minor
>
> We have been observing this issue several times in our production where some 
> executors are being stuck at BlockTransferService#fetchBlockSync().
> After some investigation, the issue seems to be caused by an unhandled edge 
> case in RetryingBlockTransferor.
> 1. Shuffle transfer fails for whatever reason
> {noformat}
> java.io.IOException: Cannot allocate memory
>   at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>   at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60)
>   at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
>   at sun.nio.ch.IOUtil.write(IOUtil.java:51)
>   at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:211)
>   at 
> org.apache.spark.network.shuffle.SimpleDownloadFile$SimpleDownloadWritableChannel.write(SimpleDownloadFile.java:78)
>   at 
> org.apache.spark.network.shuffle.OneForOneBlockFetcher$DownloadCallback.onData(OneForOneBlockFetcher.java:340)
>   at 
> org.apache.spark.network.client.StreamInterceptor.handle(StreamInterceptor.java:79)
>   at 
> org.apache.spark.network.util.TransportFrameDecoder.feedInterceptor(TransportFrameDecoder.java:263)
>   at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:87)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
> {noformat}
> 2. The above exception caught by 
> [AbstractChannelHandlerContext#invokeChannelRead()|https://github.com/netty/netty/blob/netty-4.1.74.Final/transport/src/main/java/io/netty/channel/AbstractChannelHandlerContext.java#L381],
>  and propagated to the exception handler
> 3. Exception reaches 
> [RetryingBlockTransferor#initiateRetry()|https://github.com/apache/spark/blob/v3.3.1/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RetryingBlockTransferor.java#L178-L180],
>  and it tries to initiate retry
> {noformat}
> 23/08/09 16:58:37 shuffle-client-4-2 INFO RetryingBlockTransferor: Retrying 
> fetch (1/3) for 1 outstanding blocks after 5000 ms
> {noformat}
> 4. Retry initiation fails (in our case, it fails to create a new thread)
> 5. Exception caught by 
> [AbstractChannelHandlerContext#invokeExceptionCaught()|https://github.com/netty/netty/blob/netty-4.1.74.Final/transport/src/main/java/io/netty/channel/AbstractChannelHandlerContext.java#L305-L309],
>  and not further processed
> {noformat}
> 23/08/09 16:58:53 shuffle-client-4-2 DEBUG AbstractChannelHandlerContext: An 
> exception java.lang.OutOfMemoryError: unable to create new native thread
>   at java.lang.Thread.start0(Native Method)
>   at java.lang.Thread.start(Thread.java:719)
>   at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1378)
>   at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockTransferor.initiateRetry(RetryingBlockTransferor.java:182)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockTransferor.access$500(RetryingBlockTransferor.java:43)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockTransferor$RetryingBlockTransferListener.handleBlockTransferFailure(RetryingBlockTransferor.java:230)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockTransferor$RetryingBlockTransferListener.onBlockFetchFailure(RetryingBlockTransferor.java:260)
>   at 
> org.apache.spark.network.shuffle.OneForOneBlockFetcher.failRemainingBlocks(OneForOneBlockFetcher.java:318)
>   at 
> org.apache.spark.network.shuffle.OneForOneBlockFetcher.access$300(OneForOneBlockFetcher.java:55)
>   at 
> org.apache.spark.network.shuffle.OneForOneBlockFetcher$DownloadCallback.onFailure(OneForOneBlockFetcher.java:357)
>   at 
> org.apache.spark.network.client.StreamInterceptor.exceptionCaught(StreamInterceptor.java:56)
>   at 
> org.apache.spark.network.util.TransportFrameDecoder.exceptionCaught(TransportFrameDecoder.java:231)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302)
> {noformat}
> 6. 

[jira] [Updated] (SPARK-44756) Executor hangs when RetryingBlockTransferor fails to initiate retry

2023-08-10 Thread Harunobu Daikoku (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harunobu Daikoku updated SPARK-44756:
-
Component/s: Spark Core

> Executor hangs when RetryingBlockTransferor fails to initiate retry
> ---
>
> Key: SPARK-44756
> URL: https://issues.apache.org/jira/browse/SPARK-44756
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 3.3.1
>Reporter: Harunobu Daikoku
>Priority: Minor
>
> We have been observing this issue several times in our production where some 
> executors are being stuck at BlockTransferService#fetchBlockSync().
> After some investigation, the issue seems to be caused by an unhandled edge 
> case in RetryingBlockTransferor.
> 1. Shuffle transfer fails for whatever reason
> {noformat}
> java.io.IOException: Cannot allocate memory
>   at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>   at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60)
>   at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
>   at sun.nio.ch.IOUtil.write(IOUtil.java:51)
>   at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:211)
>   at 
> org.apache.spark.network.shuffle.SimpleDownloadFile$SimpleDownloadWritableChannel.write(SimpleDownloadFile.java:78)
>   at 
> org.apache.spark.network.shuffle.OneForOneBlockFetcher$DownloadCallback.onData(OneForOneBlockFetcher.java:340)
>   at 
> org.apache.spark.network.client.StreamInterceptor.handle(StreamInterceptor.java:79)
>   at 
> org.apache.spark.network.util.TransportFrameDecoder.feedInterceptor(TransportFrameDecoder.java:263)
>   at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:87)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
> {noformat}
> 2. The above exception caught by 
> [AbstractChannelHandlerContext#invokeChannelRead()|https://github.com/netty/netty/blob/netty-4.1.74.Final/transport/src/main/java/io/netty/channel/AbstractChannelHandlerContext.java#L381],
>  and propagated to the exception handler
> 3. Exception reaches 
> [RetryingBlockTransferor#initiateRetry()|https://github.com/apache/spark/blob/v3.3.1/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RetryingBlockTransferor.java#L178-L180],
>  and it tries to initiate retry
> {noformat}
> 23/08/09 16:58:37 shuffle-client-4-2 INFO RetryingBlockTransferor: Retrying 
> fetch (1/3) for 1 outstanding blocks after 5000 ms
> {noformat}
> 4. Retry initiation fails (in our case, it fails to create a new thread)
> 5. Exception caught by 
> [AbstractChannelHandlerContext#invokeExceptionCaught()|https://github.com/netty/netty/blob/netty-4.1.74.Final/transport/src/main/java/io/netty/channel/AbstractChannelHandlerContext.java#L305-L309],
>  and not further processed
> {noformat}
> 23/08/09 16:58:53 shuffle-client-4-2 DEBUG AbstractChannelHandlerContext: An 
> exception java.lang.OutOfMemoryError: unable to create new native thread
>   at java.lang.Thread.start0(Native Method)
>   at java.lang.Thread.start(Thread.java:719)
>   at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1378)
>   at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockTransferor.initiateRetry(RetryingBlockTransferor.java:182)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockTransferor.access$500(RetryingBlockTransferor.java:43)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockTransferor$RetryingBlockTransferListener.handleBlockTransferFailure(RetryingBlockTransferor.java:230)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockTransferor$RetryingBlockTransferListener.onBlockFetchFailure(RetryingBlockTransferor.java:260)
>   at 
> org.apache.spark.network.shuffle.OneForOneBlockFetcher.failRemainingBlocks(OneForOneBlockFetcher.java:318)
>   at 
> org.apache.spark.network.shuffle.OneForOneBlockFetcher.access$300(OneForOneBlockFetcher.java:55)
>   at 
> org.apache.spark.network.shuffle.OneForOneBlockFetcher$DownloadCallback.onFailure(OneForOneBlockFetcher.java:357)
>   at 
> org.apache.spark.network.client.StreamInterceptor.exceptionCaught(StreamInterceptor.java:56)
>   at 
> org.apache.spark.network.util.TransportFrameDecoder.exceptionCaught(TransportFrameDecoder.java:231)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302)
> {noformat}
> 6. After all, retry never happens and the executor thread ends up being 

[jira] [Created] (SPARK-44760) Index Out Of Bound for JIRA resolution in merge_spark_pr

2023-08-10 Thread Kent Yao (Jira)
Kent Yao created SPARK-44760:


 Summary: Index Out Of Bound for JIRA resolution in merge_spark_pr
 Key: SPARK-44760
 URL: https://issues.apache.org/jira/browse/SPARK-44760
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Kent Yao


I



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44758) Support memory limit configurable

2023-08-10 Thread zhuml (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhuml updated SPARK-44758:
--
Description: Currently the memory request and limit are set by summing the 
values of spark.\{driver,executor}.memory and 
spark.\{driver,executor}.memoryOverhead. Supporting memory limits configurable 
can bring some benefits. For example, use unfixed memory to use page cache, 
reduce disk IO of shuffle read to improve performance.  (was: Currently the 
memory request and limit are set by summing the values of 
spark.\{driver,executor}.memory and spark.\{driver,executor}.memoryOverhead. 
Supporting configurable memory limits can bring some benefits. For example, use 
unfixed memory to use page cache, reduce disk IO of shuffle read to improve 
performance.)

> Support memory limit configurable
> -
>
> Key: SPARK-44758
> URL: https://issues.apache.org/jira/browse/SPARK-44758
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.4.1
>Reporter: zhuml
>Priority: Minor
>
> Currently the memory request and limit are set by summing the values of 
> spark.\{driver,executor}.memory and spark.\{driver,executor}.memoryOverhead. 
> Supporting memory limits configurable can bring some benefits. For example, 
> use unfixed memory to use page cache, reduce disk IO of shuffle read to 
> improve performance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44758) Support memory limit configurable

2023-08-10 Thread zhuml (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhuml updated SPARK-44758:
--
Summary: Support memory limit configurable  (was: Support memory limit)

> Support memory limit configurable
> -
>
> Key: SPARK-44758
> URL: https://issues.apache.org/jira/browse/SPARK-44758
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.4.1
>Reporter: zhuml
>Priority: Minor
>
> Currently the memory request and limit are set by summing the values of 
> spark.\{driver,executor}.memory and spark.\{driver,executor}.memoryOverhead. 
> Supporting configurable memory limits can bring some benefits. For example, 
> use unfixed memory to use page cache, reduce disk IO of shuffle read to 
> improve performance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen because it can easily cause OOM failures

2023-08-10 Thread Franck Tago (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-44759:

Attachment: (was: spark-verbosewithcodegenenabled)

> Do not combine multiple Generate nodes in the same WholeStageCodeGen because 
> it can  easily cause OOM failures
> --
>
> Key: SPARK-44759
> URL: https://issues.apache.org/jira/browse/SPARK-44759
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.0, 3.3.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: wholestagecodegen_wc1_debug_wholecodegen_passed
>
>
> The generate node used to flatten array generally  produces an amount of 
> output rows that is significantly higher than the input rows to the node. 
> the number of output rows generated is even drastically higher when 
> flattening a nested array .
> When we combine more that 1 generate node in the same WholeStageCodeGen  
> node, we run  a high risk of running out of memory for multiple reasons. 
> 1- As you can see from the attachment ,  the rows created in the nested loop 
> are saved in writer buffer.  In my case because the rows were big , I hit an 
> Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
> nested loop that for each row  , will explode the parent array and then 
> explode the inner array.  This is prone to OutOfmerry errors
>  
> Please view the attached Spark Gui and Spark Dag 
> In my case the wholestagecodegen includes 2 explode nodes. 
> Because the array elements are large , we end up with an Out Of Memory error. 
>  
> I recommend that we do not merge  multiple explode nodes in the same whoe 
> stage code gen node . Doing so leads to potential memory issues.
>  
> !image-2023-08-10-02-18-24-758.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen because it can easily cause OOM failures

2023-08-10 Thread Franck Tago (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-44759:

Attachment: wholestagecodegen_wc1_debug_wholecodegen_passed

> Do not combine multiple Generate nodes in the same WholeStageCodeGen because 
> it can  easily cause OOM failures
> --
>
> Key: SPARK-44759
> URL: https://issues.apache.org/jira/browse/SPARK-44759
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.0, 3.3.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: wholestagecodegen_wc1_debug_wholecodegen_passed
>
>
> The generate node used to flatten array generally  produces an amount of 
> output rows that is significantly higher than the input rows to the node. 
> the number of output rows generated is even drastically higher when 
> flattening a nested array .
> When we combine more that 1 generate node in the same WholeStageCodeGen  
> node, we run  a high risk of running out of memory for multiple reasons. 
> 1- As you can see from the attachment ,  the rows created in the nested loop 
> are saved in writer buffer.  In my case because the rows were big , I hit an 
> Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
> nested loop that for each row  , will explode the parent array and then 
> explode the inner array.  This is prone to OutOfmerry errors
>  
> Please view the attached Spark Gui and Spark Dag 
> In my case the wholestagecodegen includes 2 explode nodes. 
> Because the array elements are large , we end up with an Out Of Memory error. 
>  
> I recommend that we do not merge  multiple explode nodes in the same whoe 
> stage code gen node . Doing so leads to potential memory issues.
>  
> !image-2023-08-10-02-18-24-758.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen because it can easily cause OOM failures

2023-08-10 Thread Franck Tago (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-44759:

Attachment: spark-verbosewithcodegenenabled

> Do not combine multiple Generate nodes in the same WholeStageCodeGen because 
> it can  easily cause OOM failures
> --
>
> Key: SPARK-44759
> URL: https://issues.apache.org/jira/browse/SPARK-44759
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.0, 3.3.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: spark-verbosewithcodegenenabled
>
>
> The generate node used to flatten array generally  produces an amount of 
> output rows that is significantly higher than the input rows to the node. 
> the number of output rows generated is even drastically higher when 
> flattening a nested array .
> When we combine more that 1 generate node in the same WholeStageCodeGen  
> node, we run  a high risk of running out of memory for multiple reasons. 
> 1- As you can see from the attachment ,  the rows created in the nested loop 
> are saved in writer buffer.  In my case because the rows were big , I hit an 
> Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
> nested loop that for each row  , will explode the parent array and then 
> explode the inner array.  This is prone to OutOfmerry errors
>  
> Please view the attached Spark Gui and Spark Dag 
> In my case the wholestagecodegen includes 2 explode nodes. 
> Because the array elements are large , we end up with an Out Of Memory error. 
>  
> I recommend that we do not merge  multiple explode nodes in the same whoe 
> stage code gen node . Doing so leads to potential memory issues.
>  
> !image-2023-08-10-02-18-24-758.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen because it can easily cause OOM failures

2023-08-10 Thread Franck Tago (Jira)
Franck Tago created SPARK-44759:
---

 Summary: Do not combine multiple Generate nodes in the same 
WholeStageCodeGen because it can  easily cause OOM failures
 Key: SPARK-44759
 URL: https://issues.apache.org/jira/browse/SPARK-44759
 Project: Spark
  Issue Type: Bug
  Components: Optimizer
Affects Versions: 3.3.1, 3.3.0
Reporter: Franck Tago


The generate node used to flatten array generally  produces an amount of output 
rows that is significantly higher than the input rows to the node. 

the number of output rows generated is even drastically higher when flattening 
a nested array .

When we combine more that 1 generate node in the same WholeStageCodeGen  node, 
we run  a high risk of running out of memory for multiple reasons. 

1- As you can see from the attachment ,  the rows created in the nested loop 
are saved in writer buffer.  In my case because the rows were big , I hit an 
Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
nested loop that for each row  , will explode the parent array and then explode 
the inner array.  This is prone to OutOfmerry errors

 

Please view the attached Spark Gui and Spark Dag 

In my case the wholestagecodegen includes 2 explode nodes. 

Because the array elements are large , we end up with an Out Of Memory error. 

 

I recommend that we do not merge  multiple explode nodes in the same whoe stage 
code gen node . Doing so leads to potential memory issues.

 

!image-2023-08-10-02-18-24-758.png!

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44757) Vulnerabilities in Spark3.4

2023-08-10 Thread Anand Balasubramaniam (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Balasubramaniam updated SPARK-44757:
--
Priority: Major  (was: Minor)

> Vulnerabilities in Spark3.4
> ---
>
> Key: SPARK-44757
> URL: https://issues.apache.org/jira/browse/SPARK-44757
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Anand Balasubramaniam
>Priority: Major
>
> We are seeing below list of TPLS's with vulnerabilities bundled with Spark3.4 
> package with StackRox scan , is there any ETA on fixing them ? Kindly apprise 
> us on the same .
> h2. Vulnerabilities in Spark3.4
> |*CVE*|*Description*|*Severity*|
> |CVE-2018-21234|Jodd before 5.0.4 performs Deserialization of Untrusted JSON 
> Data when setClassMetadataName is set.|CVSS Score:9.8Critical|
> |CVE-2022-42004|In FasterXML jackson-databind before 2.13.4, resource 
> exhaustion can occur because of a lack of a check in 
> BeanDeserializer._deserializeFromArray to prevent use of deeply nested 
> arrays. An application is vulnerable only with certain customized choices for 
> deserialization.|CVSS Score 7.5Important|
> | CVE-2022-42003|In FasterXML jackson-databind before 2.14.0-rc1, resource 
> exhaustion can occur because of a lack of a check in primitive value 
> deserializers to avoid deep wrapper array nesting, when the 
> UNWRAP_SINGLE_VALUE_ARRAYS feature is enabled. Additional fix version in 
> 2.13.4.1 and 2.12.17.1|CVSS Score 7.5Important|
> |CVE-2022-40152|Those using Woodstox to parse XML data may be vulnerable to 
> Denial of Service attacks (DOS) if DTD support is enabled. If the parser is 
> running on user supplied input, an attacker may supply content that causes 
> the parser to crash by stackoverflow. This effect may support a denial of 
> service attack.|CVSS Score 7.5Important|
> |CVE-2022-3171|A parsing issue with binary data in protobuf-java core and 
> lite versions prior to 3.21.7, 3.20.3, 3.19.6 and 3.16.3 can lead to a denial 
> of service attack. Inputs containing multiple instances of non-repeated 
> embedded messages with repeated or unknown fields causes objects to be 
> converted back-n-forth between mutable and immutable forms, resulting in 
> potentially long garbage collection pauses. We recommend updating to the 
> versions mentioned above.|CVSS Score 7.5Important|
> |CVE-2021-34538|Apache Hive before 3.1.3 "CREATE" and "DROP" function 
> operations does not check for necessary authorization of involved entities in 
> the query. It was found that an unauthorized user can manipulate an existing 
> UDF without having the privileges to do so. This allowed unauthorized or 
> underprivileged users to drop and recreate UDFs pointing them to new jars 
> that could be potentially malicious.|CVSS Score 7.5Important|
> |CVE-2020-13949|In Apache Thrift 0.9.3 to 0.13.0, malicious RPC clients could 
> send short messages which would result in a large memory allocation, 
> potentially leading to denial of service.|CVSS Score 7.5Important|
> |CVE-2018-10237|Unbounded memory allocation in Google Guava 11.0 through 24.x 
> before 24.1.1 allows remote attackers to conduct denial of service attacks 
> against servers that depend on this library and deserialize attacker-provided 
> data, because the AtomicDoubleArray class (when serialized with Java 
> serialization) and the CompoundOrdering class (when serialized with GWT 
> serialization) perform eager allocation without appropriate checks on what a 
> client has sent and whether the data size is reasonable.|CVSS 5.9Moderate|
> |CVE-2021-22569|An issue in protobuf-java allowed the interleaving of 
> com.google.protobuf.UnknownFieldSet fields in such a way that would be 
> processed out of order. A small malicious payload can occupy the parser for 
> several minutes by creating large numbers of short-lived objects that cause 
> frequent, repeated pauses. We recommend upgrading libraries beyond the 
> vulnerable versions.|CVSS 5.9Moderate|
> |CVE-2020-8908|A temp directory creation vulnerability exists in all versions 
> of Guava, allowing an attacker with access to the machine to potentially 
> access data in a temporary directory created by the Guava API 
> [com.google.common.io|https://urldefense.com/v3/__http:/com.google.common.io/__;!!KpaPruflFCEp!hUy3fNZoxf_mnbeTP7GUWkbaKtRLDswR2fRnQ9Gm_AoaeVUncE_plq53EqTWyd1ZfAI7tIFOgmmEBPoGRw$].Files.createTempDir().
>  By default, on unix-like systems, the created directory is world-readable 
> (readable by an attacker with access to the system). The method in question 
> has been marked @Deprecated in versions 30.0 and later and should not be 
> used. For Android developers, we recommend choosing a temporary directory API 
> provided by Android, such as context.getCacheDir(). For other Java 
> 

[jira] [Created] (SPARK-44758) Support memory limit

2023-08-10 Thread zhuml (Jira)
zhuml created SPARK-44758:
-

 Summary: Support memory limit
 Key: SPARK-44758
 URL: https://issues.apache.org/jira/browse/SPARK-44758
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.4.1
Reporter: zhuml


Currently the memory request and limit are set by summing the values of 
spark.\{driver,executor}.memory and spark.\{driver,executor}.memoryOverhead. 
Supporting configurable memory limits can bring some benefits. For example, use 
unfixed memory to use page cache, reduce disk IO of shuffle read to improve 
performance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44757) Vulnerabilities in Spark3.4

2023-08-10 Thread Anand Balasubramaniam (Jira)
Anand Balasubramaniam created SPARK-44757:
-

 Summary: Vulnerabilities in Spark3.4
 Key: SPARK-44757
 URL: https://issues.apache.org/jira/browse/SPARK-44757
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Anand Balasubramaniam


We are seeing below list of TPLS's with vulnerabilities bundled with Spark3.4 
package with StackRox scan , is there any ETA on fixing them ? Kindly apprise 
us on the same .
h2. Vulnerabilities in Spark3.4
|*CVE*|*Description*|*Severity*|
|CVE-2018-21234|Jodd before 5.0.4 performs Deserialization of Untrusted JSON 
Data when setClassMetadataName is set.|CVSS Score:9.8Critical|
|CVE-2022-42004|In FasterXML jackson-databind before 2.13.4, resource 
exhaustion can occur because of a lack of a check in 
BeanDeserializer._deserializeFromArray to prevent use of deeply nested arrays. 
An application is vulnerable only with certain customized choices for 
deserialization.|CVSS Score 7.5Important|
| CVE-2022-42003|In FasterXML jackson-databind before 2.14.0-rc1, resource 
exhaustion can occur because of a lack of a check in primitive value 
deserializers to avoid deep wrapper array nesting, when the 
UNWRAP_SINGLE_VALUE_ARRAYS feature is enabled. Additional fix version in 
2.13.4.1 and 2.12.17.1|CVSS Score 7.5Important|
|CVE-2022-40152|Those using Woodstox to parse XML data may be vulnerable to 
Denial of Service attacks (DOS) if DTD support is enabled. If the parser is 
running on user supplied input, an attacker may supply content that causes the 
parser to crash by stackoverflow. This effect may support a denial of service 
attack.|CVSS Score 7.5Important|
|CVE-2022-3171|A parsing issue with binary data in protobuf-java core and lite 
versions prior to 3.21.7, 3.20.3, 3.19.6 and 3.16.3 can lead to a denial of 
service attack. Inputs containing multiple instances of non-repeated embedded 
messages with repeated or unknown fields causes objects to be converted 
back-n-forth between mutable and immutable forms, resulting in potentially long 
garbage collection pauses. We recommend updating to the versions mentioned 
above.|CVSS Score 7.5Important|
|CVE-2021-34538|Apache Hive before 3.1.3 "CREATE" and "DROP" function 
operations does not check for necessary authorization of involved entities in 
the query. It was found that an unauthorized user can manipulate an existing 
UDF without having the privileges to do so. This allowed unauthorized or 
underprivileged users to drop and recreate UDFs pointing them to new jars that 
could be potentially malicious.|CVSS Score 7.5Important|
|CVE-2020-13949|In Apache Thrift 0.9.3 to 0.13.0, malicious RPC clients could 
send short messages which would result in a large memory allocation, 
potentially leading to denial of service.|CVSS Score 7.5Important|
|CVE-2018-10237|Unbounded memory allocation in Google Guava 11.0 through 24.x 
before 24.1.1 allows remote attackers to conduct denial of service attacks 
against servers that depend on this library and deserialize attacker-provided 
data, because the AtomicDoubleArray class (when serialized with Java 
serialization) and the CompoundOrdering class (when serialized with GWT 
serialization) perform eager allocation without appropriate checks on what a 
client has sent and whether the data size is reasonable.|CVSS 5.9Moderate|
|CVE-2021-22569|An issue in protobuf-java allowed the interleaving of 
com.google.protobuf.UnknownFieldSet fields in such a way that would be 
processed out of order. A small malicious payload can occupy the parser for 
several minutes by creating large numbers of short-lived objects that cause 
frequent, repeated pauses. We recommend upgrading libraries beyond the 
vulnerable versions.|CVSS 5.9Moderate|
|CVE-2020-8908|A temp directory creation vulnerability exists in all versions 
of Guava, allowing an attacker with access to the machine to potentially access 
data in a temporary directory created by the Guava API 
[com.google.common.io|https://urldefense.com/v3/__http:/com.google.common.io/__;!!KpaPruflFCEp!hUy3fNZoxf_mnbeTP7GUWkbaKtRLDswR2fRnQ9Gm_AoaeVUncE_plq53EqTWyd1ZfAI7tIFOgmmEBPoGRw$].Files.createTempDir().
 By default, on unix-like systems, the created directory is world-readable 
(readable by an attacker with access to the system). The method in question has 
been marked @Deprecated in versions 30.0 and later and should not be used. For 
Android developers, we recommend choosing a temporary directory API provided by 
Android, such as context.getCacheDir(). For other Java developers, we recommend 
migrating to the Java 7 API java.nio.file.Files.createTempDirectory() which 
explicitly configures permissions of 700, or configuring the Java runtime's 

[jira] [Assigned] (SPARK-43705) Enable TimedeltaIndexTests.test_properties for pandas 2.0.0.

2023-08-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-43705:


Assignee: Haejoon Lee

> Enable TimedeltaIndexTests.test_properties for pandas 2.0.0.
> 
>
> Key: SPARK-43705
> URL: https://issues.apache.org/jira/browse/SPARK-43705
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> Enable TimedeltaIndexTests.test_properties for pandas 2.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43245) Fix DatetimeIndex.microsecond to return 'int32' instead of 'int64' type of Index.

2023-08-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-43245.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42271
[https://github.com/apache/spark/pull/42271]

> Fix DatetimeIndex.microsecond to return 'int32' instead of 'int64' type of 
> Index.
> -
>
> Key: SPARK-43245
> URL: https://issues.apache.org/jira/browse/SPARK-43245
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 4.0.0
>
>
> https://pandas.pydata.org/docs/dev/whatsnew/v2.0.0.html#index-can-now-hold-numpy-numeric-dtypes



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43705) Enable TimedeltaIndexTests.test_properties for pandas 2.0.0.

2023-08-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-43705.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42271
[https://github.com/apache/spark/pull/42271]

> Enable TimedeltaIndexTests.test_properties for pandas 2.0.0.
> 
>
> Key: SPARK-43705
> URL: https://issues.apache.org/jira/browse/SPARK-43705
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 4.0.0
>
>
> Enable TimedeltaIndexTests.test_properties for pandas 2.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43245) Fix DatetimeIndex.microsecond to return 'int32' instead of 'int64' type of Index.

2023-08-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-43245:


Assignee: Haejoon Lee

> Fix DatetimeIndex.microsecond to return 'int32' instead of 'int64' type of 
> Index.
> -
>
> Key: SPARK-43245
> URL: https://issues.apache.org/jira/browse/SPARK-43245
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> https://pandas.pydata.org/docs/dev/whatsnew/v2.0.0.html#index-can-now-hold-numpy-numeric-dtypes



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44742) Add Spark version drop down to the PySpark doc site

2023-08-10 Thread BingKun Pan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752689#comment-17752689
 ] 

BingKun Pan commented on SPARK-44742:
-

I work on it.

> Add Spark version drop down to the PySpark doc site
> ---
>
> Key: SPARK-44742
> URL: https://issues.apache.org/jira/browse/SPARK-44742
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> Currently, PySpark documentation does not have a version dropdown. While by 
> default we want people to land on the latest version, it will be helpful and 
> easier for people to find docs if we have this version dropdown. 
> Other libraries such as numpy have such version dropdown.  
> !image-2023-08-09-09-38-00-805.png|width=214,height=189!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44756) Executor hangs when RetryingBlockTransferor fails to initiate retry

2023-08-10 Thread Harunobu Daikoku (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harunobu Daikoku updated SPARK-44756:
-
Description: 
We have been observing this issue several times in our production where some 
executors are being stuck at BlockTransferService#fetchBlockSync().

After some investigation, the issue seems to be caused by an unhandled edge 
case in RetryingBlockTransferor.

1. Shuffle transfer fails for whatever reason
{noformat}
java.io.IOException: Cannot allocate memory
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
at sun.nio.ch.IOUtil.write(IOUtil.java:51)
at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:211)
at 
org.apache.spark.network.shuffle.SimpleDownloadFile$SimpleDownloadWritableChannel.write(SimpleDownloadFile.java:78)
at 
org.apache.spark.network.shuffle.OneForOneBlockFetcher$DownloadCallback.onData(OneForOneBlockFetcher.java:340)
at 
org.apache.spark.network.client.StreamInterceptor.handle(StreamInterceptor.java:79)
at 
org.apache.spark.network.util.TransportFrameDecoder.feedInterceptor(TransportFrameDecoder.java:263)
at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:87)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
{noformat}
2. The above exception caught by 
[AbstractChannelHandlerContext#invokeChannelRead()|https://github.com/netty/netty/blob/netty-4.1.74.Final/transport/src/main/java/io/netty/channel/AbstractChannelHandlerContext.java#L381],
 and propagated to the exception handler

3. Exception reaches 
[RetryingBlockTransferor#initiateRetry()|https://github.com/apache/spark/blob/v3.3.1/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RetryingBlockTransferor.java#L178-L180],
 and it tries to initiate retry
{noformat}
23/08/09 16:58:37 shuffle-client-4-2 INFO RetryingBlockTransferor: Retrying 
fetch (1/3) for 1 outstanding blocks after 5000 ms
{noformat}
4. Retry initiation fails (in our case, it fails to create a new thread)

5. Exception caught by 
[AbstractChannelHandlerContext#invokeExceptionCaught()|https://github.com/netty/netty/blob/netty-4.1.74.Final/transport/src/main/java/io/netty/channel/AbstractChannelHandlerContext.java#L305-L309],
 and not further processed
{noformat}
23/08/09 16:58:53 shuffle-client-4-2 DEBUG AbstractChannelHandlerContext: An 
exception java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:719)
at 
java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1378)
at 
java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
at 
org.apache.spark.network.shuffle.RetryingBlockTransferor.initiateRetry(RetryingBlockTransferor.java:182)
at 
org.apache.spark.network.shuffle.RetryingBlockTransferor.access$500(RetryingBlockTransferor.java:43)
at 
org.apache.spark.network.shuffle.RetryingBlockTransferor$RetryingBlockTransferListener.handleBlockTransferFailure(RetryingBlockTransferor.java:230)
at 
org.apache.spark.network.shuffle.RetryingBlockTransferor$RetryingBlockTransferListener.onBlockFetchFailure(RetryingBlockTransferor.java:260)
at 
org.apache.spark.network.shuffle.OneForOneBlockFetcher.failRemainingBlocks(OneForOneBlockFetcher.java:318)
at 
org.apache.spark.network.shuffle.OneForOneBlockFetcher.access$300(OneForOneBlockFetcher.java:55)
at 
org.apache.spark.network.shuffle.OneForOneBlockFetcher$DownloadCallback.onFailure(OneForOneBlockFetcher.java:357)
at 
org.apache.spark.network.client.StreamInterceptor.exceptionCaught(StreamInterceptor.java:56)
at 
org.apache.spark.network.util.TransportFrameDecoder.exceptionCaught(TransportFrameDecoder.java:231)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302)
{noformat}
6. After all, retry never happens and the executor thread ends up being stuck 
at 
[BlockTransferService#fetchBlockSync()|https://github.com/apache/spark/blob/v3.3.1/core/src/main/scala/org/apache/spark/network/BlockTransferService.scala#L103],
 waiting for the transfer to complete/fail
{noformat}
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)

[jira] [Updated] (SPARK-44756) Executor hangs when RetryingBlockTransferor fails to initiate retry

2023-08-10 Thread Harunobu Daikoku (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harunobu Daikoku updated SPARK-44756:
-
Description: 
We have been observing this issue several times in our production where some 
executors are being stuck at BlockTransferService.fetchBlockSync.

After some investigation, the issue seems to be caused by an unhandled edge 
case in RetryingBlockTransferor.

1. Shuffle transfer fails for whatever reason
{noformat}
java.io.IOException: Cannot allocate memory
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
at sun.nio.ch.IOUtil.write(IOUtil.java:51)
at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:211)
at 
org.apache.spark.network.shuffle.SimpleDownloadFile$SimpleDownloadWritableChannel.write(SimpleDownloadFile.java:78)
at 
org.apache.spark.network.shuffle.OneForOneBlockFetcher$DownloadCallback.onData(OneForOneBlockFetcher.java:340)
at 
org.apache.spark.network.client.StreamInterceptor.handle(StreamInterceptor.java:79)
at 
org.apache.spark.network.util.TransportFrameDecoder.feedInterceptor(TransportFrameDecoder.java:263)
at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:87)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
{noformat}
2. The above exception caught by 
[AbstractChannelHandlerContext#invokeChannelRead()|https://github.com/netty/netty/blob/netty-4.1.74.Final/transport/src/main/java/io/netty/channel/AbstractChannelHandlerContext.java#L381],
 and propagated to the exception handler

3. Exception reaches 
[RetryingBlockTransferor#initiateRetry()|https://github.com/apache/spark/blob/v3.3.1/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RetryingBlockTransferor.java#L178-L180],
 and it tries to initiate retry
{noformat}
23/08/09 16:58:37 shuffle-client-4-2 INFO RetryingBlockTransferor: Retrying 
fetch (1/3) for 1 outstanding blocks after 5000 ms
{noformat}
4. Retry initiation fails (in our case, it fails to create a new thread)

5. Exception caught by 
[AbstractChannelHandlerContext#invokeExceptionCaught()|https://github.com/netty/netty/blob/netty-4.1.74.Final/transport/src/main/java/io/netty/channel/AbstractChannelHandlerContext.java#L305-L309],
 and not further processed
{noformat}
23/08/09 16:58:53 shuffle-client-4-2 DEBUG AbstractChannelHandlerContext: An 
exception java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:719)
at 
java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1378)
at 
java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
at 
org.apache.spark.network.shuffle.RetryingBlockTransferor.initiateRetry(RetryingBlockTransferor.java:182)
at 
org.apache.spark.network.shuffle.RetryingBlockTransferor.access$500(RetryingBlockTransferor.java:43)
at 
org.apache.spark.network.shuffle.RetryingBlockTransferor$RetryingBlockTransferListener.handleBlockTransferFailure(RetryingBlockTransferor.java:230)
at 
org.apache.spark.network.shuffle.RetryingBlockTransferor$RetryingBlockTransferListener.onBlockFetchFailure(RetryingBlockTransferor.java:260)
at 
org.apache.spark.network.shuffle.OneForOneBlockFetcher.failRemainingBlocks(OneForOneBlockFetcher.java:318)
at 
org.apache.spark.network.shuffle.OneForOneBlockFetcher.access$300(OneForOneBlockFetcher.java:55)
at 
org.apache.spark.network.shuffle.OneForOneBlockFetcher$DownloadCallback.onFailure(OneForOneBlockFetcher.java:357)
at 
org.apache.spark.network.client.StreamInterceptor.exceptionCaught(StreamInterceptor.java:56)
at 
org.apache.spark.network.util.TransportFrameDecoder.exceptionCaught(TransportFrameDecoder.java:231)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302)
{noformat}
6. After all, retry never happens and the executor thread ends up being stuck 
at 
[BlockTransferService#fetchBlockSync()|https://github.com/apache/spark/blob/v3.3.1/core/src/main/scala/org/apache/spark/network/BlockTransferService.scala#L103],
 waiting for the transfer to complete/fail
{noformat}
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)

[jira] [Updated] (SPARK-44756) Executor hangs when RetryingBlockTransferor fails to initiate retry

2023-08-10 Thread Harunobu Daikoku (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harunobu Daikoku updated SPARK-44756:
-
Summary: Executor hangs when RetryingBlockTransferor fails to initiate 
retry  (was: Executor hangs when RetryingBlockTransferor fails to submit retry 
request)

> Executor hangs when RetryingBlockTransferor fails to initiate retry
> ---
>
> Key: SPARK-44756
> URL: https://issues.apache.org/jira/browse/SPARK-44756
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.3.1
>Reporter: Harunobu Daikoku
>Priority: Minor
>
> We have been observing this issue several times in our production where some 
> executors are being stuck at BlockTransferService.fetchBlockSync.
> After some investigation, the issue seems to be caused by an unhandled edge 
> case in RetryingBlockTransferor.
> 1. Shuffle transfer fails for whatever reason
> {noformat}
> java.io.IOException: Cannot allocate memory
>   at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>   at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60)
>   at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
>   at sun.nio.ch.IOUtil.write(IOUtil.java:51)
>   at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:211)
>   at 
> org.apache.spark.network.shuffle.SimpleDownloadFile$SimpleDownloadWritableChannel.write(SimpleDownloadFile.java:78)
>   at 
> org.apache.spark.network.shuffle.OneForOneBlockFetcher$DownloadCallback.onData(OneForOneBlockFetcher.java:340)
>   at 
> org.apache.spark.network.client.StreamInterceptor.handle(StreamInterceptor.java:79)
>   at 
> org.apache.spark.network.util.TransportFrameDecoder.feedInterceptor(TransportFrameDecoder.java:263)
>   at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:87)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
> {noformat}
> 2. The above exception caught by 
> [AbstractChannelHandlerContext#invokeChannelRead()|https://github.com/netty/netty/blob/netty-4.1.74.Final/transport/src/main/java/io/netty/channel/AbstractChannelHandlerContext.java#L381],
>  and propagated to the exception handler
> 3. Exception reaches 
> [RetryingBlockTransferor#initiateRetry()|https://github.com/apache/spark/blob/v3.3.1/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RetryingBlockTransferor.java#L178-L180],
>  and it tries to initiate retry
> {noformat}
> 23/08/09 16:58:37 shuffle-client-4-2 INFO RetryingBlockTransferor: Retrying 
> fetch (1/3) for 1 outstanding blocks after 5000 ms
> {noformat}
> 4. Retry initiation fails (in our case, it fails to create a new thread)
> 5. Exception caught by 
> [AbstractChannelHandlerContext#invokeExceptionCaught()|https://github.com/netty/netty/blob/netty-4.1.74.Final/transport/src/main/java/io/netty/channel/AbstractChannelHandlerContext.java#L299],
>  and not further processed
> {noformat}
> 23/08/09 16:58:53 shuffle-client-4-2 DEBUG AbstractChannelHandlerContext: An 
> exception java.lang.OutOfMemoryError: unable to create new native thread
>   at java.lang.Thread.start0(Native Method)
>   at java.lang.Thread.start(Thread.java:719)
>   at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1378)
>   at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockTransferor.initiateRetry(RetryingBlockTransferor.java:182)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockTransferor.access$500(RetryingBlockTransferor.java:43)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockTransferor$RetryingBlockTransferListener.handleBlockTransferFailure(RetryingBlockTransferor.java:230)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockTransferor$RetryingBlockTransferListener.onBlockFetchFailure(RetryingBlockTransferor.java:260)
>   at 
> org.apache.spark.network.shuffle.OneForOneBlockFetcher.failRemainingBlocks(OneForOneBlockFetcher.java:318)
>   at 
> org.apache.spark.network.shuffle.OneForOneBlockFetcher.access$300(OneForOneBlockFetcher.java:55)
>   at 
> org.apache.spark.network.shuffle.OneForOneBlockFetcher$DownloadCallback.onFailure(OneForOneBlockFetcher.java:357)
>   at 
> org.apache.spark.network.client.StreamInterceptor.exceptionCaught(StreamInterceptor.java:56)
>   at 
> org.apache.spark.network.util.TransportFrameDecoder.exceptionCaught(TransportFrameDecoder.java:231)
>   at 
> 

[jira] [Created] (SPARK-44756) Executor hangs when RetryingBlockTransferor fails to submit retry request

2023-08-10 Thread Harunobu Daikoku (Jira)
Harunobu Daikoku created SPARK-44756:


 Summary: Executor hangs when RetryingBlockTransferor fails to 
submit retry request
 Key: SPARK-44756
 URL: https://issues.apache.org/jira/browse/SPARK-44756
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 3.3.1
Reporter: Harunobu Daikoku


We have been observing this issue several times in our production where some 
executors are being stuck at BlockTransferService.fetchBlockSync.

After some investigation, the issue seems to be caused by an unhandled edge 
case in RetryingBlockTransferor.

1. Shuffle transfer fails for whatever reason
{noformat}
java.io.IOException: Cannot allocate memory
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
at sun.nio.ch.IOUtil.write(IOUtil.java:51)
at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:211)
at 
org.apache.spark.network.shuffle.SimpleDownloadFile$SimpleDownloadWritableChannel.write(SimpleDownloadFile.java:78)
at 
org.apache.spark.network.shuffle.OneForOneBlockFetcher$DownloadCallback.onData(OneForOneBlockFetcher.java:340)
at 
org.apache.spark.network.client.StreamInterceptor.handle(StreamInterceptor.java:79)
at 
org.apache.spark.network.util.TransportFrameDecoder.feedInterceptor(TransportFrameDecoder.java:263)
at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:87)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
{noformat}
2. The above exception caught by 
[AbstractChannelHandlerContext#invokeChannelRead()|https://github.com/netty/netty/blob/netty-4.1.74.Final/transport/src/main/java/io/netty/channel/AbstractChannelHandlerContext.java#L381],
 and propagated to the exception handler

3. Exception reaches 
[RetryingBlockTransferor#initiateRetry()|https://github.com/apache/spark/blob/v3.3.1/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RetryingBlockTransferor.java#L178-L180],
 and it tries to initiate retry
{noformat}
23/08/09 16:58:37 shuffle-client-4-2 INFO RetryingBlockTransferor: Retrying 
fetch (1/3) for 1 outstanding blocks after 5000 ms
{noformat}
4. Retry initiation fails (in our case, it fails to create a new thread)

5. Exception caught by 
[AbstractChannelHandlerContext#invokeExceptionCaught()|https://github.com/netty/netty/blob/netty-4.1.74.Final/transport/src/main/java/io/netty/channel/AbstractChannelHandlerContext.java#L299],
 and not further processed
{noformat}
23/08/09 16:58:53 shuffle-client-4-2 DEBUG AbstractChannelHandlerContext: An 
exception java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:719)
at 
java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1378)
at 
java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
at 
org.apache.spark.network.shuffle.RetryingBlockTransferor.initiateRetry(RetryingBlockTransferor.java:182)
at 
org.apache.spark.network.shuffle.RetryingBlockTransferor.access$500(RetryingBlockTransferor.java:43)
at 
org.apache.spark.network.shuffle.RetryingBlockTransferor$RetryingBlockTransferListener.handleBlockTransferFailure(RetryingBlockTransferor.java:230)
at 
org.apache.spark.network.shuffle.RetryingBlockTransferor$RetryingBlockTransferListener.onBlockFetchFailure(RetryingBlockTransferor.java:260)
at 
org.apache.spark.network.shuffle.OneForOneBlockFetcher.failRemainingBlocks(OneForOneBlockFetcher.java:318)
at 
org.apache.spark.network.shuffle.OneForOneBlockFetcher.access$300(OneForOneBlockFetcher.java:55)
at 
org.apache.spark.network.shuffle.OneForOneBlockFetcher$DownloadCallback.onFailure(OneForOneBlockFetcher.java:357)
at 
org.apache.spark.network.client.StreamInterceptor.exceptionCaught(StreamInterceptor.java:56)
at 
org.apache.spark.network.util.TransportFrameDecoder.exceptionCaught(TransportFrameDecoder.java:231)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302)
{noformat}
6. After all, retry never happens and the executor thread ends up being stuck 
at 
[BlockTransferService#fetchBlockSync()|https://github.com/apache/spark/blob/v3.3.1/core/src/main/scala/org/apache/spark/network/BlockTransferService.scala#L103],
 waiting for the transfer to complete/fail
{noformat}
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)

[jira] [Updated] (SPARK-44573) Couldn't submit Spark application to Kubenetes in versions v1.27.3

2023-08-10 Thread Siddaraju G C (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddaraju G C updated SPARK-44573:
--
Priority: Blocker  (was: Major)

> Couldn't submit Spark application to Kubenetes in versions v1.27.3
> --
>
> Key: SPARK-44573
> URL: https://issues.apache.org/jira/browse/SPARK-44573
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Submit
>Affects Versions: 3.4.1
>Reporter: Siddaraju G C
>Priority: Blocker
>
> Spark-submit ( cluster mode on Kubernetes ) results error 
> *io.fabric8.kubernetes.client.KubernetesClientException* on my 3 nodes k8s 
> cluster.
> Steps followed:
>  * using IBM cloud, created 3 Instances
>  * 1st Instance act as master node and another two acts as worker nodes
>  
> {noformat}
> root@vsi-spark-master:/opt# kubectl get nodes
> NAME                 STATUS   ROLES                  AGE   VERSION
> vsi-spark-master     Ready    control-plane,master   2d    v1.27.3+k3s1
> vsi-spark-worker-1   Ready                     47h   v1.27.3+k3s1
> vsi-spark-worker-2   Ready                     47h   
> v1.27.3+k3s1{noformat}
>  * Copy spark-3.4.1-bin-hadoop3.tgz in to /opt/spark folder 
>  * Ran spark by using below command
>  
> {noformat}
> root@vsi-spark-master:/opt# /opt/spark/bin/spark-submit --master 
> k8s://http://:6443 --conf 
> spark.kubernetes.authenticate.submission.oauthToken=$TOKEN --deploy-mode 
> cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf 
> spark.executor.instances=5 --conf 
> spark.kubernetes.authenticate.driver.serviceAccountName=spark  --conf 
> spark.kubernetes.container.image=sushmakorati/testrepo:pyrandomGB 
> local:///opt/spark/examples/jars/spark-examples_2.12-3.4.1.jar{noformat}
>  * And getting below error message.
> {noformat}
> 3/07/27 12:56:26 WARN Utils: Kubernetes master URL uses HTTP instead of HTTPS.
> 23/07/27 12:56:26 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 23/07/27 12:56:26 INFO SparkKubernetesClientFactory: Auto-configuring K8S 
> client using current context from users K8S config file
> 23/07/27 12:56:26 INFO KerberosConfDriverFeatureStep: You have not specified 
> a krb5.conf file locally or via a ConfigMap. Make sure that you have the 
> krb5.conf locally on the driver image.
> 23/07/27 12:56:27 ERROR Client: Please check "kubectl auth can-i create pod" 
> first. It should be yes.
> Exception in thread "main" 
> io.fabric8.kubernetes.client.KubernetesClientException: An error has occurred.
>     at 
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:129)
>     at 
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:122)
>     at 
> io.fabric8.kubernetes.client.dsl.internal.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:44)
>     at 
> io.fabric8.kubernetes.client.dsl.internal.BaseOperation.create(BaseOperation.java:1113)
>     at 
> io.fabric8.kubernetes.client.dsl.internal.BaseOperation.create(BaseOperation.java:93)
>     at 
> org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:153)
>     at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$5(KubernetesClientApplication.scala:250)
>     at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$5$adapted(KubernetesClientApplication.scala:244)
>     at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2786)
>     at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:244)
>     at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:216)
>     at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1020)
>     at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192)
>     at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:215)
>     at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
>     at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:)
>     at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1120)
>     at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.io.IOException: Connection reset
>     at 
> io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:535)
>     at 
> io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleResponse(OperationSupport.java:558)
>     at 
> io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleCreate(OperationSupport.java:349)
>     at 
> 

[jira] [Assigned] (SPARK-44691) Move Subclasses of Analysis to sql/api

2023-08-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-44691:
---

Assignee: Yihong He

> Move Subclasses of Analysis to sql/api
> --
>
> Key: SPARK-44691
> URL: https://issues.apache.org/jira/browse/SPARK-44691
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Yihong He
>Assignee: Yihong He
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44691) Move Subclasses of Analysis to sql/api

2023-08-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-44691.
-
Fix Version/s: 3.5.0
   Resolution: Fixed

> Move Subclasses of Analysis to sql/api
> --
>
> Key: SPARK-44691
> URL: https://issues.apache.org/jira/browse/SPARK-44691
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Yihong He
>Assignee: Yihong He
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44755) Local tmp data is not cleared while using spark streaming consuming from kafka

2023-08-10 Thread leesf (Jira)
leesf created SPARK-44755:
-

 Summary: Local tmp data is not cleared while using spark streaming 
consuming from kafka
 Key: SPARK-44755
 URL: https://issues.apache.org/jira/browse/SPARK-44755
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.2.0
Reporter: leesf


we are using spark 3.2 consuming data from kafka and then using `collectAsMap` 
to send to driver, we found the local temp file do not get cleared if the data 
consumed from kafka is larger than 
200m(spark.network.maxRemoteBlockSizeFetchToMem)

!https://intranetproxy.alipay.com/skylark/lark/0/2023/png/320711/1691419276170-2dd0964f-4cf4-4b15-9fbe-9622116671da.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44754) Improve DeduplicateRelations rewriteAttrs compatibility

2023-08-10 Thread Jia Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jia Fan updated SPARK-44754:

Description: 
{{Follow [https://github.com/apache/spark/pull/41554,] we should add test for }}

{{{}MapPartitionsInR{}}}, {{{}MapPartitionsInRWithArrow{}}}, 
{{{}MapElements{}}}, {{{}MapGroups{}}}, {{{}FlatMapGroupsWithState{}}}, 
{{{}FlatMapGroupsInR{}}}, {{{}FlatMapGroupsInR{}}}, 
{{{}FlatMapGroupsInRWithArrow{}}}, {{{}FlatMapGroupsInPandas{}}}, 
{{{}MapInPandas{}}}, {{{}PythonMapInArrow{}}}, {{FlatMapGroupsInPandasWithState 
and}} {{FlatMapCoGroupsInPandas}}

{{To make sure DeduplicateRelations rewriteAttrs will rewrite its attribute 
normally. Also should fix the error behavior follow 
[https://github.com/apache/spark/pull/41554|https://github.com/apache/spark/pull/41554,]}}

  was:
{{Follow [https://github.com/apache/spark/pull/41554,] we should add test for }}

{}MapPartitionsInR{}}}, {{{}MapPartitionsInRWithArrow{}}}, 
{{{}MapElements{}}}, {{{}MapGroups{}}}, {{{}FlatMapGroupsWithState{}}}, 
{{{}FlatMapGroupsInR{}}}, {{{}FlatMapGroupsInR{}}}, 
{{{}FlatMapGroupsInRWithArrow{}}}, {{{}FlatMapGroupsInPandas{}}}, 
{{{}MapInPandas{}}}, {{{}PythonMapInArrow{}}}, {{FlatMapGroupsInPandasWithState 
and}} {{FlatMapCoGroupsInPandas

{{{}{{{}{}}}{{{}{}}}{}}}{{{}{{To make sure DeduplicateRelations rewriteAttrs 
will rewrite its attribute normally. Also should fix the error behavior follow 
[https://github.com/apache/spark/pull/41554|https://github.com/apache/spark/pull/41554,]}}{}}}


> Improve DeduplicateRelations rewriteAttrs compatibility
> ---
>
> Key: SPARK-44754
> URL: https://issues.apache.org/jira/browse/SPARK-44754
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Jia Fan
>Priority: Major
>
> {{Follow [https://github.com/apache/spark/pull/41554,] we should add test for 
> }}
> {{{}MapPartitionsInR{}}}, {{{}MapPartitionsInRWithArrow{}}}, 
> {{{}MapElements{}}}, {{{}MapGroups{}}}, {{{}FlatMapGroupsWithState{}}}, 
> {{{}FlatMapGroupsInR{}}}, {{{}FlatMapGroupsInR{}}}, 
> {{{}FlatMapGroupsInRWithArrow{}}}, {{{}FlatMapGroupsInPandas{}}}, 
> {{{}MapInPandas{}}}, {{{}PythonMapInArrow{}}}, 
> {{FlatMapGroupsInPandasWithState and}} {{FlatMapCoGroupsInPandas}}
> {{To make sure DeduplicateRelations rewriteAttrs will rewrite its attribute 
> normally. Also should fix the error behavior follow 
> [https://github.com/apache/spark/pull/41554|https://github.com/apache/spark/pull/41554,]}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44754) Improve DeduplicateRelations rewriteAttrs compatibility

2023-08-10 Thread Jia Fan (Jira)
Jia Fan created SPARK-44754:
---

 Summary: Improve DeduplicateRelations rewriteAttrs compatibility
 Key: SPARK-44754
 URL: https://issues.apache.org/jira/browse/SPARK-44754
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: Jia Fan


{{Follow [https://github.com/apache/spark/pull/41554,] we should add test for }}

{}MapPartitionsInR{}}}, {{{}MapPartitionsInRWithArrow{}}}, 
{{{}MapElements{}}}, {{{}MapGroups{}}}, {{{}FlatMapGroupsWithState{}}}, 
{{{}FlatMapGroupsInR{}}}, {{{}FlatMapGroupsInR{}}}, 
{{{}FlatMapGroupsInRWithArrow{}}}, {{{}FlatMapGroupsInPandas{}}}, 
{{{}MapInPandas{}}}, {{{}PythonMapInArrow{}}}, {{FlatMapGroupsInPandasWithState 
and}} {{FlatMapCoGroupsInPandas

{{{}{{{}{}}}{{{}{}}}{}}}{{{}{{To make sure DeduplicateRelations rewriteAttrs 
will rewrite its attribute normally. Also should fix the error behavior follow 
[https://github.com/apache/spark/pull/41554|https://github.com/apache/spark/pull/41554,]}}{}}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org