Question regarding how to make spar Scala to evenly divide the spark job between executors

2022-07-16 Thread Orkhan Dadashov
Hi, I am working on the project and as a resource I have provided 40 executors and 14 gb memory per executor. I am trying to optimize my spark job in such a way that spark will evenly distribute the spark job between the executors. Could you please give me some advice? Kind regards,

Re: [EXTERNAL] RDD.pipe() for binary data

2022-07-16 Thread Andrew Melo
I'm curious about using shared memory to speed up the JVM->Python round trip. Is there any sane way to do anonymous shared memory in Java/scale? On Sat, Jul 16, 2022 at 16:10 Sebastian Piu wrote: > Other alternatives are to look at how PythonRDD does it in spark, you > could also try to go for

Re: [EXTERNAL] RDD.pipe() for binary data

2022-07-16 Thread Sebastian Piu
Other alternatives are to look at how PythonRDD does it in spark, you could also try to go for a more traditional setup where you expose your python functions behind a local/remote service and call that from scala - say over thrift/grpc/http/local socket etc. Another option, but I've never done it

Re: [EXTERNAL] RDD.pipe() for binary data

2022-07-16 Thread Sean Owen
Use GraphFrames? On Sat, Jul 16, 2022 at 3:54 PM Yuhao Zhang wrote: > Hi Shay, > > Thanks for your reply! I would very much like to use pyspark. However, my > project depends on GraphX, which is only available in the Scala API as far > as I know. So I'm locked with Scala and trying to find a

Re: [EXTERNAL] RDD.pipe() for binary data

2022-07-16 Thread Yuhao Zhang
Hi Shay, Thanks for your reply! I would very much like to use pyspark. However, my project depends on GraphX, which is only available in the Scala API as far as I know. So I'm locked with Scala and trying to find a way out. I wonder if there's a way to go around it. Best regards, Yuhao Zhang

Re: [EXTERNAL] spark re-use shuffle files not happening

2022-07-16 Thread Koert Kuipers
ok thanks. guess i am simply misremembering that i saw the shuffle files getting re-used across jobs (actions). it was probably across stages for the same job. in structured streaming this is a pretty big deal. if you join a streaming dataframe with a large static dataframe each microbatch

Re: [EXTERNAL] spark re-use shuffle files not happening

2022-07-16 Thread Shay Elbaz
Spark can reuse shuffle stages in the same job (action), not cross jobs. From: Koert Kuipers Sent: Saturday, July 16, 2022 6:43 PM To: user Subject: [EXTERNAL] spark re-use shuffle files not happening ATTENTION: This email originated from outside of GM. i

spark re-use shuffle files not happening

2022-07-16 Thread Koert Kuipers
i have seen many jobs where spark re-uses shuffle files (and skips a stage of a job), which is an awesome feature given how expensive shuffles are, and i generally now assume this will happen. however i feel like i am going a little crazy today. i did the simplest test in spark 3.3.0, basically i

Spark Convert Column to String

2022-07-16 Thread Gibson
Hi Folks, Have created a UDF that queries a confluent schema registry for a schema, which is then used within a Dataset Select with the from_avro function to decode an avro encoded value (reading from a bunch of kafka topics) Dataset recordDF = df.select(