Hi,
I am working on the project and as a resource I have provided 40 executors
and 14 gb memory per executor.
I am trying to optimize my spark job in such a way that spark will evenly
distribute the spark job between the executors.
Could you please give me some advice?
Kind regards,
I'm curious about using shared memory to speed up the JVM->Python round
trip. Is there any sane way to do anonymous shared memory in Java/scale?
On Sat, Jul 16, 2022 at 16:10 Sebastian Piu wrote:
> Other alternatives are to look at how PythonRDD does it in spark, you
> could also try to go for
Other alternatives are to look at how PythonRDD does it in spark, you could
also try to go for a more traditional setup where you expose your python
functions behind a local/remote service and call that from scala - say over
thrift/grpc/http/local socket etc.
Another option, but I've never done it
Use GraphFrames?
On Sat, Jul 16, 2022 at 3:54 PM Yuhao Zhang wrote:
> Hi Shay,
>
> Thanks for your reply! I would very much like to use pyspark. However, my
> project depends on GraphX, which is only available in the Scala API as far
> as I know. So I'm locked with Scala and trying to find a
Hi Shay,
Thanks for your reply! I would very much like to use pyspark. However, my
project depends on GraphX, which is only available in the Scala API as far
as I know. So I'm locked with Scala and trying to find a way out. I wonder
if there's a way to go around it.
Best regards,
Yuhao Zhang
ok thanks. guess i am simply misremembering that i saw the shuffle files
getting re-used across jobs (actions). it was probably across stages for
the same job.
in structured streaming this is a pretty big deal. if you join a streaming
dataframe with a large static dataframe each microbatch
Spark can reuse shuffle stages in the same job (action), not cross jobs.
From: Koert Kuipers
Sent: Saturday, July 16, 2022 6:43 PM
To: user
Subject: [EXTERNAL] spark re-use shuffle files not happening
ATTENTION: This email originated from outside of GM.
i
i have seen many jobs where spark re-uses shuffle files (and skips a stage
of a job), which is an awesome feature given how expensive shuffles are,
and i generally now assume this will happen.
however i feel like i am going a little crazy today. i did the simplest
test in spark 3.3.0, basically i
Hi Folks,
Have created a UDF that queries a confluent schema registry for a schema,
which is then used within a Dataset Select with the from_avro function to
decode an avro encoded value (reading from a bunch of kafka topics)
Dataset recordDF = df.select(