I use Spark in standalone mode. It works well, and the instructions on the
site are accurate for the most part. The only thing that didn't work for me
was the start_all.sh script. Instead, I use a simple script that starts the
master node, then uses SSH to connect to the worker machines and start the
worker nodes.

All the nodes will need access to the same data, so you will need some sort
of shared file system. You could use an NFS share mounted to the same point
on each machine, S3, or HDFS.

Standalone also acquires all resources when an application is submitted, so
by default only one application may be run at a time. You can limit the
resources allocated to each application to allow multiple concurrent
applications, or you could configure dynamic allocation to scale the
resources up and down per application as needed.

On Fri, Sep 15, 2023 at 5:56 AM Ilango <elango...@gmail.com> wrote:

>
> Hi all,
>
> We have 4 HPC nodes and installed spark individually in all nodes.
>
> Spark is used as local mode(each driver/executor will have 8 cores and 65
> GB) in Sparklyr/pyspark using Rstudio/Posit workbench. Slurm is used as
> scheduler.
>
> As this is local mode, we are facing performance issue(as only one
> executor) when it comes dealing with large datasets.
>
> Can I convert this 4 nodes into spark standalone cluster. We dont have
> hadoop so yarn mode is out of scope.
>
> Shall I follow the official documentation for setting up standalone
> cluster. Will it work? Do I need to aware anything else?
> Can you please share your thoughts?
>
> Thanks,
> Elango
>

Reply via email to