How to process S3 data in Scalable Manner Using Spark API (wholeTextFile VERY SLOW and NOT scalable)

2021-10-02 Thread Alchemist
Issue:  We are using wholeTextFile() API to read files from S3.  But this API is extremely SLOW due to reasons mentioned below.  Question is how to fix this issue? Here is our analysis so FAR:  Issue is we are using Spark WholeTextFile API to read s3 files. WholeTextFile API works in two step. F

wholeTextAPI() extremely SLOW under high load (How to fix?)

2021-10-02 Thread Rachana Srivastava
Issue:  We are using wholeTextFile() API to read files from S3.  But this API is extremely SLOW due to reasons mentioned below.  Question is how to fix this issue? Here is our analysis so FAR:  Issue is we are using Spark WholeTextFile API to read s3 files. WholeTextFile API works in two step. F

Re: Choice of IDE for Spark

2021-10-02 Thread Christian Pfarr
We use Jupyter on Hadoop https://jupyterhub-on-hadoop.readthedocs.io/en/latest/ for developing spark jobs directly inside the Cluster it should run. With that you have direct access to yarn and hdfs (fully secured) without any migration steps. You can control the size of your Jupyter yarn

Re: Choice of IDE for Spark

2021-10-02 Thread Паша
Disclaimer: I'm developer avocado for data engineering at JetBrains, so I'm definitely biased. And if someone likes Zeppelin — there is an awesome integration of Zeppelin into IDEA via Big Data Tools plugin — one can perform any explorations they want/need and then extract all their work into real