Code is always distributed for any operations on a DataFrame or RDD. The size
of your code is irrelevant except to Jvm memory limits. For most jobs the
entire application jar and all dependencies are put on the classpath of every
executor.
There are some exceptions but generally you should
Hi Tufan,
Thanks for the answers. However, by the second point, I mean to say where
would my code reside? Will it be copied to all the executors since the code
size would be small or will it be maintained on the driver's side? I know
that driver converts the code to DAG and when an action is
Please find the answers inline please .
1) Can I apply predicate pushdown filters if I have data stored in S3 or it
should be used only while reading from DBs?
it can be applied in s3 if you store parquet , csv, json or in avro format
.It does not depend on the DB , its supported in object store
Hi Team,
I have various doubts as below:
1) Can I apply predicate pushdown filters if I have data stored in S3 or it
should be used only while reading from DBs?
2) While running the data in distributed form, is my code copied to each
and every executor. As per me, it should be the case since
Hi,
Thanks for your answers. Much appreciated
I know that we can cache the data frame in memory or disk but I want to
understand when the data frame is loaded initially and where does it reside
by default?
Thanks,
Sid
On Wed, Jun 22, 2022 at 6:10 AM Yong Walt wrote:
> These are the basic
These are the basic concepts in spark :)
You may take a bit time to read this small book:
https://cloudcache.net/resume/PDDWS2-V2.pdf
regards
On Wed, Jun 22, 2022 at 3:17 AM Sid wrote:
> Hi Team,
>
> I have a few doubts about the below questions:
>
> 1) data frame will reside where? memory?
Dear Sid.
You are asking questions for which answers exist in the Apache Spark
website or in books or in MOOCS or in other URLs.
For example, take a look at this one:
https://sparkbyexamples.com/spark/spark-dataframe-cache-and-persist-explained/
Hi Team,
I have a few doubts about the below questions:
1) data frame will reside where? memory? disk? memory allocation about data
frame?
2) How do you configure each partition?
3) Is there any way to calculate the exact partitions needed to load a
specific file?
Thanks,
Sid