Re: Spark Doubts

2022-06-25 Thread russell . spitzer
Code is always distributed for any operations on a DataFrame or RDD. The size 
of your code is irrelevant except to Jvm memory limits. For most jobs the 
entire application jar and all dependencies are put on the classpath of every 
executor. 

There are some exceptions but generally you should think about all data 
processing occurring on executor Jvms.

Sent from my iPhone

> On Jun 25, 2022, at 2:18 AM, Sid  wrote:
> 
> 
> Hi Tufan,
> 
> Thanks for the answers. However, by the second point, I mean to say where 
> would my code reside? Will it be copied to all the executors since the code 
> size would be small or will it be maintained on the driver's side? I know 
> that driver converts the code to DAG and when an action is called it is 
> submitted to the DAG scheduler and so on...
> 
> Thanks,
> Sid
> 
>> On Sat, Jun 25, 2022 at 12:34 PM Tufan Rakshit  wrote:
>> Please find the answers inline please .
>> 1) Can I apply predicate pushdown filters if I have data stored in S3 or it 
>> should be used only while reading from DBs?
>> it can be applied in s3 if you store parquet , csv, json or in avro format 
>> .It does not depend on the DB , its supported in object store like s3 as 
>> well .
>> 
>> 2) While running the data in distributed form, is my code copied to each and 
>> every executor. As per me, it should be the case since code.zip would be 
>> smaller in size to be copied on each worker node.
>> if  you are trying to join two datasets out of which one is small , Spark by 
>> default would try to broadcast the smaller data set to the other executor , 
>> rather going for a Sort merge Join , There is property which is enabled by 
>> default from spark 3.1 , the limit for smaller dataframe to be broadcasted 
>> is 10 MB , it can also be changed  to higher value with config .
>> 
>> 3) Also my understanding of shuffling of data is " It is moving one 
>> partition to another partition or moving data(keys) of one partition to 
>> another partition of those keys. It increases memory since before shuffling 
>> it copies the data in the memory and then transfers to another partition". 
>> Is it correct? If not, please correct me.
>> 
>> It depends on the context of Distributed computing as Your data does not sit 
>> in one machine , neither in one Disk . Shuffle is involved when you try to 
>> trigger actions like Group by or Sort as it involves bringing all the keys 
>> into one executor Do the computation , or when Sort merge Join is triggered 
>> then both the dataset Sorted and this sort is Global sort not partition wise 
>> sort . yes its memory intensive operation as , if you see a lot of shuffle 
>> to be involved best to use SSD (M5d based machine in AWS ) .
>> As for really big jobs where TB worth of data has to be joined its not 
>> possible to do all the operation in memory in RAM 
>> 
>> 
>> Hope that helps .
>> 
>> Best 
>> Tufan
>> 
>> 
>> 
>>> On Sat, 25 Jun 2022 at 08:43, Sid  wrote:
>>> Hi Team,
>>> 
>>> I have various doubts as below:
>>> 
>>> 1) Can I apply predicate pushdown filters if I have data stored in S3 or it 
>>> should be used only while reading from DBs?
>>> 
>>> 2) While running the data in distributed form, is my code copied to each 
>>> and every executor. As per me, it should be the case since code.zip would 
>>> be smaller in size to be copied on each worker node.
>>> 
>>> 3) Also my understanding of shuffling of data is " It is moving one 
>>> partition to another partition or moving data(keys) of one partition to 
>>> another partition of those keys. It increases memory since before shuffling 
>>> it copies the data in the memory and then transfers to another partition". 
>>> Is it correct? If not, please correct me.
>>> 
>>> Please help me to understand these things in layman's terms if my 
>>> assumptions are not correct.
>>> 
>>> Thanks,
>>> Sid


Re: Spark Doubts

2022-06-25 Thread Sid
Hi Tufan,

Thanks for the answers. However, by the second point, I mean to say where
would my code reside? Will it be copied to all the executors since the code
size would be small or will it be maintained on the driver's side? I know
that driver converts the code to DAG and when an action is called it is
submitted to the DAG scheduler and so on...

Thanks,
Sid

On Sat, Jun 25, 2022 at 12:34 PM Tufan Rakshit  wrote:

> Please find the answers inline please .
> 1) Can I apply predicate pushdown filters if I have data stored in S3 or
> it should be used only while reading from DBs?
> it can be applied in s3 if you store parquet , csv, json or in avro format
> .It does not depend on the DB , its supported in object store like s3 as
> well .
>
> 2) While running the data in distributed form, is my code copied to each
> and every executor. As per me, it should be the case since code.zip would
> be smaller in size to be copied on each worker node.
> if  you are trying to join two datasets out of which one is small , Spark
> by default would try to broadcast the smaller data set to the other
> executor , rather going for a Sort merge Join , There is property which is
> enabled by default from spark 3.1 , the limit for smaller dataframe to be
> broadcasted is 10 MB , it can also be changed  to higher value with config .
>
> 3) Also my understanding of shuffling of data is " It is moving one
> partition to another partition or moving data(keys) of one partition to
> another partition of those keys. It increases memory since before shuffling
> it copies the data in the memory and then transfers to another partition".
> Is it correct? If not, please correct me.
>
> It depends on the context of Distributed computing as Your data does not
> sit in one machine , neither in one Disk . Shuffle is involved when you try
> to trigger actions like Group by or Sort as it involves bringing all the
> keys into one executor Do the computation , or when Sort merge Join is
> triggered then both the dataset Sorted and this sort is Global sort not
> partition wise sort . yes its memory intensive operation as , if you see a
> lot of shuffle to be involved best to use SSD (M5d based machine in AWS ) .
> As for really big jobs where TB worth of data has to be joined its not
> possible to do all the operation in memory in RAM
>
>
> Hope that helps .
>
> Best
> Tufan
>
>
>
> On Sat, 25 Jun 2022 at 08:43, Sid  wrote:
>
>> Hi Team,
>>
>> I have various doubts as below:
>>
>> 1) Can I apply predicate pushdown filters if I have data stored in S3 or
>> it should be used only while reading from DBs?
>>
>> 2) While running the data in distributed form, is my code copied to each
>> and every executor. As per me, it should be the case since code.zip would
>> be smaller in size to be copied on each worker node.
>>
>> 3) Also my understanding of shuffling of data is " It is moving one
>> partition to another partition or moving data(keys) of one partition to
>> another partition of those keys. It increases memory since before shuffling
>> it copies the data in the memory and then transfers to another partition".
>> Is it correct? If not, please correct me.
>>
>> Please help me to understand these things in layman's terms if my
>> assumptions are not correct.
>>
>> Thanks,
>> Sid
>>
>


Re: Spark Doubts

2022-06-25 Thread Tufan Rakshit
Please find the answers inline please .
1) Can I apply predicate pushdown filters if I have data stored in S3 or it
should be used only while reading from DBs?
it can be applied in s3 if you store parquet , csv, json or in avro format
.It does not depend on the DB , its supported in object store like s3 as
well .

2) While running the data in distributed form, is my code copied to each
and every executor. As per me, it should be the case since code.zip would
be smaller in size to be copied on each worker node.
if  you are trying to join two datasets out of which one is small , Spark
by default would try to broadcast the smaller data set to the other
executor , rather going for a Sort merge Join , There is property which is
enabled by default from spark 3.1 , the limit for smaller dataframe to be
broadcasted is 10 MB , it can also be changed  to higher value with config .

3) Also my understanding of shuffling of data is " It is moving one
partition to another partition or moving data(keys) of one partition to
another partition of those keys. It increases memory since before shuffling
it copies the data in the memory and then transfers to another partition".
Is it correct? If not, please correct me.

It depends on the context of Distributed computing as Your data does not
sit in one machine , neither in one Disk . Shuffle is involved when you try
to trigger actions like Group by or Sort as it involves bringing all the
keys into one executor Do the computation , or when Sort merge Join is
triggered then both the dataset Sorted and this sort is Global sort not
partition wise sort . yes its memory intensive operation as , if you see a
lot of shuffle to be involved best to use SSD (M5d based machine in AWS ) .
As for really big jobs where TB worth of data has to be joined its not
possible to do all the operation in memory in RAM


Hope that helps .

Best
Tufan



On Sat, 25 Jun 2022 at 08:43, Sid  wrote:

> Hi Team,
>
> I have various doubts as below:
>
> 1) Can I apply predicate pushdown filters if I have data stored in S3 or
> it should be used only while reading from DBs?
>
> 2) While running the data in distributed form, is my code copied to each
> and every executor. As per me, it should be the case since code.zip would
> be smaller in size to be copied on each worker node.
>
> 3) Also my understanding of shuffling of data is " It is moving one
> partition to another partition or moving data(keys) of one partition to
> another partition of those keys. It increases memory since before shuffling
> it copies the data in the memory and then transfers to another partition".
> Is it correct? If not, please correct me.
>
> Please help me to understand these things in layman's terms if my
> assumptions are not correct.
>
> Thanks,
> Sid
>


Spark Doubts

2022-06-25 Thread Sid
Hi Team,

I have various doubts as below:

1) Can I apply predicate pushdown filters if I have data stored in S3 or it
should be used only while reading from DBs?

2) While running the data in distributed form, is my code copied to each
and every executor. As per me, it should be the case since code.zip would
be smaller in size to be copied on each worker node.

3) Also my understanding of shuffling of data is " It is moving one
partition to another partition or moving data(keys) of one partition to
another partition of those keys. It increases memory since before shuffling
it copies the data in the memory and then transfers to another partition".
Is it correct? If not, please correct me.

Please help me to understand these things in layman's terms if my
assumptions are not correct.

Thanks,
Sid


Re: Spark Doubts

2022-06-22 Thread Sid
Hi,

Thanks for your answers. Much appreciated

I know that we can cache the data frame in memory or disk but I want to
understand when the data frame is loaded initially and where does it reside
by default?


Thanks,
Sid

On Wed, Jun 22, 2022 at 6:10 AM Yong Walt  wrote:

> These are the basic concepts in spark :)
> You may take a bit time to read this small book:
> https://cloudcache.net/resume/PDDWS2-V2.pdf
>
> regards
>
>
> On Wed, Jun 22, 2022 at 3:17 AM Sid  wrote:
>
>> Hi Team,
>>
>> I have a few doubts about the below questions:
>>
>> 1) data frame will reside where? memory? disk? memory allocation about
>> data frame?
>> 2) How do you configure each partition?
>> 3) Is there any way to calculate the exact partitions needed to load a
>> specific file?
>>
>> Thanks,
>> Sid
>>
>


Re: Spark Doubts

2022-06-21 Thread Yong Walt
These are the basic concepts in spark :)
You may take a bit time to read this small book:
https://cloudcache.net/resume/PDDWS2-V2.pdf

regards


On Wed, Jun 22, 2022 at 3:17 AM Sid  wrote:

> Hi Team,
>
> I have a few doubts about the below questions:
>
> 1) data frame will reside where? memory? disk? memory allocation about
> data frame?
> 2) How do you configure each partition?
> 3) Is there any way to calculate the exact partitions needed to load a
> specific file?
>
> Thanks,
> Sid
>


Re: Spark Doubts

2022-06-21 Thread Apostolos N. Papadopoulos

Dear Sid.

You are asking questions for which answers exist in the Apache Spark 
website or in books or in MOOCS or in other URLs.


For example, take a look at this one: 
https://sparkbyexamples.com/spark/spark-dataframe-cache-and-persist-explained/



https://spark.apache.org/docs/latest/sql-programming-guide.html

What do you mean by question 2?

About question 3, it depends on how you load the file. For example, if 
you have a text file in HDFS and you want to


use an RDD, initially, the number of partitions equals the number of 
HDFS blocks, unless you specify the number of


partitions when you create the RDD from the file.

I would suggest first to go through a book devoted to Spark, like The 
Definitive Guide, or any other similar resource.


Also, I would suggest to take a MOOC on Spark (e.g., in Coursera, edX, 
etc).


All the best,

Apostolos


On 21/6/22 22:16, Sid wrote:

Hi Team,

I have a few doubts about the below questions:

1) data frame will reside where? memory? disk? memory allocation about 
data frame?

2) How do you configure each partition?
3) Is there any way to calculate the exact partitions needed to load a 
specific file?


Thanks,
Sid


--
Apostolos N. Papadopoulos, Associate Professor
Department of Informatics
Aristotle University of Thessaloniki
Thessaloniki, GREECE
tel: ++0030312310991918
email: papad...@csd.auth.gr
twitter: @papadopoulos_ap
web: http://datalab.csd.auth.gr/~apostol


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark Doubts

2022-06-21 Thread Sid
Hi Team,

I have a few doubts about the below questions:

1) data frame will reside where? memory? disk? memory allocation about data
frame?
2) How do you configure each partition?
3) Is there any way to calculate the exact partitions needed to load a
specific file?

Thanks,
Sid