Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-19 Thread Varun Shah
Hi @Mich Talebzadeh  , community,

Where can I find such insights on the Spark Architecture ?

I found few sites below which did/does cover internals :
1. https://github.com/JerryLead/SparkInternals
2. https://books.japila.pl/apache-spark-internals/overview/
3. https://stackoverflow.com/questions/30691385/how-spark-works-internally

Most of them are very old, and hoping the basic internals have not changed,
where can we find more information on internals ? Asking in case you or
someone from community has more articles / videos / document links to
share.

Appreciate your help.


Regards,
Varun Shah



On Fri, Mar 15, 2024, 03:10 Mich Talebzadeh 
wrote:

> Hi,
>
> When you create a DataFrame from Python objects using
> spark.createDataFrame, here it goes:
>
>
> *Initial Local Creation:*
> The DataFrame is initially created in the memory of the driver node. The
> data is not yet distributed to executors at this point.
>
> *The role of lazy Evaluation:*
>
> Spark applies lazy evaluation, *meaning transformations are not executed
> immediately*.  It constructs a logical plan describing the operations,
> but data movement does not occur yet.
>
> *Action Trigger:*
>
> When you initiate an action (things like show(), collect(), etc), Spark
> triggers the execution.
>
>
>
> *When partitioning  and distribution come in:Spark partitions the
> DataFrame into logical chunks for parallel processing*. It divides the
> data based on a partitioning scheme (default is hash partitioning). Each
> partition is sent to different executor nodes for distributed execution.
> This stage involves data transfer across the cluster, but it is not that
> expensive shuffle you have heard of. Shuffles happen within repartitioning
> or certain join operations.
>
> *Storage on Executors:*
>
> Executors receive their assigned partitions and store them in their
> memory. If memory is limited, Spark spills partitions to disk. look at
> stages tab in UI (4040)
>
>
> *In summary:*
> No Data Transfer During Creation: --> Data transfer occurs only when an
> action is triggered.
> Distributed Processing: --> DataFrames are distributed for parallel
> execution, not stored entirely on the driver node.
> Lazy Evaluation Optimization: --> Delaying data transfer until necessary
> enhances performance.
> Shuffle vs. Partitioning: --> Data movement during partitioning is not
> considered a shuffle in Spark terminology.
> Shuffles involve more complex data rearrangement.
>
> *Considerations: *
> Large DataFrames: For very large DataFrames
>
>- manage memory carefully to avoid out-of-memory errors. Consider
>options like:
>- Increasing executor memory
>- Using partitioning strategies to optimize memory usage
>- Employing techniques like checkpointing to persistent storage (hard
>disks) or caching for memory efficiency
>- You can get additional info from Spark UI default port 4040 tabs
>like SQL and executors
>- Spark uses Catalyst optimiser for efficient execution plans.
>df.explain("extended") shows both logical and physical plans
>
> HTH
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Thu, 14 Mar 2024 at 19:46, Sreyan Chakravarty 
> wrote:
>
>> I am trying to understand Spark Architecture.
>>
>> For Dataframes that are created from python objects ie. that are *created
>> in memory where are they stored ?*
>>
>> Take following example:
>>
>> from pyspark.sql import Rowimport datetime
>> courses = [
>> {
>> 'course_id': 1,
>> 'course_title': 'Mastering Python',
>> 'course_published_dt': datetime.date(2021, 1, 14),
>> 'is_active': True,
>> 'last_updated_ts': datetime.datetime(2021, 2, 18, 16, 57, 25)
>> }
>>
>> ]
>>
>>
>> courses_df = spark.createDataFrame([Row(**course) for course in courses])
>>
>>
>> Where is the dataframe stored when I invoke the call:
>>
>> courses_df = spark.createDataFrame([Row(**course) for course in courses])
>>
>> Does it:
>>
>>1. Send the data to a random executor ?
>>
>>
>>- Does this mean this counts as a shuffle ?
>>
>>
>>1. Or does it stay on the driver node ?
>>
>>
>>- That does not make sense when the dataframe grows large.
>>
>>
>> --
>> Regards,
>> Sreyan Chakravarty
>>
>


Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-18 Thread Sreyan Chakravarty
On Mon, Mar 18, 2024 at 1:16 PM Mich Talebzadeh 
wrote:

>
> "I may need something like that for synthetic data for testing. Any way to
> do that ?"
>
> Have a look at this.
>
> https://github.com/joke2k/faker
>

No I was not actually referring to data that can be faked. I want data to
actually reside on the storage or executors.

Maybe this will be better tackled in a separate thread here:

https://lists.apache.org/thread/w6f7rq7m8fj6hzwpyhvvx3c42wbmkwdq

-- 
Regards,
Sreyan Chakravarty


Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-18 Thread Mich Talebzadeh
Yes, transformations are indeed executed on the worker nodes, but they are
only performed when necessary, usually when an action is called. This lazy
evaluation helps in optimizing the execution of Spark jobs by allowing
Spark to optimize the execution plan and perform optimizations such as
pipelining transformations and removing unnecessary computations.

"I may need something like that for synthetic data for testing. Any way to
do that ?"

Have a look at this.

https://github.com/joke2k/faker

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Mon, 18 Mar 2024 at 07:16, Sreyan Chakravarty  wrote:

>
> On Fri, Mar 15, 2024 at 3:10 AM Mich Talebzadeh 
> wrote:
>
>>
>> No Data Transfer During Creation: --> Data transfer occurs only when an
>> action is triggered.
>> Distributed Processing: --> DataFrames are distributed for parallel
>> execution, not stored entirely on the driver node.
>> Lazy Evaluation Optimization: --> Delaying data transfer until necessary
>> enhances performance.
>> Shuffle vs. Partitioning: --> Data movement during partitioning is not
>> considered a shuffle in Spark terminology.
>> Shuffles involve more complex data rearrangement.
>>
>
> So just to be clear the transformations are always executed on the worker
> node but it is just transferred until an action on the dataframe is
> triggered.
>
> Am I correct ?
>
> If so, then how do I generate a large dataset ?
>
> I may need something like that for synthetic data for testing. Any way to
> do that ?
>
>
> --
> Regards,
> Sreyan Chakravarty
>


Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-18 Thread Sreyan Chakravarty
On Fri, Mar 15, 2024 at 3:10 AM Mich Talebzadeh 
wrote:

>
> No Data Transfer During Creation: --> Data transfer occurs only when an
> action is triggered.
> Distributed Processing: --> DataFrames are distributed for parallel
> execution, not stored entirely on the driver node.
> Lazy Evaluation Optimization: --> Delaying data transfer until necessary
> enhances performance.
> Shuffle vs. Partitioning: --> Data movement during partitioning is not
> considered a shuffle in Spark terminology.
> Shuffles involve more complex data rearrangement.
>

So just to be clear the transformations are always executed on the worker
node but it is just transferred until an action on the dataframe is
triggered.

Am I correct ?

If so, then how do I generate a large dataset ?

I may need something like that for synthetic data for testing. Any way to
do that ?


-- 
Regards,
Sreyan Chakravarty


Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-14 Thread Mich Talebzadeh
Hi,

When you create a DataFrame from Python objects using
spark.createDataFrame, here it goes:


*Initial Local Creation:*
The DataFrame is initially created in the memory of the driver node. The
data is not yet distributed to executors at this point.

*The role of lazy Evaluation:*

Spark applies lazy evaluation, *meaning transformations are not executed
immediately*.  It constructs a logical plan describing the operations, but
data movement does not occur yet.

*Action Trigger:*

When you initiate an action (things like show(), collect(), etc), Spark
triggers the execution.



*When partitioning  and distribution come in:Spark partitions the DataFrame
into logical chunks for parallel processing*. It divides the data based on
a partitioning scheme (default is hash partitioning). Each partition is
sent to different executor nodes for distributed execution.
This stage involves data transfer across the cluster, but it is not that
expensive shuffle you have heard of. Shuffles happen within repartitioning
or certain join operations.

*Storage on Executors:*

Executors receive their assigned partitions and store them in their
memory. If memory is limited, Spark spills partitions to disk. look at
stages tab in UI (4040)


*In summary:*
No Data Transfer During Creation: --> Data transfer occurs only when an
action is triggered.
Distributed Processing: --> DataFrames are distributed for parallel
execution, not stored entirely on the driver node.
Lazy Evaluation Optimization: --> Delaying data transfer until necessary
enhances performance.
Shuffle vs. Partitioning: --> Data movement during partitioning is not
considered a shuffle in Spark terminology.
Shuffles involve more complex data rearrangement.

*Considerations: *
Large DataFrames: For very large DataFrames

   - manage memory carefully to avoid out-of-memory errors. Consider
   options like:
   - Increasing executor memory
   - Using partitioning strategies to optimize memory usage
   - Employing techniques like checkpointing to persistent storage (hard
   disks) or caching for memory efficiency
   - You can get additional info from Spark UI default port 4040 tabs like
   SQL and executors
   - Spark uses Catalyst optimiser for efficient execution plans.
   df.explain("extended") shows both logical and physical plans

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Thu, 14 Mar 2024 at 19:46, Sreyan Chakravarty  wrote:

> I am trying to understand Spark Architecture.
>
> For Dataframes that are created from python objects ie. that are *created
> in memory where are they stored ?*
>
> Take following example:
>
> from pyspark.sql import Rowimport datetime
> courses = [
> {
> 'course_id': 1,
> 'course_title': 'Mastering Python',
> 'course_published_dt': datetime.date(2021, 1, 14),
> 'is_active': True,
> 'last_updated_ts': datetime.datetime(2021, 2, 18, 16, 57, 25)
> }
>
> ]
>
>
> courses_df = spark.createDataFrame([Row(**course) for course in courses])
>
>
> Where is the dataframe stored when I invoke the call:
>
> courses_df = spark.createDataFrame([Row(**course) for course in courses])
>
> Does it:
>
>1. Send the data to a random executor ?
>
>
>- Does this mean this counts as a shuffle ?
>
>
>1. Or does it stay on the driver node ?
>
>
>- That does not make sense when the dataframe grows large.
>
>
> --
> Regards,
> Sreyan Chakravarty
>


pyspark - Where are Dataframes created from Python objects stored?

2024-03-14 Thread Sreyan Chakravarty
I am trying to understand Spark Architecture.

For Dataframes that are created from python objects ie. that are *created
in memory where are they stored ?*

Take following example:

from pyspark.sql import Rowimport datetime
courses = [
{
'course_id': 1,
'course_title': 'Mastering Python',
'course_published_dt': datetime.date(2021, 1, 14),
'is_active': True,
'last_updated_ts': datetime.datetime(2021, 2, 18, 16, 57, 25)
}

]


courses_df = spark.createDataFrame([Row(**course) for course in courses])


Where is the dataframe stored when I invoke the call:

courses_df = spark.createDataFrame([Row(**course) for course in courses])

Does it:

   1. Send the data to a random executor ?


   - Does this mean this counts as a shuffle ?


   1. Or does it stay on the driver node ?


   - That does not make sense when the dataframe grows large.


-- 
Regards,
Sreyan Chakravarty