Re: Does spark read the same file twice, if two stages are using the same DataFrame?

Nitin Siwach Sun, 07 May 2023 08:10:06 -0700

Thank you for your response, Sir.

My understanding is that the final ```df3.count()``` is the only action in
the code I have attached. In fact, I tried running the rest of the code
(commenting out just the final df3.count()) and, as I expected, no
computations were triggered


On Sun, 7 May, 2023, 20:16 Mich Talebzadeh, <mich.talebza...@gmail.com>
wrote:

>
> ...However, In my case here I am calling just one action. ..
>
> ok, which line  in your code is called one action?
>
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sun, 7 May 2023 at 14:13, Nitin Siwach <nitinsiw...@gmail.com> wrote:
>
>> @Vikas Kumar <vku...@etsy.com>
>> I am sorry but I thought that you had answered the other question that I
>> had raised to the same email address yesterday. It was around the SQL tab
>> in web UI and the output of .explain showing different plans.
>>
>> I get how using .cache I can ensure that the data from a particular
>> checkpoint is reused and the computations do not happen again.
>>
>> However, In my case here I am calling just one action. Within the purview
>> of one action Spark should not rerun the overlapping parts of the DAG. I do
>> not understand why the file scan is happening several times. I can easily
>> mitigate the issue by using window functions and creating all the columns
>> in one go without having to use several joins later on. That being said
>> this particular behavior is what I am trying ot understand. The golden rule
>> "The DAG overlaps wont run several times for one action" seems not to be
>> apocryphal. If you can shed some light on this matter I would appreciate it
>>
>> @weiruanl...@gmail.com <weiruanl...@gmail.com> My datasets are very
>> small as you can see in the sample examples that I am creating as the first
>> part of the code
>>
>> Really appreciate you guys helping me out with this :)
>>
>> On Sun, May 7, 2023 at 12:23 PM Winston Lai <weiruanl...@gmail.com>
>> wrote:
>>
>>> When your memory is not sufficient to keep the cached data for your jobs
>>> in two different stages, it might be read twice because Spark might have to
>>> clear the previous cache for other jobs. In those cases, a spill may
>>> triggered when Spark write your data from memory to disk.
>>>
>>> One way to to check is to read Spark UI. When Spark cache the data, you
>>> will see a little green dot connected to the blue rectangle in the Spark
>>> UI. If you see this green dot twice on your two stages, likely Spark spill
>>> the data after your first job and read it again in the second run. You can
>>> also confirm it in other metrics from Spark UI.
>>>
>>> That is my personal understanding based on what I have read and seen on
>>> my job runs. If there is any mistake, be free to correct me.
>>>
>>> Thank You & Best Regards
>>> Winston Lai
>>> ------------------------------
>>> *From:* Nitin Siwach <nitinsiw...@gmail.com>
>>> *Sent:* Sunday, May 7, 2023 12:22:32 PM
>>> *To:* Vikas Kumar <vku...@etsy.com>
>>> *Cc:* User <user@spark.apache.org>
>>> *Subject:* Re: Does spark read the same file twice, if two stages are
>>> using the same DataFrame?
>>>
>>> Thank you tons, Vikas :). That makes so much sense now
>>>
>>> I'm in learning phase and was just browsing through various concepts of
>>> spark with self made small examples.
>>>
>>> It didn't make sense to me that the two physical plans should be
>>> different. But, now I understand what you're saying.
>>>
>>> Again, thank you for helping me out
>>>
>>> On Sun, 7 May, 2023, 07:48 Vikas Kumar, <vku...@etsy.com> wrote:
>>>
>>>
>>> Spark came up with a plan but that may or may not be optimal plan given
>>> the system settings.
>>> If you do df1.cache() , i am guessing spark will not read df1 twice.
>>>
>>> Btw, Why do you have adaptive enabled to be false?
>>>
>>> On Sat, May 6, 2023, 1:46 PM Nitin Siwach <nitinsiw...@gmail.com> wrote:
>>>
>>> I hope this email finds you well :)
>>>
>>> The following code reads the same csv twice even though only one action
>>> is called
>>>
>>> End to end runnable example:
>>> ```
>>> import pandas as pd
>>> import numpy as np
>>>
>>> df1=  pd.DataFrame(np.arange(1_000).reshape(-1,1))
>>> df1.index = np.random.choice(range(10),size=1000)
>>> df1.to_csv("./df1.csv",index_label = "index")
>>>
>>> ############################################################################
>>>
>>> from pyspark.sql import SparkSession
>>> from pyspark.sql import functions as F
>>> from pyspark.sql.types import StructType, StringType, StructField
>>>
>>> spark =
>>> SparkSession.builder.config("spark.sql.autoBroadcastJoinThreshold","-1").\
>>> config("spark.sql.adaptive.enabled","false").getOrCreate()
>>>
>>> schema = StructType([StructField('index', StringType(), True),
>>>                      StructField('0', StringType(), True)])
>>>
>>> df1 = spark.read.csv("./df1.csv", header=True, schema = schema)
>>>
>>> df2 = df1.groupby("index").agg(F.mean("0"))
>>> df3 = df1.join(df2,on='index')
>>>
>>> df3.explain()
>>> df3.count()
>>> ```
>>>
>>> The sql tab in the web UI shows the following:
>>>
>>> [image:
>>> screencapture-localhost-4040-SQL-execution-2023-05-06-19_48_41.png]
>>>
>>> As you can see, the df1 file is read twice. Is this the expected
>>> behaviour? Why is that happening? I have just one action so the same part
>>> of the pipeline should not run multiple times.
>>>
>>> I have read the answer [here][1]
>>> <https://stackoverflow.com/questions/37894099/does-spark-read-the-same-file-twice-if-two-stages-are-using-the-same-rdd>.
>>> The question is almost the same it is just that in that question the RDDs
>>> are used and I am using dataframe in pyspark API. In that question, it is
>>> suggested that if multiple file scans are to be avoided then DataFrames API
>>> would help and this is the reason why DataFrama API exists in the first
>>> place
>>>
>>>
>>> However, as it turns out, I am facing the exact same issue with the
>>> DataFrames as well. It seems rather weird of spark, which is celebrated for
>>> its efficiency, to be this inefficient (Mostly I am just missing something
>>> and that is not a valid criticism :))
>>>
>>>

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

Reply via email to