Re: is dataframe thread safe?

Jörn Franke Sun, 12 Feb 2017 04:47:52 -0800

Cf. also https://spark.apache.org/docs/latest/job-scheduling.html


> On 12 Feb 2017, at 11:30, Jörn Franke <jornfra...@gmail.com> wrote:
> 
> I think you should have a look at the spark documentation. It has something 
> called scheduler who does exactly this. In more sophisticated environments 
> yarn or mesos do this for you.
> 
> Using threads for transformations does not make sense. 
> 
>> On 12 Feb 2017, at 09:50, Mendelson, Assaf <assaf.mendel...@rsa.com> wrote:
>> 
>> I know spark takes care of executing everything in a distributed manner, 
>> however, spark also supports having multiple threads on the same spark 
>> session/context and knows (Through fair scheduler) to distribute the tasks 
>> from them in a round robin.
>>  
>> The question is, can those two actions (with a different set of 
>> transformations) be applied to the SAME dataframe.
>>  
>> Let’s say I want to do something like:
>>  
>>  
>>  
>> Val df = ???
>> df.cache()
>> df.count()
>>  
>> def f1(df: DataFrame): Unit = {
>>   val df1 = df.groupby(something).agg(some aggs)
>>   df1.write.parquet(“some path”)
>> }
>>  
>> def f2(df: DataFrame): Unit = {
>>   val df2 = df.groupby(something else).agg(some different aggs)
>>   df2.write.parquet(“some path 2”)
>> }
>>  
>> f1(df)
>> f2(df)
>>  
>> df.unpersist()
>>  
>> if the aggregations do not use the full cluster (e.g. because of data 
>> skewness, because there aren’t enough partitions or any other reason) then 
>> this would leave the cluster under utilized.
>>  
>> However, if I would call f1 and f2 on different threads, then df2 can use 
>> free resources f1 has not consumed and the overall utilization would improve.
>>  
>> Of course, I can do this only if the operations on the dataframe are thread 
>> safe. For example, if I would do a cache in f1 and an unpersist in f2 I 
>> would get an inconsistent result. So my question is, what, if any are the 
>> legal operations to use on a dataframe so I could do the above.
>>  
>> Thanks,
>>                 Assaf.
>>  
>> From: Jörn Franke [mailto:jornfra...@gmail.com] 
>> Sent: Sunday, February 12, 2017 10:39 AM
>> To: Mendelson, Assaf
>> Cc: user
>> Subject: Re: is dataframe thread safe?
>>  
>> I am not sure what you are trying to achieve here. Spark is taking care of 
>> executing the transformations in a distributed fashion. This means you must 
>> not use threads - it does not make sense. Hence, you do not find 
>> documentation about it.
>> 
>> On 12 Feb 2017, at 09:06, Mendelson, Assaf <assaf.mendel...@rsa.com> wrote:
>> 
>> Hi,
>> I was wondering if dataframe is considered thread safe. I know the spark 
>> session and spark context are thread safe (and actually have tools to manage 
>> jobs from different threads) but the question is, can I use the same 
>> dataframe in both threads.
>> The idea would be to create a dataframe in the main thread and then in two 
>> sub threads do different transformations and actions on it.
>> I understand that some things might not be thread safe (e.g. if I unpersist 
>> in one thread it would affect the other. Checkpointing would cause similar 
>> issues), however, I can’t find any documentation as to what operations (if 
>> any) are thread safe.
>>  
>> Thanks,
>>                 Assaf.
>>

Re: is dataframe thread safe?

Reply via email to