DataFrame is immutable, so it should be thread safe, right? On Sun, Feb 12, 2017 at 6:45 PM, Sean Owen <so...@cloudera.com> wrote:
> No this use case is perfectly sensible. Yes it is thread safe. > > > On Sun, Feb 12, 2017, 10:30 Jörn Franke <jornfra...@gmail.com> wrote: > >> I think you should have a look at the spark documentation. It has >> something called scheduler who does exactly this. In more sophisticated >> environments yarn or mesos do this for you. >> >> Using threads for transformations does not make sense. >> >> On 12 Feb 2017, at 09:50, Mendelson, Assaf <assaf.mendel...@rsa.com> >> wrote: >> >> I know spark takes care of executing everything in a distributed manner, >> however, spark also supports having multiple threads on the same spark >> session/context and knows (Through fair scheduler) to distribute the tasks >> from them in a round robin. >> >> >> >> The question is, can those two actions (with a different set of >> transformations) be applied to the SAME dataframe. >> >> >> >> Let’s say I want to do something like: >> >> >> >> >> >> >> >> Val df = ??? >> >> df.cache() >> >> df.count() >> >> >> >> def f1(df: DataFrame): Unit = { >> >> val df1 = df.groupby(something).agg(some aggs) >> >> df1.write.parquet(“some path”) >> >> } >> >> >> >> def f2(df: DataFrame): Unit = { >> >> val df2 = df.groupby(something else).agg(some different aggs) >> >> df2.write.parquet(“some path 2”) >> >> } >> >> >> >> f1(df) >> >> f2(df) >> >> >> >> df.unpersist() >> >> >> >> if the aggregations do not use the full cluster (e.g. because of data >> skewness, because there aren’t enough partitions or any other reason) then >> this would leave the cluster under utilized. >> >> >> >> However, if I would call f1 and f2 on different threads, then df2 can use >> free resources f1 has not consumed and the overall utilization would >> improve. >> >> >> >> Of course, I can do this only if the operations on the dataframe are >> thread safe. For example, if I would do a cache in f1 and an unpersist in >> f2 I would get an inconsistent result. So my question is, what, if any are >> the legal operations to use on a dataframe so I could do the above. >> >> >> >> Thanks, >> >> Assaf. >> >> >> >> *From:* Jörn Franke [mailto:jornfra...@gmail.com <jornfra...@gmail.com>] >> *Sent:* Sunday, February 12, 2017 10:39 AM >> *To:* Mendelson, Assaf >> *Cc:* user >> *Subject:* Re: is dataframe thread safe? >> >> >> >> I am not sure what you are trying to achieve here. Spark is taking care >> of executing the transformations in a distributed fashion. This means you >> must not use threads - it does not make sense. Hence, you do not find >> documentation about it. >> >> >> On 12 Feb 2017, at 09:06, Mendelson, Assaf <assaf.mendel...@rsa.com> >> wrote: >> >> Hi, >> >> I was wondering if dataframe is considered thread safe. I know the spark >> session and spark context are thread safe (and actually have tools to >> manage jobs from different threads) but the question is, can I use the same >> dataframe in both threads. >> >> The idea would be to create a dataframe in the main thread and then in >> two sub threads do different transformations and actions on it. >> >> I understand that some things might not be thread safe (e.g. if I >> unpersist in one thread it would affect the other. Checkpointing would >> cause similar issues), however, I can’t find any documentation as to what >> operations (if any) are thread safe. >> >> >> >> Thanks, >> >> Assaf. >> >> >> >>