Yes, iterating over a dataframe and making changes is not uncommon.
Ofcourse RDDs, dataframes and datasets are immultable, but there is some 
optimization in the optimizer that can potentially help to dampen the 
effect/impact of creating a new rdd, df or ds.
Also, the use-case you cited is similar to what is done in regression, 
clustering and other algorithms.
I.e. you iterate making a change to a dataframe/dataset until the desired 
E.g. see this -
 and the setting of the iteration ceiling

// instantiate the base classifier
val classifier = new LogisticRegression()

Now the impact of that depends on a variety of things.
E.g. if the data is completely contained in memory and there is no spill over 
to disk, it might not be a big issue (ofcourse there will still be memory, CPU 
and network overhead/latency).
If you are looking at storing the data on disk (e.g. as part of a checkpoint or 
explicit storage), then there can be substantial I/O activity.

From: Xi Shen <>
Date: Monday, October 17, 2016 at 2:54 AM
To: Divya Gehlot <>, Mungeol Heo <>
Cc: "user @spark" <>
Subject: Re: Is spark a right tool for updating a dataframe repeatedly

I think most of the "big data" tools, like Spark and Hive, are not designed to 
edit data. They are only designed to query data. I wonder in what scenario you 
need to update large volume of data repetitively.

On Mon, Oct 17, 2016 at 2:00 PM Divya Gehlot 
<<>> wrote:
If  my understanding is correct about your query
In spark Dataframes are immutable , cant update the dataframe.
you have to create a new dataframe to update the current dataframe .


On 17 October 2016 at 09:50, Mungeol Heo 
<<>> wrote:
Hello, everyone.

As I mentioned at the tile, I wonder that is spark a right tool for
updating a data frame repeatedly until there is no more date to

For example.

while (if there was a updating) {
update a data frame A

If it is the right tool, then what is the best practice for this kind of work?
Thank you.

To unsubscribe e-mail:<>


David S.

Reply via email to