If you want to always get the latest data in files, its best to always
recreate the DataFrame.

On Wed, May 3, 2017 at 7:30 AM, JayeshLalwani <[email protected]
> wrote:

> We have a Structured Streaming application that gets accounts from Kafka
> into
> a streaming data frame. We have a blacklist of accounts stored in S3 and we
> want to filter out all the accounts that are blacklisted. So, we are
> loading
> the blacklisted accounts into a batch data frame and joining it with the
> streaming data frame to filter out the bad accounts.
> Now, the blacklist doesn't change very often.. once a week at max. SO, we
> wanted to cache the blacklist data frame to prevent going out to S3
> everytime. Since, the blacklist might change, we want to be able to refresh
> the cache at a cadence, without restarting the whole app.
> So, to begin with we wrote a simple app that caches and refreshes a simple
> data frame. The steps we followed are
> /Create a CSV file
> load CSV into a DF: df = spark.read.csv(filename)
> Persist the data frame: df.persist
> Now when we do df.show, we see the contents of the csv.
> We change the CSV, and call df.show, we can see that the old contents are
> being displayed, proving that the df is cached
> df.unpersist
> df.persist
> df.show/
>
> What we see is that the rows that were modified in the CSV are reloaded..
> But new rows aren't
> Is this expected behavior? Is there a better way to refresh cached data
> without restarting the Spark application?
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Refreshing-a-persisted-RDD-tp28642.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [email protected]
>
>

Reply via email to