If you want to always get the latest data in files, its best to always recreate the DataFrame.
On Wed, May 3, 2017 at 7:30 AM, JayeshLalwani <[email protected] > wrote: > We have a Structured Streaming application that gets accounts from Kafka > into > a streaming data frame. We have a blacklist of accounts stored in S3 and we > want to filter out all the accounts that are blacklisted. So, we are > loading > the blacklisted accounts into a batch data frame and joining it with the > streaming data frame to filter out the bad accounts. > Now, the blacklist doesn't change very often.. once a week at max. SO, we > wanted to cache the blacklist data frame to prevent going out to S3 > everytime. Since, the blacklist might change, we want to be able to refresh > the cache at a cadence, without restarting the whole app. > So, to begin with we wrote a simple app that caches and refreshes a simple > data frame. The steps we followed are > /Create a CSV file > load CSV into a DF: df = spark.read.csv(filename) > Persist the data frame: df.persist > Now when we do df.show, we see the contents of the csv. > We change the CSV, and call df.show, we can see that the old contents are > being displayed, proving that the df is cached > df.unpersist > df.persist > df.show/ > > What we see is that the rows that were modified in the CSV are reloaded.. > But new rows aren't > Is this expected behavior? Is there a better way to refresh cached data > without restarting the Spark application? > > > > -- > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/Refreshing-a-persisted-RDD-tp28642.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe e-mail: [email protected] > >
