Yes, you will have to recreate the streaming Dataframe along with the static Dataframe, and restart the query. There isnt a currently feasible to do this without a query restart. But restarting a query WITHOUT restarting the whole application + spark cluster, is reasonably fast. If your applicatoin can tolerate 10 second latencies, then stopping and restarting a query within the same Spark application is a reasonable solution.
On Wed, May 3, 2017 at 4:13 PM, Lalwani, Jayesh < [email protected]> wrote: > Thanks, TD for answering this question on the Spark mailing list. > > > > A follow-up. So, let’s say we are joining a cached dataframe with a > streaming dataframe, and we recreate the cached dataframe, do we have to > recreate the streaming dataframe too? > > > > One possible solution that we have is > > > > val dfBlackList = spark.read.csv(….) //batch dataframe… assume that this > dataframe has a single column namedAccountName > dfBlackList.createOrReplaceTempView(“blacklist”) > val dfAccount = spark.readStream.kafka(…..) // assume for brevity’s sake > that we have parsed the kafka payload and have a data frame here with > multiple columns.. one of them called accountName > > dfAccount. createOrReplaceTempView(“account”) > > val dfBlackListedAccount = spark.sql(“select * from account inner join > blacklist on account.accountName = blacklist.accountName”) > > df.writeStream(…..).start() // boom started > > > > Now some time later while the query is running we do > > > > val dfRefreshedBlackList = spark.read.csv(….) > dfRefreshedBlackList.createOrReplaceTempView(“blacklist”) > > > > Now, will dfBlackListedAccount pick up the newly created blacklist? Or > will it continue to hold the reference to the old dataframe? What if we had > done RDD operations instead of using Spark SQL to join the dataframes? > > > > *From: *Tathagata Das <[email protected]> > *Date: *Wednesday, May 3, 2017 at 6:32 PM > *To: *"Lalwani, Jayesh" <[email protected]> > *Cc: *user <[email protected]> > *Subject: *Re: Refreshing a persisted RDD > > > > If you want to always get the latest data in files, its best to always > recreate the DataFrame. > > > > On Wed, May 3, 2017 at 7:30 AM, JayeshLalwani < > [email protected]> wrote: > > We have a Structured Streaming application that gets accounts from Kafka > into > a streaming data frame. We have a blacklist of accounts stored in S3 and we > want to filter out all the accounts that are blacklisted. So, we are > loading > the blacklisted accounts into a batch data frame and joining it with the > streaming data frame to filter out the bad accounts. > Now, the blacklist doesn't change very often.. once a week at max. SO, we > wanted to cache the blacklist data frame to prevent going out to S3 > everytime. Since, the blacklist might change, we want to be able to refresh > the cache at a cadence, without restarting the whole app. > So, to begin with we wrote a simple app that caches and refreshes a simple > data frame. The steps we followed are > /Create a CSV file > load CSV into a DF: df = spark.read.csv(filename) > Persist the data frame: df.persist > Now when we do df.show, we see the contents of the csv. > We change the CSV, and call df.show, we can see that the old contents are > being displayed, proving that the df is cached > df.unpersist > df.persist > df.show/ > > What we see is that the rows that were modified in the CSV are reloaded.. > But new rows aren't > Is this expected behavior? Is there a better way to refresh cached data > without restarting the Spark application? > > > > -- > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/Refreshing-a-persisted-RDD-tp28642.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe e-mail: [email protected] > > > > ------------------------------ > > The information contained in this e-mail is confidential and/or > proprietary to Capital One and/or its affiliates and may only be used > solely in performance of work or services for Capital One. The information > transmitted herewith is intended only for use by the individual or entity > to which it is addressed. If the reader of this message is not the intended > recipient, you are hereby notified that any review, retransmission, > dissemination, distribution, copying or other use of, or taking of any > action in reliance upon this information is strictly prohibited. If you > have received this communication in error, please contact the sender and > delete the material from your computer. >
