Re: Refreshing a persisted RDD

Tathagata Das Wed, 03 May 2017 16:31:42 -0700

Yes, you will have to recreate the streaming Dataframe along with the
static Dataframe, and restart the query. There isnt a currently feasible to
do this without a query restart. But restarting a query WITHOUT restarting
the whole application + spark cluster, is reasonably fast. If your
applicatoin can tolerate 10 second latencies, then stopping and restarting
a query within the same Spark application is a reasonable solution.


On Wed, May 3, 2017 at 4:13 PM, Lalwani, Jayesh <
[email protected]> wrote:

> Thanks, TD for answering this question on the Spark mailing list.
>
>
>
> A follow-up. So, let’s say we are joining a cached dataframe with a
> streaming dataframe, and we recreate the cached dataframe, do we have to
> recreate the streaming dataframe too?
>
>
>
> One possible solution that we have is
>
>
>
> val dfBlackList = spark.read.csv(….) //batch dataframe… assume that this
> dataframe has a single column namedAccountName
> dfBlackList.createOrReplaceTempView(“blacklist”)
> val dfAccount = spark.readStream.kafka(…..) // assume for brevity’s sake
> that we have parsed the kafka payload and have a data frame here with
> multiple columns.. one of them called accountName
>
> dfAccount. createOrReplaceTempView(“account”)
>
> val dfBlackListedAccount = spark.sql(“select * from account inner join
> blacklist on account.accountName = blacklist.accountName”)
>
> df.writeStream(…..).start() // boom started
>
>
>
> Now some time later while the query is running we do
>
>
>
> val dfRefreshedBlackList = spark.read.csv(….)
> dfRefreshedBlackList.createOrReplaceTempView(“blacklist”)
>
>
>
> Now, will dfBlackListedAccount pick up the newly created blacklist? Or
> will it continue to hold the reference to the old dataframe? What if we had
> done RDD operations instead of using Spark SQL to join the dataframes?
>
>
>
> *From: *Tathagata Das <[email protected]>
> *Date: *Wednesday, May 3, 2017 at 6:32 PM
> *To: *"Lalwani, Jayesh" <[email protected]>
> *Cc: *user <[email protected]>
> *Subject: *Re: Refreshing a persisted RDD
>
>
>
> If you want to always get the latest data in files, its best to always
> recreate the DataFrame.
>
>
>
> On Wed, May 3, 2017 at 7:30 AM, JayeshLalwani <
> [email protected]> wrote:
>
> We have a Structured Streaming application that gets accounts from Kafka
> into
> a streaming data frame. We have a blacklist of accounts stored in S3 and we
> want to filter out all the accounts that are blacklisted. So, we are
> loading
> the blacklisted accounts into a batch data frame and joining it with the
> streaming data frame to filter out the bad accounts.
> Now, the blacklist doesn't change very often.. once a week at max. SO, we
> wanted to cache the blacklist data frame to prevent going out to S3
> everytime. Since, the blacklist might change, we want to be able to refresh
> the cache at a cadence, without restarting the whole app.
> So, to begin with we wrote a simple app that caches and refreshes a simple
> data frame. The steps we followed are
> /Create a CSV file
> load CSV into a DF: df = spark.read.csv(filename)
> Persist the data frame: df.persist
> Now when we do df.show, we see the contents of the csv.
> We change the CSV, and call df.show, we can see that the old contents are
> being displayed, proving that the df is cached
> df.unpersist
> df.persist
> df.show/
>
> What we see is that the rows that were modified in the CSV are reloaded..
> But new rows aren't
> Is this expected behavior? Is there a better way to refresh cached data
> without restarting the Spark application?
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Refreshing-a-persisted-RDD-tp28642.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [email protected]
>
>
>
> ------------------------------
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates and may only be used
> solely in performance of work or services for Capital One. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed. If the reader of this message is not the intended
> recipient, you are hereby notified that any review, retransmission,
> dissemination, distribution, copying or other use of, or taking of any
> action in reliance upon this information is strictly prohibited. If you
> have received this communication in error, please contact the sender and
> delete the material from your computer.
>

Re: Refreshing a persisted RDD

Reply via email to