Re: How to put an object in cache for ever in Streaming
That should also get cleaned through the GC, though you may have to explicitly run GC periodically for faster clean up. RDDs are by definition distributed across executors in parts. When caches the RDD partitions are cached in memory across the executors. On Fri, Oct 16, 2015 at 6:15 PM, swetha kasireddy wrote: > What about cleaning up the tempData that gets generated by shuffles. We > have a lot of temp data that gets generated by shuffles in /tmp folder. > That's why we are using ttl. Also if I keep an RDD in cache is it available > across all the executors or just the same executor? > > On Fri, Oct 16, 2015 at 5:49 PM, Tathagata Das > wrote: > >> Setting a ttl is not recommended any more as Spark works with Java GC to >> clean up stuff (RDDs, shuffles, broadcasts,etc.) that are not in reference >> any more. >> >> So you can keep an RDD cached in Spark, and every minute uncache the >> previous one, and cache a new one. >> >> TD >> >> On Fri, Oct 16, 2015 at 12:02 PM, swetha >> wrote: >> >>> Hi, >>> >>> How to put a changing object in Cache for ever in Streaming. I know that >>> we >>> can do rdd.cache but I think .cache would be cleaned up if we set ttl in >>> Streaming. Our requirement is to have an object in memory. The object >>> would >>> be updated every minute based on the records that we get in our >>> Streaming >>> job. >>> >>> Currently I am keeping that in updateStateByKey. But, my >>> updateStateByKey >>> is tracking the realtime Session information as well. So, my >>> updateStateByKey has 4 arguments that tracks session information and >>> this >>> object that tracks the performance info separately. I was thinking it >>> may >>> be too much to keep so much data in updateStateByKey. >>> >>> Is it recommended to hold a lot of data using updateStateByKey? >>> >>> >>> Thanks, >>> Swetha >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-put-an-object-in-cache-for-ever-in-Streaming-tp25098.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> - >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >> >
Re: How to put an object in cache for ever in Streaming
What about cleaning up the tempData that gets generated by shuffles. We have a lot of temp data that gets generated by shuffles in /tmp folder. That's why we are using ttl. Also if I keep an RDD in cache is it available across all the executors or just the same executor? On Fri, Oct 16, 2015 at 5:49 PM, Tathagata Das wrote: > Setting a ttl is not recommended any more as Spark works with Java GC to > clean up stuff (RDDs, shuffles, broadcasts,etc.) that are not in reference > any more. > > So you can keep an RDD cached in Spark, and every minute uncache the > previous one, and cache a new one. > > TD > > On Fri, Oct 16, 2015 at 12:02 PM, swetha > wrote: > >> Hi, >> >> How to put a changing object in Cache for ever in Streaming. I know that >> we >> can do rdd.cache but I think .cache would be cleaned up if we set ttl in >> Streaming. Our requirement is to have an object in memory. The object >> would >> be updated every minute based on the records that we get in our Streaming >> job. >> >> Currently I am keeping that in updateStateByKey. But, my updateStateByKey >> is tracking the realtime Session information as well. So, my >> updateStateByKey has 4 arguments that tracks session information and this >> object that tracks the performance info separately. I was thinking it may >> be too much to keep so much data in updateStateByKey. >> >> Is it recommended to hold a lot of data using updateStateByKey? >> >> >> Thanks, >> Swetha >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-put-an-object-in-cache-for-ever-in-Streaming-tp25098.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >
Re: How to put an object in cache for ever in Streaming
Setting a ttl is not recommended any more as Spark works with Java GC to clean up stuff (RDDs, shuffles, broadcasts,etc.) that are not in reference any more. So you can keep an RDD cached in Spark, and every minute uncache the previous one, and cache a new one. TD On Fri, Oct 16, 2015 at 12:02 PM, swetha wrote: > Hi, > > How to put a changing object in Cache for ever in Streaming. I know that we > can do rdd.cache but I think .cache would be cleaned up if we set ttl in > Streaming. Our requirement is to have an object in memory. The object would > be updated every minute based on the records that we get in our Streaming > job. > > Currently I am keeping that in updateStateByKey. But, my updateStateByKey > is tracking the realtime Session information as well. So, my > updateStateByKey has 4 arguments that tracks session information and this > object that tracks the performance info separately. I was thinking it may > be too much to keep so much data in updateStateByKey. > > Is it recommended to hold a lot of data using updateStateByKey? > > > Thanks, > Swetha > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/How-to-put-an-object-in-cache-for-ever-in-Streaming-tp25098.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
How to put an object in cache for ever in Streaming
Hi, How to put a changing object in Cache for ever in Streaming. I know that we can do rdd.cache but I think .cache would be cleaned up if we set ttl in Streaming. Our requirement is to have an object in memory. The object would be updated every minute based on the records that we get in our Streaming job. Currently I am keeping that in updateStateByKey. But, my updateStateByKey is tracking the realtime Session information as well. So, my updateStateByKey has 4 arguments that tracks session information and this object that tracks the performance info separately. I was thinking it may be too much to keep so much data in updateStateByKey. Is it recommended to hold a lot of data using updateStateByKey? Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-put-an-object-in-cache-for-ever-in-Streaming-tp25098.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org