you can use HyperLogLog with Spark Streaming to accomplish this. here is an example from my fluxcapacitor GitHub repo:
https://github.com/fluxcapacitor/pipeline/tree/master/myapps/spark/streaming/src/main/scala/com/advancedspark/streaming/rating/approx here's an accompanying SlideShare presentation from one of my recent meetups (slides 70-83): http://www.slideshare.net/cfregly/spark-after-dark-20-apache-big-data-conf-vancouver-may-11-2016-61970037 <http://www.slideshare.net/cfregly/spark-after-dark-20-apache-big-data-conf-vancouver-may-11-2016-61970037> and a YouTube video for those that prefer video (starting at 32 mins into the video for your convenience): https://youtu.be/wM9Z0PLx3cw?t=1922 On Tue, May 17, 2016 at 12:17 PM, Mich Talebzadeh <mich.talebza...@gmail.com > wrote: > Ok but how about something similar to > > val countByValueAndWindow = price.filter(_ > > 95.0).countByValueAndWindow(Seconds(windowLength), Seconds(slidingInterval)) > > > Using a new count => c*ountDistinctByValueAndWindow ?* > > val countDistinctByValueAndWindow = price.filter(_ > > 95.0).countDistinctByValueAndWindow(Seconds(windowLength), > Seconds(slidingInterval)) > > > HTH > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 17 May 2016 at 20:02, Michael Armbrust <mich...@databricks.com> wrote: > >> In 2.0 you won't be able to do this. The long term vision would be to >> make this possible, but a window will be required (like the 24 hours you >> suggest). >> >> On Tue, May 17, 2016 at 1:36 AM, Todd <bit1...@163.com> wrote: >> >>> Hi, >>> We have a requirement to do count(distinct) in a processing batch >>> against all the streaming data(eg, last 24 hours' data),that is,when we do >>> count(distinct),we actually want to compute distinct against last 24 hours' >>> data. >>> Does structured streaming support this scenario?Thanks! >>> >> >> > -- *Chris Fregly* Research Scientist @ Flux Capacitor AI "Bringing AI Back to the Future!" San Francisco, CA http://fluxcapacitor.com