There are a few diff ways to apply approximation algorithms and probabilistic data structures to your Spark data - including Spark's countApproxDistinct() methods as you pointed out.
There's also Twitter Algebird, and Redis HyperLogLog (PFCOUNT, PFADD). Here's some examples from my *pipeline Github project* <https://github.com/fluxcapacitor/pipeline/wiki> that demonstrates how to use these in a streaming context - if that's interesting to you, at all: *1) Algebird CountMin Sketch* https://github.com/fluxcapacitor/pipeline/blob/master/myapps/streaming/src/main/scala/com/advancedspark/streaming/rating/approx/AlgebirdCountMinSketchTopK.scala *2) Algebird HyperLogLog* https://github.com/fluxcapacitor/pipeline/blob/master/myapps/streaming/src/main/scala/com/advancedspark/streaming/rating/approx/AlgebirdHyperLogLog.scala *3) Redis HyperLogLog* https://github.com/fluxcapacitor/pipeline/blob/master/myapps/streaming/src/main/scala/com/advancedspark/streaming/rating/approx/RedisHyperLogLog.scala In addition, my *Global Advanced Apache Spark Meetup* (based in SF) has an entire evening dedicated to this exact topic next month: http://www.meetup.com/Advanced-Apache-Spark-Meetup/events/226122226/. Video, slides, and live streaming urls will be available the day of the meetup. On Mon, Dec 14, 2015 at 12:20 PM, Krishna Rao <krishnanj...@gmail.com> wrote: > Thanks for the response Jörn. So to elaborate, I have a large dataset with > userIds, each tagged with a property, e.g.: > > user_1 prop1=X > user_2 prop1=Y prop2=A > user_3 prop2=B > > > I would like to be able to get the number of distinct users that have a > particular property (or combination of properties). The cardinality of each > property is in the 1000s and will only grow, as will the number of > properties. I'm happy with approximate values to trade accuracy for > performance. > > Spark's performance when doing this via spark-shell is more that excellent > using the "countApproxDistinct" method on a "javaRDD". However, I've no > idea what's the best way to be able to run a query programatically like I > can do manually via spark-shell. > > Hope this clarifies things. > > > On 14 December 2015 at 17:04, Jörn Franke <jornfra...@gmail.com> wrote: > >> Can you elaborate a little bit more on the use case? It looks a little >> bit like an abuse of Spark in general . Interactive queries that are not >> suitable for in-memory batch processing might be better supported by ignite >> that has in-memory indexes, concept of hot, warm, cold data etc. or hive on >> tez+llap . >> >> > On 14 Dec 2015, at 17:19, Krishna Rao <krishnanj...@gmail.com> wrote: >> > >> > Hi all, >> > >> > What's the best way to run ad-hoc queries against a cached RDDs? >> > >> > For example, say I have an RDD that has been processed, and persisted >> to memory-only. I want to be able to run a count (actually >> "countApproxDistinct") after filtering by an, at compile time, unknown >> (specified by query) value. >> > >> > I've experimented with using (abusing) Spark Streaming, by streaming >> queries and running these against the cached RDD. However, as I say I don't >> think that this is an intended use-case of Streaming. >> > >> > Cheers, >> > >> > Krishna >> > > -- *Chris Fregly* Principal Data Solutions Engineer IBM Spark Technology Center, San Francisco, CA http://spark.tc | http://advancedspark.com