There are a few diff ways to apply approximation algorithms and
probabilistic data structures to your Spark data - including Spark's
countApproxDistinct() methods as you pointed out.

There's also Twitter Algebird, and Redis HyperLogLog (PFCOUNT, PFADD).

Here's some examples from my *pipeline Github project*
<https://github.com/fluxcapacitor/pipeline/wiki> that demonstrates how to
use these in a streaming context - if that's interesting to you, at all:

*1) Algebird CountMin Sketch*

https://github.com/fluxcapacitor/pipeline/blob/master/myapps/streaming/src/main/scala/com/advancedspark/streaming/rating/approx/AlgebirdCountMinSketchTopK.scala

*2) Algebird HyperLogLog*

https://github.com/fluxcapacitor/pipeline/blob/master/myapps/streaming/src/main/scala/com/advancedspark/streaming/rating/approx/AlgebirdHyperLogLog.scala

*3) Redis HyperLogLog*

https://github.com/fluxcapacitor/pipeline/blob/master/myapps/streaming/src/main/scala/com/advancedspark/streaming/rating/approx/RedisHyperLogLog.scala

In addition, my *Global Advanced Apache Spark Meetup* (based in SF) has an
entire evening dedicated to this exact topic next month:

http://www.meetup.com/Advanced-Apache-Spark-Meetup/events/226122226/.

Video, slides, and live streaming urls will be available the day of the
meetup.


On Mon, Dec 14, 2015 at 12:20 PM, Krishna Rao <krishnanj...@gmail.com>
wrote:

> Thanks for the response Jörn. So to elaborate, I have a large dataset with
> userIds, each tagged with a property, e.g.:
>
> user_1    prop1=X
> user_2    prop1=Y    prop2=A
> user_3    prop2=B
>
>
> I would like to be able to get the number of distinct users that have a
> particular property (or combination of properties). The cardinality of each
> property is in the 1000s and will only grow, as will the number of
> properties. I'm happy with approximate values to trade accuracy for
> performance.
>
> Spark's performance when doing this via spark-shell is more that excellent
> using the "countApproxDistinct" method on a "javaRDD". However, I've no
> idea what's the best way to be able to run a query programatically like I
> can do manually via spark-shell.
>
> Hope this clarifies things.
>
>
> On 14 December 2015 at 17:04, Jörn Franke <jornfra...@gmail.com> wrote:
>
>> Can you elaborate a little bit more on the use case? It looks a little
>> bit like an abuse of Spark in general . Interactive queries that are not
>> suitable for in-memory batch processing might be better supported by ignite
>> that has in-memory indexes, concept of hot, warm, cold data etc. or hive on
>> tez+llap .
>>
>> > On 14 Dec 2015, at 17:19, Krishna Rao <krishnanj...@gmail.com> wrote:
>> >
>> > Hi all,
>> >
>> > What's the best way to run ad-hoc queries against a cached RDDs?
>> >
>> > For example, say I have an RDD that has been processed, and persisted
>> to memory-only. I want to be able to run a count (actually
>> "countApproxDistinct") after filtering by an, at compile time, unknown
>> (specified by query) value.
>> >
>> > I've experimented with using (abusing) Spark Streaming, by streaming
>> queries and running these against the cached RDD. However, as I say I don't
>> think that this is an intended use-case of Streaming.
>> >
>> > Cheers,
>> >
>> > Krishna
>>
>
>


-- 

*Chris Fregly*
Principal Data Solutions Engineer
IBM Spark Technology Center, San Francisco, CA
http://spark.tc | http://advancedspark.com

Reply via email to