Re: Counting distinct values for a key?

2015-07-20 Thread N B
Hi Jerry, In fact, HashSet approach is what we took earlier. However, this did not work with a Windowed DStream (i.e. if we provide a forward and inverse reduce operation). The reason is that the inverse reduce tries to remove values that may still exist elsewhere in the window and should not have

Re: Counting distinct values for a key?

2015-07-19 Thread suyog choudhari
May be you need to do below steps: 1) Swap key and value 2) Use sortByKey API 3) Swap key and value 4) Reduce result for top keys http://stackoverflow.com/questions/29003246/how-to-achieve-sort-by-value-in-spark-java On Sun, Jul 19, 2015 at 5:48 PM, N B wrote: > Hi Suyog, > > That code out

Re: Counting distinct values for a key?

2015-07-19 Thread Jerry Lam
Hi Nikunj, Sorry, I totally misread your question. I think you need to first groupbykey (get all values of the same key together), then follow by mapValues (probably put the values into a set and then take the size of it because you want a distinct count) HTH, Jerry Sent from my iPhone > On

Re: Counting distinct values for a key?

2015-07-19 Thread N B
Hi Suyog, That code outputs the following: key2 val22 : 1 key1 val1 : 2 key2 val2 : 2 while the output I want to achieve would have been (with your example): key1 : 2 key2 : 2 because there are 2 distinct types of values for each key ( regardless of their actual duplicate counts .. hence the u

Re: Counting distinct values for a key?

2015-07-19 Thread N B
Hi Jerry, It does not work directly for 2 reasons: 1. I am trying to do this using Spark Streaming (Window DStreams) and DataFrames API does not work with Streaming yet. 2. The query equivalent has a "distinct" embedded in it i.e. I am looking to achieve the equivalent of SELECT key, count(dist

Re: Counting distinct values for a key?

2015-07-19 Thread suyog choudhari
public static void main(String[] args) { SparkConf sparkConf = new SparkConf().setAppName("CountDistinct"); JavaSparkContext jsc = new JavaSparkContext(sparkConf); List> list = new ArrayList>(); list.add(new Tuple2("key1", "val1")); list.add(new Tuple2("key1", "val1")); list.add(new T

Re: Counting distinct values for a key?

2015-07-19 Thread Jerry Lam
You mean this does not work? SELECT key, count(value) from table group by key On Sun, Jul 19, 2015 at 2:28 PM, N B wrote: > Hello, > > How do I go about performing the equivalent of the following SQL clause in > Spark Streaming? I will be using this on a Windowed DStream. > > SELECT key, coun

Counting distinct values for a key?

2015-07-19 Thread N B
Hello, How do I go about performing the equivalent of the following SQL clause in Spark Streaming? I will be using this on a Windowed DStream. SELECT key, count(distinct(value)) from table group by key; so for example, given the following dataset in the table: key | value -+--- k1 |