Hi Jerry,
In fact, HashSet approach is what we took earlier. However, this did not
work with a Windowed DStream (i.e. if we provide a forward and inverse
reduce operation). The reason is that the inverse reduce tries to remove
values that may still exist elsewhere in the window and should not have
May be you need to do below steps:
1) Swap key and value
2) Use sortByKey API
3) Swap key and value
4) Reduce result for top keys
http://stackoverflow.com/questions/29003246/how-to-achieve-sort-by-value-in-spark-java
On Sun, Jul 19, 2015 at 5:48 PM, N B wrote:
> Hi Suyog,
>
> That code out
Hi Nikunj,
Sorry, I totally misread your question.
I think you need to first groupbykey (get all values of the same key together),
then follow by mapValues (probably put the values into a set and then take the
size of it because you want a distinct count)
HTH,
Jerry
Sent from my iPhone
> On
Hi Suyog,
That code outputs the following:
key2 val22 : 1
key1 val1 : 2
key2 val2 : 2
while the output I want to achieve would have been (with your example):
key1 : 2
key2 : 2
because there are 2 distinct types of values for each key ( regardless of
their actual duplicate counts .. hence the u
Hi Jerry,
It does not work directly for 2 reasons:
1. I am trying to do this using Spark Streaming (Window DStreams) and
DataFrames API does not work with Streaming yet.
2. The query equivalent has a "distinct" embedded in it i.e. I am looking
to achieve the equivalent of
SELECT key, count(dist
public static void main(String[] args) {
SparkConf sparkConf = new SparkConf().setAppName("CountDistinct");
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
List> list = new ArrayList>();
list.add(new Tuple2("key1", "val1"));
list.add(new Tuple2("key1", "val1"));
list.add(new T
You mean this does not work?
SELECT key, count(value) from table group by key
On Sun, Jul 19, 2015 at 2:28 PM, N B wrote:
> Hello,
>
> How do I go about performing the equivalent of the following SQL clause in
> Spark Streaming? I will be using this on a Windowed DStream.
>
> SELECT key, coun
Hello,
How do I go about performing the equivalent of the following SQL clause in
Spark Streaming? I will be using this on a Windowed DStream.
SELECT key, count(distinct(value)) from table group by key;
so for example, given the following dataset in the table:
key | value
-+---
k1 |