Dataset.distinct - Question on deterministic results

Will Bastian Thu, 09 Aug 2018 09:13:29 -0700

I'm operating on a data set with some challenges to overcome. They are:

   1. There is possibility for multiple entries for a single key
   and
   2. For a single key, there may be multiple unique value-tuples


For example
key, val1, val2, val3
1,      0,    0,    0
1,      0,    0,    0
1,      1,    0,    0
2,      1,    1,    1
2,      1,    1,    1
2,      1,    1,    0
1,      0,    0,    0

I've found when executing mySet.distinct(_.key) on the above, that my final
results suggest distinct isn't always pulling the same record/value-tuple
on every run.

Fully understanding that the use of distinct I've outlined above isn't
optimal (we don't know, or care which value-tuple we get, we just want it
to be consistent on each run), I wanted to validate whether what I believe
I'm observing is accurate. Specifically, in this example is Flink reducing
by key with no concern for value, and we can expect the possibility that we
may pull different instances back on each distinct call?

Thanks,
Will

Dataset.distinct - Question on deterministic results

Reply via email to