thanks, this direction seems to be inline with what I want.
what i really want is
groupBy() and then for the rows in each group, get an Iterator, and run
each element from the iterator through a local function (specifically SGD),
right now the DataSet API provides this , but it's literally an Iter
thanks.
exactly this is what I ended up doing finally. though it seemed to work,
there seems to be guarantee that the randomness after the
sortWithinPartitions() would be preserved after I do a further groupBy.
On Fri, Oct 21, 2016 at 3:55 PM, Cheng Lian wrote:
> I think it would much easier
groupBy always materializes the entire group (on disk or in memory) which
is why you should avoid it for large groups.
The key is to never materialize the grouped and shuffled data.
To see one approach to do this take a look at
https://github.com/tresata/spark-sorted
It's basically a combination
I think it would much easier to use DataFrame API to do this by doing
local sort using randn() as key. For example, in Spark 2.0:
val df = spark.range(100)
val shuffled = df.repartition($"id" % 10).sortWithinPartitions(randn(42))
Replace df with a DataFrame wrapping your RDD, and $"id" % 10 wit
in my application, I group by same training samples by their model_id's
(the input table contains training samples for 100k different models),
then each group ends up having about 1 million training samples,
then I feed that group of samples to a little Logistic Regression solver
(SGD), but SGD r