Hello all,

I have one big RDD, in which there is a column of groups A1, A2, B1, B2, B3, 
C1, D1, ..., XY.
Out of it, I am using map() to transform into RDD[LabeledPoint] with dense 
vectors for later use into Logistic Regression, which takes RDD[LabeledPoint]
I would like to run a logistic regression for each one of this N groups (which 
is NOT part of any features used in the model itself), but I could not find a 
proper way.

1.      Can't programatically create sub RDDs with a loop: 
org.apache.spark.SparkException: RDD transformations and actions can only be 
invoked by the driver, not inside of other transformations;

2.      Can't create RDDs manually with split() since unknown and large number 
of groups

3.      Pair RDDs seemed a tempting choice with some reduce/combine/values 
bykey functions, but non of them return a data-type valuable as a 
RDD[LabeledPoint] which is lately an input for Logistic Regressions. Any 
programatical way to get sub-RDDs get me back to item 1.

The logit is a simple binary dependant variable out of n features, I just need 
to run one logit for each group.
There may be some mathematical equivalent to run this in one big regression, 
but so far, im out of ideas.

Saif

Reply via email to