Hi ,

I am trying to use StreamingLogisticRegressionwithSGD to build a CTR
prediction model.

The document :

http://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression

mentions that the numFeatures should be *constant*.

The problem that I am facing is :
Since most of my variables are categorical, the numFeatures variable should
be the final set of variables after encoding and parsing the categorical
variables in labeled point format.

Suppose, for a categorical variable x1 I have 10 distinct values in current
window.

But in the next window some new values/items gets added to x1 and the
number of distinct values increases. How should I handle the numFeatures
variable in this case, because it will change now ?

Basically, my question is how should I handle the new values of the
categorical variables in streaming model.

Thanks,
Kundan

Reply via email to