General question on using StringIndexer in SparkML

Vishnu Viswanath Sat, 28 Nov 2015 12:34:20 -0800

Hi All,

I have a general question on using StringIndexer.
StringIndexer gives an index to each label in the feature starting from 0 (
0 for least frequent word).


Suppose I am building a model, and I use StringIndexer for transforming on
of my column.
e.g., suppose A was most frequent word followed by B and C.

So the StringIndexer will generate

A  0.0
B  1.0
C  2.0

After building the model, I am going to do some prediction using this
model, So I do the same transformation on my new data which I need to
predict. And suppose the new dataset has C as the most frequent word,
followed by B and A. So the StringIndexer will assign index as

C 0.0
B 1.0
A 2.0

These indexes are different from what we used for modeling. So won’t this
give me a wrong prediction if I use StringIndexer?

-- 
Thanks and Regards,
Vishnu Viswanath,
*www.vishnuviswanath.com <http://www.vishnuviswanath.com>*

General question on using StringIndexer in SparkML

Reply via email to