Re: Random Forest and StringIndexer in pyspark ML Pipeline

2015-08-21 Thread Yanbo Liang
ML plans to make Machine Learning pipeline that users can make machine
learning more efficient.
It's more general to make StringIndexer chain with any kinds of Estimators.
I think we can make StringIndexer and reverse process automatic in the
future.
If you want to know your original labels, you can use IndexToString.

2015-08-11 6:56 GMT+08:00 pkphlam pkph...@gmail.com:

 Hi,

 If I understand the RandomForest model in the ML Pipeline implementation in
 the ml package correctly, I have to first run my outcome label variable
 through the StringIndexer, even if my labels are numeric. The StringIndexer
 then converts the labels into numeric indices based on frequency of the
 label.

 This could create situations where if I'm classifying binary outcomes where
 my original labels are simply 0 and 1, the StringIndexer may actually flip
 my labels such that 0s become 1s and 1s become 0s if my original 1s were
 more frequent. This transformation would then extend itself to the
 predictions. In the old mllib implementation, the RF does not require the
 labels to be changed and I could use 0/1 labels without worrying about them
 being transformed.

 I was wondering:
 1. Why is this the default implementation for the Pipeline RF? This seems
 like it could cause a lot of confusion in cases like the one I outlined
 above.
 2. Is there a way to avoid this by either controlling how the indices are
 created in StringIndexer or bypassing StringIndexer altogether?
 3. If 2 is not possible, is there an easy way to see how my original labels
 mapped onto the indices so that I can revert the predictions back to the
 original labels rather than the transformed labels? I suppose I could do
 this by counting the original labels and mapping by frequency, but it seems
 like there should be a more straightforward way to get it out of the
 StringIndexer.

 Thanks!



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Random-Forest-and-StringIndexer-in-pyspark-ML-Pipeline-tp24200.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Random Forest and StringIndexer in pyspark ML Pipeline

2015-08-10 Thread pkphlam
Hi,

If I understand the RandomForest model in the ML Pipeline implementation in
the ml package correctly, I have to first run my outcome label variable
through the StringIndexer, even if my labels are numeric. The StringIndexer
then converts the labels into numeric indices based on frequency of the
label. 

This could create situations where if I'm classifying binary outcomes where
my original labels are simply 0 and 1, the StringIndexer may actually flip
my labels such that 0s become 1s and 1s become 0s if my original 1s were
more frequent. This transformation would then extend itself to the
predictions. In the old mllib implementation, the RF does not require the
labels to be changed and I could use 0/1 labels without worrying about them
being transformed.

I was wondering:
1. Why is this the default implementation for the Pipeline RF? This seems
like it could cause a lot of confusion in cases like the one I outlined
above.
2. Is there a way to avoid this by either controlling how the indices are
created in StringIndexer or bypassing StringIndexer altogether?
3. If 2 is not possible, is there an easy way to see how my original labels
mapped onto the indices so that I can revert the predictions back to the
original labels rather than the transformed labels? I suppose I could do
this by counting the original labels and mapping by frequency, but it seems
like there should be a more straightforward way to get it out of the
StringIndexer.

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Random-Forest-and-StringIndexer-in-pyspark-ML-Pipeline-tp24200.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org