subject:"Random Forest and StringIndexer in pyspark ML Pipeline"

Re: Random Forest and StringIndexer in pyspark ML Pipeline

2015-08-21 Thread Yanbo Liang

ML plans to make Machine Learning pipeline that users can make machine
learning more efficient.
It's more general to make StringIndexer chain with any kinds of Estimators.
I think we can make StringIndexer and reverse process automatic in the
future.
If you want to know your original labels, you can use IndexToString.

2015-08-11 6:56 GMT+08:00 pkphlam pkph...@gmail.com:

Hi,

If I understand the RandomForest model in the ML Pipeline implementation in
the ml package correctly, I have to first run my outcome label variable
through the StringIndexer, even if my labels are numeric. The StringIndexer
then converts the labels into numeric indices based on frequency of the
label.

This could create situations where if I'm classifying binary outcomes where
my original labels are simply 0 and 1, the StringIndexer may actually flip
my labels such that 0s become 1s and 1s become 0s if my original 1s were
more frequent. This transformation would then extend itself to the
predictions. In the old mllib implementation, the RF does not require the
labels to be changed and I could use 0/1 labels without worrying about them
being transformed.

I was wondering:
1. Why is this the default implementation for the Pipeline RF? This seems
like it could cause a lot of confusion in cases like the one I outlined
above.
2. Is there a way to avoid this by either controlling how the indices are
created in StringIndexer or bypassing StringIndexer altogether?
3. If 2 is not possible, is there an easy way to see how my original labels
mapped onto the indices so that I can revert the predictions back to the
original labels rather than the transformed labels? I suppose I could do
this by counting the original labels and mapping by frequency, but it seems
like there should be a more straightforward way to get it out of the
StringIndexer.

Thanks!

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Random-Forest-and-StringIndexer-in-pyspark-ML-Pipeline-tp24200.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Random Forest and StringIndexer in pyspark ML Pipeline

2015-08-10 Thread pkphlam

Hi,

Thanks!

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Random Forest and StringIndexer in pyspark ML Pipeline

Random Forest and StringIndexer in pyspark ML Pipeline

2 matches

Site Navigation

Mail list logo

Footer information