[jira] [Commented] (SPARK-23414) Plotting using matplotlib in MLlib pyspark

2018-02-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363029#comment-16363029
 ] 

Sean Owen commented on SPARK-23414:
---

matplotlib doesn't interact with Spark, so issues with using it are unlikely to 
be relevant to Spark itself anyway.

> Plotting using matplotlib in MLlib pyspark 
> ---
>
> Key: SPARK-23414
> URL: https://issues.apache.org/jira/browse/SPARK-23414
> Project: Spark
>  Issue Type: Question
>  Components: MLlib
>Affects Versions: 2.2.1
>Reporter: Waleed Esmail
>Priority: Major
>
> Dear MLlib experts,
> I just want to plot a fancy confusion matrix (true values vs predicted 
> values) like the one produced by seaborn module in python, so I did the 
> following:
> {code:java}
> labelIndexer = StringIndexer(inputCol="label", 
> outputCol="indexedLabel").fit(output)
> # Automatically identify categorical features, and index them.
> # We specify maxCategories so features with > 4 distinct values are treated 
> as continuous.
> featureIndexer = VectorIndexer(inputCol="features", 
> outputCol="indexedFeatures").fit(output)
> # Split the data into training and test sets (30% held out for testing)
> (trainingData, testData) = output.randomSplit([0.7, 0.3])
> dt = DecisionTreeClassifier(labelCol="indexedLabel", 
> featuresCol="indexedFeatures", maxDepth=15)
> # Chain indexers and tree in a Pipeline
> pipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt])
> # Train model.  This also runs the indexers.
> model = pipeline.fit(trainingData)
> # Make predictions.
> predictions = model.transform(testData)
> predictionAndLabels = predictions.select("prediction", "indexedLabel")
> y_predicted = np.array(predictions.select("prediction").collect())
> y_test = np.array(predictions.select("indexedLabel").collect())
> from sklearn.metrics import confusion_matrix
> import matplotlib.ticker as ticker
> figcm, ax = plt.subplots()
> cm = confusion_matrix(y_test, y_predicted)
> # for normalization
> cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
> sns.heatmap(cm, square=True, annot=True, cbar=False)
> plt.xlabel('predication')
> plt.ylabel('true value')
> {code}
> is this the right way to do it?!. please note that I am new to Spark and MLlib
>  
> thank you in advance,



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23414) Plotting using matplotlib in MLlib pyspark

2018-02-13 Thread Waleed Esmail (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363025#comment-16363025
 ] 

Waleed Esmail commented on SPARK-23414:
---

I am sorry, I didn't get it, what do you mean by "orthogonal"?!.

> Plotting using matplotlib in MLlib pyspark 
> ---
>
> Key: SPARK-23414
> URL: https://issues.apache.org/jira/browse/SPARK-23414
> Project: Spark
>  Issue Type: Question
>  Components: MLlib
>Affects Versions: 2.2.1
>Reporter: Waleed Esmail
>Priority: Major
>
> Dear MLlib experts,
> I just want to plot a fancy confusion matrix (true values vs predicted 
> values) like the one produced by seaborn module in python, so I did the 
> following:
> {code:java}
> labelIndexer = StringIndexer(inputCol="label", 
> outputCol="indexedLabel").fit(output)
> # Automatically identify categorical features, and index them.
> # We specify maxCategories so features with > 4 distinct values are treated 
> as continuous.
> featureIndexer = VectorIndexer(inputCol="features", 
> outputCol="indexedFeatures").fit(output)
> # Split the data into training and test sets (30% held out for testing)
> (trainingData, testData) = output.randomSplit([0.7, 0.3])
> dt = DecisionTreeClassifier(labelCol="indexedLabel", 
> featuresCol="indexedFeatures", maxDepth=15)
> # Chain indexers and tree in a Pipeline
> pipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt])
> # Train model.  This also runs the indexers.
> model = pipeline.fit(trainingData)
> # Make predictions.
> predictions = model.transform(testData)
> predictionAndLabels = predictions.select("prediction", "indexedLabel")
> y_predicted = np.array(predictions.select("prediction").collect())
> y_test = np.array(predictions.select("indexedLabel").collect())
> from sklearn.metrics import confusion_matrix
> import matplotlib.ticker as ticker
> figcm, ax = plt.subplots()
> cm = confusion_matrix(y_test, y_predicted)
> # for normalization
> cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
> sns.heatmap(cm, square=True, annot=True, cbar=False)
> plt.xlabel('predication')
> plt.ylabel('true value')
> {code}
> is this the right way to do it?!. please note that I am new to Spark and MLlib
>  
> thank you in advance,



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org