[GitHub] spark pull request #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms shou...

2016-09-02 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/13584


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms shou...

2016-09-01 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/13584#discussion_r77290933
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/r/RWrapperUtils.scala ---
@@ -35,13 +35,37 @@ object RWrapperUtils extends Logging {
*/
   def checkDataColumns(rFormula: RFormula, data: Dataset[_]): Unit = {
 if (data.schema.fieldNames.contains(rFormula.getLabelCol)) {
-  logWarning("data containing 'label' column, so change its name to 
avoid conflict")
-  rFormula.setLabelCol(rFormula.getLabelCol + "_output")
+  val newLabelName = convertToUniqueName(rFormula.getLabelCol, 
data.schema.fieldNames)
--- End diff --

fair enough. that makes sense, thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms shou...

2016-09-01 Thread keypointt
Github user keypointt commented on a diff in the pull request:

https://github.com/apache/spark/pull/13584#discussion_r77273428
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/r/RWrapperUtils.scala ---
@@ -35,13 +35,37 @@ object RWrapperUtils extends Logging {
*/
   def checkDataColumns(rFormula: RFormula, data: Dataset[_]): Unit = {
 if (data.schema.fieldNames.contains(rFormula.getLabelCol)) {
-  logWarning("data containing 'label' column, so change its name to 
avoid conflict")
-  rFormula.setLabelCol(rFormula.getLabelCol + "_output")
+  val newLabelName = convertToUniqueName(rFormula.getLabelCol, 
data.schema.fieldNames)
--- End diff --

I think in `if (data.schema.fieldNames.contains(rFormula.getFeaturesCol))`, 
it's checking `label` only

and in `convertToUniqueName ()`,  `_output` will be appended resulting in 
`label_output `: `var newName = originalName + "_output"`, and then 
`label_output ` is checked at `while (fieldNames.contains(newName))`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms shou...

2016-09-01 Thread keypointt
Github user keypointt commented on a diff in the pull request:

https://github.com/apache/spark/pull/13584#discussion_r77272735
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/r/RWrapperUtils.scala ---
@@ -35,13 +35,37 @@ object RWrapperUtils extends Logging {
*/
   def checkDataColumns(rFormula: RFormula, data: Dataset[_]): Unit = {
 if (data.schema.fieldNames.contains(rFormula.getLabelCol)) {
-  logWarning("data containing 'label' column, so change its name to 
avoid conflict")
-  rFormula.setLabelCol(rFormula.getLabelCol + "_output")
+  val newLabelName = convertToUniqueName(rFormula.getLabelCol, 
data.schema.fieldNames)
--- End diff --

ok, it's a better way


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms shou...

2016-09-01 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/13584#discussion_r77272532
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/r/RWrapperUtils.scala ---
@@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.r
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.feature.RFormula
+import org.apache.spark.sql.Dataset
+
+object RWrapperUtils extends Logging {
+
+  /**
+   * DataFrame column check.
+   * When loading data, default columns "features" and "label" will be 
added. And these two names
+   * would conflict with RFormula default feature and label column names.
+   * Here is to change the column name to avoid "column already exists" 
error.
+   *
+   * @param rFormula RFormula instance
+   * @param data Input dataset
+   * @return Unit
+   */
+  def checkDataColumns(rFormula: RFormula, data: Dataset[_]): Unit = {
+if (data.schema.fieldNames.contains(rFormula.getLabelCol)) {
+  val newLabelName = convertToUniqueName(rFormula.getLabelCol, 
data.schema.fieldNames)
+  logWarning(
+s"data containing ${rFormula.getLabelCol} column, using new name 
$newLabelName instead")
+  rFormula.setLabelCol(newLabelName)
+}
+
+if (data.schema.fieldNames.contains(rFormula.getFeaturesCol)) {
+  val newFeaturesName = convertToUniqueName(rFormula.getFeaturesCol, 
data.schema.fieldNames)
+  logWarning(
+s"data containing ${rFormula.getFeaturesCol} column, using new 
name $newFeaturesName")
--- End diff --

let's make this consistent with the message above in L40?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms shou...

2016-09-01 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/13584#discussion_r77272395
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/r/RWrapperUtils.scala ---
@@ -35,13 +35,37 @@ object RWrapperUtils extends Logging {
*/
   def checkDataColumns(rFormula: RFormula, data: Dataset[_]): Unit = {
 if (data.schema.fieldNames.contains(rFormula.getLabelCol)) {
-  logWarning("data containing 'label' column, so change its name to 
avoid conflict")
-  rFormula.setLabelCol(rFormula.getLabelCol + "_output")
+  val newLabelName = convertToUniqueName(rFormula.getLabelCol, 
data.schema.fieldNames)
--- End diff --

nit: i think we end up checking for `label_output` twice, once in `if 
(data.schema.fieldNames.contains(rFormula.getFeaturesCol))` and second time 
within `convertToUniqueName`? Perhaps we merge them?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms shou...

2016-09-01 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/13584#discussion_r77272270
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/r/RWrapperUtils.scala ---
@@ -35,13 +35,37 @@ object RWrapperUtils extends Logging {
*/
   def checkDataColumns(rFormula: RFormula, data: Dataset[_]): Unit = {
 if (data.schema.fieldNames.contains(rFormula.getLabelCol)) {
-  logWarning("data containing 'label' column, so change its name to 
avoid conflict")
-  rFormula.setLabelCol(rFormula.getLabelCol + "_output")
+  val newLabelName = convertToUniqueName(rFormula.getLabelCol, 
data.schema.fieldNames)
+  logWarning(
+s"data containing ${rFormula.getLabelCol} column, changing its 
name to $newLabelName")
+  rFormula.setLabelCol(newLabelName)
 }
 
 if (data.schema.fieldNames.contains(rFormula.getFeaturesCol)) {
-  logWarning("data containing 'features' column, so change its name to 
avoid conflict")
-  rFormula.setFeaturesCol(rFormula.getFeaturesCol + "_output")
+  val newFeaturesName = convertToUniqueName(rFormula.getFeaturesCol, 
data.schema.fieldNames)
+  logWarning(
+s"data containing ${rFormula.getFeaturesCol} column, changing its 
name to $newFeaturesName")
--- End diff --

same here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms shou...

2016-09-01 Thread keypointt
Github user keypointt commented on a diff in the pull request:

https://github.com/apache/spark/pull/13584#discussion_r77271690
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/r/RWrapperUtils.scala ---
@@ -35,13 +35,37 @@ object RWrapperUtils extends Logging {
*/
   def checkDataColumns(rFormula: RFormula, data: Dataset[_]): Unit = {
 if (data.schema.fieldNames.contains(rFormula.getLabelCol)) {
-  logWarning("data containing 'label' column, so change its name to 
avoid conflict")
-  rFormula.setLabelCol(rFormula.getLabelCol + "_output")
+  val newLabelName = convertToUniqueName(rFormula.getLabelCol, 
data.schema.fieldNames)
+  logWarning(
+s"data containing ${rFormula.getLabelCol} column, changing its 
name to $newLabelName")
--- End diff --

sure, I'll change it


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms shou...

2016-09-01 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/13584#discussion_r77271384
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/r/RWrapperUtils.scala ---
@@ -35,13 +35,37 @@ object RWrapperUtils extends Logging {
*/
   def checkDataColumns(rFormula: RFormula, data: Dataset[_]): Unit = {
 if (data.schema.fieldNames.contains(rFormula.getLabelCol)) {
-  logWarning("data containing 'label' column, so change its name to 
avoid conflict")
-  rFormula.setLabelCol(rFormula.getLabelCol + "_output")
+  val newLabelName = convertToUniqueName(rFormula.getLabelCol, 
data.schema.fieldNames)
+  logWarning(
+s"data containing ${rFormula.getLabelCol} column, changing its 
name to $newLabelName")
--- End diff --

this sounds a bit like we are renaming the existing `label` column?
perhaps just change to `s"data containing ${rFormula.getLabelCol} column, 
using new name to $newLabelName instead"`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms shou...

2016-08-31 Thread keypointt
Github user keypointt commented on a diff in the pull request:

https://github.com/apache/spark/pull/13584#discussion_r77098961
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/r/RWrapperUtils.scala ---
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.r
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.feature.RFormula
+import org.apache.spark.sql.Dataset
+
+object RWrapperUtils extends Logging {
+
+  /**
+   * DataFrame column check.
+   * When loading data, default columns "features" and "label" will be 
added. And these two names
+   * would conflict with RFormula default feature and label column names.
+   * Here is to change the column name to avoid "column already exists" 
error.
+   *
+   * @param rFormula RFormula instance
+   * @param data Input dataset
+   * @return Unit
+   */
+  def checkDataColumns(rFormula: RFormula, data: Dataset[_]): Unit = {
+if (data.schema.fieldNames.contains(rFormula.getLabelCol)) {
+  logWarning("data containing 'label' column, so change its name to 
avoid conflict")
+  rFormula.setLabelCol(rFormula.getLabelCol + "_output")
--- End diff --

sure I'll add this logic, incrementing until no conflict


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms shou...

2016-08-31 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/13584#discussion_r76940615
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/r/RWrapperUtils.scala ---
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.r
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.feature.RFormula
+import org.apache.spark.sql.Dataset
+
+object RWrapperUtils extends Logging {
+
+  /**
+   * DataFrame column check.
+   * When loading data, default columns "features" and "label" will be 
added. And these two names
+   * would conflict with RFormula default feature and label column names.
+   * Here is to change the column name to avoid "column already exists" 
error.
+   *
+   * @param rFormula RFormula instance
+   * @param data Input dataset
+   * @return Unit
+   */
+  def checkDataColumns(rFormula: RFormula, data: Dataset[_]): Unit = {
+if (data.schema.fieldNames.contains(rFormula.getLabelCol)) {
+  logWarning("data containing 'label' column, so change its name to 
avoid conflict")
+  rFormula.setLabelCol(rFormula.getLabelCol + "_output")
--- End diff --

what if `something_output` is also already in the DataFrame?
should we check for it?
I thought the earlier discussion calls for appending a sequence number, 
like `something_output1` and incrementing util it is not already there?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms shou...

2016-08-31 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/13584#discussion_r76940367
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/r/RWrapperUtils.scala ---
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.r
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.feature.RFormula
+import org.apache.spark.sql.Dataset
+
+object RWrapperUtils extends Logging {
+
+  /**
+   * DataFrame column check.
+   * When loading data, default columns "features" and "label" will be 
added. And these two names
+   * would conflict with RFormula default feature and label column names.
+   * Here is to change the column name to avoid "column already exists" 
error.
+   *
+   * @param rFormula RFormula instance
+   * @param data Input dataset
+   * @return Unit
+   */
+  def checkDataColumns(rFormula: RFormula, data: Dataset[_]): Unit = {
+if (data.schema.fieldNames.contains(rFormula.getLabelCol)) {
+  logWarning("data containing 'label' column, so change its name to 
avoid conflict")
--- End diff --

is it possible to include the featurecol name in logging?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms shou...

2016-08-28 Thread keypointt
Github user keypointt commented on a diff in the pull request:

https://github.com/apache/spark/pull/13584#discussion_r76543182
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala ---
@@ -54,9 +54,6 @@ class RFormulaSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defaul
 intercept[IllegalArgumentException] {
   formula.fit(original)
 }
-intercept[IllegalArgumentException] {
--- End diff --

here is just a duplication of above


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms shou...

2016-08-28 Thread keypointt
Github user keypointt commented on a diff in the pull request:

https://github.com/apache/spark/pull/13584#discussion_r76543133
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/r/RWrapperUtilsSuite.scala ---
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.r
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.feature.{RFormula, RFormulaModel}
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+
+class RWrapperUtilsSuite extends SparkFunSuite with MLlibTestSparkContext {
+
+  test("avoid column name conflicting") {
+val rFormula = new RFormula().setFormula("label ~ features")
+val data = 
spark.read.format("libsvm").load("../data/mllib/sample_libsvm_data.txt")
--- End diff --

Here I used `"../data/"`, I'm not sure if there is a better way to do it, 
something like `$current_directory/data/mllib/sample_libsvm_data.txt`?

All I found is like this `val data = 
spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")` 
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/NaiveBayesExample.scala#L36


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms shou...

2016-06-09 Thread keypointt
GitHub user keypointt opened a pull request:

https://github.com/apache/spark/pull/13584

[SPARK-15509][ML][SparkR] R MLlib algorithms should support input columns 
"features" and "label"

https://issues.apache.org/jira/browse/SPARK-15509

## What changes were proposed in this pull request?

Currently in SparkR, when you load a LibSVM dataset using the sqlContext 
and then pass it to an MLlib algorithm, the ML wrappers will fail since they 
will try to create a "features" column, which conflicts with the existing 
"features" column from the LibSVM loader. E.g., using the "mnist" dataset from 
LibSVM:
`training <- loadDF(sqlContext, ".../mnist", "libsvm")`
`model <- naiveBayes(label ~ features, training)`
This fails with:
```
16/05/24 11:52:41 ERROR RBackendHandler: fit on 
org.apache.spark.ml.r.NaiveBayesWrapper failed
Error in invokeJava(isStatic = TRUE, className, methodName, ...) : 
  java.lang.IllegalArgumentException: Output column features already exists.
at 
org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:120)
at 
org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:179)
at 
org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:179)
at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:179)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:67)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:131)
at org.apache.spark.ml.feature.RFormula.fit(RFormula.scala:169)
at 
org.apache.spark.ml.r.NaiveBayesWrapper$.fit(NaiveBayesWrapper.scala:62)
at org.apache.spark.ml.r.NaiveBayesWrapper.fit(NaiveBayesWrapper.sca
The same issue appears for the "label" column once you rename the 
"features" column.
```
The cause is, when using `loadDF()` to generate dataframes, sometimes 
it’s with default column name `“label”` and `“features”`, and these 
two name will conflict with default column names `setDefault(labelCol, 
"label")` and ` setDefault(featuresCol, "features")` of `SharedParams.scala`




## How was this patch tested?

Test on my local machine.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/keypointt/spark SPARK-15509

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13584.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13584


commit cfed8844cbadbd760f73c2f906a1591806001a93
Author: Xin Ren 
Date:   2016-06-08T23:39:28Z

[SPARK-15509] remove duplicate of intercept[IllegalArgumentException]

commit 77886fe59463027f24c6ca909638731145b46ee2
Author: Xin Ren 
Date:   2016-06-09T20:59:38Z

[SPARK-15509] no column exists error for naivebayes. expand to other 
wrappers

commit e112ac0c0685f399f72e9ed60be00964ec4fcdc4
Author: Xin Ren 
Date:   2016-06-09T21:04:56Z

[SPARK-15509] add a util function for all wrappers

commit ef3702ee5beefad1ee51fe15cb01e1716aeda362
Author: Xin Ren 
Date:   2016-06-09T22:27:37Z

[SPARK-15509] expand column check to other wrappers

commit aab3a12fe09cf3039708468a80837fa421739c69
Author: Xin Ren 
Date:   2016-06-09T23:05:51Z

[SPARK-15509] add unit test

commit f68ac34907f3a7d1d66e98572ada34d47df3eab9
Author: Xin Ren 
Date:   2016-06-10T00:01:44Z

[SPARK-15509] some clean up

commit c8e30e9452031908fc829e527ab82a8e93598302
Author: Xin Ren 
Date:   2016-06-10T00:45:53Z

[SPARK-15509] fix path

commit 43b2f8c5fb9e0d74579b948b1d52cad4faa76b66
Author: Xin Ren 
Date:   2016-06-10T00:48:36Z

[SPARK-15509] fix path




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org