[GitHub] spark pull request #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms shou...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/13584 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms shou...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/13584#discussion_r77290933 --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/RWrapperUtils.scala --- @@ -35,13 +35,37 @@ object RWrapperUtils extends Logging { */ def checkDataColumns(rFormula: RFormula, data: Dataset[_]): Unit = { if (data.schema.fieldNames.contains(rFormula.getLabelCol)) { - logWarning("data containing 'label' column, so change its name to avoid conflict") - rFormula.setLabelCol(rFormula.getLabelCol + "_output") + val newLabelName = convertToUniqueName(rFormula.getLabelCol, data.schema.fieldNames) --- End diff -- fair enough. that makes sense, thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms shou...
Github user keypointt commented on a diff in the pull request: https://github.com/apache/spark/pull/13584#discussion_r77273428 --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/RWrapperUtils.scala --- @@ -35,13 +35,37 @@ object RWrapperUtils extends Logging { */ def checkDataColumns(rFormula: RFormula, data: Dataset[_]): Unit = { if (data.schema.fieldNames.contains(rFormula.getLabelCol)) { - logWarning("data containing 'label' column, so change its name to avoid conflict") - rFormula.setLabelCol(rFormula.getLabelCol + "_output") + val newLabelName = convertToUniqueName(rFormula.getLabelCol, data.schema.fieldNames) --- End diff -- I think in `if (data.schema.fieldNames.contains(rFormula.getFeaturesCol))`, it's checking `label` only and in `convertToUniqueName ()`, `_output` will be appended resulting in `label_output `: `var newName = originalName + "_output"`, and then `label_output ` is checked at `while (fieldNames.contains(newName))` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms shou...
Github user keypointt commented on a diff in the pull request: https://github.com/apache/spark/pull/13584#discussion_r77272735 --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/RWrapperUtils.scala --- @@ -35,13 +35,37 @@ object RWrapperUtils extends Logging { */ def checkDataColumns(rFormula: RFormula, data: Dataset[_]): Unit = { if (data.schema.fieldNames.contains(rFormula.getLabelCol)) { - logWarning("data containing 'label' column, so change its name to avoid conflict") - rFormula.setLabelCol(rFormula.getLabelCol + "_output") + val newLabelName = convertToUniqueName(rFormula.getLabelCol, data.schema.fieldNames) --- End diff -- ok, it's a better way --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms shou...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/13584#discussion_r77272532 --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/RWrapperUtils.scala --- @@ -0,0 +1,71 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.r + +import org.apache.spark.internal.Logging +import org.apache.spark.ml.feature.RFormula +import org.apache.spark.sql.Dataset + +object RWrapperUtils extends Logging { + + /** + * DataFrame column check. + * When loading data, default columns "features" and "label" will be added. And these two names + * would conflict with RFormula default feature and label column names. + * Here is to change the column name to avoid "column already exists" error. + * + * @param rFormula RFormula instance + * @param data Input dataset + * @return Unit + */ + def checkDataColumns(rFormula: RFormula, data: Dataset[_]): Unit = { +if (data.schema.fieldNames.contains(rFormula.getLabelCol)) { + val newLabelName = convertToUniqueName(rFormula.getLabelCol, data.schema.fieldNames) + logWarning( +s"data containing ${rFormula.getLabelCol} column, using new name $newLabelName instead") + rFormula.setLabelCol(newLabelName) +} + +if (data.schema.fieldNames.contains(rFormula.getFeaturesCol)) { + val newFeaturesName = convertToUniqueName(rFormula.getFeaturesCol, data.schema.fieldNames) + logWarning( +s"data containing ${rFormula.getFeaturesCol} column, using new name $newFeaturesName") --- End diff -- let's make this consistent with the message above in L40? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms shou...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/13584#discussion_r77272395 --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/RWrapperUtils.scala --- @@ -35,13 +35,37 @@ object RWrapperUtils extends Logging { */ def checkDataColumns(rFormula: RFormula, data: Dataset[_]): Unit = { if (data.schema.fieldNames.contains(rFormula.getLabelCol)) { - logWarning("data containing 'label' column, so change its name to avoid conflict") - rFormula.setLabelCol(rFormula.getLabelCol + "_output") + val newLabelName = convertToUniqueName(rFormula.getLabelCol, data.schema.fieldNames) --- End diff -- nit: i think we end up checking for `label_output` twice, once in `if (data.schema.fieldNames.contains(rFormula.getFeaturesCol))` and second time within `convertToUniqueName`? Perhaps we merge them? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms shou...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/13584#discussion_r77272270 --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/RWrapperUtils.scala --- @@ -35,13 +35,37 @@ object RWrapperUtils extends Logging { */ def checkDataColumns(rFormula: RFormula, data: Dataset[_]): Unit = { if (data.schema.fieldNames.contains(rFormula.getLabelCol)) { - logWarning("data containing 'label' column, so change its name to avoid conflict") - rFormula.setLabelCol(rFormula.getLabelCol + "_output") + val newLabelName = convertToUniqueName(rFormula.getLabelCol, data.schema.fieldNames) + logWarning( +s"data containing ${rFormula.getLabelCol} column, changing its name to $newLabelName") + rFormula.setLabelCol(newLabelName) } if (data.schema.fieldNames.contains(rFormula.getFeaturesCol)) { - logWarning("data containing 'features' column, so change its name to avoid conflict") - rFormula.setFeaturesCol(rFormula.getFeaturesCol + "_output") + val newFeaturesName = convertToUniqueName(rFormula.getFeaturesCol, data.schema.fieldNames) + logWarning( +s"data containing ${rFormula.getFeaturesCol} column, changing its name to $newFeaturesName") --- End diff -- same here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms shou...
Github user keypointt commented on a diff in the pull request: https://github.com/apache/spark/pull/13584#discussion_r77271690 --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/RWrapperUtils.scala --- @@ -35,13 +35,37 @@ object RWrapperUtils extends Logging { */ def checkDataColumns(rFormula: RFormula, data: Dataset[_]): Unit = { if (data.schema.fieldNames.contains(rFormula.getLabelCol)) { - logWarning("data containing 'label' column, so change its name to avoid conflict") - rFormula.setLabelCol(rFormula.getLabelCol + "_output") + val newLabelName = convertToUniqueName(rFormula.getLabelCol, data.schema.fieldNames) + logWarning( +s"data containing ${rFormula.getLabelCol} column, changing its name to $newLabelName") --- End diff -- sure, I'll change it --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms shou...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/13584#discussion_r77271384 --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/RWrapperUtils.scala --- @@ -35,13 +35,37 @@ object RWrapperUtils extends Logging { */ def checkDataColumns(rFormula: RFormula, data: Dataset[_]): Unit = { if (data.schema.fieldNames.contains(rFormula.getLabelCol)) { - logWarning("data containing 'label' column, so change its name to avoid conflict") - rFormula.setLabelCol(rFormula.getLabelCol + "_output") + val newLabelName = convertToUniqueName(rFormula.getLabelCol, data.schema.fieldNames) + logWarning( +s"data containing ${rFormula.getLabelCol} column, changing its name to $newLabelName") --- End diff -- this sounds a bit like we are renaming the existing `label` column? perhaps just change to `s"data containing ${rFormula.getLabelCol} column, using new name to $newLabelName instead"`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms shou...
Github user keypointt commented on a diff in the pull request: https://github.com/apache/spark/pull/13584#discussion_r77098961 --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/RWrapperUtils.scala --- @@ -0,0 +1,47 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.r + +import org.apache.spark.internal.Logging +import org.apache.spark.ml.feature.RFormula +import org.apache.spark.sql.Dataset + +object RWrapperUtils extends Logging { + + /** + * DataFrame column check. + * When loading data, default columns "features" and "label" will be added. And these two names + * would conflict with RFormula default feature and label column names. + * Here is to change the column name to avoid "column already exists" error. + * + * @param rFormula RFormula instance + * @param data Input dataset + * @return Unit + */ + def checkDataColumns(rFormula: RFormula, data: Dataset[_]): Unit = { +if (data.schema.fieldNames.contains(rFormula.getLabelCol)) { + logWarning("data containing 'label' column, so change its name to avoid conflict") + rFormula.setLabelCol(rFormula.getLabelCol + "_output") --- End diff -- sure I'll add this logic, incrementing until no conflict --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms shou...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/13584#discussion_r76940615 --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/RWrapperUtils.scala --- @@ -0,0 +1,47 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.r + +import org.apache.spark.internal.Logging +import org.apache.spark.ml.feature.RFormula +import org.apache.spark.sql.Dataset + +object RWrapperUtils extends Logging { + + /** + * DataFrame column check. + * When loading data, default columns "features" and "label" will be added. And these two names + * would conflict with RFormula default feature and label column names. + * Here is to change the column name to avoid "column already exists" error. + * + * @param rFormula RFormula instance + * @param data Input dataset + * @return Unit + */ + def checkDataColumns(rFormula: RFormula, data: Dataset[_]): Unit = { +if (data.schema.fieldNames.contains(rFormula.getLabelCol)) { + logWarning("data containing 'label' column, so change its name to avoid conflict") + rFormula.setLabelCol(rFormula.getLabelCol + "_output") --- End diff -- what if `something_output` is also already in the DataFrame? should we check for it? I thought the earlier discussion calls for appending a sequence number, like `something_output1` and incrementing util it is not already there? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms shou...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/13584#discussion_r76940367 --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/RWrapperUtils.scala --- @@ -0,0 +1,47 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.r + +import org.apache.spark.internal.Logging +import org.apache.spark.ml.feature.RFormula +import org.apache.spark.sql.Dataset + +object RWrapperUtils extends Logging { + + /** + * DataFrame column check. + * When loading data, default columns "features" and "label" will be added. And these two names + * would conflict with RFormula default feature and label column names. + * Here is to change the column name to avoid "column already exists" error. + * + * @param rFormula RFormula instance + * @param data Input dataset + * @return Unit + */ + def checkDataColumns(rFormula: RFormula, data: Dataset[_]): Unit = { +if (data.schema.fieldNames.contains(rFormula.getLabelCol)) { + logWarning("data containing 'label' column, so change its name to avoid conflict") --- End diff -- is it possible to include the featurecol name in logging? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms shou...
Github user keypointt commented on a diff in the pull request: https://github.com/apache/spark/pull/13584#discussion_r76543182 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala --- @@ -54,9 +54,6 @@ class RFormulaSuite extends SparkFunSuite with MLlibTestSparkContext with Defaul intercept[IllegalArgumentException] { formula.fit(original) } -intercept[IllegalArgumentException] { --- End diff -- here is just a duplication of above --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms shou...
Github user keypointt commented on a diff in the pull request: https://github.com/apache/spark/pull/13584#discussion_r76543133 --- Diff: mllib/src/test/scala/org/apache/spark/ml/r/RWrapperUtilsSuite.scala --- @@ -0,0 +1,47 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.r + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.feature.{RFormula, RFormulaModel} +import org.apache.spark.mllib.util.MLlibTestSparkContext + +class RWrapperUtilsSuite extends SparkFunSuite with MLlibTestSparkContext { + + test("avoid column name conflicting") { +val rFormula = new RFormula().setFormula("label ~ features") +val data = spark.read.format("libsvm").load("../data/mllib/sample_libsvm_data.txt") --- End diff -- Here I used `"../data/"`, I'm not sure if there is a better way to do it, something like `$current_directory/data/mllib/sample_libsvm_data.txt`? All I found is like this `val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")` https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/NaiveBayesExample.scala#L36 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms shou...
GitHub user keypointt opened a pull request: https://github.com/apache/spark/pull/13584 [SPARK-15509][ML][SparkR] R MLlib algorithms should support input columns "features" and "label" https://issues.apache.org/jira/browse/SPARK-15509 ## What changes were proposed in this pull request? Currently in SparkR, when you load a LibSVM dataset using the sqlContext and then pass it to an MLlib algorithm, the ML wrappers will fail since they will try to create a "features" column, which conflicts with the existing "features" column from the LibSVM loader. E.g., using the "mnist" dataset from LibSVM: `training <- loadDF(sqlContext, ".../mnist", "libsvm")` `model <- naiveBayes(label ~ features, training)` This fails with: ``` 16/05/24 11:52:41 ERROR RBackendHandler: fit on org.apache.spark.ml.r.NaiveBayesWrapper failed Error in invokeJava(isStatic = TRUE, className, methodName, ...) : java.lang.IllegalArgumentException: Output column features already exists. at org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:120) at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:179) at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:179) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186) at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:179) at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:67) at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:131) at org.apache.spark.ml.feature.RFormula.fit(RFormula.scala:169) at org.apache.spark.ml.r.NaiveBayesWrapper$.fit(NaiveBayesWrapper.scala:62) at org.apache.spark.ml.r.NaiveBayesWrapper.fit(NaiveBayesWrapper.sca The same issue appears for the "label" column once you rename the "features" column. ``` The cause is, when using `loadDF()` to generate dataframes, sometimes itâs with default column name `âlabelâ` and `âfeaturesâ`, and these two name will conflict with default column names `setDefault(labelCol, "label")` and ` setDefault(featuresCol, "features")` of `SharedParams.scala` ## How was this patch tested? Test on my local machine. You can merge this pull request into a Git repository by running: $ git pull https://github.com/keypointt/spark SPARK-15509 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13584.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13584 commit cfed8844cbadbd760f73c2f906a1591806001a93 Author: Xin RenDate: 2016-06-08T23:39:28Z [SPARK-15509] remove duplicate of intercept[IllegalArgumentException] commit 77886fe59463027f24c6ca909638731145b46ee2 Author: Xin Ren Date: 2016-06-09T20:59:38Z [SPARK-15509] no column exists error for naivebayes. expand to other wrappers commit e112ac0c0685f399f72e9ed60be00964ec4fcdc4 Author: Xin Ren Date: 2016-06-09T21:04:56Z [SPARK-15509] add a util function for all wrappers commit ef3702ee5beefad1ee51fe15cb01e1716aeda362 Author: Xin Ren Date: 2016-06-09T22:27:37Z [SPARK-15509] expand column check to other wrappers commit aab3a12fe09cf3039708468a80837fa421739c69 Author: Xin Ren Date: 2016-06-09T23:05:51Z [SPARK-15509] add unit test commit f68ac34907f3a7d1d66e98572ada34d47df3eab9 Author: Xin Ren Date: 2016-06-10T00:01:44Z [SPARK-15509] some clean up commit c8e30e9452031908fc829e527ab82a8e93598302 Author: Xin Ren Date: 2016-06-10T00:45:53Z [SPARK-15509] fix path commit 43b2f8c5fb9e0d74579b948b1d52cad4faa76b66 Author: Xin Ren Date: 2016-06-10T00:48:36Z [SPARK-15509] fix path --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org