Hi, Marco Do not call any single fit/transform by your self. You only need to call `pipeline.fit`/`pipelineModel.transform`. Like following:
val assembler = new VectorAssembler(). setInputCols(inputData.columns.filter(_ != "Severity")). setOutputCol("features") val data = assembler.transform(inputData) val labelIndexer = new StringIndexer() .setInputCol("Severity") .setOutputCol("indexedLabel") val featureIndexer = new VectorIndexer() .setInputCol("features") .setOutputCol("indexedFeatures") .setMaxCategories(5) // features with > 4 distinct values are treated as continuous. val Array(trainingData, testData) = data.randomSplit(Array(0.8, 0.2)) // Train a DecisionTree model. val dt = new DecisionTreeClassifier() .setLabelCol("indexedLabel") .setFeaturesCol("indexedFeatures") // Convert indexed labels back to original labels. val labelConverter = new IndexToString() .setInputCol("prediction") .setOutputCol("predictedLabel") .setLabels(labelIndexer.labels) // Chain indexers and tree in a Pipeline. val pipeline = new Pipeline() .setStages(Array(assembler, labelIndexer, featureIndexer, dt, labelConverter)) trainingData.cache() testData.cache() // Train model. This also runs the indexers. val model = pipeline.fit(trainingData) // Make predictions. val predictions = model.transform(testData) Thanks. On Wed, Dec 20, 2017 at 5:26 AM, Marco Mistroni <mmistr...@gmail.com> wrote: > Hello Weichen > i will try it out and le tyou know > But, if i add assembler to the pipeline, do i still have to call > Assembler.transform and XXXIndexer.fit() ? > kind regards > Marco > > On Tue, Dec 19, 2017 at 2:45 AM, Weichen Xu <weichen...@databricks.com> > wrote: > >> Hi Marco, >> >> If you add assembler at the first of the pipeline, like: >> ``` >> val pipeline = new Pipeline() >> .setStages(Array(assembler, labelIndexer, featureIndexer, dt, >> labelConverter)) >> ``` >> >> Which error do you got ? >> >> I think it can work fine if the `assembler` added into pipeline. >> >> Thanks. >> >> On Tue, Dec 19, 2017 at 6:08 AM, Marco Mistroni <mmistr...@gmail.com> >> wrote: >> >>> Hello Weichen >>> sorry to bother you again with my ML issue... but i feel you have more >>> experience than i do in this and perhaps you can suggest me if i am >>> following the correct steps, as i seem to get confused by different >>> examples on Decision Treees >>> >>> So, as a starting point i have this dataframe >>> >>> [BI-RADS, Age, Shape, Margin,Density,Severity] >>> >>> The label is 'Severity' and all others are features >>> I am following these steps and i was wondering if you can advise if i am >>> doing the correct thing , as i am unable to add the assembler at the >>> beginning of the pipeilne, resorting instead to the following code >>> <inputData is the original DataFrame> >>> >>> val assembler = new VectorAssembler(). >>> setInputCols(inputData.columns.filter(_ != "Severity")). >>> setOutputCol("features") >>> >>> val data = assembler.transform(inputData) >>> >>> val labelIndexer = new StringIndexer() >>> .setInputCol("Severity") >>> .setOutputCol("indexedLabel") >>> .fit(data) >>> >>> val featureIndexer = >>> new VectorIndexer() >>> .setInputCol("features") >>> .setOutputCol("indexedFeatures") >>> .setMaxCategories(5) // features with > 4 distinct values are >>> treated as continuous. >>> .fit(data) >>> >>> val Array(trainingData, testData) = data.randomSplit(Array(0.8, 0.2)) >>> // Train a DecisionTree model. >>> val dt = new DecisionTreeClassifier() >>> .setLabelCol("indexedLabel") >>> .setFeaturesCol("indexedFeatures") >>> >>> // Convert indexed labels back to original labels. >>> val labelConverter = new IndexToString() >>> .setInputCol("prediction") >>> .setOutputCol("predictedLabel") >>> .setLabels(labelIndexer.labels) >>> >>> // Chain indexers and tree in a Pipeline. >>> val pipeline = new Pipeline() >>> .setStages(Array(labelIndexer, featureIndexer, dt, labelConverter)) >>> >>> trainingData.cache() >>> testData.cache() >>> >>> >>> // Train model. This also runs the indexers. >>> val model = pipeline.fit(trainingData) >>> >>> // Make predictions. >>> val predictions = model.transform(testData) >>> >>> // Select example rows to display. >>> predictions.select("predictedLabel", "indexedLabel", >>> "indexedFeatures").show(5) >>> >>> // Select (prediction, true label) and compute test error. >>> val evaluator = new MulticlassClassificationEvaluator() >>> .setLabelCol("indexedLabel") >>> .setPredictionCol("prediction") >>> .setMetricName("accuracy") >>> val accuracy = evaluator.evaluate(predictions) >>> println("Test Error = " + (1.0 - accuracy)) >>> >>> Could you advise if this is the proper way to follow when using an >>> Assembler? >>> I was unable to add the Assembler at the beginning of the pipeline... it >>> seems it dint get invoked as , at the moment of calling the FeatureIndexer, >>> the column 'features' was not found >>> >>> this is not urgent, i'll appreciate ifyou can give me your comments >>> kind regards >>> marco >>> >>> >>> >>> >>> >>> >>> On Sun, Dec 17, 2017 at 2:48 AM, Weichen Xu <weichen...@databricks.com> >>> wrote: >>> >>>> Hi Marco, >>>> >>>> Yes you can apply `VectorAssembler` first in the pipeline to assemble >>>> multiple features column. >>>> >>>> Thanks. >>>> >>>> On Sun, Dec 17, 2017 at 6:33 AM, Marco Mistroni <mmistr...@gmail.com> >>>> wrote: >>>> >>>>> Hello Wei >>>>> Thanks, i should have c hecked the data >>>>> My data has this format >>>>> |col1|col2|col3|label| >>>>> >>>>> so it looks like i cannot use VectorIndexer directly (it accepts a >>>>> Vector column). >>>>> I am guessing what i should do is something like this (given i have >>>>> few categorical features) >>>>> >>>>> val assembler = new VectorAssembler(). >>>>> setInputCols(inputData.columns.filter(_ != "Label")). >>>>> setOutputCol("features") >>>>> >>>>> val transformedData = assembler.transform(inputData) >>>>> >>>>> >>>>> val featureIndexer = >>>>> new VectorIndexer() >>>>> .setInputCol("features") >>>>> .setOutputCol("indexedFeatures") >>>>> .setMaxCategories(5) // features with > 4 distinct values are >>>>> treated as continuous. >>>>> .fit(transformedData) >>>>> >>>>> ? >>>>> Apologies for the basic question btu last time i worked on an ML >>>>> project i was using Spark 1.x >>>>> >>>>> kr >>>>> marco >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Dec 16, 2017 1:24 PM, "Weichen Xu" <weichen...@databricks.com> >>>>> wrote: >>>>> >>>>>> Hi, Marco, >>>>>> >>>>>> val data = spark.read.format("libsvm").lo >>>>>> ad("data/mllib/sample_libsvm_data.txt") >>>>>> >>>>>> The data now include a feature column with name "features", >>>>>> >>>>>> val featureIndexer = new VectorIndexer() >>>>>> .setInputCol("features") <------ Here specify the "features" column >>>>>> to index. >>>>>> .setOutputCol("indexedFeatures") >>>>>> >>>>>> >>>>>> Thanks. >>>>>> >>>>>> >>>>>> On Sat, Dec 16, 2017 at 6:26 AM, Marco Mistroni <mmistr...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> HI all >>>>>>> i am trying to run a sample decision tree, following examples here >>>>>>> (for Mllib) >>>>>>> >>>>>>> https://spark.apache.org/docs/latest/ml-classification-regre >>>>>>> ssion.html#decision-tree-classifier >>>>>>> >>>>>>> the example seems to use a Vectorindexer, however i am missing >>>>>>> something. >>>>>>> How does the featureIndexer knows which columns are features? >>>>>>> Isnt' there something missing? or the featuresIndexer is able to >>>>>>> figure out by itself >>>>>>> which columns of teh DAtaFrame are features? >>>>>>> >>>>>>> val labelIndexer = new StringIndexer() >>>>>>> .setInputCol("label") >>>>>>> .setOutputCol("indexedLabel") >>>>>>> .fit(data)// Automatically identify categorical features, and index >>>>>>> them.val featureIndexer = new VectorIndexer() >>>>>>> .setInputCol("features") >>>>>>> .setOutputCol("indexedFeatures") >>>>>>> .setMaxCategories(4) // features with > 4 distinct values are treated >>>>>>> as continuous. >>>>>>> .fit(data) >>>>>>> >>>>>>> Using this code i am getting back this exception >>>>>>> >>>>>>> Exception in thread "main" java.lang.IllegalArgumentException: Field >>>>>>> "features" does not exist. >>>>>>> at >>>>>>> org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266) >>>>>>> at >>>>>>> org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266) >>>>>>> at scala.collection.MapLike$class.getOrElse(MapLike.scala:128) >>>>>>> at scala.collection.AbstractMap.getOrElse(Map.scala:59) >>>>>>> at >>>>>>> org.apache.spark.sql.types.StructType.apply(StructType.scala:265) >>>>>>> at >>>>>>> org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:40) >>>>>>> at >>>>>>> org.apache.spark.ml.feature.VectorIndexer.transformSchema(VectorIndexer.scala:141) >>>>>>> at >>>>>>> org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74) >>>>>>> at >>>>>>> org.apache.spark.ml.feature.VectorIndexer.fit(VectorIndexer.scala:118) >>>>>>> >>>>>>> what am i missing? >>>>>>> >>>>>>> w/kindest regarsd >>>>>>> >>>>>>> marco >>>>>>> >>>>>>> >>>>>> >>>> >>> >> >