Re: Please Help with DecisionTree/FeatureIndexer

Weichen Xu Tue, 19 Dec 2017 17:59:15 -0800

Hi, Marco

Do not call any single fit/transform by your self. You only need to call
`pipeline.fit`/`pipelineModel.transform`. Like following:


    val assembler = new VectorAssembler().
      setInputCols(inputData.columns.filter(_ != "Severity")).
      setOutputCol("features")

    val data = assembler.transform(inputData)

    val labelIndexer = new StringIndexer()
      .setInputCol("Severity")
      .setOutputCol("indexedLabel")

    val featureIndexer =
      new VectorIndexer()
      .setInputCol("features")
      .setOutputCol("indexedFeatures")
      .setMaxCategories(5) // features with > 4 distinct values are treated
as continuous.

    val Array(trainingData, testData) = data.randomSplit(Array(0.8, 0.2))
    // Train a DecisionTree model.
    val dt = new DecisionTreeClassifier()
      .setLabelCol("indexedLabel")
      .setFeaturesCol("indexedFeatures")

    // Convert indexed labels back to original labels.
      val labelConverter = new IndexToString()
      .setInputCol("prediction")
      .setOutputCol("predictedLabel")
      .setLabels(labelIndexer.labels)

    // Chain indexers and tree in a Pipeline.
    val pipeline = new Pipeline()
      .setStages(Array(assembler, labelIndexer, featureIndexer, dt,
labelConverter))

    trainingData.cache()
    testData.cache()


    // Train model. This also runs the indexers.
    val model = pipeline.fit(trainingData)

    // Make predictions.
    val predictions = model.transform(testData)


Thanks.

On Wed, Dec 20, 2017 at 5:26 AM, Marco Mistroni <mmistr...@gmail.com> wrote:

> Hello Weichen
>  i will try it out and le tyou know
> But, if i add assembler to the pipeline, do i still have to call
> Assembler.transform  and XXXIndexer.fit() ?
> kind regards
>  Marco
>
> On Tue, Dec 19, 2017 at 2:45 AM, Weichen Xu <weichen...@databricks.com>
> wrote:
>
>> Hi Marco,
>>
>> If you add assembler at the first of the pipeline, like:
>> ```
>>  val pipeline = new Pipeline()
>>       .setStages(Array(assembler, labelIndexer, featureIndexer, dt,
>> labelConverter))
>> ```
>>
>> Which error do you got ?
>>
>> I think it can work fine if the `assembler` added into pipeline.
>>
>> Thanks.
>>
>> On Tue, Dec 19, 2017 at 6:08 AM, Marco Mistroni <mmistr...@gmail.com>
>> wrote:
>>
>>> Hello Weichen
>>>  sorry to bother you again with my ML issue... but i feel you have more
>>> experience than i do in this and perhaps you can suggest me  if i am
>>> following the correct steps, as i seem to get confused by different
>>> examples on Decision Treees
>>>
>>> So, as a starting point i have this dataframe
>>>
>>> [BI-RADS, Age, Shape, Margin,Density,Severity]
>>>
>>> The label is 'Severity' and all others are features
>>> I am following these steps and i was wondering if you can advise if i am
>>> doing the correct thing , as i am unable to add the assembler at the
>>> beginning of the pipeilne, resorting instead to the following code
>>> <inputData is the original DataFrame>
>>>
>>>     val assembler = new VectorAssembler().
>>>       setInputCols(inputData.columns.filter(_ != "Severity")).
>>>       setOutputCol("features")
>>>
>>>     val data = assembler.transform(inputData)
>>>
>>>     val labelIndexer = new StringIndexer()
>>>       .setInputCol("Severity")
>>>       .setOutputCol("indexedLabel")
>>>       .fit(data)
>>>
>>>     val featureIndexer =
>>>       new VectorIndexer()
>>>       .setInputCol("features")
>>>       .setOutputCol("indexedFeatures")
>>>       .setMaxCategories(5) // features with > 4 distinct values are
>>> treated as continuous.
>>>       .fit(data)
>>>
>>>     val Array(trainingData, testData) = data.randomSplit(Array(0.8, 0.2))
>>>     // Train a DecisionTree model.
>>>     val dt = new DecisionTreeClassifier()
>>>       .setLabelCol("indexedLabel")
>>>       .setFeaturesCol("indexedFeatures")
>>>
>>>     // Convert indexed labels back to original labels.
>>>       val labelConverter = new IndexToString()
>>>       .setInputCol("prediction")
>>>       .setOutputCol("predictedLabel")
>>>       .setLabels(labelIndexer.labels)
>>>
>>>     // Chain indexers and tree in a Pipeline.
>>>     val pipeline = new Pipeline()
>>>       .setStages(Array(labelIndexer, featureIndexer, dt, labelConverter))
>>>
>>>     trainingData.cache()
>>>     testData.cache()
>>>
>>>
>>>     // Train model. This also runs the indexers.
>>>     val model = pipeline.fit(trainingData)
>>>
>>>     // Make predictions.
>>>     val predictions = model.transform(testData)
>>>
>>>     // Select example rows to display.
>>>     predictions.select("predictedLabel", "indexedLabel",
>>> "indexedFeatures").show(5)
>>>
>>>     // Select (prediction, true label) and compute test error.
>>>     val evaluator = new MulticlassClassificationEvaluator()
>>>       .setLabelCol("indexedLabel")
>>>       .setPredictionCol("prediction")
>>>       .setMetricName("accuracy")
>>>     val accuracy = evaluator.evaluate(predictions)
>>>     println("Test Error = " + (1.0 - accuracy))
>>>
>>> Could you advise if this is the proper way to follow when using an
>>> Assembler?
>>> I was unable to add the Assembler at the beginning of the pipeline... it
>>> seems it dint get invoked as , at the moment of calling the FeatureIndexer,
>>> the column 'features' was not found
>>>
>>> this is not urgent, i'll appreciate ifyou can give me your comments
>>> kind regards
>>>  marco
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sun, Dec 17, 2017 at 2:48 AM, Weichen Xu <weichen...@databricks.com>
>>> wrote:
>>>
>>>> Hi Marco,
>>>>
>>>> Yes you can apply `VectorAssembler` first in the pipeline to assemble
>>>> multiple features column.
>>>>
>>>> Thanks.
>>>>
>>>> On Sun, Dec 17, 2017 at 6:33 AM, Marco Mistroni <mmistr...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello Wei
>>>>>  Thanks, i should have c hecked the data
>>>>> My data has this format
>>>>> |col1|col2|col3|label|
>>>>>
>>>>> so it looks like i cannot use VectorIndexer directly (it accepts a
>>>>> Vector column).
>>>>> I am guessing what i should do is something like this (given i have
>>>>> few categorical features)
>>>>>
>>>>> val assembler = new VectorAssembler().
>>>>>       setInputCols(inputData.columns.filter(_ != "Label")).
>>>>>       setOutputCol("features")
>>>>>
>>>>>     val transformedData = assembler.transform(inputData)
>>>>>
>>>>>
>>>>>     val featureIndexer =
>>>>>       new VectorIndexer()
>>>>>       .setInputCol("features")
>>>>>       .setOutputCol("indexedFeatures")
>>>>>       .setMaxCategories(5) // features with > 4 distinct values are
>>>>> treated as continuous.
>>>>>       .fit(transformedData)
>>>>>
>>>>> ?
>>>>> Apologies for the basic question btu last time i worked on an ML
>>>>> project i was using Spark 1.x
>>>>>
>>>>> kr
>>>>>  marco
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Dec 16, 2017 1:24 PM, "Weichen Xu" <weichen...@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> Hi, Marco,
>>>>>>
>>>>>> val data = spark.read.format("libsvm").lo
>>>>>> ad("data/mllib/sample_libsvm_data.txt")
>>>>>>
>>>>>> The data now include a feature column with name "features",
>>>>>>
>>>>>> val featureIndexer = new VectorIndexer()
>>>>>>   .setInputCol("features")   <------ Here specify the "features" column 
>>>>>> to index.
>>>>>>   .setOutputCol("indexedFeatures")
>>>>>>
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>>
>>>>>> On Sat, Dec 16, 2017 at 6:26 AM, Marco Mistroni <mmistr...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> HI all
>>>>>>>  i am trying to run a sample decision tree, following examples here
>>>>>>> (for Mllib)
>>>>>>>
>>>>>>> https://spark.apache.org/docs/latest/ml-classification-regre
>>>>>>> ssion.html#decision-tree-classifier
>>>>>>>
>>>>>>> the example seems to use  a Vectorindexer, however i am missing
>>>>>>> something.
>>>>>>> How does the featureIndexer knows which columns are features?
>>>>>>> Isnt' there something missing?  or the featuresIndexer is able to
>>>>>>> figure out by itself
>>>>>>> which columns of teh DAtaFrame are features?
>>>>>>>
>>>>>>> val labelIndexer = new StringIndexer()
>>>>>>>   .setInputCol("label")
>>>>>>>   .setOutputCol("indexedLabel")
>>>>>>>   .fit(data)// Automatically identify categorical features, and index 
>>>>>>> them.val featureIndexer = new VectorIndexer()
>>>>>>>   .setInputCol("features")
>>>>>>>   .setOutputCol("indexedFeatures")
>>>>>>>   .setMaxCategories(4) // features with > 4 distinct values are treated 
>>>>>>> as continuous.
>>>>>>>   .fit(data)
>>>>>>>
>>>>>>> Using this code i am getting back this exception
>>>>>>>
>>>>>>> Exception in thread "main" java.lang.IllegalArgumentException: Field 
>>>>>>> "features" does not exist.
>>>>>>>         at 
>>>>>>> org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)
>>>>>>>         at 
>>>>>>> org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)
>>>>>>>         at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
>>>>>>>         at scala.collection.AbstractMap.getOrElse(Map.scala:59)
>>>>>>>         at 
>>>>>>> org.apache.spark.sql.types.StructType.apply(StructType.scala:265)
>>>>>>>         at 
>>>>>>> org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:40)
>>>>>>>         at 
>>>>>>> org.apache.spark.ml.feature.VectorIndexer.transformSchema(VectorIndexer.scala:141)
>>>>>>>         at 
>>>>>>> org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
>>>>>>>         at 
>>>>>>> org.apache.spark.ml.feature.VectorIndexer.fit(VectorIndexer.scala:118)
>>>>>>>
>>>>>>> what am i missing?
>>>>>>>
>>>>>>> w/kindest regarsd
>>>>>>>
>>>>>>>  marco
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>
>>
>

Re: Please Help with DecisionTree/FeatureIndexer

Reply via email to