Hi Marco,

Yes you can apply `VectorAssembler` first in the pipeline to assemble
multiple features column.

Thanks.

On Sun, Dec 17, 2017 at 6:33 AM, Marco Mistroni <mmistr...@gmail.com> wrote:

> Hello Wei
>  Thanks, i should have c hecked the data
> My data has this format
> |col1|col2|col3|label|
>
> so it looks like i cannot use VectorIndexer directly (it accepts a Vector
> column).
> I am guessing what i should do is something like this (given i have few
> categorical features)
>
> val assembler = new VectorAssembler().
>       setInputCols(inputData.columns.filter(_ != "Label")).
>       setOutputCol("features")
>
>     val transformedData = assembler.transform(inputData)
>
>
>     val featureIndexer =
>       new VectorIndexer()
>       .setInputCol("features")
>       .setOutputCol("indexedFeatures")
>       .setMaxCategories(5) // features with > 4 distinct values are
> treated as continuous.
>       .fit(transformedData)
>
> ?
> Apologies for the basic question btu last time i worked on an ML project i
> was using Spark 1.x
>
> kr
>  marco
>
>
>
>
>
>
>
>
>
> On Dec 16, 2017 1:24 PM, "Weichen Xu" <weichen...@databricks.com> wrote:
>
>> Hi, Marco,
>>
>> val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_d
>> ata.txt")
>>
>> The data now include a feature column with name "features",
>>
>> val featureIndexer = new VectorIndexer()
>>   .setInputCol("features")   <------ Here specify the "features" column to 
>> index.
>>   .setOutputCol("indexedFeatures")
>>
>>
>> Thanks.
>>
>>
>> On Sat, Dec 16, 2017 at 6:26 AM, Marco Mistroni <mmistr...@gmail.com>
>> wrote:
>>
>>> HI all
>>>  i am trying to run a sample decision tree, following examples here (for
>>> Mllib)
>>>
>>> https://spark.apache.org/docs/latest/ml-classification-regre
>>> ssion.html#decision-tree-classifier
>>>
>>> the example seems to use  a Vectorindexer, however i am missing
>>> something.
>>> How does the featureIndexer knows which columns are features?
>>> Isnt' there something missing?  or the featuresIndexer is able to figure
>>> out by itself
>>> which columns of teh DAtaFrame are features?
>>>
>>> val labelIndexer = new StringIndexer()
>>>   .setInputCol("label")
>>>   .setOutputCol("indexedLabel")
>>>   .fit(data)// Automatically identify categorical features, and index 
>>> them.val featureIndexer = new VectorIndexer()
>>>   .setInputCol("features")
>>>   .setOutputCol("indexedFeatures")
>>>   .setMaxCategories(4) // features with > 4 distinct values are treated as 
>>> continuous.
>>>   .fit(data)
>>>
>>> Using this code i am getting back this exception
>>>
>>> Exception in thread "main" java.lang.IllegalArgumentException: Field 
>>> "features" does not exist.
>>>         at 
>>> org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)
>>>         at 
>>> org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)
>>>         at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
>>>         at scala.collection.AbstractMap.getOrElse(Map.scala:59)
>>>         at org.apache.spark.sql.types.StructType.apply(StructType.scala:265)
>>>         at 
>>> org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:40)
>>>         at 
>>> org.apache.spark.ml.feature.VectorIndexer.transformSchema(VectorIndexer.scala:141)
>>>         at 
>>> org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
>>>         at 
>>> org.apache.spark.ml.feature.VectorIndexer.fit(VectorIndexer.scala:118)
>>>
>>> what am i missing?
>>>
>>> w/kindest regarsd
>>>
>>>  marco
>>>
>>>
>>

Reply via email to