many thanks Sean!
kr
 marco

On Wed, Sep 14, 2016 at 10:33 PM, Sean Owen <[email protected]> wrote:

> If it helps, I've already updated that code for the 2nd edition, which
> will be based on ~Spark 2.1:
>
> https://github.com/sryza/aas/blob/master/ch04-rdf/src/main/
> scala/com/cloudera/datascience/rdf/RunRDF.scala#L220
>
> This should be an equivalent working example that deals with
> categoricals via VectorIndexer.
>
> You're right that you must use it because it adds the metadata that
> says it's categorical. I'm not sure of another way to do it?
>
> Sean
>
>
> On Wed, Sep 14, 2016 at 10:18 PM, Marco Mistroni <[email protected]>
> wrote:
> > hi all
> >  i have been toying around with this well known RandomForestExample code
> >
> > val forest = RandomForest.trainClassifier(
> >   trainData, 7, Map(10 -> 4, 11 -> 40), 20,
> >   "auto", "entropy", 30, 300)
> >
> > This comes from this link
> > (https://www.safaribooksonline.com/library/view/advanced-analytics-with/
> 9781491912751/ch04.html),
> > and also Sean Owen's presentation
> >
> > (https://www.youtube.com/watch?v=ObiCMJ24ezs)
> >
> >
> >
> > and now i want to migrate it to use ML Libraries.
> > The problem i have is that the MLLib  example has categorical features,
> and
> > i cannot find
> > a way to use categorical features with ML
> > Apparently i should use VectorIndexer, but VectorIndexer assumes only one
> > input
> > column for features.
> > I am at the moment using Vectorassembler instead, but i cannot find a
> way to
> > achieve the
> > same
> > I have checed spark samples, but all i can see is RandomForestClassifier
> > using VectorIndexer for 1 feature
> >
> >
> >
> > Could anyone assist?
> > This is my current code....what do i need to add to take into account
> > categorical features?
> >
> > val labelIndexer = new StringIndexer()
> >       .setInputCol("Col0")
> >       .setOutputCol("indexedLabel")
> >       .fit(data)
> >
> >     val features = new VectorAssembler()
> >       .setInputCols(Array(
> >         "Col1", "Col2", "Col3", "Col4", "Col5",
> >         "Col6", "Col7", "Col8", "Col9", "Col10"))
> >       .setOutputCol("features")
> >
> >     val labelConverter = new IndexToString()
> >       .setInputCol("prediction")
> >       .setOutputCol("predictedLabel")
> >       .setLabels(labelIndexer.labels)
> >
> >     val rf = new RandomForestClassifier()
> >       .setLabelCol("indexedLabel")
> >       .setFeaturesCol("features")
> >       .setNumTrees(20)
> >       .setMaxDepth(30)
> >       .setMaxBins(300)
> >       .setImpurity("entropy")
> >
> >     println("Kicking off pipeline..")
> >
> >     val pipeline = new Pipeline()
> >       .setStages(Array(labelIndexer, features, rf, labelConverter))
> >
> > thanks in advance and regards
> >  Marco
> >
>

Reply via email to