many thanks Sean! kr marco On Wed, Sep 14, 2016 at 10:33 PM, Sean Owen <[email protected]> wrote:
> If it helps, I've already updated that code for the 2nd edition, which > will be based on ~Spark 2.1: > > https://github.com/sryza/aas/blob/master/ch04-rdf/src/main/ > scala/com/cloudera/datascience/rdf/RunRDF.scala#L220 > > This should be an equivalent working example that deals with > categoricals via VectorIndexer. > > You're right that you must use it because it adds the metadata that > says it's categorical. I'm not sure of another way to do it? > > Sean > > > On Wed, Sep 14, 2016 at 10:18 PM, Marco Mistroni <[email protected]> > wrote: > > hi all > > i have been toying around with this well known RandomForestExample code > > > > val forest = RandomForest.trainClassifier( > > trainData, 7, Map(10 -> 4, 11 -> 40), 20, > > "auto", "entropy", 30, 300) > > > > This comes from this link > > (https://www.safaribooksonline.com/library/view/advanced-analytics-with/ > 9781491912751/ch04.html), > > and also Sean Owen's presentation > > > > (https://www.youtube.com/watch?v=ObiCMJ24ezs) > > > > > > > > and now i want to migrate it to use ML Libraries. > > The problem i have is that the MLLib example has categorical features, > and > > i cannot find > > a way to use categorical features with ML > > Apparently i should use VectorIndexer, but VectorIndexer assumes only one > > input > > column for features. > > I am at the moment using Vectorassembler instead, but i cannot find a > way to > > achieve the > > same > > I have checed spark samples, but all i can see is RandomForestClassifier > > using VectorIndexer for 1 feature > > > > > > > > Could anyone assist? > > This is my current code....what do i need to add to take into account > > categorical features? > > > > val labelIndexer = new StringIndexer() > > .setInputCol("Col0") > > .setOutputCol("indexedLabel") > > .fit(data) > > > > val features = new VectorAssembler() > > .setInputCols(Array( > > "Col1", "Col2", "Col3", "Col4", "Col5", > > "Col6", "Col7", "Col8", "Col9", "Col10")) > > .setOutputCol("features") > > > > val labelConverter = new IndexToString() > > .setInputCol("prediction") > > .setOutputCol("predictedLabel") > > .setLabels(labelIndexer.labels) > > > > val rf = new RandomForestClassifier() > > .setLabelCol("indexedLabel") > > .setFeaturesCol("features") > > .setNumTrees(20) > > .setMaxDepth(30) > > .setMaxBins(300) > > .setImpurity("entropy") > > > > println("Kicking off pipeline..") > > > > val pipeline = new Pipeline() > > .setStages(Array(labelIndexer, features, rf, labelConverter)) > > > > thanks in advance and regards > > Marco > > >
