Hi Jorge, Unfortunately, I couldn't transform the data as you suggested.
This is what I get: +---+---------+-------------+ | id|pageIndex| pageVec| +---+---------+-------------+ |0.0| 3.0| (3,[],[])| |1.0| 0.0|(3,[0],[1.0])| |2.0| 2.0|(3,[2],[1.0])| |3.0| 1.0|(3,[1],[1.0])| +---+---------+-------------+ This is the snippets: JavaRDD<Row> jrdd = jsc.parallelize(Arrays.asList( RowFactory.create(0.0, "PageA", 1.0, 2.0, 3.0), RowFactory.create(1.0, "PageB", 4.0, 5.0, 6.0), RowFactory.create(2.0, "PageC", 7.0, 8.0, 9.0), RowFactory.create(3.0, "PageD", 10.0, 11.0, 12.0) )); StructType schema = new StructType(new StructField[] { new StructField("id", DataTypes.DoubleType, false, Metadata.empty()), new StructField("page", DataTypes.StringType, false, Metadata.empty()), new StructField("Nov", DataTypes.DoubleType, false, Metadata.empty()), new StructField("Dec", DataTypes.DoubleType, false, Metadata.empty()), new StructField("Jan", DataTypes.DoubleType, false, Metadata.empty()) }); DataFrame df = sqlContext.createDataFrame(jrdd, schema); StringIndexerModel indexer = new StringIndexer().setInputCol("page").setInputCol("Nov") .setInputCol("Dec").setInputCol("Jan").setOutputCol("pageIndex").fit(df); OneHotEncoder encoder = new OneHotEncoder().setInputCol("pageIndex").setOutputCol("pageVec"); DataFrame indexed = indexer.transform(df); DataFrame encoded = encoder.transform(indexed); encoded.select("id", "pageIndex", "pageVec").show(); Could you please let me know what I'm doing wrong? PS: My cluster is running Spark 1.3.0, which doesn't support StringIndexer, OneHotEncoder but for testing this I've installed the 1.6.0 on my local machine. Cheer. On 2 February 2016 at 10:25, Jorge Machado <jom...@me.com> wrote: > Hi Guru, > > Any results ? :) > > On 01/02/2016, at 14:34, diplomatic Guru <diplomaticg...@gmail.com> wrote: > > Hi Jorge, > > Thank you for the reply and your example. I'll try your suggestion and > will let you know the outcome. > > Cheers > > > On 1 February 2016 at 13:17, Jorge Machado <jom...@me.com> wrote: > >> Hi Guru, >> >> So First transform your Name pages with OneHotEncoder ( >> https://spark.apache.org/docs/latest/ml-features.html#onehotencoder) >> then make the same thing for months: >> >> You will end with something like: >> (first tree are the pagename, the other the month,) >> (0,0,1,0,0,1) >> >> then you have your label that is what you want to predict. At the end you >> will have an LabeledPoint with (10000 -> (0,0,1,0,0,1)) this will represent >> (10000 -> (PageA, UV_NOV)) >> After that try a regression tree with >> >> val model = DecisionTree.trainRegressor(trainingData, >> categoricalFeaturesInfo, impurity,maxDepth, maxBins) >> >> >> Regards >> Jorge >> >> On 01/02/2016, at 12:29, diplomatic Guru <diplomaticg...@gmail.com> >> wrote: >> >> Any suggestions please? >> >> >> On 29 January 2016 at 22:31, diplomatic Guru <diplomaticg...@gmail.com> >> wrote: >> >>> Hello guys, >>> >>> I'm trying understand how I could predict the next month page views >>> based on the previous access pattern. >>> >>> For example, I've collected statistics on page views: >>> >>> e.g. >>> Page,UniqueView >>> ------------------------- >>> pageA, 10000 >>> pageB, 999 >>> ... >>> pageZ,200 >>> >>> I aggregate the statistics monthly. >>> >>> I've prepared a file containing last 3 months as this: >>> >>> e.g. >>> Page,UV_NOV, UV_DEC, UV_JAN >>> --------------------------------------------------- >>> pageA, 10000,9989,11000 >>> pageB, 999,500,700 >>> ... >>> pageZ,200,50,34 >>> >>> >>> Based on above information, I want to predict the next month (FEB). >>> >>> Which alogrithm do you think will suit most, I think linear regression >>> is the safe bet. However, I'm struggling to prepare this data for LR ML, >>> especially how do I prepare the X,Y relationship. >>> >>> The Y is easy (uniqiue visitors), but not sure about the X(it should be >>> Page,right). However, how do I plot those three months of data. >>> >>> Could you give me an example based on above example data? >>> >>> >>> >>> Page,UV_NOV, UV_DEC, UV_JAN >>> --------------------------------------------------- >>> 1, 10000,9989,11000 >>> 2, 999,500,700 >>> ... >>> 26,200,50,34 >>> >>> >>> >>> >>> >> >> > >