I was thinking to use org.apache.hadoop.mapred.join.TupleWritable in order to realize my clustering..according to you,...is this a right choice? Otherwise...how may I implement my scenario?
Thank you Angelo 2013/12/3 Angelo Immediata <[email protected]> > well similarity between data should be calculated by taking care of the > following variables: meteo, manifestation, day of the week, month of the > year and vacation > > > 2013/12/3 Ted Dunning <[email protected]> > >> The key first question is how you plan to compute similarity between data >> points. It isn't clear how you should do this with your data. >> >> >> >> >> On Mon, Dec 2, 2013 at 1:31 AM, Angelo Immediata <[email protected] >> >wrote: >> >> > Hi >> > >> > I'm pretty newbie regarding learning achine and above all Apache >> Mahout, so >> > pardon me my low level questions >> > >> > I need to do some cluster analysis by using some data. At the beginning >> > this data can be not too much huge, but after some time they can be >> really >> > huge (I did some calculation and after 1 year this data cann be around >> 37 >> > billion of records) Since I have this huge data, I decided to do the >> > cluster analysis by using Mahout on the top of Apache Hadoop and its >> HDFS. >> > Regarding where to store this big amount of data I decided to use Apache >> > HBase always on the top of Apache Hadoop HDFS >> > >> > Now I need to do this cluster analysi by considering some environment >> > variables. These variable may be the following: >> > >> > - *recordId* = id of the record >> > - *arcId *= id of the arc between 2 points of my "street graph" >> > - *mediumVelocity *= medium velocity of the considered arc in the >> > specified >> > - *vehiclesNumber* = number of the monitored vehicles in order to get >> > that velocity >> > - *meteo *= weather condition (a numeric representing if there is >> sun, >> > rain etc...) >> > - *manifestation *= a numeric representing if there is any kind of >> > manifestation (sport manifestation or other) >> > - *day of the week* >> > - *month of the year* >> > - *hour of the day* >> > - *vacation *= a numeric representing if it's a vacation day or a >> > working day >> > >> > So my data are so formatted (raw representation): >> > >> > *recordId arcId mediumVelocity vehiclesNumber meteo manifestation >> > weekDay yearMonth dayHour vacation* >> > 1 1 34.5 20 1 3 4 >> > 2011 10 3 >> > 2 156 66.5 3 2 5 1 >> > 2008 6 2 >> > >> > As far as I know, in order to do the cluster analysis in Mahout I need >> to >> > format my data in Mahout format (that is in a SequenceFile) The question >> > is: how can I format my data represented as the previously written >> table in >> > a SequenceFile? I tried to find something but I was not able in finding >> any >> > good sample Any suggestion would be really appreciated >> > >> > Thank you Angelo >> > >> > >
