Hi there I open this old topic since I got some information more becouse I was able in talking with my customer Basically my customer wants the following: by using some historical data, we have to cluster the data by using some cluster analysis and some environment variables; for each cluster we have to find the medium velocity of the records. When an user wants to know the velocity on a street in a well known period (e.g. 24 Jenuary 2014) I should be able in finding to which cluster the new data belongs and to propose to the user the medium velocity of that cluster Now I have to consider the following environment variables:
- arcId=id of the arc between 2 points of my "street graph" - startTime=start time of the pre-clustering misuration - endTime=end time of the pre-clustering misuration - mediumVelocity=medium velocity of the considered arc in the specified time range - vehiclesNumber=number of the monitored vehicles in order to get that velocity in that time range - meteo=weather condition (a numeric representing if there is sun, rain etc...) - manifestation=a numeric representing if there is any kind of manifestation (sport manifestation or other) - day of the week - month of the year - hour of the day - vacation=a numeric representing if it's a vacation day or a working day And maybe some other variable Now my idea was to use mahout in order to do the cluster analysis by using kmeans and canopy; moreover the data I should use in the cluster analysis can be pretty uge (in one year they can arrive also around to 37billion or records in one table) so I decided to use mahout on top od Hadoop cluster and to use HBase in order to store and read data So what I would like to know is if my solution makes sense (it seems to me good....but as I said I'm newbie to these technologies but, on the other side, I need performance too) If this solution is OK....how should I Map/Reduce my historical data in order to pass them to Mahout to do the cluster analysis? I hope I didn't do too many mistakes :) Thank you Angelo 2013/10/16 Bertrand Dechoux <[email protected]> > That's why I was asking a bit more about the problem. It looks to me that > what will bring more value at the beginning is to find the shortest path, > which is a classical graph algorithm. Then the results could be improved by > changing the speed of each route according to additional information. As a > client, if it's raining, I only want to know if I should turn left or > right. Estimating the speed of each route with a good enough accuracy is > more complex and is relevant only if there is a single long enough route. > > If you are dealing with large volume of data, there are also graph > solutions for Hadoop like Giraph or Hama. > > IMHO, YMMV... > > Bertrand > > > > > On Tue, Oct 15, 2013 at 10:01 PM, Angelo Immediata <[email protected] > >wrote: > > > hi All > > > > First of all thank you for the great suggestions you gave me; you are > > simply great :) > > Anyway, returning to my problem, I'll try to be as much clear as > > possible...As far as I know (but we are still collecting requirements and > > understanding which kind of data we will have) we should have a situation > > of this type: > > on street XYZ in Spring without any events (an event can be > manifestation, > > parade etc...) the medium velocity is 50 Km/h > > on street XYZ in Spring with an event the medium velocity is 20 Km/h > > on street XYZ in Autumn without any events (an event can be > manifestation, > > parade etc...) the medium velocity is 40 Km/h > > on street XYZ in Autumn with an event the medium velocity is 15 Km/h > > > > and so on for all the interested street (basically using the Open Street > > Map data); note that we are not interested in the worst case that is the > > case with accident (at least as far as I know). > > > > Now my customer would like to offer this kind functionality to the > clients: > > a client connects to the site (or downloads an app) and he/she wants to > go > > by car to the restaurant W; he/she would like to know if it's a good idea > > to go on that street or search for a different street; so by knowing the > > period of time (Spring, Autumn, Summer or Winter) and by knowing if there > > are some events (manifestations, parades etc...) I should tell him/her: > if > > you go on street XYZ probably you will travel at 50Km/h or 20Km/h (the > best > > would be if I may suggest a different way...but this is another topic :) > ) > > > > So, since i should use old data in order to suggest to the client the > > velocity he/she may have on street XYZ, I was thinking to use > mahout....but > > maybe I was wrong (sadly I'm really new in this kind of world...though > I'm > > finding it amazing) > > > > > > Now by using the "old" data (the one I listed previously) > > > > > > > > 2013/10/15 Andrew Butkus <[email protected]> > > > > > > > > After giving some more thought, you could do something like this: > > > > > > Store: > > > > > > route > > > { > > > road > > > { > > > timestamp, > > > time_to_run_road, > > > } > > > } > > > > > > then build up a bigger model, which extracts timestamp from the road on > > > the route and the time it takes to run that road, and calculate an > > average > > > on a per day basis, (for example, if you travel this route every monday > > at > > > 9am, then extract the timestamp which matches every monday at 9am, and > > > average the time_to_run_road data you have collected on a monday for > that > > > road. If you want to see how long it takes to run a road on every > monday > > at > > > 9am in january, then you extract all timestamps that match that road > for > > > january at 9am on monday > > > > > > Not entirely sure where mahout fits in here, but this could be a > > potential > > > way forward for you (assuming you can collect/have data about the road) > > > > > > Hope that helps > > > > > > Andy > > > > > > On 15 Oct 2013, at 13:09, Andrew Butkus <[email protected]> wrote: > > > > > > > Also to add to this you probably wouldn't want to do it by route, but > > > > maybe break it down by road, this gives more coverage and greater > > > > granularity > > > > > > > > Sent from my Windows Phone From: Andrew Butkus > > > > Sent: 15/10/2013 13:07 > > > > To: Bertrand Dechoux; [email protected] > > > > Subject: RE: Information > > > > IM not sure, i think the last 2 can be predicted, for example in > > > > january in the uk we get bad weather which causes delays and on > average > > > > it will take longer to run a route in this month because of that, > > > > > > > > To consider weather as a variable is probably not scalable, recording > > > > the time to run a route with a timestamp should be good enough. > > > > > > > > Also consider once a year there is a festival in reading, so over > this > > > > weekend routes through reading will always take longer. > > > > > > > > IM not sure where mahout can fit this problem, other than, but if u > can > > > > train route time and add a timestamp this would give u something > > > > scalable. Then figure out on average how long it takes to run a route > > > > at similar time stamp, for example, minute, hour, week, month, year. > > > > > > > > Sent from my Windows Phone From: Bertrand Dechoux > > > > Sent: 15/10/2013 08:33 > > > > To: [email protected] > > > > Subject: Re: Information > > > > The biggest point is what data do you have and what exactly is your > > > problem. > > > > > > > > The maximum speed of the route can be easily known and in the best > case > > > > that would be your speed. From a very broad point of view, there is > > three > > > > reasons for a slowdown. > > > > 1) traffic jam > > > > 2) accident > > > > 3) bad weather > > > > > > > > But without up to date observations, those three points are non > trivial > > > to > > > > predict (especially the last two). Doing simple statistics (like > > average) > > > > can be a good start to see the variations and understand what factors > > > > should be taken into account. > > > > > > > > At the end, you want to do a regression but classification and > > clustering > > > > might help before that. Hard to say more without knowing why the > medium > > > > speed is important, for which area, at which time... > > > > > > > > Bertrand > > > > > > > > On Tue, Oct 15, 2013 at 9:14 AM, Pavan K Narayanan < > > > > [email protected]> wrote: > > > > > > > >> Based on the information you have provided, street routing is > > > potentially a > > > >> Vehicle Routing Problem which is based on TSPs. You can check out > the > > > below > > > >> link: > > > >> > https://cwiki.apache.org/confluence/display/MAHOUT/Traveling+Salesman > > > >> Secondly, if you want to use Mahout for Forecasting, it is not > > possible > > > yet > > > >> as the solution methodology for Forecasting (LWR) is still an open > > > problem. > > > >> https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms > > > >> > > > >> Bottomline: IMHO, you cannot use Mahout for forecasting at the > moment; > > > good > > > >> luck with your project. > > > >> > > > >> Also, you can explore parallel computing paradigms if you have > > > relatively > > > >> high volumes of data. > > > >> > > > >> > > > >> On 15 October 2013 12:19, Angelo Immediata <[email protected]> > > wrote: > > > >> > > > >>> Hi there > > > >>> > > > >>> I'm pretty new to learning machine and apache mahout as well so > > pardon > > > me > > > >>> if this question is not too correct :) > > > >>> > > > >>> I'm in a street routing project where, beside other > functionalities, > > we > > > >>> have to make forecasts. Precisely we should be able in forecasting > > the > > > >>> medium speed in a street in a well know period season (e.g we > should > > be > > > >>> able in answering to this kind of question: on the american route > 66 > > > what > > > >>> will be the medium speed in spring 2015?) > > > >>> As far as I know in order to offer this functionality we should use > > > some > > > >>> learning machine; this is the reason I'm checking mahout (moreover > we > > > >> need > > > >>> to guarantee high performance and since mahout is based on Apache > > > hadoop > > > >>> and since it uses Map/Reduce, it seems to me very amazing) > > > >>> The first question I'ld love to do is: can I use Apache mahout in > > order > > > >> to > > > >>> implement the previously written funcionality? > > > >>> If I can use it sure I'll need some data in order to "train" > > > >> mahout....can > > > >>> I train mahout in a different time respect to when i need the > > > prevision? > > > >> I > > > >>> mean: can I make the train let's say every week at 10pm and then > > offer > > > >> the > > > >>> forecasting functionality only when a user is interested in it? > > Should > > > I > > > >>> store the training result in some way? > > > >>> And the last, but not the least :), always if I can use > > mahout....which > > > >>> algoritm should I use in order to implement my scenario? > > > >>> > > > >>> Thank you for the help and pardon me if i was not too much > corrected > > > >>> > > > >> > > > > > > > > > > > > -- > Bertrand Dechoux >
