Vasil- Would you consider adding your estimation algorithm to this patch? https://issues.apache.org/jira/browse/MAHOUT-563
The estimator in there now is stupid- a real one would make the Canopy algorithms orders of magnitude more useful. Lance On Fri, Jan 21, 2011 at 7:16 AM, Ted Dunning <[email protected]> wrote: > On Fri, Jan 21, 2011 at 12:39 AM, Vasil Vasilev <[email protected]> wrote: > >> >> dimension 1: Using linear regression with gradient descent algorithm I find >> what is the trend of the line, i.e. is it increasing, decreasing or >> straight >> line >> dimension 2: Knowing the approximating line (from the linear regression) I >> count how many times this line gets crossed by the original signal. This >> helps in separating the cyclic data from all the rest >> dimension 3: What is the biggest increase/decrease of a single signal line. >> This helps find shifts >> >> So to say - I put a semantics for the data that are to be clustered (I >> don't >> know if it is correct to do that, but I couldn't think of how an algorithm >> could cope with the task without such additional semantics) >> > > It is very common for feature extraction like this to be the key for > data-mining projects. Such features are absolutely critical for most time > series mining and are highly application dependent. > > One key aspect of your features is that they are shift invariant. > > >> Also I developed a small swing application which visualizes the clustered >> signals and which helped me in playing with the algorithms. >> > > Great idea. > -- Lance Norskog [email protected]
