Darius comments are good. You also have to think about what similar means to you. From the data you describe, I see several possibilities:
- geo-location from machine id (if it includes IP address) - content from the query - frequency of posting - diurnal phase of posting (tells us time zone) Once you know what similar means, you can meaningfully talk about next steps. If you assume that only query content matters, then I would go towards several ways. - cluster directly based on query histories using IDF weighting (likely to be kinda sorta lousy results) - use cooccurrence analysis to augment query histories and repeat the clustering - use SVD or ALS to generate user vectors and query term vectors and cluster users using user vectors and then look for coherence. If you want to use geo, the question of scaling comes in. If you want to use time, you have to derive some sort of features. I find latent variable methods useful for this. On Fri, Sep 6, 2013 at 1:25 AM, Darius Miliauskas < [email protected]> wrote: > Dear Vishal, > > can you give some code how you performed your mentioned steps: > > #) Created custom VectorIterable by inheriting Iterable<Vector>. > #) Created custom VectorItertor by inheriting AbstractIterator<Vector> > #) Model class which will be responsible to pass attribute values > (username or data etc) to custom VectorIterator > #) Custom VectorIterator.computeNext() will read line, create dense > vector having size equal to number of attribute in a row. > > Can you compile the code? > > > Best, > > Darius > > > > 2013/9/6 Vishal Danech <[email protected]> > > > Hi > > > > I have a custom log data which contains following details. > > > > 1) UserName > > 2) MachineId > > 3) DateTime > > 4) Data - which contains text - search term etc > > > > I would like to use this data to know > > #) how much time they are spending on browsing etc. > > #) User based search pattern > > > > First problem can be addressed using Hive query. > > > > For second problem, I suppose clustering can be applied and for this I > have > > converted data to vectors. I have used dense vector and applied Canopy > > algorithm on it. I got an output which I provided as an input to > > ClusterDump utility but the out I got was not in readable form, I figured > > out that I need to use named vectors so that Key can be displayed as a > > output. Here I am facing issue, how to use NamedVector ? > > > > I am performing following steps to generate vectors.. > > #) Created custom VectorIterable by inheriting Iterable<Vector>. > > #) Created custom VectorItertor by inheriting > AbstractIterator<Vector> > > #) Model class which will be responsible to pass attribute values > > (username or data etc) to custom VectorIterator > > #) Custom VectorIterator.computeNext() will read line, create dense > > vector having size equal to number of attribute in a row. > > > > Please let me know how to add NamedVector here so that I can get some > > readable output from ClusterDump utility. > > > > -- > > Thanks and Regards > > Vishal Danech > > >
