You are absolutely right, but here I have simplified the problem. Content similarity can be regarded as one to enrich the features. Features can be defined in many ways, here I would like to start with most simple feature: visited or not, later on I will add more features if the results can not meet expectation
On Fri, Nov 9, 2012 at 10:57 AM, Dmitriy Lyubimov <[email protected]> wrote: > sorry you probably meant that anyway. your trained input should be labeled > by groups and your prediction request input is not labeled. > > looks like a job for a classification like sgd except visited pages make up > poor categorical source without looking into their content similarities. > On Nov 9, 2012 8:49 AM, "Dmitriy Lyubimov" <[email protected]> wrote: > > > if it is supervised classification, your input should contain the groups. > > te idea is that you extend knowledge hidden in a smaller perhaps expert > > labeled dataset to the rest of the universe. > > On Nov 9, 2012 8:43 AM, "qiaoresearcher" <[email protected]> > wrote: > > > >> It is a supervised classification problem. > >> > >> For example, a very simple case: > >> say, overall we collect 4 pages from the data set: { web_page 1 > web_page > >> 2 web_page 3 web_page 4 } > >> then users may have input vectors like: > >> user1 [1 1 0 0] > >> user2 [1 1 0 0] > >> user3 [0 0 1 1] > >> user4 [0 0 1 1] > >> user5 [0 0 1 1] > >> ... .... > >> > >> then whatever classification algorithm mahout has should return > >> classification results as > >> group 1 { user1, user2} > >> group 2 { user3, user4, user5 } > >> > >> > >> > >> On Fri, Nov 9, 2012 at 10:29 AM, Sean Owen <[email protected]> wrote: > >> > >> > First: what question are you trying to answer from this data? You are > >> > trying to classify users into what, for what purpose? > >> > > >> > > >> > On Fri, Nov 9, 2012 at 4:20 PM, qiaoresearcher < > >> [email protected] > >> > >wrote: > >> > > >> > > Hi All, > >> > > > >> > > Assume the data is stored in a gzip file which includes many text > >> files. > >> > > Within each text file, each line represents an activity of a user, > for > >> > > example, a click on a web page. > >> > > the text file will look like: > >> > > > >> > > > >> > > >> > ---------------------------------------------------------------------------------- > >> > > user 1 time11 visiting_web_page11 > >> > > user 2 time21 visiting_web_page21 > >> > > user 1 time12 visiting_web_page12 > >> > > user 1 time13 visiting_web_page13 > >> > > user 2 time22 visiting_web_page22 > >> > > user 3 time31 visiting_web_page31 > >> > > user 1 time14 visiting_web_page14 > >> > > ... .... .......... > >> > > > >> > > I am thinking to first construct a web page set like > >> > > { visiting_web_page11, visiting_web_page12, visiting_web_page31, > >> ....... > >> > } > >> > > > >> > > then for each user, we form a vector [ 1 0 0 1 0 0 ..... ] > >> where > >> > > '1' means the user visited that page and 0 means he did not > >> > > then use mahout to classify the users based on the vectors > >> > > > >> > > does mahout has example like this? if not, what kind of java code we > >> need > >> > > to write to process this task? > >> > > > >> > > thanks for any suggestions in advance ! > >> > > > >> > > >> > > >
