ok. i guess you can try factorization first (against user vs. pages) and then try to run user factor vectors as predictors with SGD. However it will not work well if your user/page matrix is too sparse. IMO you need to prototype this approach in R first before moving to scale to see if you even can get an acceptable result.
On Fri, Nov 9, 2012 at 9:06 AM, qiaoresearcher <[email protected]>wrote: > You are absolutely right, but here I have simplified the problem. Content > similarity can be regarded as one to enrich the features. Features can be > defined in many ways, here I would like to start with most simple feature: > visited or not, later on I will add more features if the results can not > meet expectation > > On Fri, Nov 9, 2012 at 10:57 AM, Dmitriy Lyubimov <[email protected]> > wrote: > > > sorry you probably meant that anyway. your trained input should be > labeled > > by groups and your prediction request input is not labeled. > > > > looks like a job for a classification like sgd except visited pages make > up > > poor categorical source without looking into their content similarities. > > On Nov 9, 2012 8:49 AM, "Dmitriy Lyubimov" <[email protected]> wrote: > > > > > if it is supervised classification, your input should contain the > groups. > > > te idea is that you extend knowledge hidden in a smaller perhaps > expert > > > labeled dataset to the rest of the universe. > > > On Nov 9, 2012 8:43 AM, "qiaoresearcher" <[email protected]> > > wrote: > > > > > >> It is a supervised classification problem. > > >> > > >> For example, a very simple case: > > >> say, overall we collect 4 pages from the data set: { web_page 1 > > web_page > > >> 2 web_page 3 web_page 4 } > > >> then users may have input vectors like: > > >> user1 [1 1 0 0] > > >> user2 [1 1 0 0] > > >> user3 [0 0 1 1] > > >> user4 [0 0 1 1] > > >> user5 [0 0 1 1] > > >> ... .... > > >> > > >> then whatever classification algorithm mahout has should return > > >> classification results as > > >> group 1 { user1, user2} > > >> group 2 { user3, user4, user5 } > > >> > > >> > > >> > > >> On Fri, Nov 9, 2012 at 10:29 AM, Sean Owen <[email protected]> wrote: > > >> > > >> > First: what question are you trying to answer from this data? You > are > > >> > trying to classify users into what, for what purpose? > > >> > > > >> > > > >> > On Fri, Nov 9, 2012 at 4:20 PM, qiaoresearcher < > > >> [email protected] > > >> > >wrote: > > >> > > > >> > > Hi All, > > >> > > > > >> > > Assume the data is stored in a gzip file which includes many text > > >> files. > > >> > > Within each text file, each line represents an activity of a user, > > for > > >> > > example, a click on a web page. > > >> > > the text file will look like: > > >> > > > > >> > > > > >> > > > >> > > > ---------------------------------------------------------------------------------- > > >> > > user 1 time11 visiting_web_page11 > > >> > > user 2 time21 visiting_web_page21 > > >> > > user 1 time12 visiting_web_page12 > > >> > > user 1 time13 visiting_web_page13 > > >> > > user 2 time22 visiting_web_page22 > > >> > > user 3 time31 visiting_web_page31 > > >> > > user 1 time14 visiting_web_page14 > > >> > > ... .... .......... > > >> > > > > >> > > I am thinking to first construct a web page set like > > >> > > { visiting_web_page11, visiting_web_page12, visiting_web_page31, > > >> ....... > > >> > } > > >> > > > > >> > > then for each user, we form a vector [ 1 0 0 1 0 0 ..... ] > > >> where > > >> > > '1' means the user visited that page and 0 means he did not > > >> > > then use mahout to classify the users based on the vectors > > >> > > > > >> > > does mahout has example like this? if not, what kind of java code > we > > >> need > > >> > > to write to process this task? > > >> > > > > >> > > thanks for any suggestions in advance ! > > >> > > > > >> > > > >> > > > > > >
