sorry you probably meant that anyway. your trained input should be labeled by groups and your prediction request input is not labeled.
looks like a job for a classification like sgd except visited pages make up poor categorical source without looking into their content similarities. On Nov 9, 2012 8:49 AM, "Dmitriy Lyubimov" <[email protected]> wrote: > if it is supervised classification, your input should contain the groups. > te idea is that you extend knowledge hidden in a smaller perhaps expert > labeled dataset to the rest of the universe. > On Nov 9, 2012 8:43 AM, "qiaoresearcher" <[email protected]> wrote: > >> It is a supervised classification problem. >> >> For example, a very simple case: >> say, overall we collect 4 pages from the data set: { web_page 1 web_page >> 2 web_page 3 web_page 4 } >> then users may have input vectors like: >> user1 [1 1 0 0] >> user2 [1 1 0 0] >> user3 [0 0 1 1] >> user4 [0 0 1 1] >> user5 [0 0 1 1] >> ... .... >> >> then whatever classification algorithm mahout has should return >> classification results as >> group 1 { user1, user2} >> group 2 { user3, user4, user5 } >> >> >> >> On Fri, Nov 9, 2012 at 10:29 AM, Sean Owen <[email protected]> wrote: >> >> > First: what question are you trying to answer from this data? You are >> > trying to classify users into what, for what purpose? >> > >> > >> > On Fri, Nov 9, 2012 at 4:20 PM, qiaoresearcher < >> [email protected] >> > >wrote: >> > >> > > Hi All, >> > > >> > > Assume the data is stored in a gzip file which includes many text >> files. >> > > Within each text file, each line represents an activity of a user, for >> > > example, a click on a web page. >> > > the text file will look like: >> > > >> > > >> > >> ---------------------------------------------------------------------------------- >> > > user 1 time11 visiting_web_page11 >> > > user 2 time21 visiting_web_page21 >> > > user 1 time12 visiting_web_page12 >> > > user 1 time13 visiting_web_page13 >> > > user 2 time22 visiting_web_page22 >> > > user 3 time31 visiting_web_page31 >> > > user 1 time14 visiting_web_page14 >> > > ... .... .......... >> > > >> > > I am thinking to first construct a web page set like >> > > { visiting_web_page11, visiting_web_page12, visiting_web_page31, >> ....... >> > } >> > > >> > > then for each user, we form a vector [ 1 0 0 1 0 0 ..... ] >> where >> > > '1' means the user visited that page and 0 means he did not >> > > then use mahout to classify the users based on the vectors >> > > >> > > does mahout has example like this? if not, what kind of java code we >> need >> > > to write to process this task? >> > > >> > > thanks for any suggestions in advance ! >> > > >> > >> >
