You are right, I have labels for each user, I just need some example code to run the job quickly.
The example code should have steps similar to what I described: read the gzip file, construct the webpage set, form the input vector for each user, then call some classification/clustering algorithm, does mahout has example like this? On Fri, Nov 9, 2012 at 10:49 AM, Dmitriy Lyubimov <[email protected]> wrote: > if it is supervised classification, your input should contain the groups. > te idea is that you extend knowledge hidden in a smaller perhaps expert > labeled dataset to the rest of the universe. > On Nov 9, 2012 8:43 AM, "qiaoresearcher" <[email protected]> wrote: > > > It is a supervised classification problem. > > > > For example, a very simple case: > > say, overall we collect 4 pages from the data set: { web_page 1 > web_page > > 2 web_page 3 web_page 4 } > > then users may have input vectors like: > > user1 [1 1 0 0] > > user2 [1 1 0 0] > > user3 [0 0 1 1] > > user4 [0 0 1 1] > > user5 [0 0 1 1] > > ... .... > > > > then whatever classification algorithm mahout has should return > > classification results as > > group 1 { user1, user2} > > group 2 { user3, user4, user5 } > > > > > > > > On Fri, Nov 9, 2012 at 10:29 AM, Sean Owen <[email protected]> wrote: > > > > > First: what question are you trying to answer from this data? You are > > > trying to classify users into what, for what purpose? > > > > > > > > > On Fri, Nov 9, 2012 at 4:20 PM, qiaoresearcher < > [email protected] > > > >wrote: > > > > > > > Hi All, > > > > > > > > Assume the data is stored in a gzip file which includes many text > > files. > > > > Within each text file, each line represents an activity of a user, > for > > > > example, a click on a web page. > > > > the text file will look like: > > > > > > > > > > > > > > ---------------------------------------------------------------------------------- > > > > user 1 time11 visiting_web_page11 > > > > user 2 time21 visiting_web_page21 > > > > user 1 time12 visiting_web_page12 > > > > user 1 time13 visiting_web_page13 > > > > user 2 time22 visiting_web_page22 > > > > user 3 time31 visiting_web_page31 > > > > user 1 time14 visiting_web_page14 > > > > ... .... .......... > > > > > > > > I am thinking to first construct a web page set like > > > > { visiting_web_page11, visiting_web_page12, visiting_web_page31, > > ....... > > > } > > > > > > > > then for each user, we form a vector [ 1 0 0 1 0 0 ..... ] > > where > > > > '1' means the user visited that page and 0 means he did not > > > > then use mahout to classify the users based on the vectors > > > > > > > > does mahout has example like this? if not, what kind of java code we > > need > > > > to write to process this task? > > > > > > > > thanks for any suggestions in advance ! > > > > > > > > > >
