Re: need help on mahout

qiaoresearcher Fri, 09 Nov 2012 09:07:01 -0800

You are absolutely right, but here I have simplified the problem. Content
similarity can be regarded as one to enrich the features. Features can be
defined in many ways, here I would like to start with most simple feature:
visited or not, later on I will add more features if the results can not
meet expectation


On Fri, Nov 9, 2012 at 10:57 AM, Dmitriy Lyubimov <[email protected]> wrote:

> sorry you probably meant that anyway. your trained input should be labeled
> by groups and your prediction request input is not labeled.
>
> looks like a job for a classification like sgd except visited pages make up
> poor categorical source without looking into their content similarities.
> On Nov 9, 2012 8:49 AM, "Dmitriy Lyubimov" <[email protected]> wrote:
>
> > if it is supervised classification, your input should contain the groups.
> > te idea is that you extend knowledge hidden in  a smaller perhaps expert
> > labeled dataset to the rest of the universe.
> > On Nov 9, 2012 8:43 AM, "qiaoresearcher" <[email protected]>
> wrote:
> >
> >> It is a supervised classification problem.
> >>
> >> For example, a very simple case:
> >> say, overall we collect 4 pages from the data set:  { web_page 1
>  web_page
> >> 2 web_page 3 web_page 4  }
> >> then users may have input vectors like:
> >> user1 [1 1  0  0]
> >> user2 [1 1  0  0]
> >> user3 [0 0  1  1]
> >> user4 [0 0  1  1]
> >> user5 [0 0  1  1]
> >>   ...       ....
> >>
> >> then whatever classification algorithm mahout has should return
> >> classification results as
> >> group 1 { user1, user2}
> >> group 2 { user3, user4, user5 }
> >>
> >>
> >>
> >> On Fri, Nov 9, 2012 at 10:29 AM, Sean Owen <[email protected]> wrote:
> >>
> >> > First: what question are you trying to answer from this data? You are
> >> > trying to classify users into what, for what purpose?
> >> >
> >> >
> >> > On Fri, Nov 9, 2012 at 4:20 PM, qiaoresearcher <
> >> [email protected]
> >> > >wrote:
> >> >
> >> > > Hi All,
> >> > >
> >> > > Assume the data is stored in a gzip file which includes many text
> >> files.
> >> > > Within each text file, each line represents an activity of a user,
> for
> >> > > example, a click on a web page.
> >> > > the text file will look like:
> >> > >
> >> > >
> >> >
> >>
> ----------------------------------------------------------------------------------
> >> > > user 1   time11  visiting_web_page11
> >> > > user 2   time21  visiting_web_page21
> >> > > user 1   time12  visiting_web_page12
> >> > > user 1   time13  visiting_web_page13
> >> > > user 2   time22  visiting_web_page22
> >> > > user 3   time31  visiting_web_page31
> >> > > user 1   time14  visiting_web_page14
> >> > >  ...           ....                ..........
> >> > >
> >> > > I am thinking to first construct a web page set like
> >> > > { visiting_web_page11, visiting_web_page12, visiting_web_page31,
> >> .......
> >> > }
> >> > >
> >> > > then for each user, we form a vector [ 1  0 0  1 0  0  .....    ]
> >>  where
> >> > > '1' means the user visited that page and 0 means he did not
> >> > > then use mahout to classify the users based on the vectors
> >> > >
> >> > > does mahout has example like this? if not, what kind of java code we
> >> need
> >> > > to write to process this task?
> >> > >
> >> > > thanks for any suggestions in advance !
> >> > >
> >> >
> >>
> >
>

Re: need help on mahout

Reply via email to