Re: Classification beginner questions

Ted Dunning Thu, 16 Jun 2011 00:15:00 -0700

A full sort is not usually feasible/desirable.

Better to just keep a pool of samples and replace random samples.


On Thu, Jun 16, 2011 at 2:41 AM, Lance Norskog <[email protected]> wrote:

> Use a crypto-hash on the base data as the sorting key. The base data
> is the value (payload). That should randomly permute things.
>
> On Wed, Jun 15, 2011 at 2:50 PM, Ted Dunning <[email protected]>
> wrote:
> > It is already in Mahout, I think.
> >
> > On Tue, Jun 14, 2011 at 5:48 AM, Lance Norskog <[email protected]>
> wrote:
> >
> >> Coding a permutation like this in Map/Reduce is a good beginner
> exercise.
> >>
> >> On Sun, Jun 12, 2011 at 11:34 PM, Ted Dunning <[email protected]>
> >> wrote:
> >> > But the key is that you have to have both kinds of samples.  Moreover,
> >> > for all of the stochastic gradient descent work, you need to have them
> >> > in a random-ish order.  You can't show all of one category and then
> >> > all of another.  It is even worse if you sort your data.
> >> >
> >> > On Mon, Jun 13, 2011 at 5:35 AM, Hector Yee <[email protected]>
> >> wrote:
> >> >> If you have a much larger background set you can try online passive
> >> >> aggressive in mahout 0.6 as it uses hinge loss and does not update
> the
> >> model
> >> >> of it gets things correct.  Log loss will always have a gradient in
> >> >> contrast.
> >> >> On Jun 12, 2011 7:54 AM, "Joscha Feth" <[email protected]> wrote:
> >> >>> Hi Ted,
> >> >>>
> >> >>> I see. Only for the OLR or also for any other algorithm? What if my
> >> >>> other category theoretically contains an infinite number of samples?
> >> >>>
> >> >>> Cheers,
> >> >>> Joscha
> >> >>>
> >> >>> Am 12.06.2011 um 15:08 schrieb Ted Dunning <[email protected]>:
> >> >>>
> >> >>>> Joscha,
> >> >>>>
> >> >>>> There is no implicit training. you need to give negative examples
> as
> >> >>>> well as positive.
> >> >>>>
> >> >>>>
> >> >>>> On Sat, Jun 11, 2011 at 9:08 AM, Joscha Feth <[email protected]>
> wrote:
> >> >>>>> Hello Ted,
> >> >>>>>
> >> >>>>> thanks for your response!
> >> >>>>> What I wanted to accomplish is actually quite simple in theory: I
> >> have
> >> >> some
> >> >>>>> sentences which have things in common (like some similar words for
> >> >> example).
> >> >>>>> I want to train my model with these example sentences I have. Once
> it
> >> is
> >> >>>>> trained I want to give an unknown sentence to my classifier and
> would
> >> >> like
> >> >>>>> to get back a percentage to which the unknown sentence is similar
> to
> >> the
> >> >>>>> sentences I trained my model with. So basically I have two
> categories
> >> >>>>> (sentence is similar and sentence is not similar). To my
> >> understanding
> >> >> it
> >> >>>>> does only make sense to train my model with the positives (e.g.
> the
> >> >> sample
> >> >>>>> sentences) and put them all into the same category (I chose
> category
> >> 0,
> >> >>>>> because the .classifyScalar() method seems to return the
> probability
> >> for
> >> >> the
> >> >>>>> first category, e.g. category 0). All other sentences are
> implicitly
> >> >> (but
> >> >>>>> not trained) in the second category (category 1).
> >> >>>>>
> >> >>>>> Does that make sense or am I completely off here?
> >> >>>>>
> >> >>>>> Kind regards,
> >> >>>>> Joscha Feth
> >> >>>>>
> >> >>>>> On Sat, Jun 11, 2011 at 03:46, Ted Dunning <[email protected]
> >
> >> >> wrote:
> >> >>>>>>
> >> >>>>>> The target variable here is always zero.
> >> >>>>>>
> >> >>>>>> Shouldn't it vary?
> >> >>>>>>
> >> >>>>>> On Fri, Jun 10, 2011 at 9:54 AM, Joscha Feth <[email protected]>
> >> wrote:
> >> >>>>>>> algorithm.train(0, generateVector(animal));
> >> >>>>>>>
> >> >>>>>
> >> >>>>>
> >> >>
> >> >
> >>
> >>
> >>
> >> --
> >> Lance Norskog
> >> [email protected]
> >>
> >
>
>
>
> --
> Lance Norskog
> [email protected]
>

Re: Classification beginner questions

Reply via email to