A full sort is not usually feasible/desirable. Better to just keep a pool of samples and replace random samples.
On Thu, Jun 16, 2011 at 2:41 AM, Lance Norskog <[email protected]> wrote: > Use a crypto-hash on the base data as the sorting key. The base data > is the value (payload). That should randomly permute things. > > On Wed, Jun 15, 2011 at 2:50 PM, Ted Dunning <[email protected]> > wrote: > > It is already in Mahout, I think. > > > > On Tue, Jun 14, 2011 at 5:48 AM, Lance Norskog <[email protected]> > wrote: > > > >> Coding a permutation like this in Map/Reduce is a good beginner > exercise. > >> > >> On Sun, Jun 12, 2011 at 11:34 PM, Ted Dunning <[email protected]> > >> wrote: > >> > But the key is that you have to have both kinds of samples. Moreover, > >> > for all of the stochastic gradient descent work, you need to have them > >> > in a random-ish order. You can't show all of one category and then > >> > all of another. It is even worse if you sort your data. > >> > > >> > On Mon, Jun 13, 2011 at 5:35 AM, Hector Yee <[email protected]> > >> wrote: > >> >> If you have a much larger background set you can try online passive > >> >> aggressive in mahout 0.6 as it uses hinge loss and does not update > the > >> model > >> >> of it gets things correct. Log loss will always have a gradient in > >> >> contrast. > >> >> On Jun 12, 2011 7:54 AM, "Joscha Feth" <[email protected]> wrote: > >> >>> Hi Ted, > >> >>> > >> >>> I see. Only for the OLR or also for any other algorithm? What if my > >> >>> other category theoretically contains an infinite number of samples? > >> >>> > >> >>> Cheers, > >> >>> Joscha > >> >>> > >> >>> Am 12.06.2011 um 15:08 schrieb Ted Dunning <[email protected]>: > >> >>> > >> >>>> Joscha, > >> >>>> > >> >>>> There is no implicit training. you need to give negative examples > as > >> >>>> well as positive. > >> >>>> > >> >>>> > >> >>>> On Sat, Jun 11, 2011 at 9:08 AM, Joscha Feth <[email protected]> > wrote: > >> >>>>> Hello Ted, > >> >>>>> > >> >>>>> thanks for your response! > >> >>>>> What I wanted to accomplish is actually quite simple in theory: I > >> have > >> >> some > >> >>>>> sentences which have things in common (like some similar words for > >> >> example). > >> >>>>> I want to train my model with these example sentences I have. Once > it > >> is > >> >>>>> trained I want to give an unknown sentence to my classifier and > would > >> >> like > >> >>>>> to get back a percentage to which the unknown sentence is similar > to > >> the > >> >>>>> sentences I trained my model with. So basically I have two > categories > >> >>>>> (sentence is similar and sentence is not similar). To my > >> understanding > >> >> it > >> >>>>> does only make sense to train my model with the positives (e.g. > the > >> >> sample > >> >>>>> sentences) and put them all into the same category (I chose > category > >> 0, > >> >>>>> because the .classifyScalar() method seems to return the > probability > >> for > >> >> the > >> >>>>> first category, e.g. category 0). All other sentences are > implicitly > >> >> (but > >> >>>>> not trained) in the second category (category 1). > >> >>>>> > >> >>>>> Does that make sense or am I completely off here? > >> >>>>> > >> >>>>> Kind regards, > >> >>>>> Joscha Feth > >> >>>>> > >> >>>>> On Sat, Jun 11, 2011 at 03:46, Ted Dunning <[email protected] > > > >> >> wrote: > >> >>>>>> > >> >>>>>> The target variable here is always zero. > >> >>>>>> > >> >>>>>> Shouldn't it vary? > >> >>>>>> > >> >>>>>> On Fri, Jun 10, 2011 at 9:54 AM, Joscha Feth <[email protected]> > >> wrote: > >> >>>>>>> algorithm.train(0, generateVector(animal)); > >> >>>>>>> > >> >>>>> > >> >>>>> > >> >> > >> > > >> > >> > >> > >> -- > >> Lance Norskog > >> [email protected] > >> > > > > > > -- > Lance Norskog > [email protected] >
