Use a crypto-hash on the base data as the sorting key. The base data is the value (payload). That should randomly permute things.
On Wed, Jun 15, 2011 at 2:50 PM, Ted Dunning <[email protected]> wrote: > It is already in Mahout, I think. > > On Tue, Jun 14, 2011 at 5:48 AM, Lance Norskog <[email protected]> wrote: > >> Coding a permutation like this in Map/Reduce is a good beginner exercise. >> >> On Sun, Jun 12, 2011 at 11:34 PM, Ted Dunning <[email protected]> >> wrote: >> > But the key is that you have to have both kinds of samples. Moreover, >> > for all of the stochastic gradient descent work, you need to have them >> > in a random-ish order. You can't show all of one category and then >> > all of another. It is even worse if you sort your data. >> > >> > On Mon, Jun 13, 2011 at 5:35 AM, Hector Yee <[email protected]> >> wrote: >> >> If you have a much larger background set you can try online passive >> >> aggressive in mahout 0.6 as it uses hinge loss and does not update the >> model >> >> of it gets things correct. Log loss will always have a gradient in >> >> contrast. >> >> On Jun 12, 2011 7:54 AM, "Joscha Feth" <[email protected]> wrote: >> >>> Hi Ted, >> >>> >> >>> I see. Only for the OLR or also for any other algorithm? What if my >> >>> other category theoretically contains an infinite number of samples? >> >>> >> >>> Cheers, >> >>> Joscha >> >>> >> >>> Am 12.06.2011 um 15:08 schrieb Ted Dunning <[email protected]>: >> >>> >> >>>> Joscha, >> >>>> >> >>>> There is no implicit training. you need to give negative examples as >> >>>> well as positive. >> >>>> >> >>>> >> >>>> On Sat, Jun 11, 2011 at 9:08 AM, Joscha Feth <[email protected]> wrote: >> >>>>> Hello Ted, >> >>>>> >> >>>>> thanks for your response! >> >>>>> What I wanted to accomplish is actually quite simple in theory: I >> have >> >> some >> >>>>> sentences which have things in common (like some similar words for >> >> example). >> >>>>> I want to train my model with these example sentences I have. Once it >> is >> >>>>> trained I want to give an unknown sentence to my classifier and would >> >> like >> >>>>> to get back a percentage to which the unknown sentence is similar to >> the >> >>>>> sentences I trained my model with. So basically I have two categories >> >>>>> (sentence is similar and sentence is not similar). To my >> understanding >> >> it >> >>>>> does only make sense to train my model with the positives (e.g. the >> >> sample >> >>>>> sentences) and put them all into the same category (I chose category >> 0, >> >>>>> because the .classifyScalar() method seems to return the probability >> for >> >> the >> >>>>> first category, e.g. category 0). All other sentences are implicitly >> >> (but >> >>>>> not trained) in the second category (category 1). >> >>>>> >> >>>>> Does that make sense or am I completely off here? >> >>>>> >> >>>>> Kind regards, >> >>>>> Joscha Feth >> >>>>> >> >>>>> On Sat, Jun 11, 2011 at 03:46, Ted Dunning <[email protected]> >> >> wrote: >> >>>>>> >> >>>>>> The target variable here is always zero. >> >>>>>> >> >>>>>> Shouldn't it vary? >> >>>>>> >> >>>>>> On Fri, Jun 10, 2011 at 9:54 AM, Joscha Feth <[email protected]> >> wrote: >> >>>>>>> algorithm.train(0, generateVector(animal)); >> >>>>>>> >> >>>>> >> >>>>> >> >> >> > >> >> >> >> -- >> Lance Norskog >> [email protected] >> > -- Lance Norskog [email protected]
