Sean by 'offset' do you mean basically subtracting the mean but only from the non-zero elements in each row? On Wed, 10 Aug 2016 at 19:02, Sean Owen <so...@cloudera.com> wrote:
> Yeah I had thought the same, that perhaps it's fine to let the > StandardScaler proceed, if it's explicitly asked to center, rather > than refuse to. It's not really much more rope to let a user hang > herself with, and, blocks legitimate usages (we ran into this last > week and couldn't use StandardScaler as a result). > > I'm personally supportive of the change and don't see a JIRA. I think > you could at least make one. > > On Wed, Aug 10, 2016 at 5:57 PM, Tobi Bosede <ani.to...@gmail.com> wrote: > > Thanks Sean, I agree with 100% that the math is math and dense vs sparse > is > > just a matter of representation. I was trying to convince a co-worker of > > this to no avail. Sending this email was mainly a sanity check. > > > > I think having an offset would be a great idea, although I am not sure > how > > to implement this. However, if anything should be done to rectify this > > issue, it should be done in the standardScaler, not vectorAssembler. > There > > should not be any forcing of vectorAssembler to produce only dense > vectors > > so as to avoid performance problems with data that does not fit in > memory. > > Furthermore, not every machine learning algo requires standardization. > > Instead, standardScaler should have withmean=True as default and should > > apply an offset if the vector is sparse, whereas there would be normal > > subtraction if the vector is dense. This way the default behavior of > > standardScaler will always be what is generally understood to be > > standardization, as opposed to people thinking they are standardizing > when > > they actually are not. > > > > Can anyone confirm whether there is a jira already? > > > > On Wed, Aug 10, 2016 at 10:58 AM, Sean Owen <so...@cloudera.com> wrote: > >> > >> Dense vs sparse is just a question of representation, so doesn't make > >> an operation on a vector more or less important as a result. You've > >> identified the reason that subtracting the mean can be undesirable: a > >> notionally billion-element sparse vector becomes too big to fit in > >> memory at once. > >> > >> I know this came up as a problem recently (I think there's a JIRA?) > >> because VectorAssembler will *sometimes* output a small dense vector > >> and sometimes output a small sparse vector based on how many zeroes > >> there are. But that's bad because then the StandardScaler can't > >> process the output at all. You can work on this if you're interested; > >> I think the proposal was to be able to force a dense representation > >> only in VectorAssembler. I don't know if that's the nature of the > >> problem you're hitting. > >> > >> It can be meaningful to only scale the dimension without centering it, > >> but it's not the same thing, no. The math is the math. > >> > >> This has come up a few times -- it's necessary to center a sparse > >> vector but prohibitive to do so. One idea I'd toyed with in the past > >> was to let a sparse vector have an 'offset' value applied to all > >> elements. That would let you shift all values while preserving a > >> sparse representation. I'm not sure if it's worth implementing but > >> would help this case. > >> > >> > >> > >> > >> On Wed, Aug 10, 2016 at 4:41 PM, Tobi Bosede <ani.to...@gmail.com> > wrote: > >> > Hi everyone, > >> > > >> > I am doing some standardization using standardScaler on data from > >> > VectorAssembler which is represented as sparse vectors. I plan to fit > a > >> > regularized model. However, standardScaler does not allow the mean to > >> > be > >> > subtracted from sparse vectors. It will only divide by the standard > >> > deviation, which I understand is to keep the vector sparse. Thus I am > >> > trying > >> > to convert my sparse vectors into dense vectors, but this may not be > >> > worthwhile. > >> > > >> > So my questions are: > >> > Is subtracting the mean during standardization only important when > >> > working > >> > with dense vectors? Does it not matter for sparse vectors? Is just > >> > dividing > >> > by the standard deviation with sparse vectors equivalent to also > >> > dividing by > >> > standard deviation w and subtracting mean with dense vectors? > >> > > >> > Thank you, > >> > Tobi > > > > > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >