Re: Standardization with Sparse Vectors

Nick Pentreath Wed, 10 Aug 2016 10:11:05 -0700

Sean by 'offset' do you mean basically subtracting the mean but only from
the non-zero elements in each row?
On Wed, 10 Aug 2016 at 19:02, Sean Owen <so...@cloudera.com> wrote:


> Yeah I had thought the same, that perhaps it's fine to let the
> StandardScaler proceed, if it's explicitly asked to center, rather
> than refuse to. It's not really much more rope to let a user hang
> herself with, and, blocks legitimate usages (we ran into this last
> week and couldn't use StandardScaler as a result).
>
> I'm personally supportive of the change and don't see a JIRA. I think
> you could at least make one.
>
> On Wed, Aug 10, 2016 at 5:57 PM, Tobi Bosede <ani.to...@gmail.com> wrote:
> > Thanks Sean, I agree with 100% that the math is math and dense vs sparse
> is
> > just a matter of representation. I was trying to convince a co-worker of
> > this to no avail. Sending this email was mainly a sanity check.
> >
> > I think having an offset would be a great idea, although I am not sure
> how
> > to implement this. However, if anything should be done to rectify this
> > issue, it should be done in the standardScaler, not vectorAssembler.
> There
> > should not be any forcing of vectorAssembler to produce only dense
> vectors
> > so as to avoid performance problems with data that does not fit in
> memory.
> > Furthermore, not every machine learning algo requires standardization.
> > Instead, standardScaler should have withmean=True as default and should
> > apply an offset if the vector is sparse, whereas there would be normal
> > subtraction if the vector is dense. This way the default behavior of
> > standardScaler will always be what is generally understood to be
> > standardization, as opposed to people thinking they are standardizing
> when
> > they actually are not.
> >
> > Can anyone confirm whether there is a jira already?
> >
> > On Wed, Aug 10, 2016 at 10:58 AM, Sean Owen <so...@cloudera.com> wrote:
> >>
> >> Dense vs sparse is just a question of representation, so doesn't make
> >> an operation on a vector more or less important as a result. You've
> >> identified the reason that subtracting the mean can be undesirable: a
> >> notionally billion-element sparse vector becomes too big to fit in
> >> memory at once.
> >>
> >> I know this came up as a problem recently (I think there's a JIRA?)
> >> because VectorAssembler will *sometimes* output a small dense vector
> >> and sometimes output a small sparse vector based on how many zeroes
> >> there are. But that's bad because then the StandardScaler can't
> >> process the output at all. You can work on this if you're interested;
> >> I think the proposal was to be able to force a dense representation
> >> only in VectorAssembler. I don't know if that's the nature of the
> >> problem you're hitting.
> >>
> >> It can be meaningful to only scale the dimension without centering it,
> >> but it's not the same thing, no. The math is the math.
> >>
> >> This has come up a few times -- it's necessary to center a sparse
> >> vector but prohibitive to do so. One idea I'd toyed with in the past
> >> was to let a sparse vector have an 'offset' value applied to all
> >> elements. That would let you shift all values while preserving a
> >> sparse representation. I'm not sure if it's worth implementing but
> >> would help this case.
> >>
> >>
> >>
> >>
> >> On Wed, Aug 10, 2016 at 4:41 PM, Tobi Bosede <ani.to...@gmail.com>
> wrote:
> >> > Hi everyone,
> >> >
> >> > I am doing some standardization using standardScaler on data from
> >> > VectorAssembler which is represented as sparse vectors. I plan to fit
> a
> >> > regularized model.  However, standardScaler does not allow the mean to
> >> > be
> >> > subtracted from sparse vectors. It will only divide by the standard
> >> > deviation, which I understand is to keep the vector sparse. Thus I am
> >> > trying
> >> > to convert my sparse vectors into dense vectors, but this may not be
> >> > worthwhile.
> >> >
> >> > So my questions are:
> >> > Is subtracting the mean during standardization only important when
> >> > working
> >> > with dense vectors? Does it not matter for sparse vectors? Is just
> >> > dividing
> >> > by the standard deviation with sparse vectors equivalent to also
> >> > dividing by
> >> > standard deviation w and subtracting mean with dense vectors?
> >> >
> >> > Thank you,
> >> > Tobi
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Standardization with Sparse Vectors

Reply via email to