Re: Standardization with Sparse Vectors

Sean Owen Wed, 10 Aug 2016 10:51:12 -0700

All elements, I think. Imagine a sparse vector 1:3 3:7 which conceptually
represents 0 3 0 7. Imagine it also has an offset stored which applies to
all elements. If it is -2 then it now represents -2 1 -2 5, but this
requires just one extra value to store. It only helps with storage of a
shifted sparse vector; iterating still typically requires iterating all
elements.


Probably, where this would help, the caller can track this offset and even
more efficiently apply this knowledge. I remember digging into this in how
sparse covariance matrices are computed. It almost but not quite enabled an
optimization.

On Wed, Aug 10, 2016, 18:10 Nick Pentreath <nick.pentre...@gmail.com> wrote:

> Sean by 'offset' do you mean basically subtracting the mean but only from
> the non-zero elements in each row?
> On Wed, 10 Aug 2016 at 19:02, Sean Owen <so...@cloudera.com> wrote:
>
>> Yeah I had thought the same, that perhaps it's fine to let the
>> StandardScaler proceed, if it's explicitly asked to center, rather
>> than refuse to. It's not really much more rope to let a user hang
>> herself with, and, blocks legitimate usages (we ran into this last
>> week and couldn't use StandardScaler as a result).
>>
>> I'm personally supportive of the change and don't see a JIRA. I think
>> you could at least make one.
>>
>> On Wed, Aug 10, 2016 at 5:57 PM, Tobi Bosede <ani.to...@gmail.com> wrote:
>> > Thanks Sean, I agree with 100% that the math is math and dense vs
>> sparse is
>> > just a matter of representation. I was trying to convince a co-worker of
>> > this to no avail. Sending this email was mainly a sanity check.
>> >
>> > I think having an offset would be a great idea, although I am not sure
>> how
>> > to implement this. However, if anything should be done to rectify this
>> > issue, it should be done in the standardScaler, not vectorAssembler.
>> There
>> > should not be any forcing of vectorAssembler to produce only dense
>> vectors
>> > so as to avoid performance problems with data that does not fit in
>> memory.
>> > Furthermore, not every machine learning algo requires standardization.
>> > Instead, standardScaler should have withmean=True as default and should
>> > apply an offset if the vector is sparse, whereas there would be normal
>> > subtraction if the vector is dense. This way the default behavior of
>> > standardScaler will always be what is generally understood to be
>> > standardization, as opposed to people thinking they are standardizing
>> when
>> > they actually are not.
>> >
>> > Can anyone confirm whether there is a jira already?
>> >
>> > On Wed, Aug 10, 2016 at 10:58 AM, Sean Owen <so...@cloudera.com> wrote:
>> >>
>> >> Dense vs sparse is just a question of representation, so doesn't make
>> >> an operation on a vector more or less important as a result. You've
>> >> identified the reason that subtracting the mean can be undesirable: a
>> >> notionally billion-element sparse vector becomes too big to fit in
>> >> memory at once.
>> >>
>> >> I know this came up as a problem recently (I think there's a JIRA?)
>> >> because VectorAssembler will *sometimes* output a small dense vector
>> >> and sometimes output a small sparse vector based on how many zeroes
>> >> there are. But that's bad because then the StandardScaler can't
>> >> process the output at all. You can work on this if you're interested;
>> >> I think the proposal was to be able to force a dense representation
>> >> only in VectorAssembler. I don't know if that's the nature of the
>> >> problem you're hitting.
>> >>
>> >> It can be meaningful to only scale the dimension without centering it,
>> >> but it's not the same thing, no. The math is the math.
>> >>
>> >> This has come up a few times -- it's necessary to center a sparse
>> >> vector but prohibitive to do so. One idea I'd toyed with in the past
>> >> was to let a sparse vector have an 'offset' value applied to all
>> >> elements. That would let you shift all values while preserving a
>> >> sparse representation. I'm not sure if it's worth implementing but
>> >> would help this case.
>> >>
>> >>
>> >>
>> >>
>> >> On Wed, Aug 10, 2016 at 4:41 PM, Tobi Bosede <ani.to...@gmail.com>
>> wrote:
>> >> > Hi everyone,
>> >> >
>> >> > I am doing some standardization using standardScaler on data from
>> >> > VectorAssembler which is represented as sparse vectors. I plan to
>> fit a
>> >> > regularized model.  However, standardScaler does not allow the mean
>> to
>> >> > be
>> >> > subtracted from sparse vectors. It will only divide by the standard
>> >> > deviation, which I understand is to keep the vector sparse. Thus I am
>> >> > trying
>> >> > to convert my sparse vectors into dense vectors, but this may not be
>> >> > worthwhile.
>> >> >
>> >> > So my questions are:
>> >> > Is subtracting the mean during standardization only important when
>> >> > working
>> >> > with dense vectors? Does it not matter for sparse vectors? Is just
>> >> > dividing
>> >> > by the standard deviation with sparse vectors equivalent to also
>> >> > dividing by
>> >> > standard deviation w and subtracting mean with dense vectors?
>> >> >
>> >> > Thank you,
>> >> > Tobi
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>

Re: Standardization with Sparse Vectors

Reply via email to