Re: Standardization with Sparse Vectors

Sean Owen Thu, 11 Aug 2016 03:03:28 -0700

No, that doesn't describe the change being discussed, since you've
copied the discussion about adding an 'offset'. That's orthogonal.
You're also suggesting making withMean=True the default, which we
don't want. The point is that if this is *explicitly* requested, the
scaler shouldn't refuse to subtract the mean from a sparse vector, and
fail.


On Wed, Aug 10, 2016 at 8:47 PM, Tobi Bosede <ani.to...@gmail.com> wrote:
> Sean,
>
> I have created a jira; I hope you don't mind that I borrowed your
> explanation of "offset". https://issues.apache.org/jira/browse/SPARK-17001
>
> So what did you do to standardize your data, if you didn't use
> standardScaler? Did you write a udf to subtract mean and divide by standard
> deviation?
>
> Although I know this is not the best approach for something I plan to put in
> production, I have been trying to write a udf to turn the sparse vector into
> a dense one and apply the udf in withcolumn(). withColumn() complains that
> the data is a tuple. I think the issue might be the datatype parameter. The
> function returns a vector of doubles but there is no type that would be
> adequate for this.
>
> sparseToDense=udf(lambda data: float(DenseVector([data.toArray()])),
> DoubleType())
> denseTrainingRdf=trainingRdfAssemb.withColumn("denseFeatures",
> sparseToDense("features"))
>
> However the function works outside the udf, but I am unable to add an
> arbitrary column to the data frame I started out working with. Thoughts?
>
> denseFeatures=TrainingRdf.select("features").map(lambda data:
> DenseVector([data.features.toArray()]))
> denseTrainingRdf=trainingRdfAssemb.withColumn("denseFeatures",
> denseFeatures)
>
> Thanks,
> Tobi
>
>
> On Wed, Aug 10, 2016 at 1:01 PM, Nick Pentreath <nick.pentre...@gmail.com>
> wrote:
>>
>> Ah right, got it. As you say for storage it helps significantly, but for
>> operations I suspect it puts one back in a "dense-like" position. Still, for
>> online / mini-batch algorithms it may still be feasible I guess.
>> On Wed, 10 Aug 2016 at 19:50, Sean Owen <so...@cloudera.com> wrote:
>>>
>>> All elements, I think. Imagine a sparse vector 1:3 3:7 which conceptually
>>> represents 0 3 0 7. Imagine it also has an offset stored which applies to
>>> all elements. If it is -2 then it now represents -2 1 -2 5, but this
>>> requires just one extra value to store. It only helps with storage of a
>>> shifted sparse vector; iterating still typically requires iterating all
>>> elements.
>>>
>>> Probably, where this would help, the caller can track this offset and
>>> even more efficiently apply this knowledge. I remember digging into this in
>>> how sparse covariance matrices are computed. It almost but not quite enabled
>>> an optimization.
>>>
>>>
>>> On Wed, Aug 10, 2016, 18:10 Nick Pentreath <nick.pentre...@gmail.com>
>>> wrote:
>>>>
>>>> Sean by 'offset' do you mean basically subtracting the mean but only
>>>> from the non-zero elements in each row?
>>>> On Wed, 10 Aug 2016 at 19:02, Sean Owen <so...@cloudera.com> wrote:
>>>>>
>>>>> Yeah I had thought the same, that perhaps it's fine to let the
>>>>> StandardScaler proceed, if it's explicitly asked to center, rather
>>>>> than refuse to. It's not really much more rope to let a user hang
>>>>> herself with, and, blocks legitimate usages (we ran into this last
>>>>> week and couldn't use StandardScaler as a result).
>>>>>
>>>>> I'm personally supportive of the change and don't see a JIRA. I think
>>>>> you could at least make one.
>>>>>
>>>>> On Wed, Aug 10, 2016 at 5:57 PM, Tobi Bosede <ani.to...@gmail.com>
>>>>> wrote:
>>>>> > Thanks Sean, I agree with 100% that the math is math and dense vs
>>>>> > sparse is
>>>>> > just a matter of representation. I was trying to convince a co-worker
>>>>> > of
>>>>> > this to no avail. Sending this email was mainly a sanity check.
>>>>> >
>>>>> > I think having an offset would be a great idea, although I am not
>>>>> > sure how
>>>>> > to implement this. However, if anything should be done to rectify
>>>>> > this
>>>>> > issue, it should be done in the standardScaler, not vectorAssembler.
>>>>> > There
>>>>> > should not be any forcing of vectorAssembler to produce only dense
>>>>> > vectors
>>>>> > so as to avoid performance problems with data that does not fit in
>>>>> > memory.
>>>>> > Furthermore, not every machine learning algo requires
>>>>> > standardization.
>>>>> > Instead, standardScaler should have withmean=True as default and
>>>>> > should
>>>>> > apply an offset if the vector is sparse, whereas there would be
>>>>> > normal
>>>>> > subtraction if the vector is dense. This way the default behavior of
>>>>> > standardScaler will always be what is generally understood to be
>>>>> > standardization, as opposed to people thinking they are standardizing
>>>>> > when
>>>>> > they actually are not.
>>>>> >
>>>>> > Can anyone confirm whether there is a jira already?
>>>>> >
>>>>> > On Wed, Aug 10, 2016 at 10:58 AM, Sean Owen <so...@cloudera.com>
>>>>> > wrote:
>>>>> >>
>>>>> >> Dense vs sparse is just a question of representation, so doesn't
>>>>> >> make
>>>>> >> an operation on a vector more or less important as a result. You've
>>>>> >> identified the reason that subtracting the mean can be undesirable:
>>>>> >> a
>>>>> >> notionally billion-element sparse vector becomes too big to fit in
>>>>> >> memory at once.
>>>>> >>
>>>>> >> I know this came up as a problem recently (I think there's a JIRA?)
>>>>> >> because VectorAssembler will *sometimes* output a small dense vector
>>>>> >> and sometimes output a small sparse vector based on how many zeroes
>>>>> >> there are. But that's bad because then the StandardScaler can't
>>>>> >> process the output at all. You can work on this if you're
>>>>> >> interested;
>>>>> >> I think the proposal was to be able to force a dense representation
>>>>> >> only in VectorAssembler. I don't know if that's the nature of the
>>>>> >> problem you're hitting.
>>>>> >>
>>>>> >> It can be meaningful to only scale the dimension without centering
>>>>> >> it,
>>>>> >> but it's not the same thing, no. The math is the math.
>>>>> >>
>>>>> >> This has come up a few times -- it's necessary to center a sparse
>>>>> >> vector but prohibitive to do so. One idea I'd toyed with in the past
>>>>> >> was to let a sparse vector have an 'offset' value applied to all
>>>>> >> elements. That would let you shift all values while preserving a
>>>>> >> sparse representation. I'm not sure if it's worth implementing but
>>>>> >> would help this case.
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> On Wed, Aug 10, 2016 at 4:41 PM, Tobi Bosede <ani.to...@gmail.com>
>>>>> >> wrote:
>>>>> >> > Hi everyone,
>>>>> >> >
>>>>> >> > I am doing some standardization using standardScaler on data from
>>>>> >> > VectorAssembler which is represented as sparse vectors. I plan to
>>>>> >> > fit a
>>>>> >> > regularized model.  However, standardScaler does not allow the
>>>>> >> > mean to
>>>>> >> > be
>>>>> >> > subtracted from sparse vectors. It will only divide by the
>>>>> >> > standard
>>>>> >> > deviation, which I understand is to keep the vector sparse. Thus I
>>>>> >> > am
>>>>> >> > trying
>>>>> >> > to convert my sparse vectors into dense vectors, but this may not
>>>>> >> > be
>>>>> >> > worthwhile.
>>>>> >> >
>>>>> >> > So my questions are:
>>>>> >> > Is subtracting the mean during standardization only important when
>>>>> >> > working
>>>>> >> > with dense vectors? Does it not matter for sparse vectors? Is just
>>>>> >> > dividing
>>>>> >> > by the standard deviation with sparse vectors equivalent to also
>>>>> >> > dividing by
>>>>> >> > standard deviation w and subtracting mean with dense vectors?
>>>>> >> >
>>>>> >> > Thank you,
>>>>> >> > Tobi
>>>>> >
>>>>> >
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Standardization with Sparse Vectors

Reply via email to