Re: Standardization with Sparse Vectors

2016-08-11 Thread Tobi Bosede
Can someone also provide input on why my code may not be working? Below, I
have pasted part of my previous reply which describes the issue I am having
here. I am really more perplexed about the first set of code (in bold). I
know why the second set of code doesn't work, it is just something I
initially tried.

>> Although I know this is not the best approach for something I plan to
put in
>> production, I have been trying to write a udf to turn the sparse vector
into
>> a dense one and apply the udf in withcolumn(). withColumn() complains
that
>> the data is a tuple. I think the issue might be the datatype parameter.
The
>> function returns a vector of doubles but there is no type that would be
>> adequate for this.
>>



*>> sparseToDense=udf(lambda data: float(DenseVector([data.toArray()])),>>
DoubleType())>>
denseTrainingRdf=trainingRdfAssemb.withColumn("denseFeatures",>>
sparseToDense("features"))*
>>
>> However the function works outside the udf, but I am unable to add an
>> arbitrary column to the data frame I started out working with.
*Thoughts?*
>>
>> denseFeatures=TrainingRdf.select("features").map(lambda data:
>> DenseVector([data.features.toArray()]))
>> denseTrainingRdf=trainingRdfAssemb.withColumn("denseFeatures",
>> denseFeatures)

On Thu, Aug 11, 2016 at 12:55 PM, Sean Owen  wrote:

> I should be more clear, since the outcome of the discussion above was
> not that obvious actually.
>
> - I agree a change should be made to StandardScaler, and not
> VectorAssembler
> - However I do think withMean should still be false by default and be
> explicitly enabled
> - The 'offset' idea is orthogonal, and as Nick says may be problematic
> anyway a step or two down the line. I'm proposing just converting to
> dense vectors if asked to center (which is why it shouldn't be the
> default)
>
> Indeed to answer your question, that's how I had resolved this in user
> code earlier. It's the same thing you're suggesting here, to make a
> UDF that converts the vectors to dense vectors manually.
>
> I updated the JIRA accordingly, to suggest converting to DenseVector
> in StandardScaler if withMean is set explicitly to true. I think we
> should consider something like the 'offset' idea separately if at all.
>
> On Thu, Aug 11, 2016 at 11:02 AM, Sean Owen  wrote:
> > No, that doesn't describe the change being discussed, since you've
> > copied the discussion about adding an 'offset'. That's orthogonal.
> > You're also suggesting making withMean=True the default, which we
> > don't want. The point is that if this is *explicitly* requested, the
> > scaler shouldn't refuse to subtract the mean from a sparse vector, and
> > fail.
> >
> > On Wed, Aug 10, 2016 at 8:47 PM, Tobi Bosede 
> wrote:
> >> Sean,
> >>
> >> I have created a jira; I hope you don't mind that I borrowed your
> >> explanation of "offset". https://issues.apache.org/
> jira/browse/SPARK-17001
> >>
> >> So what did you do to standardize your data, if you didn't use
> >> standardScaler? Did you write a udf to subtract mean and divide by
> standard
> >> deviation?
> >>
> >> Although I know this is not the best approach for something I plan to
> put in
> >> production, I have been trying to write a udf to turn the sparse vector
> into
> >> a dense one and apply the udf in withcolumn(). withColumn() complains
> that
> >> the data is a tuple. I think the issue might be the datatype parameter.
> The
> >> function returns a vector of doubles but there is no type that would be
> >> adequate for this.
> >>
> >> sparseToDense=udf(lambda data: float(DenseVector([data.toArray()])),
> >> DoubleType())
> >> denseTrainingRdf=trainingRdfAssemb.withColumn("denseFeatures",
> >> sparseToDense("features"))
> >>
> >> However the function works outside the udf, but I am unable to add an
> >> arbitrary column to the data frame I started out working with. Thoughts?
> >>
> >> denseFeatures=TrainingRdf.select("features").map(lambda data:
> >> DenseVector([data.features.toArray()]))
> >> denseTrainingRdf=trainingRdfAssemb.withColumn("denseFeatures",
> >> denseFeatures)
> >>
> >> Thanks,
> >> Tobi
> >>
> >>
> >> On Wed, Aug 10, 2016 at 1:01 PM, Nick Pentreath <
> nick.pentre...@gmail.com>
> >> wrote:
> >>>
> >>> Ah right, got it. As you say for storage it helps significantly, but
> for
> >>> operations I suspect it puts one back in a "dense-like" position.
> Still, for
> >>> online / mini-batch algorithms it may still be feasible I guess.
> >>> On Wed, 10 Aug 2016 at 19:50, Sean Owen  wrote:
> 
>  All elements, I think. Imagine a sparse vector 1:3 3:7 which
> conceptually
>  represents 0 3 0 7. Imagine it also has an offset stored which
> applies to
>  all elements. If it is -2 then it now represents -2 1 -2 5, but this
>  requires just one extra value to store. It only helps with storage of
> a
>  shifted sparse vector; iterating still typically requires iterating
> all
>  elements.
> 
>  Probably, where this would help, the caller can track th

Re: Standardization with Sparse Vectors

2016-08-11 Thread Sean Owen
I should be more clear, since the outcome of the discussion above was
not that obvious actually.

- I agree a change should be made to StandardScaler, and not VectorAssembler
- However I do think withMean should still be false by default and be
explicitly enabled
- The 'offset' idea is orthogonal, and as Nick says may be problematic
anyway a step or two down the line. I'm proposing just converting to
dense vectors if asked to center (which is why it shouldn't be the
default)

Indeed to answer your question, that's how I had resolved this in user
code earlier. It's the same thing you're suggesting here, to make a
UDF that converts the vectors to dense vectors manually.

I updated the JIRA accordingly, to suggest converting to DenseVector
in StandardScaler if withMean is set explicitly to true. I think we
should consider something like the 'offset' idea separately if at all.

On Thu, Aug 11, 2016 at 11:02 AM, Sean Owen  wrote:
> No, that doesn't describe the change being discussed, since you've
> copied the discussion about adding an 'offset'. That's orthogonal.
> You're also suggesting making withMean=True the default, which we
> don't want. The point is that if this is *explicitly* requested, the
> scaler shouldn't refuse to subtract the mean from a sparse vector, and
> fail.
>
> On Wed, Aug 10, 2016 at 8:47 PM, Tobi Bosede  wrote:
>> Sean,
>>
>> I have created a jira; I hope you don't mind that I borrowed your
>> explanation of "offset". https://issues.apache.org/jira/browse/SPARK-17001
>>
>> So what did you do to standardize your data, if you didn't use
>> standardScaler? Did you write a udf to subtract mean and divide by standard
>> deviation?
>>
>> Although I know this is not the best approach for something I plan to put in
>> production, I have been trying to write a udf to turn the sparse vector into
>> a dense one and apply the udf in withcolumn(). withColumn() complains that
>> the data is a tuple. I think the issue might be the datatype parameter. The
>> function returns a vector of doubles but there is no type that would be
>> adequate for this.
>>
>> sparseToDense=udf(lambda data: float(DenseVector([data.toArray()])),
>> DoubleType())
>> denseTrainingRdf=trainingRdfAssemb.withColumn("denseFeatures",
>> sparseToDense("features"))
>>
>> However the function works outside the udf, but I am unable to add an
>> arbitrary column to the data frame I started out working with. Thoughts?
>>
>> denseFeatures=TrainingRdf.select("features").map(lambda data:
>> DenseVector([data.features.toArray()]))
>> denseTrainingRdf=trainingRdfAssemb.withColumn("denseFeatures",
>> denseFeatures)
>>
>> Thanks,
>> Tobi
>>
>>
>> On Wed, Aug 10, 2016 at 1:01 PM, Nick Pentreath 
>> wrote:
>>>
>>> Ah right, got it. As you say for storage it helps significantly, but for
>>> operations I suspect it puts one back in a "dense-like" position. Still, for
>>> online / mini-batch algorithms it may still be feasible I guess.
>>> On Wed, 10 Aug 2016 at 19:50, Sean Owen  wrote:

 All elements, I think. Imagine a sparse vector 1:3 3:7 which conceptually
 represents 0 3 0 7. Imagine it also has an offset stored which applies to
 all elements. If it is -2 then it now represents -2 1 -2 5, but this
 requires just one extra value to store. It only helps with storage of a
 shifted sparse vector; iterating still typically requires iterating all
 elements.

 Probably, where this would help, the caller can track this offset and
 even more efficiently apply this knowledge. I remember digging into this in
 how sparse covariance matrices are computed. It almost but not quite 
 enabled
 an optimization.


 On Wed, Aug 10, 2016, 18:10 Nick Pentreath 
 wrote:
>
> Sean by 'offset' do you mean basically subtracting the mean but only
> from the non-zero elements in each row?
> On Wed, 10 Aug 2016 at 19:02, Sean Owen  wrote:
>>
>> Yeah I had thought the same, that perhaps it's fine to let the
>> StandardScaler proceed, if it's explicitly asked to center, rather
>> than refuse to. It's not really much more rope to let a user hang
>> herself with, and, blocks legitimate usages (we ran into this last
>> week and couldn't use StandardScaler as a result).
>>
>> I'm personally supportive of the change and don't see a JIRA. I think
>> you could at least make one.
>>
>> On Wed, Aug 10, 2016 at 5:57 PM, Tobi Bosede 
>> wrote:
>> > Thanks Sean, I agree with 100% that the math is math and dense vs
>> > sparse is
>> > just a matter of representation. I was trying to convince a co-worker
>> > of
>> > this to no avail. Sending this email was mainly a sanity check.
>> >
>> > I think having an offset would be a great idea, although I am not
>> > sure how
>> > to implement this. However, if anything should be done to rectify
>> > this
>> > issue, it should be done in the standardScaler, not vectorAsse

Re: Standardization with Sparse Vectors

2016-08-11 Thread Tobi Bosede
Opening this follow-up question to the entire mailing list. Anyone
have thoughts
on how I can add a column of dense vectors (created by converting a column
of sparse features) to a data frame? My efforts are below.

Although I know this is not the best approach for something I plan to put
in production, I have been trying to write a udf to turn the sparse vector
into a dense one and apply the udf in withcolumn(). withColumn() complains
that the data is a tuple. I think the issue might be the datatype
parameter. The function returns a vector of doubles but there is no type
that would be adequate for this.


*sparseToDense=udf(lambda data:
float(DenseVector([data.toArray()])), DoubleType())*
*denseTrainingRdf=trainingRdfAssemb.withColumn("denseFeatures",
sparseToDense("features"))*

However the function works outside the udf, but I am unable to add an
arbitrary column to the data frame I started out working with.

*denseFeatures=TrainingRdf.select("features").map(lambda data:
DenseVector([data.features.toArray()]))*
*denseTrainingRdf=trainingRdfAssemb.withColumn("denseFeatures", denseFeatures)*

Thanks,
Tobi

On Thu, Aug 11, 2016 at 5:02 AM, Sean Owen  wrote:

> No, that doesn't describe the change being discussed, since you've
> copied the discussion about adding an 'offset'. That's orthogonal.
> You're also suggesting making withMean=True the default, which we
> don't want. The point is that if this is *explicitly* requested, the
> scaler shouldn't refuse to subtract the mean from a sparse vector, and
> fail.
>
> On Wed, Aug 10, 2016 at 8:47 PM, Tobi Bosede  wrote:
> > Sean,
> >
> > I have created a jira; I hope you don't mind that I borrowed your
> > explanation of "offset". https://issues.apache.org/
> jira/browse/SPARK-17001
> >
> > So what did you do to standardize your data, if you didn't use
> > standardScaler? Did you write a udf to subtract mean and divide by
> standard
> > deviation?
> >
> > Although I know this is not the best approach for something I plan to
> put in
> > production, I have been trying to write a udf to turn the sparse vector
> into
> > a dense one and apply the udf in withcolumn(). withColumn() complains
> that
> > the data is a tuple. I think the issue might be the datatype parameter.
> The
> > function returns a vector of doubles but there is no type that would be
> > adequate for this.
> >
> > sparseToDense=udf(lambda data: float(DenseVector([data.toArray()])),
> > DoubleType())
> > denseTrainingRdf=trainingRdfAssemb.withColumn("denseFeatures",
> > sparseToDense("features"))
> >
> > However the function works outside the udf, but I am unable to add an
> > arbitrary column to the data frame I started out working with. Thoughts?
> >
> > denseFeatures=TrainingRdf.select("features").map(lambda data:
> > DenseVector([data.features.toArray()]))
> > denseTrainingRdf=trainingRdfAssemb.withColumn("denseFeatures",
> > denseFeatures)
> >
> > Thanks,
> > Tobi
> >
> >
> > On Wed, Aug 10, 2016 at 1:01 PM, Nick Pentreath <
> nick.pentre...@gmail.com>
> > wrote:
> >>
> >> Ah right, got it. As you say for storage it helps significantly, but for
> >> operations I suspect it puts one back in a "dense-like" position.
> Still, for
> >> online / mini-batch algorithms it may still be feasible I guess.
> >> On Wed, 10 Aug 2016 at 19:50, Sean Owen  wrote:
> >>>
> >>> All elements, I think. Imagine a sparse vector 1:3 3:7 which
> conceptually
> >>> represents 0 3 0 7. Imagine it also has an offset stored which applies
> to
> >>> all elements. If it is -2 then it now represents -2 1 -2 5, but this
> >>> requires just one extra value to store. It only helps with storage of a
> >>> shifted sparse vector; iterating still typically requires iterating all
> >>> elements.
> >>>
> >>> Probably, where this would help, the caller can track this offset and
> >>> even more efficiently apply this knowledge. I remember digging into
> this in
> >>> how sparse covariance matrices are computed. It almost but not quite
> enabled
> >>> an optimization.
> >>>
> >>>
> >>> On Wed, Aug 10, 2016, 18:10 Nick Pentreath 
> >>> wrote:
> 
>  Sean by 'offset' do you mean basically subtracting the mean but only
>  from the non-zero elements in each row?
>  On Wed, 10 Aug 2016 at 19:02, Sean Owen  wrote:
> >
> > Yeah I had thought the same, that perhaps it's fine to let the
> > StandardScaler proceed, if it's explicitly asked to center, rather
> > than refuse to. It's not really much more rope to let a user hang
> > herself with, and, blocks legitimate usages (we ran into this last
> > week and couldn't use StandardScaler as a result).
> >
> > I'm personally supportive of the change and don't see a JIRA. I think
> > you could at least make one.
> >
> > On Wed, Aug 10, 2016 at 5:57 PM, Tobi Bosede 
> > wrote:
> > > Thanks Sean, I agree with 100% that the math is math and dense vs
> > > sparse is
> > > just a matter of representation. I was trying to convince 

Re: Standardization with Sparse Vectors

2016-08-11 Thread Sean Owen
No, that doesn't describe the change being discussed, since you've
copied the discussion about adding an 'offset'. That's orthogonal.
You're also suggesting making withMean=True the default, which we
don't want. The point is that if this is *explicitly* requested, the
scaler shouldn't refuse to subtract the mean from a sparse vector, and
fail.

On Wed, Aug 10, 2016 at 8:47 PM, Tobi Bosede  wrote:
> Sean,
>
> I have created a jira; I hope you don't mind that I borrowed your
> explanation of "offset". https://issues.apache.org/jira/browse/SPARK-17001
>
> So what did you do to standardize your data, if you didn't use
> standardScaler? Did you write a udf to subtract mean and divide by standard
> deviation?
>
> Although I know this is not the best approach for something I plan to put in
> production, I have been trying to write a udf to turn the sparse vector into
> a dense one and apply the udf in withcolumn(). withColumn() complains that
> the data is a tuple. I think the issue might be the datatype parameter. The
> function returns a vector of doubles but there is no type that would be
> adequate for this.
>
> sparseToDense=udf(lambda data: float(DenseVector([data.toArray()])),
> DoubleType())
> denseTrainingRdf=trainingRdfAssemb.withColumn("denseFeatures",
> sparseToDense("features"))
>
> However the function works outside the udf, but I am unable to add an
> arbitrary column to the data frame I started out working with. Thoughts?
>
> denseFeatures=TrainingRdf.select("features").map(lambda data:
> DenseVector([data.features.toArray()]))
> denseTrainingRdf=trainingRdfAssemb.withColumn("denseFeatures",
> denseFeatures)
>
> Thanks,
> Tobi
>
>
> On Wed, Aug 10, 2016 at 1:01 PM, Nick Pentreath 
> wrote:
>>
>> Ah right, got it. As you say for storage it helps significantly, but for
>> operations I suspect it puts one back in a "dense-like" position. Still, for
>> online / mini-batch algorithms it may still be feasible I guess.
>> On Wed, 10 Aug 2016 at 19:50, Sean Owen  wrote:
>>>
>>> All elements, I think. Imagine a sparse vector 1:3 3:7 which conceptually
>>> represents 0 3 0 7. Imagine it also has an offset stored which applies to
>>> all elements. If it is -2 then it now represents -2 1 -2 5, but this
>>> requires just one extra value to store. It only helps with storage of a
>>> shifted sparse vector; iterating still typically requires iterating all
>>> elements.
>>>
>>> Probably, where this would help, the caller can track this offset and
>>> even more efficiently apply this knowledge. I remember digging into this in
>>> how sparse covariance matrices are computed. It almost but not quite enabled
>>> an optimization.
>>>
>>>
>>> On Wed, Aug 10, 2016, 18:10 Nick Pentreath 
>>> wrote:

 Sean by 'offset' do you mean basically subtracting the mean but only
 from the non-zero elements in each row?
 On Wed, 10 Aug 2016 at 19:02, Sean Owen  wrote:
>
> Yeah I had thought the same, that perhaps it's fine to let the
> StandardScaler proceed, if it's explicitly asked to center, rather
> than refuse to. It's not really much more rope to let a user hang
> herself with, and, blocks legitimate usages (we ran into this last
> week and couldn't use StandardScaler as a result).
>
> I'm personally supportive of the change and don't see a JIRA. I think
> you could at least make one.
>
> On Wed, Aug 10, 2016 at 5:57 PM, Tobi Bosede 
> wrote:
> > Thanks Sean, I agree with 100% that the math is math and dense vs
> > sparse is
> > just a matter of representation. I was trying to convince a co-worker
> > of
> > this to no avail. Sending this email was mainly a sanity check.
> >
> > I think having an offset would be a great idea, although I am not
> > sure how
> > to implement this. However, if anything should be done to rectify
> > this
> > issue, it should be done in the standardScaler, not vectorAssembler.
> > There
> > should not be any forcing of vectorAssembler to produce only dense
> > vectors
> > so as to avoid performance problems with data that does not fit in
> > memory.
> > Furthermore, not every machine learning algo requires
> > standardization.
> > Instead, standardScaler should have withmean=True as default and
> > should
> > apply an offset if the vector is sparse, whereas there would be
> > normal
> > subtraction if the vector is dense. This way the default behavior of
> > standardScaler will always be what is generally understood to be
> > standardization, as opposed to people thinking they are standardizing
> > when
> > they actually are not.
> >
> > Can anyone confirm whether there is a jira already?
> >
> > On Wed, Aug 10, 2016 at 10:58 AM, Sean Owen 
> > wrote:
> >>
> >> Dense vs sparse is just a question of representation, so doesn't
> >> make
> >> an operation on a vector more or less important as a res

Re: Standardization with Sparse Vectors

2016-08-10 Thread Tobi Bosede
Sean,

I have created a jira; I hope you don't mind that I borrowed your
explanation of "offset". https://issues.apache.org/jira/browse/SPARK-17001

So what did you do to standardize your data, if you didn't use
standardScaler? Did you write a udf to subtract mean and divide by standard
deviation?

Although I know this is not the best approach for something I plan to put
in production, I have been trying to write a udf to turn the sparse vector
into a dense one and apply the udf in withcolumn(). withColumn() complains
that the data is a tuple. I think the issue might be the datatype
parameter. The function returns a vector of doubles but there is no type
that would be adequate for this.


*sparseToDense=udf(lambda data: float(DenseVector([data.toArray()])),
DoubleType())*
*denseTrainingRdf=trainingRdfAssemb.withColumn("denseFeatures",
sparseToDense("features"))*

However the function works outside the udf, but I am unable to add an
arbitrary column to the data frame I started out working with. Thoughts?

*denseFeatures=TrainingRdf.select("features").map(lambda data:
DenseVector([data.features.toArray()]))*
*denseTrainingRdf=trainingRdfAssemb.withColumn("denseFeatures", denseFeatures)*

Thanks,
Tobi

On Wed, Aug 10, 2016 at 1:01 PM, Nick Pentreath 
wrote:

> Ah right, got it. As you say for storage it helps significantly, but for
> operations I suspect it puts one back in a "dense-like" position. Still,
> for online / mini-batch algorithms it may still be feasible I guess.
> On Wed, 10 Aug 2016 at 19:50, Sean Owen  wrote:
>
>> All elements, I think. Imagine a sparse vector 1:3 3:7 which conceptually
>> represents 0 3 0 7. Imagine it also has an offset stored which applies to
>> all elements. If it is -2 then it now represents -2 1 -2 5, but this
>> requires just one extra value to store. It only helps with storage of a
>> shifted sparse vector; iterating still typically requires iterating all
>> elements.
>>
>> Probably, where this would help, the caller can track this offset and
>> even more efficiently apply this knowledge. I remember digging into this in
>> how sparse covariance matrices are computed. It almost but not quite
>> enabled an optimization.
>>
>>
>> On Wed, Aug 10, 2016, 18:10 Nick Pentreath 
>> wrote:
>>
>>> Sean by 'offset' do you mean basically subtracting the mean but only
>>> from the non-zero elements in each row?
>>> On Wed, 10 Aug 2016 at 19:02, Sean Owen  wrote:
>>>
 Yeah I had thought the same, that perhaps it's fine to let the
 StandardScaler proceed, if it's explicitly asked to center, rather
 than refuse to. It's not really much more rope to let a user hang
 herself with, and, blocks legitimate usages (we ran into this last
 week and couldn't use StandardScaler as a result).

 I'm personally supportive of the change and don't see a JIRA. I think
 you could at least make one.

 On Wed, Aug 10, 2016 at 5:57 PM, Tobi Bosede 
 wrote:
 > Thanks Sean, I agree with 100% that the math is math and dense vs
 sparse is
 > just a matter of representation. I was trying to convince a co-worker
 of
 > this to no avail. Sending this email was mainly a sanity check.
 >
 > I think having an offset would be a great idea, although I am not
 sure how
 > to implement this. However, if anything should be done to rectify this
 > issue, it should be done in the standardScaler, not vectorAssembler.
 There
 > should not be any forcing of vectorAssembler to produce only dense
 vectors
 > so as to avoid performance problems with data that does not fit in
 memory.
 > Furthermore, not every machine learning algo requires standardization.
 > Instead, standardScaler should have withmean=True as default and
 should
 > apply an offset if the vector is sparse, whereas there would be normal
 > subtraction if the vector is dense. This way the default behavior of
 > standardScaler will always be what is generally understood to be
 > standardization, as opposed to people thinking they are standardizing
 when
 > they actually are not.
 >
 > Can anyone confirm whether there is a jira already?
 >
 > On Wed, Aug 10, 2016 at 10:58 AM, Sean Owen 
 wrote:
 >>
 >> Dense vs sparse is just a question of representation, so doesn't make
 >> an operation on a vector more or less important as a result. You've
 >> identified the reason that subtracting the mean can be undesirable: a
 >> notionally billion-element sparse vector becomes too big to fit in
 >> memory at once.
 >>
 >> I know this came up as a problem recently (I think there's a JIRA?)
 >> because VectorAssembler will *sometimes* output a small dense vector
 >> and sometimes output a small sparse vector based on how many zeroes
 >> there are. But that's bad because then the StandardScaler can't
 >> process the output at all. You can work on this if you're interested;
>

Re: Standardization with Sparse Vectors

2016-08-10 Thread Nick Pentreath
Ah right, got it. As you say for storage it helps significantly, but for
operations I suspect it puts one back in a "dense-like" position. Still,
for online / mini-batch algorithms it may still be feasible I guess.
On Wed, 10 Aug 2016 at 19:50, Sean Owen  wrote:

> All elements, I think. Imagine a sparse vector 1:3 3:7 which conceptually
> represents 0 3 0 7. Imagine it also has an offset stored which applies to
> all elements. If it is -2 then it now represents -2 1 -2 5, but this
> requires just one extra value to store. It only helps with storage of a
> shifted sparse vector; iterating still typically requires iterating all
> elements.
>
> Probably, where this would help, the caller can track this offset and even
> more efficiently apply this knowledge. I remember digging into this in how
> sparse covariance matrices are computed. It almost but not quite enabled an
> optimization.
>
>
> On Wed, Aug 10, 2016, 18:10 Nick Pentreath 
> wrote:
>
>> Sean by 'offset' do you mean basically subtracting the mean but only from
>> the non-zero elements in each row?
>> On Wed, 10 Aug 2016 at 19:02, Sean Owen  wrote:
>>
>>> Yeah I had thought the same, that perhaps it's fine to let the
>>> StandardScaler proceed, if it's explicitly asked to center, rather
>>> than refuse to. It's not really much more rope to let a user hang
>>> herself with, and, blocks legitimate usages (we ran into this last
>>> week and couldn't use StandardScaler as a result).
>>>
>>> I'm personally supportive of the change and don't see a JIRA. I think
>>> you could at least make one.
>>>
>>> On Wed, Aug 10, 2016 at 5:57 PM, Tobi Bosede 
>>> wrote:
>>> > Thanks Sean, I agree with 100% that the math is math and dense vs
>>> sparse is
>>> > just a matter of representation. I was trying to convince a co-worker
>>> of
>>> > this to no avail. Sending this email was mainly a sanity check.
>>> >
>>> > I think having an offset would be a great idea, although I am not sure
>>> how
>>> > to implement this. However, if anything should be done to rectify this
>>> > issue, it should be done in the standardScaler, not vectorAssembler.
>>> There
>>> > should not be any forcing of vectorAssembler to produce only dense
>>> vectors
>>> > so as to avoid performance problems with data that does not fit in
>>> memory.
>>> > Furthermore, not every machine learning algo requires standardization.
>>> > Instead, standardScaler should have withmean=True as default and should
>>> > apply an offset if the vector is sparse, whereas there would be normal
>>> > subtraction if the vector is dense. This way the default behavior of
>>> > standardScaler will always be what is generally understood to be
>>> > standardization, as opposed to people thinking they are standardizing
>>> when
>>> > they actually are not.
>>> >
>>> > Can anyone confirm whether there is a jira already?
>>> >
>>> > On Wed, Aug 10, 2016 at 10:58 AM, Sean Owen 
>>> wrote:
>>> >>
>>> >> Dense vs sparse is just a question of representation, so doesn't make
>>> >> an operation on a vector more or less important as a result. You've
>>> >> identified the reason that subtracting the mean can be undesirable: a
>>> >> notionally billion-element sparse vector becomes too big to fit in
>>> >> memory at once.
>>> >>
>>> >> I know this came up as a problem recently (I think there's a JIRA?)
>>> >> because VectorAssembler will *sometimes* output a small dense vector
>>> >> and sometimes output a small sparse vector based on how many zeroes
>>> >> there are. But that's bad because then the StandardScaler can't
>>> >> process the output at all. You can work on this if you're interested;
>>> >> I think the proposal was to be able to force a dense representation
>>> >> only in VectorAssembler. I don't know if that's the nature of the
>>> >> problem you're hitting.
>>> >>
>>> >> It can be meaningful to only scale the dimension without centering it,
>>> >> but it's not the same thing, no. The math is the math.
>>> >>
>>> >> This has come up a few times -- it's necessary to center a sparse
>>> >> vector but prohibitive to do so. One idea I'd toyed with in the past
>>> >> was to let a sparse vector have an 'offset' value applied to all
>>> >> elements. That would let you shift all values while preserving a
>>> >> sparse representation. I'm not sure if it's worth implementing but
>>> >> would help this case.
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> On Wed, Aug 10, 2016 at 4:41 PM, Tobi Bosede 
>>> wrote:
>>> >> > Hi everyone,
>>> >> >
>>> >> > I am doing some standardization using standardScaler on data from
>>> >> > VectorAssembler which is represented as sparse vectors. I plan to
>>> fit a
>>> >> > regularized model.  However, standardScaler does not allow the mean
>>> to
>>> >> > be
>>> >> > subtracted from sparse vectors. It will only divide by the standard
>>> >> > deviation, which I understand is to keep the vector sparse. Thus I
>>> am
>>> >> > trying
>>> >> > to convert my sparse vectors into dense vectors, but this may not be

Re: Standardization with Sparse Vectors

2016-08-10 Thread Sean Owen
All elements, I think. Imagine a sparse vector 1:3 3:7 which conceptually
represents 0 3 0 7. Imagine it also has an offset stored which applies to
all elements. If it is -2 then it now represents -2 1 -2 5, but this
requires just one extra value to store. It only helps with storage of a
shifted sparse vector; iterating still typically requires iterating all
elements.

Probably, where this would help, the caller can track this offset and even
more efficiently apply this knowledge. I remember digging into this in how
sparse covariance matrices are computed. It almost but not quite enabled an
optimization.

On Wed, Aug 10, 2016, 18:10 Nick Pentreath  wrote:

> Sean by 'offset' do you mean basically subtracting the mean but only from
> the non-zero elements in each row?
> On Wed, 10 Aug 2016 at 19:02, Sean Owen  wrote:
>
>> Yeah I had thought the same, that perhaps it's fine to let the
>> StandardScaler proceed, if it's explicitly asked to center, rather
>> than refuse to. It's not really much more rope to let a user hang
>> herself with, and, blocks legitimate usages (we ran into this last
>> week and couldn't use StandardScaler as a result).
>>
>> I'm personally supportive of the change and don't see a JIRA. I think
>> you could at least make one.
>>
>> On Wed, Aug 10, 2016 at 5:57 PM, Tobi Bosede  wrote:
>> > Thanks Sean, I agree with 100% that the math is math and dense vs
>> sparse is
>> > just a matter of representation. I was trying to convince a co-worker of
>> > this to no avail. Sending this email was mainly a sanity check.
>> >
>> > I think having an offset would be a great idea, although I am not sure
>> how
>> > to implement this. However, if anything should be done to rectify this
>> > issue, it should be done in the standardScaler, not vectorAssembler.
>> There
>> > should not be any forcing of vectorAssembler to produce only dense
>> vectors
>> > so as to avoid performance problems with data that does not fit in
>> memory.
>> > Furthermore, not every machine learning algo requires standardization.
>> > Instead, standardScaler should have withmean=True as default and should
>> > apply an offset if the vector is sparse, whereas there would be normal
>> > subtraction if the vector is dense. This way the default behavior of
>> > standardScaler will always be what is generally understood to be
>> > standardization, as opposed to people thinking they are standardizing
>> when
>> > they actually are not.
>> >
>> > Can anyone confirm whether there is a jira already?
>> >
>> > On Wed, Aug 10, 2016 at 10:58 AM, Sean Owen  wrote:
>> >>
>> >> Dense vs sparse is just a question of representation, so doesn't make
>> >> an operation on a vector more or less important as a result. You've
>> >> identified the reason that subtracting the mean can be undesirable: a
>> >> notionally billion-element sparse vector becomes too big to fit in
>> >> memory at once.
>> >>
>> >> I know this came up as a problem recently (I think there's a JIRA?)
>> >> because VectorAssembler will *sometimes* output a small dense vector
>> >> and sometimes output a small sparse vector based on how many zeroes
>> >> there are. But that's bad because then the StandardScaler can't
>> >> process the output at all. You can work on this if you're interested;
>> >> I think the proposal was to be able to force a dense representation
>> >> only in VectorAssembler. I don't know if that's the nature of the
>> >> problem you're hitting.
>> >>
>> >> It can be meaningful to only scale the dimension without centering it,
>> >> but it's not the same thing, no. The math is the math.
>> >>
>> >> This has come up a few times -- it's necessary to center a sparse
>> >> vector but prohibitive to do so. One idea I'd toyed with in the past
>> >> was to let a sparse vector have an 'offset' value applied to all
>> >> elements. That would let you shift all values while preserving a
>> >> sparse representation. I'm not sure if it's worth implementing but
>> >> would help this case.
>> >>
>> >>
>> >>
>> >>
>> >> On Wed, Aug 10, 2016 at 4:41 PM, Tobi Bosede 
>> wrote:
>> >> > Hi everyone,
>> >> >
>> >> > I am doing some standardization using standardScaler on data from
>> >> > VectorAssembler which is represented as sparse vectors. I plan to
>> fit a
>> >> > regularized model.  However, standardScaler does not allow the mean
>> to
>> >> > be
>> >> > subtracted from sparse vectors. It will only divide by the standard
>> >> > deviation, which I understand is to keep the vector sparse. Thus I am
>> >> > trying
>> >> > to convert my sparse vectors into dense vectors, but this may not be
>> >> > worthwhile.
>> >> >
>> >> > So my questions are:
>> >> > Is subtracting the mean during standardization only important when
>> >> > working
>> >> > with dense vectors? Does it not matter for sparse vectors? Is just
>> >> > dividing
>> >> > by the standard deviation with sparse vectors equivalent to also
>> >> > dividing by
>> >> > standard deviation w and subtracting mean with dens

Re: Standardization with Sparse Vectors

2016-08-10 Thread Nick Pentreath
Sean by 'offset' do you mean basically subtracting the mean but only from
the non-zero elements in each row?
On Wed, 10 Aug 2016 at 19:02, Sean Owen  wrote:

> Yeah I had thought the same, that perhaps it's fine to let the
> StandardScaler proceed, if it's explicitly asked to center, rather
> than refuse to. It's not really much more rope to let a user hang
> herself with, and, blocks legitimate usages (we ran into this last
> week and couldn't use StandardScaler as a result).
>
> I'm personally supportive of the change and don't see a JIRA. I think
> you could at least make one.
>
> On Wed, Aug 10, 2016 at 5:57 PM, Tobi Bosede  wrote:
> > Thanks Sean, I agree with 100% that the math is math and dense vs sparse
> is
> > just a matter of representation. I was trying to convince a co-worker of
> > this to no avail. Sending this email was mainly a sanity check.
> >
> > I think having an offset would be a great idea, although I am not sure
> how
> > to implement this. However, if anything should be done to rectify this
> > issue, it should be done in the standardScaler, not vectorAssembler.
> There
> > should not be any forcing of vectorAssembler to produce only dense
> vectors
> > so as to avoid performance problems with data that does not fit in
> memory.
> > Furthermore, not every machine learning algo requires standardization.
> > Instead, standardScaler should have withmean=True as default and should
> > apply an offset if the vector is sparse, whereas there would be normal
> > subtraction if the vector is dense. This way the default behavior of
> > standardScaler will always be what is generally understood to be
> > standardization, as opposed to people thinking they are standardizing
> when
> > they actually are not.
> >
> > Can anyone confirm whether there is a jira already?
> >
> > On Wed, Aug 10, 2016 at 10:58 AM, Sean Owen  wrote:
> >>
> >> Dense vs sparse is just a question of representation, so doesn't make
> >> an operation on a vector more or less important as a result. You've
> >> identified the reason that subtracting the mean can be undesirable: a
> >> notionally billion-element sparse vector becomes too big to fit in
> >> memory at once.
> >>
> >> I know this came up as a problem recently (I think there's a JIRA?)
> >> because VectorAssembler will *sometimes* output a small dense vector
> >> and sometimes output a small sparse vector based on how many zeroes
> >> there are. But that's bad because then the StandardScaler can't
> >> process the output at all. You can work on this if you're interested;
> >> I think the proposal was to be able to force a dense representation
> >> only in VectorAssembler. I don't know if that's the nature of the
> >> problem you're hitting.
> >>
> >> It can be meaningful to only scale the dimension without centering it,
> >> but it's not the same thing, no. The math is the math.
> >>
> >> This has come up a few times -- it's necessary to center a sparse
> >> vector but prohibitive to do so. One idea I'd toyed with in the past
> >> was to let a sparse vector have an 'offset' value applied to all
> >> elements. That would let you shift all values while preserving a
> >> sparse representation. I'm not sure if it's worth implementing but
> >> would help this case.
> >>
> >>
> >>
> >>
> >> On Wed, Aug 10, 2016 at 4:41 PM, Tobi Bosede 
> wrote:
> >> > Hi everyone,
> >> >
> >> > I am doing some standardization using standardScaler on data from
> >> > VectorAssembler which is represented as sparse vectors. I plan to fit
> a
> >> > regularized model.  However, standardScaler does not allow the mean to
> >> > be
> >> > subtracted from sparse vectors. It will only divide by the standard
> >> > deviation, which I understand is to keep the vector sparse. Thus I am
> >> > trying
> >> > to convert my sparse vectors into dense vectors, but this may not be
> >> > worthwhile.
> >> >
> >> > So my questions are:
> >> > Is subtracting the mean during standardization only important when
> >> > working
> >> > with dense vectors? Does it not matter for sparse vectors? Is just
> >> > dividing
> >> > by the standard deviation with sparse vectors equivalent to also
> >> > dividing by
> >> > standard deviation w and subtracting mean with dense vectors?
> >> >
> >> > Thank you,
> >> > Tobi
> >
> >
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Standardization with Sparse Vectors

2016-08-10 Thread Sean Owen
Yeah I had thought the same, that perhaps it's fine to let the
StandardScaler proceed, if it's explicitly asked to center, rather
than refuse to. It's not really much more rope to let a user hang
herself with, and, blocks legitimate usages (we ran into this last
week and couldn't use StandardScaler as a result).

I'm personally supportive of the change and don't see a JIRA. I think
you could at least make one.

On Wed, Aug 10, 2016 at 5:57 PM, Tobi Bosede  wrote:
> Thanks Sean, I agree with 100% that the math is math and dense vs sparse is
> just a matter of representation. I was trying to convince a co-worker of
> this to no avail. Sending this email was mainly a sanity check.
>
> I think having an offset would be a great idea, although I am not sure how
> to implement this. However, if anything should be done to rectify this
> issue, it should be done in the standardScaler, not vectorAssembler. There
> should not be any forcing of vectorAssembler to produce only dense vectors
> so as to avoid performance problems with data that does not fit in memory.
> Furthermore, not every machine learning algo requires standardization.
> Instead, standardScaler should have withmean=True as default and should
> apply an offset if the vector is sparse, whereas there would be normal
> subtraction if the vector is dense. This way the default behavior of
> standardScaler will always be what is generally understood to be
> standardization, as opposed to people thinking they are standardizing when
> they actually are not.
>
> Can anyone confirm whether there is a jira already?
>
> On Wed, Aug 10, 2016 at 10:58 AM, Sean Owen  wrote:
>>
>> Dense vs sparse is just a question of representation, so doesn't make
>> an operation on a vector more or less important as a result. You've
>> identified the reason that subtracting the mean can be undesirable: a
>> notionally billion-element sparse vector becomes too big to fit in
>> memory at once.
>>
>> I know this came up as a problem recently (I think there's a JIRA?)
>> because VectorAssembler will *sometimes* output a small dense vector
>> and sometimes output a small sparse vector based on how many zeroes
>> there are. But that's bad because then the StandardScaler can't
>> process the output at all. You can work on this if you're interested;
>> I think the proposal was to be able to force a dense representation
>> only in VectorAssembler. I don't know if that's the nature of the
>> problem you're hitting.
>>
>> It can be meaningful to only scale the dimension without centering it,
>> but it's not the same thing, no. The math is the math.
>>
>> This has come up a few times -- it's necessary to center a sparse
>> vector but prohibitive to do so. One idea I'd toyed with in the past
>> was to let a sparse vector have an 'offset' value applied to all
>> elements. That would let you shift all values while preserving a
>> sparse representation. I'm not sure if it's worth implementing but
>> would help this case.
>>
>>
>>
>>
>> On Wed, Aug 10, 2016 at 4:41 PM, Tobi Bosede  wrote:
>> > Hi everyone,
>> >
>> > I am doing some standardization using standardScaler on data from
>> > VectorAssembler which is represented as sparse vectors. I plan to fit a
>> > regularized model.  However, standardScaler does not allow the mean to
>> > be
>> > subtracted from sparse vectors. It will only divide by the standard
>> > deviation, which I understand is to keep the vector sparse. Thus I am
>> > trying
>> > to convert my sparse vectors into dense vectors, but this may not be
>> > worthwhile.
>> >
>> > So my questions are:
>> > Is subtracting the mean during standardization only important when
>> > working
>> > with dense vectors? Does it not matter for sparse vectors? Is just
>> > dividing
>> > by the standard deviation with sparse vectors equivalent to also
>> > dividing by
>> > standard deviation w and subtracting mean with dense vectors?
>> >
>> > Thank you,
>> > Tobi
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Standardization with Sparse Vectors

2016-08-10 Thread Tobi Bosede
Thanks Sean, I agree with 100% that the math is math and dense vs sparse is
just a matter of representation. I was trying to convince a co-worker of
this to no avail. Sending this email was mainly a sanity check.

I think having an offset would be a great idea, although I am not sure how
to implement this. However, if anything should be done to rectify this
issue, it should be done in the standardScaler, not vectorAssembler. There
should not be any forcing of vectorAssembler to produce only dense vectors
so as to avoid performance problems with data that does not fit in memory.
Furthermore, not every machine learning algo requires standardization.
Instead, standardScaler should have withmean=True as default and should
apply an offset if the vector is sparse, whereas there would be normal
subtraction if the vector is dense. This way the default behavior of
standardScaler will always be what is generally understood to be
standardization, as opposed to people thinking they are standardizing when
they actually are not.

Can anyone confirm whether there is a jira already?

On Wed, Aug 10, 2016 at 10:58 AM, Sean Owen  wrote:

> Dense vs sparse is just a question of representation, so doesn't make
> an operation on a vector more or less important as a result. You've
> identified the reason that subtracting the mean can be undesirable: a
> notionally billion-element sparse vector becomes too big to fit in
> memory at once.
>
> I know this came up as a problem recently (I think there's a JIRA?)
> because VectorAssembler will *sometimes* output a small dense vector
> and sometimes output a small sparse vector based on how many zeroes
> there are. But that's bad because then the StandardScaler can't
> process the output at all. You can work on this if you're interested;
> I think the proposal was to be able to force a dense representation
> only in VectorAssembler. I don't know if that's the nature of the
> problem you're hitting.
>
> It can be meaningful to only scale the dimension without centering it,
> but it's not the same thing, no. The math is the math.
>
> This has come up a few times -- it's necessary to center a sparse
> vector but prohibitive to do so. One idea I'd toyed with in the past
> was to let a sparse vector have an 'offset' value applied to all
> elements. That would let you shift all values while preserving a
> sparse representation. I'm not sure if it's worth implementing but
> would help this case.
>
>
>
>
> On Wed, Aug 10, 2016 at 4:41 PM, Tobi Bosede  wrote:
> > Hi everyone,
> >
> > I am doing some standardization using standardScaler on data from
> > VectorAssembler which is represented as sparse vectors. I plan to fit a
> > regularized model.  However, standardScaler does not allow the mean to be
> > subtracted from sparse vectors. It will only divide by the standard
> > deviation, which I understand is to keep the vector sparse. Thus I am
> trying
> > to convert my sparse vectors into dense vectors, but this may not be
> > worthwhile.
> >
> > So my questions are:
> > Is subtracting the mean during standardization only important when
> working
> > with dense vectors? Does it not matter for sparse vectors? Is just
> dividing
> > by the standard deviation with sparse vectors equivalent to also
> dividing by
> > standard deviation w and subtracting mean with dense vectors?
> >
> > Thank you,
> > Tobi
>


Re: Standardization with Sparse Vectors

2016-08-10 Thread Sean Owen
Dense vs sparse is just a question of representation, so doesn't make
an operation on a vector more or less important as a result. You've
identified the reason that subtracting the mean can be undesirable: a
notionally billion-element sparse vector becomes too big to fit in
memory at once.

I know this came up as a problem recently (I think there's a JIRA?)
because VectorAssembler will *sometimes* output a small dense vector
and sometimes output a small sparse vector based on how many zeroes
there are. But that's bad because then the StandardScaler can't
process the output at all. You can work on this if you're interested;
I think the proposal was to be able to force a dense representation
only in VectorAssembler. I don't know if that's the nature of the
problem you're hitting.

It can be meaningful to only scale the dimension without centering it,
but it's not the same thing, no. The math is the math.

This has come up a few times -- it's necessary to center a sparse
vector but prohibitive to do so. One idea I'd toyed with in the past
was to let a sparse vector have an 'offset' value applied to all
elements. That would let you shift all values while preserving a
sparse representation. I'm not sure if it's worth implementing but
would help this case.




On Wed, Aug 10, 2016 at 4:41 PM, Tobi Bosede  wrote:
> Hi everyone,
>
> I am doing some standardization using standardScaler on data from
> VectorAssembler which is represented as sparse vectors. I plan to fit a
> regularized model.  However, standardScaler does not allow the mean to be
> subtracted from sparse vectors. It will only divide by the standard
> deviation, which I understand is to keep the vector sparse. Thus I am trying
> to convert my sparse vectors into dense vectors, but this may not be
> worthwhile.
>
> So my questions are:
> Is subtracting the mean during standardization only important when working
> with dense vectors? Does it not matter for sparse vectors? Is just dividing
> by the standard deviation with sparse vectors equivalent to also dividing by
> standard deviation w and subtracting mean with dense vectors?
>
> Thank you,
> Tobi

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Standardization with Sparse Vectors

2016-08-10 Thread Tobi Bosede
Hi everyone,

I am doing some standardization using standardScaler on data from
VectorAssembler which is represented as sparse vectors. I plan to fit a
regularized model.  However, standardScaler does not allow the mean to be
subtracted from sparse vectors. It will only divide by the standard
deviation, which I understand is to keep the vector sparse. Thus I am
trying to convert my sparse vectors into dense vectors, but this may not be
worthwhile.

So my questions are:
Is subtracting the mean during standardization only important when working
with dense vectors? Does it not matter for sparse vectors? Is just dividing
by the standard deviation with sparse vectors equivalent to also dividing
by standard deviation w and subtracting mean with dense vectors?

Thank you,
Tobi