Re: Changing how we compute release hashes

2018-03-23 Thread Nicholas Chammas
To close the loop here: SPARK-23716


On Fri, Mar 16, 2018 at 5:00 PM Nicholas Chammas 
wrote:

> OK, will do.
>
> On Fri, Mar 16, 2018 at 4:41 PM Sean Owen  wrote:
>
>> I think you can file a JIRA and open a PR. All of the bits that use "gpg
>> ... SHA512 file ..." can use shasum instead.
>> I would not change any existing release artifacts though.
>>
>> On Fri, Mar 16, 2018 at 1:14 PM Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> I have sha512sum on my Mac via Homebrew, but yeah as long as the format
>>> is the same I suppose it doesn’t matter if we use shasum -a or sha512sum
>>> .
>>>
>>> So shall I file a JIRA + PR for this? Or should I leave the PR to a
>>> maintainer? And are we OK with updating all the existing release hashes to
>>> use the new format, or do we only want to do this for new releases?
>>> ​
>>>
>>> On Fri, Mar 16, 2018 at 1:50 PM Felix Cheung 
>>> wrote:
>>>
 +1 there

 --
 *From:* Sean Owen 
 *Sent:* Friday, March 16, 2018 9:51:49 AM
 *To:* Felix Cheung
 *Cc:* rb...@netflix.com; Nicholas Chammas; Spark dev list

 *Subject:* Re: Changing how we compute release hashes
 I think the issue with that is that OS X doesn't have "sha512sum". Both
 it and Linux have "shasum -a 512" though.

 On Fri, Mar 16, 2018 at 11:05 AM Felix Cheung <
 felixcheun...@hotmail.com> wrote:

> Instead of using gpg to create the sha512 hash file we could just
> change to using sha512sum? That would output the right format that is in
> turns verifiable.
>
>
> --
> *From:* Ryan Blue 
> *Sent:* Friday, March 16, 2018 8:31:45 AM
> *To:* Nicholas Chammas
> *Cc:* Spark dev list
> *Subject:* Re: Changing how we compute release hashes
>
> +1 It's possible to produce the same file with gpg, but the sha*sum
> utilities are a bit easier to remember the syntax for.
>
> On Thu, Mar 15, 2018 at 9:01 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> To verify that I’ve downloaded a Hadoop release correctly, I can just
>> do this:
>>
>> $ shasum --check hadoop-2.7.5.tar.gz.sha256
>> hadoop-2.7.5.tar.gz: OK
>>
>> However, since we generate Spark release hashes with GPG
>> ,
>> the resulting hash is in a format that doesn’t play well with any tools:
>>
>> $ shasum --check spark-2.3.0-bin-hadoop2.7.tgz.sha512
>> shasum: spark-2.3.0-bin-hadoop2.7.tgz.sha512: no properly formatted SHA1 
>> checksum lines found
>>
>> GPG doesn’t seem to offer a way to verify a file from a hash.
>>
>> I know I can always manipulate the SHA512 hash into a different
>> format or just manually inspect it, but as a “quality of life” 
>> improvement
>> can we change how we generate the SHA512 hash so that it plays nicely 
>> with
>> shasum? If it’s too disruptive to change the format of the SHA512
>> hash, can we add a SHA256 hash to our releases in this format?
>>
>> I suppose if it’s not easy to update or add hashes to our existing
>> releases, it may be too difficult to change anything here. But I’m not
>> sure, so I thought I’d ask.
>>
>> Nick
>> ​
>>
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>



RE: MatrixUDT and VectorUDT in Spark ML

2018-03-23 Thread Himanshu Mohan
I agree



Thanks
Himanshu

From: Li Jin [mailto:ice.xell...@gmail.com]
Sent: Friday, March 23, 2018 8:24 PM
To: dev 
Subject: MatrixUDT and VectorUDT in Spark ML

Hi All,

I came across these two types MatrixUDT and VectorUDF in Spark ML when doing 
feature extraction and preprocessing with PySpark. However, when trying to do 
some basic operations, such as vector multiplication and matrix multiplication, 
I had to go down to Python UDF.

It seems to be it would be very useful to have built-in operators on these 
types just like first class Spark SQL types, e.g.,

df.withColumn('v', df.matrix_column * df.vector_column)

I wonder what are other people's thoughts on this?

Li


American Express made the following annotations
**
"This message and any attachments are solely for the intended recipient and may 
contain confidential or privileged information. If you are not the intended 
recipient, any disclosure, copying, use, or distribution of the information 
included in this message and any attachments is prohibited. If you have 
received this communication in error, please notify us by reply e-mail and 
immediately and permanently delete this message and any attachments. Thank you."

American Express a ajouté le commentaire suivant le Ce courrier et toute pièce 
jointe qu'il contient sont réservés au seul destinataire indiqué et peuvent 
renfermer des 
renseignements confidentiels et privilégiés. Si vous n'êtes pas le destinataire 
prévu, toute divulgation, duplication, utilisation ou distribution du courrier 
ou de toute pièce jointe est interdite. Si vous avez reçu cette communication 
par erreur, veuillez nous en aviser par courrier et détruire immédiatement le 
courrier et les pièces jointes. Merci.

**


MatrixUDT and VectorUDT in Spark ML

2018-03-23 Thread Li Jin
Hi All,

I came across these two types MatrixUDT and VectorUDF in Spark ML when
doing feature extraction and preprocessing with PySpark. However, when
trying to do some basic operations, such as vector multiplication and
matrix multiplication, I had to go down to Python UDF.

It seems to be it would be very useful to have built-in operators on these
types just like first class Spark SQL types, e.g.,

df.withColumn('v', df.matrix_column * df.vector_column)

I wonder what are other people's thoughts on this?

Li