Re: Use /usr/bin/env python3 in scripts?

2020-07-17 Thread Jungtaek Lim
For me merge script worked for python 2.7, but I got some trouble with the
encoding issue (probably from contributor's name) so now I use the merge
script with virtualenv & python 3.7.7.

"python3" would be OK for me as well as it doesn't break virtualenv with
python 3.

On Sat, Jul 18, 2020 at 6:13 AM Driesprong, Fokko 
wrote:

> +1 I'm in favor of using python3
>
> Cheers, Fokko
>
> Op vr 17 jul. 2020 om 19:49 schreef Sean Owen :
>
>> Yeah I figured it's a best practice, so I'll raise a PR unless
>> somebody tells me not to. This is about build scripts, not Pyspark
>> itself, and half the scripts already specify python3.
>>
>> On Fri, Jul 17, 2020 at 12:36 PM Oli McCormack  wrote:
>> >
>> > [Warning: not spark+python specific information]
>> >
>> > It's recommended that you should explicitly call out python3 in a case
>> like this (see PEP-0394, and SO). Your environment is typical: python is
>> often a pointer to python2 for tooling compatibility reasons (other tools
>> or scripts that expect they're going to get python2 when they call python),
>> and you should use python3 to use the new version. What python points to
>> will change over time, so it's recommended to use python2 if explicitly
>> depending on that.
>> >
>> > More generally: It's common/recommended to use a virtual environment +
>> explicitly stated versions of Python and dependencies, rather than system
>> Python, so that python means exactly what you intend it to. I know very
>> little about the Spark python dev stack and how challenging it may be to do
>> this, so please take this with a dose of naiveté.
>> >
>> > - Oli
>> >
>> >
>> > On Fri, Jul 17, 2020 at 9:58 AM Sean Owen  wrote:
>> >>
>> >> So, we are on Python 3 entirely now right?
>> >> It might be just my local Mac env, but "/usr/bin/env python" uses
>> >> Python 2 on my mac.
>> >> Some scripts write "/usr/bin/env python3" now. Should that be the case
>> >> in all scripts?
>> >> Right now the merge script doesn't work for me b/c it was just updated
>> >> to be Python 3 only.
>> >>
>> >> -
>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: Use /usr/bin/env python3 in scripts?

2020-07-17 Thread Driesprong, Fokko
+1 I'm in favor of using python3

Cheers, Fokko

Op vr 17 jul. 2020 om 19:49 schreef Sean Owen :

> Yeah I figured it's a best practice, so I'll raise a PR unless
> somebody tells me not to. This is about build scripts, not Pyspark
> itself, and half the scripts already specify python3.
>
> On Fri, Jul 17, 2020 at 12:36 PM Oli McCormack  wrote:
> >
> > [Warning: not spark+python specific information]
> >
> > It's recommended that you should explicitly call out python3 in a case
> like this (see PEP-0394, and SO). Your environment is typical: python is
> often a pointer to python2 for tooling compatibility reasons (other tools
> or scripts that expect they're going to get python2 when they call python),
> and you should use python3 to use the new version. What python points to
> will change over time, so it's recommended to use python2 if explicitly
> depending on that.
> >
> > More generally: It's common/recommended to use a virtual environment +
> explicitly stated versions of Python and dependencies, rather than system
> Python, so that python means exactly what you intend it to. I know very
> little about the Spark python dev stack and how challenging it may be to do
> this, so please take this with a dose of naiveté.
> >
> > - Oli
> >
> >
> > On Fri, Jul 17, 2020 at 9:58 AM Sean Owen  wrote:
> >>
> >> So, we are on Python 3 entirely now right?
> >> It might be just my local Mac env, but "/usr/bin/env python" uses
> >> Python 2 on my mac.
> >> Some scripts write "/usr/bin/env python3" now. Should that be the case
> >> in all scripts?
> >> Right now the merge script doesn't work for me b/c it was just updated
> >> to be Python 3 only.
> >>
> >> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: R installation broken on ubuntu workers, impacts K8s PRB builds

2020-07-17 Thread shane knapp ☠
this is done, except for amp-jenkins-staging-worker-02 which is refusing to
allow me to reinstall R...  i marked that worker offline and will beat on
it later today.

On Fri, Jul 17, 2020 at 11:36 AM shane knapp ☠  wrote:

> starting now...  pausing jenkins so no new builds are launched.
>
> On Thu, Jul 16, 2020 at 3:09 PM Holden Karau  wrote:
>
>> Sounds good, thanks. No rush :)
>>
>> On Thu, Jul 16, 2020 at 3:03 PM shane knapp ☠ 
>> wrote:
>>
>>> i'll get to this tomorrow afternoon, and there will be a short
>>> downtime.  more details to come.
>>>
>>> On Wed, Jul 15, 2020 at 12:17 PM Holden Karau 
>>> wrote:
>>>
 Oh cool, I filed a JIRA for this already and assigned it to you
 (noticed in one of my PRs)-
 https://issues.apache.org/jira/browse/SPARK-32326

 On Wed, Jul 15, 2020 at 12:09 PM shane knapp ☠ 
 wrote:

> i'm not entirely sure when the dep for R got bumped to 3.5+, but it's
> breaking the k8s builds.
>
> i'll need to purge these workers of all previous versions of R +
> packages, then reinstall from scratch.  this isn't a horrible task as i
> have most of it automated but it will still require a ~few hours of
> downtime.
>
> i'll file a JIRA, and figure out when i will be able to get to
> this...  possibly this afternoon.
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


 --
 Twitter: https://twitter.com/holdenkarau
 Books (Learning Spark, High Performance Spark, etc.):
 https://amzn.to/2MaRAG9  
 YouTube Live Streams: https://www.youtube.com/user/holdenkarau

>>>
>>>
>>> --
>>> Shane Knapp
>>> Computer Guy / Voice of Reason
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: R installation broken on ubuntu workers, impacts K8s PRB builds

2020-07-17 Thread shane knapp ☠
starting now...  pausing jenkins so no new builds are launched.

On Thu, Jul 16, 2020 at 3:09 PM Holden Karau  wrote:

> Sounds good, thanks. No rush :)
>
> On Thu, Jul 16, 2020 at 3:03 PM shane knapp ☠  wrote:
>
>> i'll get to this tomorrow afternoon, and there will be a short downtime.
>> more details to come.
>>
>> On Wed, Jul 15, 2020 at 12:17 PM Holden Karau 
>> wrote:
>>
>>> Oh cool, I filed a JIRA for this already and assigned it to you (noticed
>>> in one of my PRs)- https://issues.apache.org/jira/browse/SPARK-32326
>>>
>>> On Wed, Jul 15, 2020 at 12:09 PM shane knapp ☠ 
>>> wrote:
>>>
 i'm not entirely sure when the dep for R got bumped to 3.5+, but it's
 breaking the k8s builds.

 i'll need to purge these workers of all previous versions of R +
 packages, then reinstall from scratch.  this isn't a horrible task as i
 have most of it automated but it will still require a ~few hours of
 downtime.

 i'll file a JIRA, and figure out when i will be able to get to this...
 possibly this afternoon.
 --
 Shane Knapp
 Computer Guy / Voice of Reason
 UC Berkeley EECS Research / RISELab Staff Technical Lead
 https://rise.cs.berkeley.edu

>>>
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>
>>
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: Use /usr/bin/env python3 in scripts?

2020-07-17 Thread Sean Owen
Yeah I figured it's a best practice, so I'll raise a PR unless
somebody tells me not to. This is about build scripts, not Pyspark
itself, and half the scripts already specify python3.

On Fri, Jul 17, 2020 at 12:36 PM Oli McCormack  wrote:
>
> [Warning: not spark+python specific information]
>
> It's recommended that you should explicitly call out python3 in a case like 
> this (see PEP-0394, and SO). Your environment is typical: python is often a 
> pointer to python2 for tooling compatibility reasons (other tools or scripts 
> that expect they're going to get python2 when they call python), and you 
> should use python3 to use the new version. What python points to will change 
> over time, so it's recommended to use python2 if explicitly depending on that.
>
> More generally: It's common/recommended to use a virtual environment + 
> explicitly stated versions of Python and dependencies, rather than system 
> Python, so that python means exactly what you intend it to. I know very 
> little about the Spark python dev stack and how challenging it may be to do 
> this, so please take this with a dose of naiveté.
>
> - Oli
>
>
> On Fri, Jul 17, 2020 at 9:58 AM Sean Owen  wrote:
>>
>> So, we are on Python 3 entirely now right?
>> It might be just my local Mac env, but "/usr/bin/env python" uses
>> Python 2 on my mac.
>> Some scripts write "/usr/bin/env python3" now. Should that be the case
>> in all scripts?
>> Right now the merge script doesn't work for me b/c it was just updated
>> to be Python 3 only.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Use /usr/bin/env python3 in scripts?

2020-07-17 Thread Oli McCormack
[Warning: not spark+python specific information]

It's recommended that you should explicitly call out python3 in a case like
this (see PEP-0394
, and SO
).
Your environment is typical: *python* is often a pointer to python2 for
tooling compatibility reasons (other tools or scripts that expect they're
going to get python2 when they call *python*), and you should use python3
to use the new version. What *python* points to will change over time, so
it's recommended to use *python2 *if explicitly depending on that.

More generally: It's common/recommended to use a virtual environment +
explicitly stated versions of Python and dependencies, rather than system
Python, so that *python* means exactly what you intend it to. I know very
little about the Spark python dev stack and how challenging it may be to do
this, so please take this with a dose of naiveté.

- Oli


On Fri, Jul 17, 2020 at 9:58 AM Sean Owen  wrote:

> So, we are on Python 3 entirely now right?
> It might be just my local Mac env, but "/usr/bin/env python" uses
> Python 2 on my mac.
> Some scripts write "/usr/bin/env python3" now. Should that be the case
> in all scripts?
> Right now the merge script doesn't work for me b/c it was just updated
> to be Python 3 only.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Use /usr/bin/env python3 in scripts?

2020-07-17 Thread Sean Owen
So, we are on Python 3 entirely now right?
It might be just my local Mac env, but "/usr/bin/env python" uses
Python 2 on my mac.
Some scripts write "/usr/bin/env python3" now. Should that be the case
in all scripts?
Right now the merge script doesn't work for me b/c it was just updated
to be Python 3 only.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Catalog API for Partition

2020-07-17 Thread JackyLee
Hi, wenchen. Thanks for your attention and reply.

Firstly. These Partition Catalog APIs are not specially used for hive, they
can be used with LakeHouse or myql or other source support partitions.
Secondly. These Partition Catalog APIs are only designed for better data
management, not for speed up data scan. The API used to speed up hive data
scan are different from these APIs.

Currently, we use Hive Catalog APIs to support speeding hive data scan and
write data into hive. However, we are trying to redefine HiveTable, which
implements FileTable, and use PartitioningPruning to support speed up hive
scan. Privately, I think this is a better way to support hive in
datasourcev2.

Thanks again.
Jacky Lee



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Catalog API for Partition

2020-07-17 Thread Wenchen Fan
In Hive, partition does two things:
1. Act as an index to speed up data scan
2. Act as a way to manage the data. People can add/drop partitions.

How do you unify these 2 things in your API design?

On Fri, Jul 17, 2020 at 12:03 AM JackyLee  wrote:

> Hi devs,
>
> In order to support Partition Commands for datasourcev2 and Lakehouse, I'm
> trying to add Partition API for multiple Catalog.
>
> They are widely used APIs in mysql or hive or other datasources, we can use
> these API to mange Partition metadata in Lakehouse.
>
> JIRA: https://issues.apache.org/jira/browse/SPARK-31694
> PR: https://github.com/apache/spark/pull/28617
>
> We have already use these APIs to support Lakehouse on Delta Lake and hive
> on datasourcev2, and it does solves partition supports on datasourcev2.
> Could anyone review it?
>
> Thanks,
> Jacky Lee
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Welcoming some new Apache Spark committers

2020-07-17 Thread Hyukjin Kwon
You earned it Dilip. Congrats again!

On Fri, 17 Jul 2020, 14:13 Dilip Biswal,  wrote:

> Thank you all for your kind words. A special "thank you" to *Xiao Li *for
> his help and mentorship over the years that helped me immensely. I would
> also like to mention *Wenchen Fan*, *Takeshi Yamamuro,* *Sean Owen*, *Dongjoon
> hyun*, *Hyukjin Kwon, *and *Liang-Chi Hsieh,* who all helped review the
> majority of my PRs allowing me to grow technically.
>
> Thanks again and looking forward to working with you all.
>
> Regards,
> Dilip
>
> On Thu, Jul 16, 2020 at 12:53 AM Gengliang Wang <
> gengliang.w...@databricks.com> wrote:
>
>> Congratulations!
>>
>> On Thu, Jul 16, 2020 at 3:17 PM Dr. Kent Yao  wrote:
>>
>>> Congrats and welcome!!!
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>