Re: Make Scala 2.12 as default Scala version in Spark 3.0

2018-11-20 Thread Sean Owen
PS: pull request at https://github.com/apache/spark/pull/23098
Not going to merge it until there's clear agreement.

On Tue, Nov 20, 2018 at 10:16 AM Ryan Blue  wrote:
>
> +1 to removing 2.11 support for 3.0 and a PR.
>
> It sounds like having multiple Scala builds is just not feasible and I don't 
> think this will be too disruptive for users since it is already a breaking 
> change.
>
> On Tue, Nov 20, 2018 at 7:05 AM Sean Owen  wrote:
>>
>> One more data point -- from looking at the SBT build yesterday, it
>> seems like most plugin updates require SBT 1.x. And both they and SBT
>> 1.x seem to need Scala 2.12. And the new zinc also does.
>> Now, the current SBT and zinc and plugins all appear to work OK with
>> 2.12 now, but updating will pretty much have to wait until 2.11
>> support goes. (I don't think it's feasible to have two SBT builds.)
>>
>> I actually haven't heard an argument for keeping 2.11, compared to the
>> overhead of maintaining it. Any substantive objections? Would it be
>> too forward to put out a WIP PR that removes it?
>>
>> On Sat, Nov 17, 2018 at 7:28 PM Sean Owen  wrote:
>> >
>> > I support dropping 2.11 support. My general logic is:
>> >
>> > - 2.11 is EOL, and is all the more EOL in the middle of next year when
>> > Spark 3 arrives
>> > - I haven't heard of a critical dependency that has no 2.12 counterpart
>> > - 2.11 users can stay on 2.4.x, which will be notionally supported
>> > through, say, end of 2019
>> > - Maintaining 2.11 vs 2.12 support is modestly difficult, in my
>> > experience resolving these differences across these two versions; it's
>> > a hassle as you need two git clones with different scala versions in
>> > the project tags
>> > - The project is already short on resources to support things as it is
>> > - Dropping things is generally necessary to add new things, to keep
>> > complexity reasonable -- like Scala 2.13 support
>> >
>> > Maintaining a separate PR builder for 2.11 isn't so bad
>> >
>> > On Fri, Nov 16, 2018 at 4:09 PM Marcelo Vanzin
>> >  wrote:
>> > >
>> > > Now that the switch to 2.12 by default has been made, it might be good
>> > > to have a serious discussion about dropping 2.11 altogether. Many of
>> > > the main arguments have already been talked about. But I don't
>> > > remember anyone mentioning how easy it would be to break the 2.11
>> > > build now.
>> > >
>> > > For example, the following works fine in 2.12 but breaks in 2.11:
>> > >
>> > > java.util.Arrays.asList("hi").stream().forEach(println)
>> > >
>> > > We had a similar issue when we supported java 1.6 but the builds were
>> > > all on 1.7 by default. Every once in a while something would silently
>> > > break, because PR builds only check the default. And the jenkins
>> > > builds, which are less monitored, would stay broken for a while.
>> > >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Maven

2018-11-20 Thread Sean Owen
Sure, if you published Spark artifacts in a local repo (even your
local file system) as com.foo:spark-core_2.12, etc, just depend on
those artifacts, not the org.apache ones.
On Tue, Nov 20, 2018 at 3:21 PM Jack Kolokasis  wrote:
>
> Hello,
>
> is there any way to use my local custom - Spark as dependency while
> I am using maven to compile my applications ?
>
> Thanks for your reply,
> --Iacovos
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Maven

2018-11-20 Thread Jack Kolokasis

Hello,

   is there any way to use my local custom - Spark as dependency while 
I am using maven to compile my applications ?


Thanks for your reply,
--Iacovos

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Can I update the "2.12" PR builder to 2.11?

2018-11-20 Thread Sean Owen
That's the ticket! yes I'll figure out the build error.
On Tue, Nov 20, 2018 at 11:16 AM shane knapp  wrote:
>
> how about this?
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/
>
> (log in w/your jenkins creds and look at the build config)
>
> basically, it runs 'dev/change-scala-version.sh 2.11' and builds w/mvn and 
> '-Pscala-2.11'
>
> i'll also disable the spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12 
> build.
>
> On Tue, Nov 20, 2018 at 8:52 AM Sean Owen  wrote:
>>
>> Ah right yes not a PR builder. So would you have to update that? If possible 
>> to get that in soon it would help detect 2.11 failures.
>>
>> On Tue, Nov 20, 2018, 10:23 AM shane knapp >>
>>> oh, the master builds are in the jenkins job builder configs in that 
>>> databricks repo (that's near the top of my TODO list to move in to the main 
>>> spark repo).
>>>
>>> and btw, the spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12 is *not* 
>>> a PR builder...  ;)
>>>
>>> On Tue, Nov 20, 2018 at 8:20 AM Sean Owen  wrote:

 The one you set up to test 2.12 separately,
 spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12
 Now master is on 2.12 by default. OK will try to change it.
 On Tue, Nov 20, 2018 at 10:15 AM shane knapp  wrote:
 >
 > which build are you referring to as "the 2.12 PR builder"?
 >
 > but yes, it should just be a simple dev/change_scala_version.sh call in 
 > the build step.
 >
 > shane
 >
 > On Tue, Nov 20, 2018 at 7:06 AM Sean Owen  wrote:
 >>
 >> Shane, on your long list of TODOs, we still need to update the 2.12 PR
 >> builder to instead test 2.11. Is that just a matter of editing Jenkins
 >> configuration that I can see and change? if so I'll just do it.
 >>
 >> Sean
 >
 >
 >
 > --
 > Shane Knapp
 > UC Berkeley EECS Research / RISELab Staff Technical Lead
 > https://rise.cs.berkeley.edu
>>>
>>>
>>>
>>> --
>>> Shane Knapp
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Numpy memory not being released in executor map-partition function (memory leak)

2018-11-20 Thread joshlk_
I believe I have uncovered a strange interaction between pySpark, Numpy and
Python which produces a memory leak. I wonder if anyone has any ideas of
what the issue could be?

I have the following minimal working example ( gist of code
  ):



When the above code is run, the memory of the executor's Python process
steadily increases after each iteration suggesting the memory of the
previous iteration isn't being released. This can lead to a job failure if
the memory exceeds the executor's memory limit.

Any of the following prevents the memory leak:
* Remove the line `data = list(rdd)`
* Insert the line `rand_data = list(rand_data.tolist())` after `rand_data =
np.random.random(int(1e7))`
* Remove the line `int(e)`

Some things to take notice of:
* While the rdd data is not used in the function, the line is required to
reproduce the leak. Reading in the RDD data has to occur as well as the
large number of ints to reproduce the leak
* The memory leak is likely due to the large Numpy array rand_data not being
released
* You have to do the int operation on each element of rand_data to reproduce
the leak

I have experimented with gc and malloc_trim to easy memory usage with no
avail.

Versions used: EMR 5.12.1, Spark 2.2.1, Python 2.7.13, Numpy 1.14.0

Some more details can be found in a  related StackOverflow post

 
.

Any ideas on what the issue could be would be very grateful.

Many thanks,
Josh



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Can I update the "2.12" PR builder to 2.11?

2018-11-20 Thread shane knapp
how about this?

https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/

(log in w/your jenkins creds and look at the build config)

basically, it runs 'dev/change-scala-version.sh 2.11' and builds w/mvn and
'-Pscala-2.11'

i'll also disable the spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12
build.

On Tue, Nov 20, 2018 at 8:52 AM Sean Owen  wrote:

> Ah right yes not a PR builder. So would you have to update that? If
> possible to get that in soon it would help detect 2.11 failures.
>
> On Tue, Nov 20, 2018, 10:23 AM shane knapp 
>> oh, the master builds are in the jenkins job builder configs in that
>> databricks repo (that's near the top of my TODO list to move in to the main
>> spark repo).
>>
>> and btw, the spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12 is
>> *not* a PR builder...  ;)
>>
>> On Tue, Nov 20, 2018 at 8:20 AM Sean Owen  wrote:
>>
>>> The one you set up to test 2.12 separately,
>>> spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12
>>> Now master is on 2.12 by default. OK will try to change it.
>>> On Tue, Nov 20, 2018 at 10:15 AM shane knapp 
>>> wrote:
>>> >
>>> > which build are you referring to as "the 2.12 PR builder"?
>>> >
>>> > but yes, it should just be a simple dev/change_scala_version.sh call
>>> in the build step.
>>> >
>>> > shane
>>> >
>>> > On Tue, Nov 20, 2018 at 7:06 AM Sean Owen  wrote:
>>> >>
>>> >> Shane, on your long list of TODOs, we still need to update the 2.12 PR
>>> >> builder to instead test 2.11. Is that just a matter of editing Jenkins
>>> >> configuration that I can see and change? if so I'll just do it.
>>> >>
>>> >> Sean
>>> >
>>> >
>>> >
>>> > --
>>> > Shane Knapp
>>> > UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> > https://rise.cs.berkeley.edu
>>>
>>
>>
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>

-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: Can I update the "2.12" PR builder to 2.11?

2018-11-20 Thread Sean Owen
Ah right yes not a PR builder. So would you have to update that? If
possible to get that in soon it would help detect 2.11 failures.

On Tue, Nov 20, 2018, 10:23 AM shane knapp  oh, the master builds are in the jenkins job builder configs in that
> databricks repo (that's near the top of my TODO list to move in to the main
> spark repo).
>
> and btw, the spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12 is *not*
> a PR builder...  ;)
>
> On Tue, Nov 20, 2018 at 8:20 AM Sean Owen  wrote:
>
>> The one you set up to test 2.12 separately,
>> spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12
>> Now master is on 2.12 by default. OK will try to change it.
>> On Tue, Nov 20, 2018 at 10:15 AM shane knapp  wrote:
>> >
>> > which build are you referring to as "the 2.12 PR builder"?
>> >
>> > but yes, it should just be a simple dev/change_scala_version.sh call in
>> the build step.
>> >
>> > shane
>> >
>> > On Tue, Nov 20, 2018 at 7:06 AM Sean Owen  wrote:
>> >>
>> >> Shane, on your long list of TODOs, we still need to update the 2.12 PR
>> >> builder to instead test 2.11. Is that just a matter of editing Jenkins
>> >> configuration that I can see and change? if so I'll just do it.
>> >>
>> >> Sean
>> >
>> >
>> >
>> > --
>> > Shane Knapp
>> > UC Berkeley EECS Research / RISELab Staff Technical Lead
>> > https://rise.cs.berkeley.edu
>>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Re: Can I update the "2.12" PR builder to 2.11?

2018-11-20 Thread shane knapp
oh, the master builds are in the jenkins job builder configs in that
databricks repo (that's near the top of my TODO list to move in to the main
spark repo).

and btw, the spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12 is *not*
a PR builder...  ;)

On Tue, Nov 20, 2018 at 8:20 AM Sean Owen  wrote:

> The one you set up to test 2.12 separately,
> spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12
> Now master is on 2.12 by default. OK will try to change it.
> On Tue, Nov 20, 2018 at 10:15 AM shane knapp  wrote:
> >
> > which build are you referring to as "the 2.12 PR builder"?
> >
> > but yes, it should just be a simple dev/change_scala_version.sh call in
> the build step.
> >
> > shane
> >
> > On Tue, Nov 20, 2018 at 7:06 AM Sean Owen  wrote:
> >>
> >> Shane, on your long list of TODOs, we still need to update the 2.12 PR
> >> builder to instead test 2.11. Is that just a matter of editing Jenkins
> >> configuration that I can see and change? if so I'll just do it.
> >>
> >> Sean
> >
> >
> >
> > --
> > Shane Knapp
> > UC Berkeley EECS Research / RISELab Staff Technical Lead
> > https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: Can I update the "2.12" PR builder to 2.11?

2018-11-20 Thread Sean Owen
The one you set up to test 2.12 separately,
spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12
Now master is on 2.12 by default. OK will try to change it.
On Tue, Nov 20, 2018 at 10:15 AM shane knapp  wrote:
>
> which build are you referring to as "the 2.12 PR builder"?
>
> but yes, it should just be a simple dev/change_scala_version.sh call in the 
> build step.
>
> shane
>
> On Tue, Nov 20, 2018 at 7:06 AM Sean Owen  wrote:
>>
>> Shane, on your long list of TODOs, we still need to update the 2.12 PR
>> builder to instead test 2.11. Is that just a matter of editing Jenkins
>> configuration that I can see and change? if so I'll just do it.
>>
>> Sean
>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Make Scala 2.12 as default Scala version in Spark 3.0

2018-11-20 Thread Ryan Blue
+1 to removing 2.11 support for 3.0 and a PR.

It sounds like having multiple Scala builds is just not feasible and I
don't think this will be too disruptive for users since it is already a
breaking change.

On Tue, Nov 20, 2018 at 7:05 AM Sean Owen  wrote:

> One more data point -- from looking at the SBT build yesterday, it
> seems like most plugin updates require SBT 1.x. And both they and SBT
> 1.x seem to need Scala 2.12. And the new zinc also does.
> Now, the current SBT and zinc and plugins all appear to work OK with
> 2.12 now, but updating will pretty much have to wait until 2.11
> support goes. (I don't think it's feasible to have two SBT builds.)
>
> I actually haven't heard an argument for keeping 2.11, compared to the
> overhead of maintaining it. Any substantive objections? Would it be
> too forward to put out a WIP PR that removes it?
>
> On Sat, Nov 17, 2018 at 7:28 PM Sean Owen  wrote:
> >
> > I support dropping 2.11 support. My general logic is:
> >
> > - 2.11 is EOL, and is all the more EOL in the middle of next year when
> > Spark 3 arrives
> > - I haven't heard of a critical dependency that has no 2.12 counterpart
> > - 2.11 users can stay on 2.4.x, which will be notionally supported
> > through, say, end of 2019
> > - Maintaining 2.11 vs 2.12 support is modestly difficult, in my
> > experience resolving these differences across these two versions; it's
> > a hassle as you need two git clones with different scala versions in
> > the project tags
> > - The project is already short on resources to support things as it is
> > - Dropping things is generally necessary to add new things, to keep
> > complexity reasonable -- like Scala 2.13 support
> >
> > Maintaining a separate PR builder for 2.11 isn't so bad
> >
> > On Fri, Nov 16, 2018 at 4:09 PM Marcelo Vanzin
> >  wrote:
> > >
> > > Now that the switch to 2.12 by default has been made, it might be good
> > > to have a serious discussion about dropping 2.11 altogether. Many of
> > > the main arguments have already been talked about. But I don't
> > > remember anyone mentioning how easy it would be to break the 2.11
> > > build now.
> > >
> > > For example, the following works fine in 2.12 but breaks in 2.11:
> > >
> > > java.util.Arrays.asList("hi").stream().forEach(println)
> > >
> > > We had a similar issue when we supported java 1.6 but the builds were
> > > all on 1.7 by default. Every once in a while something would silently
> > > break, because PR builds only check the default. And the jenkins
> > > builds, which are less monitored, would stay broken for a while.
> > >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
Ryan Blue
Software Engineer
Netflix


Re: Can I update the "2.12" PR builder to 2.11?

2018-11-20 Thread shane knapp
which build are you referring to as "the 2.12 PR builder"?

but yes, it should just be a simple dev/change_scala_version.sh call in the
build step.

shane

On Tue, Nov 20, 2018 at 7:06 AM Sean Owen  wrote:

> Shane, on your long list of TODOs, we still need to update the 2.12 PR
> builder to instead test 2.11. Is that just a matter of editing Jenkins
> configuration that I can see and change? if so I'll just do it.
>
> Sean
>


-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Can I update the "2.12" PR builder to 2.11?

2018-11-20 Thread Sean Owen
Shane, on your long list of TODOs, we still need to update the 2.12 PR
builder to instead test 2.11. Is that just a matter of editing Jenkins
configuration that I can see and change? if so I'll just do it.

Sean

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Make Scala 2.12 as default Scala version in Spark 3.0

2018-11-20 Thread Sean Owen
One more data point -- from looking at the SBT build yesterday, it
seems like most plugin updates require SBT 1.x. And both they and SBT
1.x seem to need Scala 2.12. And the new zinc also does.
Now, the current SBT and zinc and plugins all appear to work OK with
2.12 now, but updating will pretty much have to wait until 2.11
support goes. (I don't think it's feasible to have two SBT builds.)

I actually haven't heard an argument for keeping 2.11, compared to the
overhead of maintaining it. Any substantive objections? Would it be
too forward to put out a WIP PR that removes it?

On Sat, Nov 17, 2018 at 7:28 PM Sean Owen  wrote:
>
> I support dropping 2.11 support. My general logic is:
>
> - 2.11 is EOL, and is all the more EOL in the middle of next year when
> Spark 3 arrives
> - I haven't heard of a critical dependency that has no 2.12 counterpart
> - 2.11 users can stay on 2.4.x, which will be notionally supported
> through, say, end of 2019
> - Maintaining 2.11 vs 2.12 support is modestly difficult, in my
> experience resolving these differences across these two versions; it's
> a hassle as you need two git clones with different scala versions in
> the project tags
> - The project is already short on resources to support things as it is
> - Dropping things is generally necessary to add new things, to keep
> complexity reasonable -- like Scala 2.13 support
>
> Maintaining a separate PR builder for 2.11 isn't so bad
>
> On Fri, Nov 16, 2018 at 4:09 PM Marcelo Vanzin
>  wrote:
> >
> > Now that the switch to 2.12 by default has been made, it might be good
> > to have a serious discussion about dropping 2.11 altogether. Many of
> > the main arguments have already been talked about. But I don't
> > remember anyone mentioning how easy it would be to break the 2.11
> > build now.
> >
> > For example, the following works fine in 2.12 but breaks in 2.11:
> >
> > java.util.Arrays.asList("hi").stream().forEach(println)
> >
> > We had a similar issue when we supported java 1.6 but the builds were
> > all on 1.7 by default. Every once in a while something would silently
> > break, because PR builds only check the default. And the jenkins
> > builds, which are less monitored, would stay broken for a while.
> >

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Array indexing functions

2018-11-20 Thread Alessandro Solimando
Hi Petar,
I have implemented similar functions a few times through ad-hoc UDFs in the
past, so +1 from me.

Can you elaborate a bit more on how you practically implement those
functions? Are they UDF or "native" functions like those in sql.functions
package?

I am asking because I wonder if/how Catalyst can take those functions into
account for producing more optimized plans, maybe you or someone else in
the list can clarify this.

Best regards,
Alessandro

On Tue, 20 Nov 2018 at 11:11, Petar Zečević  wrote:

>
> Hi,
> I implemented two array functions that are useful to us and I wonder if
> you think it would be useful to add them to the distribution. The functions
> are used for filtering arrays based on indexes:
>
> array_allpositions (named after array_position) - takes a column and a
> value and returns an array of the column's indexes corresponding to
> elements equal to the provided value
>
> array_select - takes an array column and an array of indexes and returns a
> subset of the array based on the provided indexes.
>
> If you agree with this addition I can create a JIRA ticket and a pull
> request.
>
> --
> Petar Zečević
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Array indexing functions

2018-11-20 Thread Petar Zečević


Hi,
I implemented two array functions that are useful to us and I wonder if you 
think it would be useful to add them to the distribution. The functions are 
used for filtering arrays based on indexes:

array_allpositions (named after array_position) - takes a column and a value 
and returns an array of the column's indexes corresponding to elements equal to 
the provided value

array_select - takes an array column and an array of indexes and returns a 
subset of the array based on the provided indexes.

If you agree with this addition I can create a JIRA ticket and a pull request.

-- 
Petar Zečević

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org