Help required in validating an architecture using Structured Streaming

2016-09-27 Thread Aravindh
Hi, We are building an internal analytics application. Kind of an event
store. We have all the basic analytics use cases like filtering,
aggregation, segmentation etc. So far our architecture used ElasticSearch
extensively but that is not scaling anymore. One unique requirement we have
is an event should be available for querying within 5 seconds of the event.
We were thinking of a lambda architecture where streaming data still goes to
elastic search (only 1 day's data), batch pipeline goes to s3. Every day
one, a spark job will transform that data and store again in s3. One problem
we were not able to solve was when a query comes, how to aggregate results
from 2 data sources (ES for current data & s3 for old data). We felt this
approach wont scale.

Spark Structured Streaming seems to solve this. Correct me if i am wrong.
With structured streaming, will the following architecture work?
Read data from kafka using spark. For every batch of data, do the
transformations and store in s3. But when a query comes, query from both s3
& in memory batch at the same time. Will this approach work? Also one more
condition is, querying should respond immediately. With a max latency of 1s
for simple queries and 5s for complex queries. If the above method is not
the right way, please suggest an alternative to solve this.

Thanks
Aravindh.S



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Help-required-in-validating-an-architecture-using-Structured-Streaming-tp19119.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [discuss] Spark 2.x release cadence

2016-09-27 Thread Felix Cheung
+1 on longer release cycle at schedule and more maintenance releases.


_
From: Mark Hamstra >
Sent: Tuesday, September 27, 2016 2:01 PM
Subject: Re: [discuss] Spark 2.x release cadence
To: Reynold Xin >
Cc: >


+1

And I'll dare say that for those with Spark in production, what is more 
important is that maintenance releases come out in a timely fashion than that 
new features are released one month sooner or later.

On Tue, Sep 27, 2016 at 12:06 PM, Reynold Xin 
> wrote:
We are 2 months past releasing Spark 2.0.0, an important milestone for the 
project. Spark 2.0.0 deviated (took 6 month from the regular release cadence we 
had for the 1.x line, and we never explicitly discussed what the release 
cadence should look like for 2.x. Thus this email.

During Spark 1.x, roughly every three months we make a new 1.x feature release 
(e.g. 1.5.0 comes out three months after 1.4.0). Development happened primarily 
in the first two months, and then a release branch was cut at the end of month 
2, and the last month was reserved for QA and release preparation.

During 2.0.0 development, I really enjoyed the longer release cycle because 
there was a lot of major changes happening and the longer time was critical for 
thinking through architectural changes as well as API design. While I don't 
expect the same degree of drastic changes in a 2.x feature release, I do think 
it'd make sense to increase the length of release cycle so we can make better 
designs.

My strawman proposal is to maintain a regular release cadence, as we did in 
Spark 1.x, and increase the cycle from 3 months to 4 months. This effectively 
gives us ~50% more time to develop (in reality it'd be slightly less than 50% 
since longer dev time also means longer QA time). As for maintenance releases, 
I think those should still be cut on-demand, similar to Spark 1.x, but more 
aggressively.

To put this into perspective, 4-month cycle means we will release Spark 2.1.0 
at the end of Nov or early Dec (and branch cut / code freeze at the end of Oct).

I am curious what others think.







Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-27 Thread Reynold Xin
So technically the vote has passed, but IMHO it does not make sense to
release this and then immediately release 2.0.2. I will work on a new RC
once SPARK-17666 and SPARK-17673 are fixed.

Please shout if you disagree.


On Tue, Sep 27, 2016 at 2:05 PM, Mark Hamstra 
wrote:

> If we're going to cut another RC, then it would be good to get this in as
> well (assuming that it is merged shortly): https://github.com/
> apache/spark/pull/15213
>
> It's not a regression, and it shouldn't happen too often, but when failed
> stages don't get resubmitted it is a fairly significant issue.
>
> On Tue, Sep 27, 2016 at 1:31 PM, Reynold Xin  wrote:
>
>> Actually I'm going to have to -1 the release myself. Sorry for crashing
>> the party, but I saw two super critical issues discovered in the last 2
>> days:
>>
>> https://issues.apache.org/jira/browse/SPARK-17666  -- this would
>> eventually hang Spark when running against S3 (and many other storage
>> systems)
>>
>> https://issues.apache.org/jira/browse/SPARK-17673  -- this is a
>> correctness issue across all non-file data sources.
>>
>> If we go ahead and release 2.0.1 based on this RC, we would need to cut
>> 2.0.2 immediately.
>>
>>
>>
>>
>>
>> On Tue, Sep 27, 2016 at 10:18 AM, Mark Hamstra 
>> wrote:
>>
>>> I've got a couple of build niggles that should really be investigated at
>>> some point (what look to be OOM issues in spark-repl when building and
>>> testing with mvn in a single pass instead of in two passes with -DskipTests
>>> first; the killing of ssh sessions by YarnClusterSuite), but these
>>> aren't anything that should hold up the release.
>>>
>>> +1
>>>
>>> On Sat, Sep 24, 2016 at 3:08 PM, Reynold Xin 
>>> wrote:
>>>
 Please vote on releasing the following candidate as Apache Spark
 version 2.0.1. The vote is open until Tue, Sep 27, 2016 at 15:30 PDT and
 passes if a majority of at least 3+1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 2.0.1
 [ ] -1 Do not release this package because ...


 The tag to be voted on is v2.0.1-rc3 (9d28cc10357a8afcfb2fa2e6eecb5
 c2cc2730d17)

 This release candidate resolves 290 issues:
 https://s.apache.org/spark-2.0.1-jira

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc3-bin/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1201/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc3-docs/


 Q: How can I help test this release?
 A: If you are a Spark user, you can help us test this release by taking
 an existing Spark workload and running on this release candidate, then
 reporting any regressions from 2.0.0.

 Q: What justifies a -1 vote for this release?
 A: This is a maintenance release in the 2.0.x series.  Bugs already
 present in 2.0.0, missing features, or bugs related to new features will
 not necessarily block this release.

 Q: What fix version should I use for patches merging into branch-2.0
 from now on?
 A: Please mark the fix version as 2.0.2, rather than 2.0.1. If a new RC
 (i.e. RC4) is cut, I will change the fix version of those patches to 2.0.1.



>>>
>>
>


Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-27 Thread Mark Hamstra
If we're going to cut another RC, then it would be good to get this in as
well (assuming that it is merged shortly):
https://github.com/apache/spark/pull/15213

It's not a regression, and it shouldn't happen too often, but when failed
stages don't get resubmitted it is a fairly significant issue.

On Tue, Sep 27, 2016 at 1:31 PM, Reynold Xin  wrote:

> Actually I'm going to have to -1 the release myself. Sorry for crashing
> the party, but I saw two super critical issues discovered in the last 2
> days:
>
> https://issues.apache.org/jira/browse/SPARK-17666  -- this would
> eventually hang Spark when running against S3 (and many other storage
> systems)
>
> https://issues.apache.org/jira/browse/SPARK-17673  -- this is a
> correctness issue across all non-file data sources.
>
> If we go ahead and release 2.0.1 based on this RC, we would need to cut
> 2.0.2 immediately.
>
>
>
>
>
> On Tue, Sep 27, 2016 at 10:18 AM, Mark Hamstra 
> wrote:
>
>> I've got a couple of build niggles that should really be investigated at
>> some point (what look to be OOM issues in spark-repl when building and
>> testing with mvn in a single pass instead of in two passes with -DskipTests
>> first; the killing of ssh sessions by YarnClusterSuite), but these
>> aren't anything that should hold up the release.
>>
>> +1
>>
>> On Sat, Sep 24, 2016 at 3:08 PM, Reynold Xin  wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 2.0.1. The vote is open until Tue, Sep 27, 2016 at 15:30 PDT and passes if
>>> a majority of at least 3+1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.0.1
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> The tag to be voted on is v2.0.1-rc3 (9d28cc10357a8afcfb2fa2e6eecb5
>>> c2cc2730d17)
>>>
>>> This release candidate resolves 290 issues:
>>> https://s.apache.org/spark-2.0.1-jira
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc3-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1201/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc3-docs/
>>>
>>>
>>> Q: How can I help test this release?
>>> A: If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions from 2.0.0.
>>>
>>> Q: What justifies a -1 vote for this release?
>>> A: This is a maintenance release in the 2.0.x series.  Bugs already
>>> present in 2.0.0, missing features, or bugs related to new features will
>>> not necessarily block this release.
>>>
>>> Q: What fix version should I use for patches merging into branch-2.0
>>> from now on?
>>> A: Please mark the fix version as 2.0.2, rather than 2.0.1. If a new RC
>>> (i.e. RC4) is cut, I will change the fix version of those patches to 2.0.1.
>>>
>>>
>>>
>>
>


Re: [discuss] Spark 2.x release cadence

2016-09-27 Thread Mark Hamstra
+1

And I'll dare say that for those with Spark in production, what is more
important is that maintenance releases come out in a timely fashion than
that new features are released one month sooner or later.

On Tue, Sep 27, 2016 at 12:06 PM, Reynold Xin  wrote:

> We are 2 months past releasing Spark 2.0.0, an important milestone for the
> project. Spark 2.0.0 deviated (took 6 month from the regular release
> cadence we had for the 1.x line, and we never explicitly discussed what the
> release cadence should look like for 2.x. Thus this email.
>
> During Spark 1.x, roughly every three months we make a new 1.x feature
> release (e.g. 1.5.0 comes out three months after 1.4.0). Development
> happened primarily in the first two months, and then a release branch was
> cut at the end of month 2, and the last month was reserved for QA and
> release preparation.
>
> During 2.0.0 development, I really enjoyed the longer release cycle
> because there was a lot of major changes happening and the longer time was
> critical for thinking through architectural changes as well as API design.
> While I don't expect the same degree of drastic changes in a 2.x feature
> release, I do think it'd make sense to increase the length of release cycle
> so we can make better designs.
>
> My strawman proposal is to maintain a regular release cadence, as we did
> in Spark 1.x, and increase the cycle from 3 months to 4 months. This
> effectively gives us ~50% more time to develop (in reality it'd be slightly
> less than 50% since longer dev time also means longer QA time). As for
> maintenance releases, I think those should still be cut on-demand, similar
> to Spark 1.x, but more aggressively.
>
> To put this into perspective, 4-month cycle means we will release Spark
> 2.1.0 at the end of Nov or early Dec (and branch cut / code freeze at the
> end of Oct).
>
> I am curious what others think.
>
>
>


Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-27 Thread Reynold Xin
Actually I'm going to have to -1 the release myself. Sorry for crashing the
party, but I saw two super critical issues discovered in the last 2 days:

https://issues.apache.org/jira/browse/SPARK-17666  -- this would eventually
hang Spark when running against S3 (and many other storage systems)

https://issues.apache.org/jira/browse/SPARK-17673  -- this is a correctness
issue across all non-file data sources.

If we go ahead and release 2.0.1 based on this RC, we would need to cut
2.0.2 immediately.





On Tue, Sep 27, 2016 at 10:18 AM, Mark Hamstra 
wrote:

> I've got a couple of build niggles that should really be investigated at
> some point (what look to be OOM issues in spark-repl when building and
> testing with mvn in a single pass instead of in two passes with -DskipTests
> first; the killing of ssh sessions by YarnClusterSuite), but these aren't
> anything that should hold up the release.
>
> +1
>
> On Sat, Sep 24, 2016 at 3:08 PM, Reynold Xin  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.0.1. The vote is open until Tue, Sep 27, 2016 at 15:30 PDT and passes if
>> a majority of at least 3+1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.0.1
>> [ ] -1 Do not release this package because ...
>>
>>
>> The tag to be voted on is v2.0.1-rc3 (9d28cc10357a8afcfb2fa2e6eecb5
>> c2cc2730d17)
>>
>> This release candidate resolves 290 issues:
>> https://s.apache.org/spark-2.0.1-jira
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc3-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1201/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc3-docs/
>>
>>
>> Q: How can I help test this release?
>> A: If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions from 2.0.0.
>>
>> Q: What justifies a -1 vote for this release?
>> A: This is a maintenance release in the 2.0.x series.  Bugs already
>> present in 2.0.0, missing features, or bugs related to new features will
>> not necessarily block this release.
>>
>> Q: What fix version should I use for patches merging into branch-2.0 from
>> now on?
>> A: Please mark the fix version as 2.0.2, rather than 2.0.1. If a new RC
>> (i.e. RC4) is cut, I will change the fix version of those patches to 2.0.1.
>>
>>
>>
>


Re: [discuss] Spark 2.x release cadence

2016-09-27 Thread Sean Owen
+1 -- I think the minor releases were taking more like 4 months than 3
months anyway, and it was good for the reasons you give. This reflects
reality and is a good thing. All the better if we then can more
comfortably really follow the timeline.

On Tue, Sep 27, 2016 at 3:06 PM, Reynold Xin  wrote:
> We are 2 months past releasing Spark 2.0.0, an important milestone for the
> project. Spark 2.0.0 deviated (took 6 month from the regular release cadence
> we had for the 1.x line, and we never explicitly discussed what the release
> cadence should look like for 2.x. Thus this email.
>
> During Spark 1.x, roughly every three months we make a new 1.x feature
> release (e.g. 1.5.0 comes out three months after 1.4.0). Development
> happened primarily in the first two months, and then a release branch was
> cut at the end of month 2, and the last month was reserved for QA and
> release preparation.
>
> During 2.0.0 development, I really enjoyed the longer release cycle because
> there was a lot of major changes happening and the longer time was critical
> for thinking through architectural changes as well as API design. While I
> don't expect the same degree of drastic changes in a 2.x feature release, I
> do think it'd make sense to increase the length of release cycle so we can
> make better designs.
>
> My strawman proposal is to maintain a regular release cadence, as we did in
> Spark 1.x, and increase the cycle from 3 months to 4 months. This
> effectively gives us ~50% more time to develop (in reality it'd be slightly
> less than 50% since longer dev time also means longer QA time). As for
> maintenance releases, I think those should still be cut on-demand, similar
> to Spark 1.x, but more aggressively.
>
> To put this into perspective, 4-month cycle means we will release Spark
> 2.1.0 at the end of Nov or early Dec (and branch cut / code freeze at the
> end of Oct).
>
> I am curious what others think.
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [discuss] Spark 2.x release cadence

2016-09-27 Thread Shivaram Venkataraman
+1 I think having a 4 month window instead of a 3 month window sounds good.

However I think figuring out a timeline for maintenance releases would
also be good. This is a common concern that comes up in many user
threads and it'll be better to have some structure around this. It
doesn't need to be strict, but something like the first maintenance
release for the latest 2.x.0 release within 2 months. And then a
second maintenance release within 6 months or something like that.

Thanks
Shivaram

On Tue, Sep 27, 2016 at 12:06 PM, Reynold Xin  wrote:
> We are 2 months past releasing Spark 2.0.0, an important milestone for the
> project. Spark 2.0.0 deviated (took 6 month from the regular release cadence
> we had for the 1.x line, and we never explicitly discussed what the release
> cadence should look like for 2.x. Thus this email.
>
> During Spark 1.x, roughly every three months we make a new 1.x feature
> release (e.g. 1.5.0 comes out three months after 1.4.0). Development
> happened primarily in the first two months, and then a release branch was
> cut at the end of month 2, and the last month was reserved for QA and
> release preparation.
>
> During 2.0.0 development, I really enjoyed the longer release cycle because
> there was a lot of major changes happening and the longer time was critical
> for thinking through architectural changes as well as API design. While I
> don't expect the same degree of drastic changes in a 2.x feature release, I
> do think it'd make sense to increase the length of release cycle so we can
> make better designs.
>
> My strawman proposal is to maintain a regular release cadence, as we did in
> Spark 1.x, and increase the cycle from 3 months to 4 months. This
> effectively gives us ~50% more time to develop (in reality it'd be slightly
> less than 50% since longer dev time also means longer QA time). As for
> maintenance releases, I think those should still be cut on-demand, similar
> to Spark 1.x, but more aggressively.
>
> To put this into perspective, 4-month cycle means we will release Spark
> 2.1.0 at the end of Nov or early Dec (and branch cut / code freeze at the
> end of Oct).
>
> I am curious what others think.
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[discuss] Spark 2.x release cadence

2016-09-27 Thread Reynold Xin
We are 2 months past releasing Spark 2.0.0, an important milestone for the
project. Spark 2.0.0 deviated (took 6 month from the regular release
cadence we had for the 1.x line, and we never explicitly discussed what the
release cadence should look like for 2.x. Thus this email.

During Spark 1.x, roughly every three months we make a new 1.x feature
release (e.g. 1.5.0 comes out three months after 1.4.0). Development
happened primarily in the first two months, and then a release branch was
cut at the end of month 2, and the last month was reserved for QA and
release preparation.

During 2.0.0 development, I really enjoyed the longer release cycle because
there was a lot of major changes happening and the longer time was critical
for thinking through architectural changes as well as API design. While I
don't expect the same degree of drastic changes in a 2.x feature release, I
do think it'd make sense to increase the length of release cycle so we can
make better designs.

My strawman proposal is to maintain a regular release cadence, as we did in
Spark 1.x, and increase the cycle from 3 months to 4 months. This
effectively gives us ~50% more time to develop (in reality it'd be slightly
less than 50% since longer dev time also means longer QA time). As for
maintenance releases, I think those should still be cut on-demand, similar
to Spark 1.x, but more aggressively.

To put this into perspective, 4-month cycle means we will release Spark
2.1.0 at the end of Nov or early Dec (and branch cut / code freeze at the
end of Oct).

I am curious what others think.


Re: https://issues.apache.org/jira/browse/SPARK-17691

2016-09-27 Thread Herman van Hövell tot Westerflier
Hi Asaf,

The current collect_list/collect_set implementations have room for
improvement. We did not implement partial aggregation for these, because
the idea of a partial aggregation is that we can reduce network traffic (by
shipping fewer partially aggregated buffers); this does not really apply to
a collect_list where the typical use case is to change the shape of the
data.

I think you have two simple options here:

   1. In the latest branch we added a TypedImperativeAggregate. This allows
   you to use any object as an aggregation buffer. You will need to do some
   serialization though. The ApproximatePercentile aggregate function uses
   this technique.
   2. Exploit the fact that you want collect a limited amount of elements.
   You can use a row a as the buffer. This is much easier to work with. See
   HyperLogLogPlusPlus for an example of this.


HTH
-Herman

On Tue, Sep 27, 2016 at 7:02 AM, assaf.mendelson 
wrote:

> Hi,
>
>
>
> I wanted to try to implement https://issues.apache.org/
> jira/browse/SPARK-17691.
>
> So I started by looking at the implementation of collect_list. My idea
> was, do the same as they but when adding a new element, if there are
> already more than the threshold, remove one instead.
>
> The problem with this is that since collect_list has no partial
> aggregation we would end up shuffling all the data anyway. So while it
> would mean the actual resulting column might be smaller, the whole process
> would be as expensive as collect_list.
>
> So I thought of adding partial aggregation. The problem is that the merge
> function receives a buffer which is in a row format. Collect_list doesn’t
> use the buffer and uses its own data structure for collecting the data.
>
> I can change the implementation to use a spark ArrayType instead, however,
> since ArrayType is immutable it would mean that I would need to copy it
> whenever I do anything.
>
> Consider the simplest implementation of the update function:
>
> If there are few elements => add an element to the array (if I use regular
> Array this would mean copy as I grow it which is fine for this stage)
>
> If there are enough elements => we do not grow the array. Instead we need
> to decide what to replace. If we want to have the top 10 for example and
> there are 10 elements, we need to drop the lowest and put the new one.
>
> This means that if we simply loop across the array we would create a new
> copy and pay the copy + loop. If we keep it sorted then adding, sorting and
> removing the low one means 3 copies.
>
> If I would have been able to use scala’s array then I would basically copy
> whenever I grow and then when we grown to the max, all I would need to do
> is REPLACE the relevant element which is much cheaper.
>
>
>
> The only other solution I see is to simply provide “take first N” agg
> function and have the user sort beforehand but this seems a bad solution to
> me both because sort is expensive and because if we do multiple
> aggregations we can’t sort in two different ways.
>
>
>
>
>
> I can’t find a way to convert an internal buffer the way collect_list does
> it to an internal buffer before the merge.
>
> I also can’t find any way to use an array in the internal buffer as a
> mutable array. If I look at GenericMutableRow implementation then updating
> an array means creating a new one. I thought maybe of adding a function
> update_array_element which would change the relevant element (and similarly
> get_array_element to get an array element) which would allow to easily make
> the array mutable but if I look at the documentation it states this is not
> allowed.
>
>
>
> Can anyone give me a tip on where to try to go from here?
>
> --
> View this message in context: https://issues.apache.org/
> jira/browse/SPARK-17691
> 
> Sent from the Apache Spark Developers List mailing list archive
>  at
> Nabble.com.
>


https://issues.apache.org/jira/browse/SPARK-17691

2016-09-27 Thread assaf.mendelson
Hi,

I wanted to try to implement https://issues.apache.org/jira/browse/SPARK-17691.
So I started by looking at the implementation of collect_list. My idea was, do 
the same as they but when adding a new element, if there are already more than 
the threshold, remove one instead.
The problem with this is that since collect_list has no partial aggregation we 
would end up shuffling all the data anyway. So while it would mean the actual 
resulting column might be smaller, the whole process would be as expensive as 
collect_list.
So I thought of adding partial aggregation. The problem is that the merge 
function receives a buffer which is in a row format. Collect_list doesn't use 
the buffer and uses its own data structure for collecting the data.
I can change the implementation to use a spark ArrayType instead, however, 
since ArrayType is immutable it would mean that I would need to copy it 
whenever I do anything.
Consider the simplest implementation of the update function:
If there are few elements => add an element to the array (if I use regular 
Array this would mean copy as I grow it which is fine for this stage)
If there are enough elements => we do not grow the array. Instead we need to 
decide what to replace. If we want to have the top 10 for example and there are 
10 elements, we need to drop the lowest and put the new one.
This means that if we simply loop across the array we would create a new copy 
and pay the copy + loop. If we keep it sorted then adding, sorting and removing 
the low one means 3 copies.
If I would have been able to use scala's array then I would basically copy 
whenever I grow and then when we grown to the max, all I would need to do is 
REPLACE the relevant element which is much cheaper.

The only other solution I see is to simply provide "take first N" agg function 
and have the user sort beforehand but this seems a bad solution to me both 
because sort is expensive and because if we do multiple aggregations we can't 
sort in two different ways.


I can't find a way to convert an internal buffer the way collect_list does it 
to an internal buffer before the merge.
I also can't find any way to use an array in the internal buffer as a mutable 
array. If I look at GenericMutableRow implementation then updating an array 
means creating a new one. I thought maybe of adding a function 
update_array_element which would change the relevant element (and similarly 
get_array_element to get an array element) which would allow to easily make the 
array mutable but if I look at the documentation it states this is not allowed.

Can anyone give me a tip on where to try to go from here?




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/https-issues-apache-org-jira-browse-SPARK-17691-tp19107.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: Should LeafExpression have children final override (like Nondeterministic)?

2016-09-27 Thread Reynold Xin
Yes - same thing with children in UnaryExpression, BinaryExpression.
Although I have to say the utility isn't that big here.


On Tue, Sep 27, 2016 at 12:53 AM, Jacek Laskowski  wrote:

> Hi,
>
> Perhaps nitpicking...you've been warned.
>
> While reviewing expressions in Catalyst I've noticed some
> inconsistency, i.e. Nondeterministic trait has two methods
> deterministic and foldable final override while LeafExpression does
> not have children final (at the very least).
>
> My thinking is that LeafExpression is to mark left expressions so
> children is assumed to be Nil.
>
> Should children be final in LeafExpression? Why not? #curious
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Should LeafExpression have children final override (like Nondeterministic)?

2016-09-27 Thread Jacek Laskowski
Hi,

Perhaps nitpicking...you've been warned.

While reviewing expressions in Catalyst I've noticed some
inconsistency, i.e. Nondeterministic trait has two methods
deterministic and foldable final override while LeafExpression does
not have children final (at the very least).

My thinking is that LeafExpression is to mark left expressions so
children is assumed to be Nil.

Should children be final in LeafExpression? Why not? #curious

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-27 Thread Suresh Thalamati

+1 (non-binding)

-suresh


> On Sep 26, 2016, at 11:11 PM, Jagadeesan As  wrote:
> 
> +1 (non binding)
>  
> Cheers,
> Jagadeesan A S
> 
> 
> 
> 
> From:Jean-Baptiste Onofré 
> To:dev@spark.apache.org
> Date:27-09-16 11:27 AM
> Subject:Re: [VOTE] Release Apache Spark 2.0.1 (RC3)
> 
> 
> 
> +1 (non binding)
> 
> Regards
> JB
> 
> On 09/27/2016 07:51 AM, Hyukjin Kwon wrote:
> > +1 (non-binding)
> >
> > 2016-09-27 13:22 GMT+09:00 Denny Lee  > >>:
> >
> > +1 on testing with Python2.
> >
> >
> > On Mon, Sep 26, 2016 at 3:13 PM Krishna Sankar  > >> wrote:
> >
> > I do run both Python and Scala. But via iPython/Python2 with my
> > own test code. Not running the tests from the distribution.
> > Cheers
> > 
> >
> > On Mon, Sep 26, 2016 at 11:59 AM, Holden Karau
> >  > >> wrote:
> >
> > I'm seeing some test failures with Python 3 that could
> > definitely be environmental (going to rebuild my virtual env
> > and double check), I'm just wondering if other people are
> > also running the Python tests on this release or if everyone
> > is focused on the Scala tests?
> >
> > On Mon, Sep 26, 2016 at 11:48 AM, Maciej Bryński
> >  > >> wrote:
> >
> > +1
> > At last :)
> >
> > 2016-09-26 19:56 GMT+02:00 Sameer Agarwal
> >  > >>:
> >
> > +1 (non-binding)
> >
> > On Mon, Sep 26, 2016 at 9:54 AM, Davies Liu
> >  >  > >> wrote:
> >
> > +1 (non-binding)
> >
> > On Mon, Sep 26, 2016 at 9:36 AM, Joseph Bradley
> >  >  > >> wrote:
> > > +1
> > >
> > > On Mon, Sep 26, 2016 at 7:47 AM, Denny Lee
> >  >  > >> wrote:
> > >>
> > >> +1 (non-binding)
> > >> On Sun, Sep 25, 2016 at 23:20 Jeff Zhang
> >  > >> wrote:
> > >>>
> > >>> +1
> > >>>
> > >>> On Mon, Sep 26, 2016 at 2:03 PM,
> > Shixiong(Ryan) Zhu
> > >>>  >  > >> wrote:
> > 
> >  +1
> > 
> >  On Sun, Sep 25, 2016 at 10:43 PM, Pete Lee
> >  >  > >>
> >  wrote:
> > >
> > > +1
> > >
> > >
> > > On Sun, Sep 25, 2016 at 3:26 PM, Herman
> > van Hövell tot Westerflier
> > >  >  > >> wrote:
> > >>
> > >> +1 (non-binding)
> > >>
> > >> On Sun, Sep 25, 2016 at 2:05 PM, Ricardo
> > Almeida
> > >>  >  > >> wrote:
> > >>>
> > >>> +1 (non-binding)
> > >>>
> > >>> Built and tested on
> > >>> - Ubuntu 16.04 / OpenJDK 1.8.0_91
> > >>> - CentOS / Oracle Java 1.7.0_55
> > >>> (-Phadoop-2.7 -Dhadoop.version=2.7.3
>