Re: [SQL] [Suggestion] Add top() to Dataset

2018-01-30 Thread Wenchen Fan
You can use `Dataset.limit`, which return a new `Dataset` instead of an
Array. Then you can transform it and still get the top k optimization from
Spark.

On Wed, Jan 31, 2018 at 3:39 PM, Yacine Mazari  wrote:

> Thanks for the quick reply and explanation @rxin.
>
> So if one does not want to collect()/take() but want the top k as a dataset
> to do further transformations there is no optimized API, that's why I am
> suggesting adding this "top()" as a public method.
>
> If that sounds like a good idea, I will open a ticket and implement it.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [SQL] [Suggestion] Add top() to Dataset

2018-01-30 Thread Yacine Mazari
Thanks for the quick reply and explanation @rxin.

So if one does not want to collect()/take() but want the top k as a dataset
to do further transformations there is no optimized API, that's why I am
suggesting adding this "top()" as a public method.

If that sounds like a good idea, I will open a ticket and implement it.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [SQL] [Suggestion] Add top() to Dataset

2018-01-30 Thread Reynold Xin
For the DataFrame/Dataset API, the optimizer rewrites orderBy followed by a
take into a priority queue based top implementation actually.


On Tue, Jan 30, 2018 at 11:10 PM, Yacine Mazari  wrote:

> Hi All,
>
> Would it make sense to add a "top()" method to the Dataset API?
> This method would return a Dataset containing the top k elements, the
> caller
> may then do further processing on the Dataset or call collect(). This is in
> contrast with RDD's top() which returns a collected array.
>
> In terms of implementation, this would use a bounded priority queue, which
> will avoid sorting all elements and run in O(n log k).
>
> I know something similar can be achieved by "orderBy().take()", but I am
> not
> sure if this is optimized.
> If that's not the case, and it performs sorting of all elements (therefore
> running in n log n), it might be handy to add this method.
>
> What do you think?
>
> Regards,
> Yacine.
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


[SQL] [Suggestion] Add top() to Dataset

2018-01-30 Thread Yacine Mazari
Hi All,

Would it make sense to add a "top()" method to the Dataset API?
This method would return a Dataset containing the top k elements, the caller
may then do further processing on the Dataset or call collect(). This is in
contrast with RDD's top() which returns a collected array.

In terms of implementation, this would use a bounded priority queue, which
will avoid sorting all elements and run in O(n log k).

I know something similar can be achieved by "orderBy().take()", but I am not
sure if this is optimized.
If that's not the case, and it performs sorting of all elements (therefore
running in n log n), it might be handy to add this method.

What do you think?

Regards,
Yacine.




--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: PSA: Release and commit quality

2018-01-30 Thread Xiao Li
Hi, Ryan,

Thanks for your inputs. These comments are pretty helpful! Please continue
to help us improve Spark and Spark community.

Thanks again,

Xiao



2018-01-30 12:58 GMT-08:00 Ryan Blue :

> Hi everyone,
>
> I’ve noticed some questionable practices around commits going into master
> lately (and historically, to be honest) and I want to remind everyone about
> some best practices for commit and release quality.
>
>-
>
>*Please don’t mix partial, unrelated changes into a commit.* This
>makes it very difficult to manage branches and to correctly address bugs
>when they are found because of conflicts when picking or reverting commits.
>
>If you need to revert a commit, but it made unrelated changes then you
>have to go resolve the conflict by hand. When backporting to a release
>branch, unrelated changes require you to pull in more commits to get
>patches to apply, or again resolve the conflicts by hand. Every conflict
>makes maintenance take longer and increases the risk of mistakes.
>
>When submitting and reviewing patches, please think about whether
>cherry-picking or reverting the commit would cause unintended changes, and
>remove them.
>-
>
>*Please don’t commit unfinished changes.* If you want to pull features
>from master to a release branch, it is much harder if you also need to
>track down all of the additional commits needed to actually make it work.
>I’m not talking about splitting work into reasonable chunks; I’m talking
>about getting commits in before they’re reasonably finished. This makes it
>hard to know where the remaining changes were merged, or if they were
>merged at all.
>
>Often, I see this combined with the first problem. For example, #19925
>
>was merged with a comment: “Merging to master to unblock #19926. If there
>are more comments, we can address them in #19926.” This change was “Add
>Structured Streaming APIs to DataSourceV2”, something that should clearly
>be finished before committing. Instead, parts of it might be lurking in
>“Split StreamExecution into MicroBatchExecution and StreamExecution.”
>-
>
>*Please don’t commit before reviews are finished.* This can be a tough
>call at times, but it is usually pretty clear. Are there people who should
>review a PR? Is there still ongoing discussion in the PR? Is there a +1
>from a Spark committer? (I recently noticed both ignoring an ongoing
>discussion and not getting a committer +1 on the same issue!)
>
>This is also a community problem. Whether or not it is intentional,
>committing prematurely sends a negative message to the community that we
>need to avoid. A good skill for a committer is to know when you should get
>agreement or consensus, even if you have a +1.
>-
>
>*Please don’t rush commits.* Rushing is cutting corners, and that
>almost always causes the problems above or others.
>
>#20427 is a good example
>
> ,
>where admittedly rushing a commit led to trying to mix unrelated changes
>into the PR. #19925
>
>was committed unfinished to unblock another commit, despite adding a public
>API that should require a careful review.
>
> We all mess this up, myself included, and my intent is not to point
> fingers, but to show examples of where we can improve. It is just far
> easier to get a branch committed as-is than to adhere to these guidelines,
> but these are important for our releases and downstream users.
>
> Thanks for reading,
>
> rb
> ​
> --
> Ryan Blue
> Software Engineer
> Netflix
>


PSA: Release and commit quality

2018-01-30 Thread Ryan Blue
Hi everyone,

I’ve noticed some questionable practices around commits going into master
lately (and historically, to be honest) and I want to remind everyone about
some best practices for commit and release quality.

   -

   *Please don’t mix partial, unrelated changes into a commit.* This makes
   it very difficult to manage branches and to correctly address bugs when
   they are found because of conflicts when picking or reverting commits.

   If you need to revert a commit, but it made unrelated changes then you
   have to go resolve the conflict by hand. When backporting to a release
   branch, unrelated changes require you to pull in more commits to get
   patches to apply, or again resolve the conflicts by hand. Every conflict
   makes maintenance take longer and increases the risk of mistakes.

   When submitting and reviewing patches, please think about whether
   cherry-picking or reverting the commit would cause unintended changes, and
   remove them.
   -

   *Please don’t commit unfinished changes.* If you want to pull features
   from master to a release branch, it is much harder if you also need to
   track down all of the additional commits needed to actually make it work.
   I’m not talking about splitting work into reasonable chunks; I’m talking
   about getting commits in before they’re reasonably finished. This makes it
   hard to know where the remaining changes were merged, or if they were
   merged at all.

   Often, I see this combined with the first problem. For example, #19925
    was
   merged with a comment: “Merging to master to unblock #19926. If there are
   more comments, we can address them in #19926.” This change was “Add
   Structured Streaming APIs to DataSourceV2”, something that should clearly
   be finished before committing. Instead, parts of it might be lurking in
   “Split StreamExecution into MicroBatchExecution and StreamExecution.”
   -

   *Please don’t commit before reviews are finished.* This can be a tough
   call at times, but it is usually pretty clear. Are there people who should
   review a PR? Is there still ongoing discussion in the PR? Is there a +1
   from a Spark committer? (I recently noticed both ignoring an ongoing
   discussion and not getting a committer +1 on the same issue!)

   This is also a community problem. Whether or not it is intentional,
   committing prematurely sends a negative message to the community that we
   need to avoid. A good skill for a committer is to know when you should get
   agreement or consensus, even if you have a +1.
   -

   *Please don’t rush commits.* Rushing is cutting corners, and that almost
   always causes the problems above or others.

   #20427 is a good example
   
,
   where admittedly rushing a commit led to trying to mix unrelated changes
   into the PR. #19925
    was
   committed unfinished to unblock another commit, despite adding a public API
   that should require a careful review.

We all mess this up, myself included, and my intent is not to point
fingers, but to show examples of where we can improve. It is just far
easier to get a branch committed as-is than to adhere to these guidelines,
but these are important for our releases and downstream users.

Thanks for reading,

rb
​
-- 
Ryan Blue
Software Engineer
Netflix


Re: [VOTE] Spark 2.3.0 (RC2)

2018-01-30 Thread Andrew Ash
I'd like to nominate SPARK-23274
 as a potential blocker
for the 2.3.0 release as well, due to being a regression from 2.2.0.  The
ticket has a simple repro included, showing a query that works in prior
releases but now fails with an exception in the catalyst optimizer.

On Fri, Jan 26, 2018 at 10:41 AM, Sameer Agarwal 
wrote:

> This vote has failed due to a number of aforementioned blockers. I'll
> follow up with RC3 as soon as the 2 remaining (non-QA) blockers are
> resolved: https://s.apache.org/oXKi
>
>
> On 25 January 2018 at 12:59, Sameer Agarwal  wrote:
>
>>
>> Most tests pass on RC2, except I'm still seeing the timeout caused by
>>> https://issues.apache.org/jira/browse/SPARK-23055 ; the tests never
>>> finish. I followed the thread a bit further and wasn't clear whether it was
>>> subsequently re-fixed for 2.3.0 or not. It says it's resolved along with
>>> https://issues.apache.org/jira/browse/SPARK-22908 for 2.3.0 though I am
>>> still seeing these tests fail or hang:
>>>
>>> - subscribing topic by name from earliest offsets (failOnDataLoss: false)
>>> - subscribing topic by name from earliest offsets (failOnDataLoss: true)
>>>
>>
>> Sean, while some of these tests were timing out on RC1, we're not aware
>> of any known issues in RC2. Both maven (https://amplab.cs.berkeley.ed
>> u/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-
>> branch-2.3-test-maven-hadoop-2.6/146/testReport/org.apache.
>> spark.sql.kafka010/history/) and sbt (https://amplab.cs.berkeley.ed
>> u/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-
>> branch-2.3-test-sbt-hadoop-2.6/123/testReport/org.apache.
>> spark.sql.kafka010/history/) historical builds on jenkins
>> for org.apache.spark.sql.kafka010 look fairly healthy. If you're still
>> seeing timeouts in RC2, can you create a JIRA with any applicable build/env
>> info?
>>
>>
>>
>>> On Tue, Jan 23, 2018 at 9:01 AM Sean Owen  wrote:
>>>
 I'm not seeing that same problem on OS X and /usr/bin/tar. I tried
 unpacking it with 'xvzf' and also unzipping it first, and it untarred
 without warnings in either case.

 I am encountering errors while running the tests, different ones each
 time, so am still figuring out whether there is a real problem or just
 flaky tests.

 These issues look like blockers, as they are inherently to be completed
 before the 2.3 release. They are mostly not done. I suppose I'd -1 on
 behalf of those who say this needs to be done first, though, we can keep
 testing.

 SPARK-23105 Spark MLlib, GraphX 2.3 QA umbrella
 SPARK-23114 Spark R 2.3 QA umbrella

 Here are the remaining items targeted for 2.3:

 SPARK-15689 Data source API v2
 SPARK-20928 SPIP: Continuous Processing Mode for Structured Streaming
 SPARK-21646 Add new type coercion rules to compatible with Hive
 SPARK-22386 Data Source V2 improvements
 SPARK-22731 Add a test for ROWID type to OracleIntegrationSuite
 SPARK-22735 Add VectorSizeHint to ML features documentation
 SPARK-22739 Additional Expression Support for Objects
 SPARK-22809 pyspark is sensitive to imports with dots
 SPARK-22820 Spark 2.3 SQL API audit


 On Mon, Jan 22, 2018 at 7:09 PM Marcelo Vanzin 
 wrote:

> +0
>
> Signatures check out. Code compiles, although I see the errors in [1]
> when untarring the source archive; perhaps we should add "use GNU tar"
> to the RM checklist?
>
> Also ran our internal tests and they seem happy.
>
> My concern is the list of open bugs targeted at 2.3.0 (ignoring the
> documentation ones). It is not long, but it seems some of those need
> to be looked at. It would be nice for the committers who are involved
> in those bugs to take a look.
>
> [1] https://superuser.com/questions/318809/linux-os-x-tar-incomp
> atibility-tarballs-created-on-os-x-give-errors-when-unt
>
>
> On Mon, Jan 22, 2018 at 1:36 PM, Sameer Agarwal 
> wrote:
> > Please vote on releasing the following candidate as Apache Spark
> version
> > 2.3.0. The vote is open until Friday January 26, 2018 at 8:00:00 am
> UTC and
> > passes if a majority of at least 3 PMC +1 votes are cast.
> >
> >
> > [ ] +1 Release this package as Apache Spark 2.3.0
> >
> > [ ] -1 Do not release this package because ...
> >
> >
> > To learn more about Apache Spark, please see
> https://spark.apache.org/
> >
> > The tag to be voted on is v2.3.0-rc2:
> > https://github.com/apache/spark/tree/v2.3.0-rc2
> > (489ecb0ef23e5d9b705e5e5bae4fa3d871bdac91)
> >
> > List of JIRA tickets resolved in this release can be found here:
> > https://issues.apache.org/jira/projects/SPARK/versions/12339551
> >
> > The 

Re: ClassNotFoundException while running unit test with local cluster mode in Intellij IDEA

2018-01-30 Thread wuyi
Hi, cloud0fan,

I tried it and that's really good and cool! Thanks again!



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: ClassNotFoundException while running unit test with local cluster mode in Intellij IDEA

2018-01-30 Thread wuyi
Hi, cloud0fan.
Yeah, tests run well in SBT. 
Maybe, I should try your way. Thanks!



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: ClassNotFoundException while running unit test with local cluster mode in Intellij IDEA

2018-01-30 Thread Wenchen Fan
You can run test in SBT and attach your IDEA to it for debugging, which
works for me.

On Tue, Jan 30, 2018 at 7:44 PM, wuyi  wrote:

> Dear devs,
> I'v got stuck on this issue for several days, and I need help now.
> At the first, I run into an old issue, which is the same as
> http://apache-spark-developers-list.1001551.n3.
> nabble.com/test-cases-stuck-on-quot-local-cluster-mode-
> quot-of-ReplSuite-td3086.html
>  nabble.com/test-cases-stuck-on-quot-local-cluster-mode-
> quot-of-ReplSuite-td3086.html>
>
>So, I check my assembly jar, and add the assembly jar to dependencies of
> core project
> (I run unit test within this sub project), and I set the SPARK_HOME(even
> though, I do not
> have a wrong SPARK_HOME before).
>
> After that, unit test with local cluster mode will not be blocking all the
> time, but throws a *ClassNotFoundException*, such as:
>
> Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most
> recent failure:
> Lost task 1.3 in stage 0.0 (TID 5, localhost, executor 3):
> java.lang.ClassNotFoundException:
> org.apache.spark.broadcast.BroadcastSuite$$anonfun$15$$anonfun$16
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at
> org.apache.spark.serializer.JavaDeserializationStream$$
> anon$1.resolveClass(JavaSerializer.scala:67)
> ...
> Driver stacktrace:
> org.apache.spark.SparkException: Job aborted due to stage failure:
> Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in
> stage 0.0 (TID 5, localhost, executor 3): java.lang.
> ClassNotFoundException:
> org.apache.spark.broadcast.BroadcastSuite$$anonfun$15$$anonfun$16
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at
> org.apache.spark.serializer.JavaDeserializationStream$$
> anon$1.resolveClass(JavaSerializer.scala:67)
> ...
>
> And then, I tried rebuilt the whole spark project or core project,
> build/test the core project,
> add the 'spark.driver.extraClassPath/spark.executor.extraClassPath' param
> and so on but, all failed.
>
> Maybe I miss something when I try to run unit test with *local cluster* in
> *Intellij IDEA*.
> I'd appreciate a lot if any guys could give me a hint. THANKS.
>
> Best wishes.
> wuyi
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


[SQL] Tests for ExtractFiltersAndInnerJoins.flattenJoin

2018-01-30 Thread Jacek Laskowski
Hi,

While exploring ReorderJoin optimization I wrote few unit test-like
examples that demo how ExtractFiltersAndInnerJoins.flattenJoin [1] works.

I've been wondering if the examples could become unit tests instead. There
are 6 different join-filter plan combinations using Catalyst DSL to create
the plans.

I'm wondering if I should file a task in JIRA for this or just send a pull
request? I'd appreciate some guidance.

[1]
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala#L167

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski


ClassNotFoundException while running unit test with local cluster mode in Intellij IDEA

2018-01-30 Thread wuyi
Dear devs,
I'v got stuck on this issue for several days, and I need help now.
At the first, I run into an old issue, which is the same as 
http://apache-spark-developers-list.1001551.n3.nabble.com/test-cases-stuck-on-quot-local-cluster-mode-quot-of-ReplSuite-td3086.html

  
  
   So, I check my assembly jar, and add the assembly jar to dependencies of
core project
(I run unit test within this sub project), and I set the SPARK_HOME(even
though, I do not 
have a wrong SPARK_HOME before). 

After that, unit test with local cluster mode will not be blocking all the
time, but throws a *ClassNotFoundException*, such as:

Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most
recent failure: 
Lost task 1.3 in stage 0.0 (TID 5, localhost, executor 3):
java.lang.ClassNotFoundException:
org.apache.spark.broadcast.BroadcastSuite$$anonfun$15$$anonfun$16
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
...
Driver stacktrace:
org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in
stage 0.0 (TID 5, localhost, executor 3): java.lang.ClassNotFoundException:
org.apache.spark.broadcast.BroadcastSuite$$anonfun$15$$anonfun$16
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
...

And then, I tried rebuilt the whole spark project or core project,
build/test the core project, 
add the 'spark.driver.extraClassPath/spark.executor.extraClassPath' param
and so on but, all failed. 

Maybe I miss something when I try to run unit test with *local cluster* in
*Intellij IDEA*. 
I'd appreciate a lot if any guys could give me a hint. THANKS.

Best wishes.
wuyi



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org