Re: no logging in pyspark code?

2018-09-05 Thread Hyukjin Kwon
FYI, we do have a basic logging by warnings module. 2018년 8월 28일 (화) 오전 2:05, Imran Rashid 님이 작성: > ah, great, thanks! sorry I missed that, I'll watch that jira. > > On Mon, Aug 27, 2018 at 12:41 PM Ilan Filonenko wrote: > >> A JIRA has been opened up on this exact topic: SPARK-25236 >>

Re: code freeze and branch cut for Apache Spark 2.4

2018-09-05 Thread Hyukjin Kwon
Oops, one more - https://github.com/apache/spark/pull/6. I just read this thread. 2018년 9월 6일 (목) 오후 12:12, Sean Owen 님이 작성: > (I slipped https://github.com/apache/spark/pull/22340 in for Scala 2.12. > Maybe it really is the last one. In any event, yes go ahead with a 2.4 RC) > > On Wed, Sep

Re: python test infrastructure

2018-09-05 Thread Hyukjin Kwon
> 1. all of the output in target/test-reports & python/unit-tests.log should be included in the jenkins archived artifacts. Hmmm, I thought they are already archived ( https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95734/artifact/target/unit-tests.log ). FWIW, unit-tests.log

Re: code freeze and branch cut for Apache Spark 2.4

2018-09-05 Thread Sean Owen
(I slipped https://github.com/apache/spark/pull/22340 in for Scala 2.12. Maybe it really is the last one. In any event, yes go ahead with a 2.4 RC) On Wed, Sep 5, 2018 at 8:14 PM Wenchen Fan wrote: > The repartition correctness bug fix is merged. The Scala 2.12 PRs > mentioned in this thread

Re: code freeze and branch cut for Apache Spark 2.4

2018-09-05 Thread Wenchen Fan
The repartition correctness bug fix is merged. The Scala 2.12 PRs mentioned in this thread are all merged. The Kryo upgrade is done. I'm going to cut the branch 2.4 since all the major blockers are now resolved. Thanks, Wenchen On Sun, Sep 2, 2018 at 12:07 AM sadhen wrote: >

Re: [DISCUSS] PySpark Window UDF

2018-09-05 Thread Li Jin
Hello again! I recently implemented a proof-of-concept implementation of proposal above. I think the results are pretty exciting so I want to share my findings with the community. I have implemented two variants of the pandas window UDF - one that takes pandas.Series as input and one that takes

Re: python test infrastructure

2018-09-05 Thread Imran Rashid
one more: seems like python/run-tests should have an option at least to not bail at the first failure: https://github.com/apache/spark/blob/master/python/run-tests.py#L113-L132 this is particularly annoying with flaky tests -- since the rest of the tests aren't run, you don't know whether you

Kubernetes Big-Data-SIG notes, September 5

2018-09-05 Thread Erik Erlandson
Meta At the weekly K8s Big Data SIG meeting today, we agreed to experiment with publishing a brief summary of noteworthy Spark-related topics from the weekly meeting to dev@spark, as a reference for interested members of the Apache Spark community. The format is a brief summary, including a link

python test infrastructure

2018-09-05 Thread Imran Rashid
Hi all, More pyspark noob questions from me. I find it really hard to figure out what versions of python I should be testing and what is tested upstream. While I'd like to just know the answers to those questions, more importantly I'd like to make sure that info is visible somewhere so all devs

Re: python tests: any reason for a huge tests.py?

2018-09-05 Thread Imran Rashid
I filed https://issues.apache.org/jira/browse/SPARK-25344 On Fri, Aug 24, 2018 at 11:57 AM Reynold Xin wrote: > We should break it. > > On Fri, Aug 24, 2018 at 9:53 AM Imran Rashid > wrote: > >> Hi, >> >> another question from looking more at python recently. Is there any >> reason we've got

Re: Select top (100) percent equivalent in spark

2018-09-05 Thread Liang-Chi Hsieh
Thanks for pinging me. Seems to me we should not make assumption about the value of spark.sql.execution.topKSortFallbackThreshold config. Once it is changed, the global sort + limit can produce wrong result for now. I will make a PR for this. cloud0fan wrote > + Liang-Chi and Herman, > > I

Re: Select top (100) percent equivalent in spark

2018-09-05 Thread Chetan Khatri
Sean, Thank you. Do you think, tempDF.orderBy($"invoice_id".desc).limit(100) this can give same result , I think so. Thanks On Wed, Sep 5, 2018 at 12:58 AM Sean Owen wrote: > Sort and take head(n)? > > On Tue, Sep 4, 2018 at 12:07 PM Chetan Khatri > wrote: > >> Dear Spark dev, anything