Re: About introduce function sum0 to Spark

2018-10-23 Thread 陶 加涛
The name is from Apache Calcite, And it doesn’t matter, we can introduce our own. --- Regards! Aron Tao 发件人: Mark Hamstra 日期: 2018年10月23日 星期二 12:28 收件人: "taojia...@gmail.com" 抄送: dev 主题: Re: About introduce function sum0 to Spark That's a horrible name. This is just a fold. On Mon, Oct

Re: About introduce function sum0 to Spark

2018-10-23 Thread Wenchen Fan
This is logically `sum( if(isnull(col), 0, col) )` right? On Tue, Oct 23, 2018 at 2:58 PM 陶 加涛 wrote: > The name is from Apache Calcite, And it doesn’t matter, we can introduce > our own. > > > > > > --- > > Regards! > > Aron Tao > > > > *发件人**: *Mark Hamstra > *日期**: *2018年10月23日 星期二 12:28 >

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Hyukjin Kwon
I am sorry for raising this late. Out of curiosity, does anyone know why we don't treat SPARK-24935 (https://github.com/apache/spark/pull/22144) as a blocker? It looks it broke a API compatibility, and an actual usecase of an external library (https://github.com/DataSketches/sketches-hive) Also,

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Sean Owen
No, because the docs are built into the release too and released to the site too from the released artifact. As a practical matter, I think these docs are not critical for release, and can follow in a maintenance release. I'd retarget to 2.4.1 or untarget. I do know at times a release's docs have

Re: Hadoop 3 support

2018-10-23 Thread Steve Loughran
> On 16 Oct 2018, at 22:06, t4 wrote: > > has anyone got spark jars working with hadoop3.1 that they can share? i am > looking to be able to use the latest hadoop-aws fixes from v3.1 we do, but we do with * a patched hive JAR * bulding spark with

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Stavros Kontopoulos
Sean, I will try it against 2.12 shortly. You're saying someone would have to first build a k8s distro from source > too? Ok I missed the error one line above, before the distro error there is another one: fatal: not a git repository (or any of the parent directories): .git So that seems to

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Hyukjin Kwon
I am searching and checking some PRs or JIRAs that state regression. Let me leave a link - it might be good to double check https://github.com/apache/spark/pull/22514 as well. 2018년 10월 23일 (화) 오후 11:58, Stavros Kontopoulos < stavros.kontopou...@lightbend.com>님이 작성: > Sean, > > I will try it

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Xiao Li
https://github.com/apache/spark/pull/22144 is also not a blocker of Spark 2.4 release, as discussed in the PR. Thanks, Xiao Xiao Li 于2018年10月23日周二 上午9:20写道: > Thanks for reporting this. https://github.com/apache/spark/pull/22514 is > not a blocker. We can fix it in the next minor release, if

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Sean Owen
(I should add, I only observed this with the Scala 2.12 build. It all seemed to work with 2.11. Therefore I'm not too worried about it. I don't think it's a Scala version issue, but perhaps something looking for a spark 2.11 tarball and not finding it. See

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Xiao Li
Thanks for reporting this. https://github.com/apache/spark/pull/22514 is not a blocker. We can fix it in the next minor release, if we are unable to make it in this release. Thanks, Xiao Sean Owen 于2018年10月23日周二 上午9:14写道: > (I should add, I only observed this with the Scala 2.12 build. It all

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Stavros Kontopoulos
Sean, Ok makes sense, im using a cloned repo. I built with Scala 2.12 profile using the related tag v2.4.0-rc4: ./dev/change-scala-version.sh 2.12 ./dev/make-distribution.sh --name test --r --tgz -Pscala-2.12 -Psparkr -Phadoop-2.7 -Pkubernetes -Phive Pushed images to dockerhub (previous email)

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Ilan Filonenko
+1 (non-binding) in reference to all k8s tests for 2.11 (including SparkR Tests with R version being 3.4.1) *[INFO] --- scalatest-maven-plugin:1.0:test (integration-test) @ spark-kubernetes-integration-tests_2.11 ---* *Discovery starting.* *Discovery completed in 202 milliseconds.* *Run starting.

Re: About introduce function sum0 to Spark

2018-10-23 Thread Mark Hamstra
Yes, as long as you are only talking about summing numeric values. Part of my point, though, is that this is just a special case of folding or aggregating with an initial or 'zero' value. It doesn't need to be limited to just numeric sums with zero = 0. On Tue, Oct 23, 2018 at 12:23 AM Wenchen

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Dongjoon Hyun
BTW, for that integration suite, I saw the related artifacts in the RC4 staging directory. Does Spark 2.4.0 need to start to release these `spark-kubernetes-integration-tests` artifacts? -

Re: Documentation of boolean column operators missing?

2018-10-23 Thread Sean Owen
Those should all be Column functions, really, and I see them at http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Column On Tue, Oct 23, 2018, 12:27 PM Nicholas Chammas wrote: > I can’t seem to find any documentation of the &, |, and ~ operators for > PySpark

Re: Documentation of boolean column operators missing?

2018-10-23 Thread Nicholas Chammas
So it appears then that the equivalent operators for PySpark are completely missing from the docs, right? That’s surprising. And if there are column function equivalents for |, &, and ~, then I can’t find those either for PySpark. Indeed, I don’t think such a thing is possible in PySpark. (e.g.

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Stavros Kontopoulos
+1 (non-binding). Run k8s tests with Scala 2.12. Also included the RTestsSuite (mentioned by Ilan) although not part of the 2.4 rc tag: [INFO] --- scalatest-maven-plugin:1.0:test (integration-test) @ spark-kubernetes-integration-tests_2.12 --- Discovery starting. Discovery completed in 239

Re: Documentation of boolean column operators missing?

2018-10-23 Thread Nicholas Chammas
Also, to clarify something for folks who don't work with PySpark: The boolean column operators in PySpark are completely different from those in Scala, and non-obvious to boot (since they overload Python's _bitwise_ operators). So their apparent absence from the docs is surprising. On Tue, Oct

Re: Documentation of boolean column operators missing?

2018-10-23 Thread Sean Owen
(& and | are both logical and bitwise operators in Java and Scala, FWIW) I don't see them in the python docs; they are defined in column.py but they don't turn up in the docs. Then again, they're not documented: ... __and__ = _bin_op('and') __or__ = _bin_op('or') __invert__ = _func_op('not')

Re: Documentation of boolean column operators missing?

2018-10-23 Thread Maciej Szymkiewicz
Even if these were documented Sphinx doesn't include dunder methods by default (with exception to __init__). There is :special-members: option which could be passed to, for example, autoclass. On Tue, 23 Oct 2018 at 21:32, Sean Owen wrote: > (& and | are both logical and bitwise operators in

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Sean Owen
To be clear I'm currently +1 on this release, with much commentary. OK, the explanation for kubernetes tests makes sense. Yes I think we need to propagate the scala-2.12 build profile to make it work. Go for it, if you have a lead on what the change is. This doesn't block the release as it's an

Re: Documentation of boolean column operators missing?

2018-10-23 Thread Nicholas Chammas
On Tue, 23 Oct 2018 at 21:32, Sean Owen wrote: > >> The comments say that it is not possible to overload 'and' and 'or', >> which would have been more natural. >> > Yes, unfortunately, Python does not allow you to override and, or, or not. They are not implemented as “dunder” method (e.g.

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Sean Owen
Yeah, that's maybe the issue here. This is a source release, not a git checkout, and it still needs to work in this context. I just added -Pkubernetes to my build and didn't do anything else. I think the ideal is that a "mvn -P... -P... install" to work from a source release; that's a good

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Wenchen Fan
I read through the contributing guide , it only mentions that data correctness and data loss issues should be marked as blockers. AFAIK we also mark regressions of current release as blockers, but not regressions of the previous releases. SPARK-24935 is

Re: Documentation of boolean column operators missing?

2018-10-23 Thread Xiao Li
They are documented at the link below https://spark.apache.org/docs/2.3.0/api/sql/index.html On Tue, Oct 23, 2018 at 10:27 AM Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > I can’t seem to find any documentation of the &, |, and ~ operators for > PySpark DataFrame columns. I assume

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Hyukjin Kwon
https://github.com/apache/spark/pull/22514 sounds like a regression that affects Hive CTAS in write path (by not replacing them into Spark internal datasources; therefore performance regression). but yea I suspect if we should block the release by this. https://github.com/apache/spark/pull/22144

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Sean Owen
Hm, so you're trying to build a source release from a binary release? I don't think that needs to work nor do I expect it to for reasons like this. They just have fairly different things. On Tue, Oct 23, 2018 at 7:04 PM Dongjoon Hyun wrote: > > Ur, Wenchen. > > Source distribution seems to fail

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Ryan Blue
+1 (non-binding) The Iceberg implementation of DataSourceV2 is passing all tests after updating to the 2.4 API, although I've had to disable ORC support because BufferHolder is no longer public. One oddity is that the DSv2 API for batch sources now includes an epoch ID, which I think will be

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Dongjoon Hyun
Ur, Wenchen. Source distribution seems to fail by default. https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/spark-2.4.0.tgz $ dev/make-distribution.sh -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver ... + cp /spark-2.4.0/LICENSE-binary /spark-2.4.0/dist/LICENSE cp:

Documentation of boolean column operators missing?

2018-10-23 Thread Nicholas Chammas
I can’t seem to find any documentation of the &, |, and ~ operators for PySpark DataFrame columns. I assume that should be in our docs somewhere. Was it always missing? Am I just missing something obvious? Nick

Re: Documentation of boolean column operators missing?

2018-10-23 Thread Nicholas Chammas
Nope, that’s different. I’m talking about the operators on DataFrame columns in PySpark, not SQL functions. For example: (df .where(~col('is_exiled') & (col('age') > 60)) .show() ) On Tue, Oct 23, 2018 at 1:48 PM Xiao Li wrote: > They are documented at the link below > >