Re: Coalesce behaviour

2018-10-10 Thread Sergey Zhemzhitsky
Well, it seems that I can still extend the CoalesceRDD to make it preserve the total number of partitions from the parent RDD, reduce some partitons in the same way as the original coalesce does for map-only jobs and fill the gaps (partitions which should reside on the positions of the coalesced

Docker image to build Spark/Spark doc

2018-10-10 Thread assaf.mendelson
Hi all, I was wondering if there was a docker image to build spark and/or spark documentation The idea would be that I would start the docker image, supplying the directory with my code and a target directory and it would simply build everything (maybe with some options). Any chance there is

Re: Docker image to build Spark/Spark doc

2018-10-10 Thread Robert Kruszewski
Me and my colleagues built one for running spark builds on circleci. The images are at https://hub.docker.com/r/palantirtechnologies/circle-spark-python/ (circle-spark-r if you want to build sparkr). Dockerfiles for those images can be found at

Re: Docker image to build Spark/Spark doc

2018-10-10 Thread Sean Owen
You can just build it with Maven or SBT as in the docs. I don't know of a docker image but there isn't much to package. On Wed, Oct 10, 2018, 1:10 AM assaf.mendelson wrote: > Hi all, > I was wondering if there was a docker image to build spark and/or spark > documentation > > The idea would be

Re: Coalesce behaviour

2018-10-10 Thread Wenchen Fan
Note that, RDD partitions and Spark tasks are not always 1-1 mapping. Assuming `rdd1` has 100 partitions, and `rdd2 = rdd1.coalesce(10)`. Then `rdd2` has 10 partitions, and there is no shuffle between `rdd1` and `rdd2`. During scheduling, `rdd1` and `rdd2` are in the same stage, and this stage

Re: [VOTE] SPARK 2.4.0 (RC3)

2018-10-10 Thread Wenchen Fan
I'm adding my own +1, since there are no known blocker issues. The correctness issue has been fixed, the streaming Java API problem has been resolved, and we have upgraded to Scala 2.12.7. On Thu, Oct 11, 2018 at 12:46 AM Wenchen Fan wrote: > Please vote on releasing the following candidate as

RE: [VOTE] SPARK 2.4.0 (RC3)

2018-10-10 Thread Garlapati, Suryanarayana (Nokia - IN/Bangalore)
Might be you need to change the date(Oct 1 has already passed). >> The vote is open until October 1 PST and passes if a majority +1 PMC votes >> are cast, with >> a minimum of 3 +1 votes. Regards Surya From: Wenchen Fan Sent: Wednesday, October 10, 2018 10:20 PM To: Spark dev list Subject:

Re: [VOTE] SPARK 2.4.0 (RC3)

2018-10-10 Thread Imran Rashid
Sorry I had messed up my testing earlier, so I only just discovered https://issues.apache.org/jira/browse/SPARK-25704 I dont' think this is a release blocker, because its not a regression and there is a workaround, just fyi. On Wed, Oct 10, 2018 at 11:47 AM Wenchen Fan wrote: > Please vote on

Re: [VOTE] SPARK 2.4.0 (RC3)

2018-10-10 Thread Jean Georges Perrin
Hi, Sorry if it's stupid question, but where can I find the release notes of 2.4.0? jg > On Oct 10, 2018, at 2:00 PM, Imran Rashid > wrote: > > Sorry I had messed up my testing earlier, so I only just discovered >

Re: moving the spark jenkins job builder repo from dbricks --> spark

2018-10-10 Thread Marcelo Vanzin
Thanks for doing this. The more things we have accessible to the project members in general the better! (Now there's that hive fork repo somewhere, but let's not talk about that.) On Wed, Oct 10, 2018 at 9:30 AM shane knapp wrote: >> > * the JJB templates are able to be run by anyone w/jenkins

Re: [DISCUSS] Cascades style CBO for Spark SQL

2018-10-10 Thread 吴晓菊
Hi All, Takeshi Yamamuro gave some comments on this topic on twitter. And after more research, here are correction and updates of my understanding about bottom-up and top-down now. Bottom-up and top-down are just 2 strategies to enumerate join order and generate the search space. Both of them

Re: moving the spark jenkins job builder repo from dbricks --> spark

2018-10-10 Thread shane knapp
hey everyone! just for visibility, after some lengthy conversations w/some PMC members (mostly sean and josh) about the location of the jenkins job builder temples being in a private, databricks repo, we've decided to move them in to the main apache spark repo.

[VOTE] SPARK 2.4.0 (RC3)

2018-10-10 Thread Wenchen Fan
Please vote on releasing the following candidate as Apache Spark version 2.4.0. The vote is open until October 1 PST and passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1 Release this package as Apache Spark 2.4.0 [ ] -1 Do not release this package because ... To

Re: [VOTE] SPARK 2.4.0 (RC3)

2018-10-10 Thread Jean Georges Perrin
Awesome - thanks Dongjoon! > On Oct 10, 2018, at 2:36 PM, Dongjoon Hyun wrote: > > For now, you can see generated release notes. Official one will be posted on > the website when the official 2.4.0 is out. > > https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315420=12342385

Re: [VOTE] SPARK 2.4.0 (RC3)

2018-10-10 Thread Nicholas Chammas
FYI I believe we have an open correctness issue here: https://issues.apache.org/jira/browse/SPARK-25150 However, it needs review by another person to confirm whether it is indeed a correctness issue (and whether it still impacts this latest RC). Nick 2018년 10월 10일 (수) 오후 3:14, Jean Georges

Re: Remove Flume support in 3.0.0?

2018-10-10 Thread Jörn Franke
I think it makes sense to remove it. If it is not too much effort and the architecture of the flume source is not considered as too strange one may extract it as a separate project and put it on github in a dedicated non-supported repository. This would enable distributors and other companies

Re: [VOTE] SPARK 2.4.0 (RC3)

2018-10-10 Thread Dongjoon Hyun
For now, you can see generated release notes. Official one will be posted on the website when the official 2.4.0 is out. https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315420=12342385 Bests, Dongjoon. On Wed, Oct 10, 2018 at 11:29 AM Jean Georges Perrin wrote: > Hi, > >

Re: moving the spark jenkins job builder repo from dbricks --> spark

2018-10-10 Thread shane knapp
> > Not sure if that's what you meant; but it should be ok for the jenkins > servers to manually sync with master after you (or someone else) have > verified the changes. That should prevent inadvertent breakages since > I don't expect it to be easy to test those scripts without access to > some

Sql custom streamer design questions and feedback

2018-10-10 Thread Vadim Chekan
Hi all, I am trying to write custom sql streaming source and I have quite a lot of questions about how it envisioned to be done. First attempt was to extend org.apache.spark.sql.execution.streaming.Source. At first it looks simple. Data source tells what last offset it has and spark would ask

Remove Flume support in 3.0.0?

2018-10-10 Thread Sean Owen
Marcelo makes an argument that Flume support should be removed in 3.0.0 at https://issues.apache.org/jira/browse/SPARK-25598 I tend to agree. Is there an argument that it needs to be supported, and can this move to Bahir if so?

Re: Remove Flume support in 3.0.0?

2018-10-10 Thread Marcelo Vanzin
BTW, although I did not file a bug for that, I think we should also consider getting rid of the kafka-0.8 connector. That would leave only kafka-0.10 as the single remaining dstream connector in Spark, though. (If you ignore kinesis which we can't ship in binary form or something like that?) On

Re: Remove Flume support in 3.0.0?

2018-10-10 Thread Sean Owen
Yup was thinking the same. It is legacy too at this point. On Wed, Oct 10, 2018, 3:19 PM Marcelo Vanzin wrote: > BTW, although I did not file a bug for that, I think we should also > consider getting rid of the kafka-0.8 connector. > > That would leave only kafka-0.10 as the single remaining

Re: [VOTE] SPARK 2.4.0 (RC3)

2018-10-10 Thread Sean Owen
+1. I tested the source build against Scala 2.12 and common build profiles. License and sigs look OK. No blockers; one critical: SPARK-25378 ArrayData.toArray(StringType) assume UTF8String in 2.4 I think this one is "won't fix" though? not trying to restore the behavior? Other items open for