Re: Make Scala 2.12 as default Scala version in Spark 3.0

2018-11-16 Thread Justin Miller
I’d add if folks rely on Twitter in their stack, they might be stuck on
older versions for a while (of their Twitter libs) which might require they
stay on 2.11 for longer than they might otherwise like.

On Friday, November 16, 2018, Marcelo Vanzin 
wrote:

> Now that the switch to 2.12 by default has been made, it might be good
> to have a serious discussion about dropping 2.11 altogether. Many of
> the main arguments have already been talked about. But I don't
> remember anyone mentioning how easy it would be to break the 2.11
> build now.
>
> For example, the following works fine in 2.12 but breaks in 2.11:
>
> java.util.Arrays.asList("hi").stream().forEach(println)
>
> We had a similar issue when we supported java 1.6 but the builds were
> all on 1.7 by default. Every once in a while something would silently
> break, because PR builds only check the default. And the jenkins
> builds, which are less monitored, would stay broken for a while.
>
> On Tue, Nov 6, 2018 at 11:13 AM DB Tsai  wrote:
> >
> > We made Scala 2.11 as default Scala version in Spark 2.0. Now, the next
> Spark version will be 3.0, so it's a great time to discuss should we make
> Scala 2.12 as default Scala version in Spark 3.0.
> >
> > Scala 2.11 is EOL, and it came out 4.5 ago; as a result, it's unlikely
> to support JDK 11 in Scala 2.11 unless we're willing to sponsor the needed
> work per discussion in Scala community, https://github.com/scala/
> scala-dev/issues/559#issuecomment-436160166
> >
> > We have initial support of Scala 2.12 in Spark 2.4. If we decide to make
> Scala 2.12 as default for Spark 3.0 now, we will have ample time to work on
> bugs and issues that we may run into.
> >
> > What do you think?
> >
> > Thanks,
> >
> > DB Tsai  |  Siri Open Source Technologies [not a contribution]  |  
> Apple, Inc
> >
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 

Justin Miller
Senior Data Engineer
*GoSpotCheck*
Direct: 720-517-3979 <+17205173979>
Email: jus...@gospotcheck.com

September 24-26, 2018
Denver, Colorado Learn More and Register
<https://www.gospotcheck.com/field-days/>


Re: [VOTE] Spark 2.3.1 (RC1)

2018-05-15 Thread Justin Miller
Did SPARK-24067 not make it in? I don’t see it in https://s.apache.org/Q3Uo 
.

Thanks,
Justin

> On May 15, 2018, at 3:00 PM, Marcelo Vanzin  wrote:
> 
> Please vote on releasing the following candidate as Apache Spark version 
> 2.3.1.
> 
> The vote is open until Friday, May 18, at 21:00 UTC and passes if
> a majority of at least 3 +1 PMC votes are cast.
> 
> [ ] +1 Release this package as Apache Spark 2.3.1
> [ ] -1 Do not release this package because ...
> 
> To learn more about Apache Spark, please see http://spark.apache.org/
> 
> The tag to be voted on is v2.3.1-rc1 (commit cc93bc95):
> https://github.com/apache/spark/tree/v2.3.0-rc1
> 
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc1-bin/
> 
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
> 
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1269/
> 
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc1-docs/
> 
> The list of bug fixes going into 2.3.1 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12342432
> 
> FAQ
> 
> =
> How can I help test this release?
> =
> 
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
> 
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
> 
> ===
> What should happen to JIRA tickets still targeting 2.3.1?
> ===
> 
> The current list of open tickets targeted at 2.3.1 can be found at:
> https://s.apache.org/Q3Uo
> 
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
> 
> ==
> But my bug isn't fixed?
> ==
> 
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
> 
> 
> -- 
> Marcelo
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 



Re: [VOTE] Spark 2.3.0 (RC4)

2018-02-21 Thread Justin Miller
Ah gotcha thanks for letting me know. We’ve been using the patch in production 
for a couple weeks now and it’s been working great. If anyone else runs into 
the issue (non-compacted topics have “gaps” in offsets) feel free to have them 
e-mail me and I can try to help them get going with patching their own systems 
until 2.3.1 is available.

Thanks,
Justin

> On Feb 21, 2018, at 10:43 AM, Xiao Li  wrote:
> 
> Hi, Justin, 
> 
> Based on my understanding, SPARK-17147 is also not a regression. Thus, Spark 
> 2.3.0 is unable to contain it. We have to wait for the committers who are 
> familiar with Spark Streaming to make a decision whether we can fix the issue 
> in Spark 2.3.1.
> 
> Since this is open source, feel free to add the patch in your local build.
> 
> Thanks for using Spark!
> 
> Xiao
> 
> 
> 2018-02-21 9:36 GMT-08:00 Ryan Blue  >:
> No problem if we can't add them, this is experimental anyway so this release 
> should be more about validating the API and the start of our implementation. 
> I just don't think we can recommend that anyone actually use DataSourceV2 
> without these patches.
> 
> On Wed, Feb 21, 2018 at 9:21 AM, Wenchen Fan  > wrote:
> SPARK-23323 adds a new API, I'm not sure we can still do it at this stage of 
> the release... Besides users can work around it by calling the spark output 
> coordinator themselves in their data source.
> 
> SPARK-23203 is non-trivial and didn't fix any known bugs, so it's hard to 
> convince other people that it's safe to add it to the release during the RC 
> phase.
> 
> SPARK-23418 depends on the above one.
> 
> Generally they are good to have in Spark 2.3, if they were merged before the 
> RC. I think this is a lesson we should learn from, that we should work on 
> stuff we want in the release before the RC, instead of after.
> 
> On Thu, Feb 22, 2018 at 1:01 AM, Ryan Blue  > wrote:
> What does everyone think about getting some of the newer DataSourceV2 
> improvements in? It should be low risk because it is a new code path, and v2 
> isn't very usable without things like support for using the output commit 
> coordinator to deconflict writes.
> 
> The ones I'd like to get in are:
> * Use the output commit coordinator: 
> https://issues.apache.org/jira/browse/SPARK-23323 
> 
> * Use immutable trees and the same push-down logic as other read paths: 
> https://issues.apache.org/jira/browse/SPARK-23203 
> 
> * Don't allow users to supply schemas when they aren't supported: 
> https://issues.apache.org/jira/browse/SPARK-23418 
> 
> 
> I think it would make the 2.3.0 release more usable for anyone interested in 
> the v2 read and write paths.
> 
> Thanks!
> 
> On Tue, Feb 20, 2018 at 7:07 PM, Weichen Xu  > wrote:
> +1
> 
> On Wed, Feb 21, 2018 at 10:07 AM, Marcelo Vanzin  > wrote:
> Done, thanks!
> 
> On Tue, Feb 20, 2018 at 6:05 PM, Sameer Agarwal  > wrote:
> > Sure, please feel free to backport.
> >
> > On 20 February 2018 at 18:02, Marcelo Vanzin  > > wrote:
> >>
> >> Hey Sameer,
> >>
> >> Mind including https://github.com/apache/spark/pull/20643 
> >> 
> >> (SPARK-23468)  in the new RC? It's a minor bug since I've only hit it
> >> with older shuffle services, but it's pretty safe.
> >>
> >> On Tue, Feb 20, 2018 at 5:58 PM, Sameer Agarwal  >> >
> >> wrote:
> >> > This RC has failed due to
> >> > https://issues.apache.org/jira/browse/SPARK-23470 
> >> > .
> >> > Now that the fix has been merged in 2.3 (thanks Marcelo!), I'll follow
> >> > up
> >> > with an RC5 soon.
> >> >
> >> > On 20 February 2018 at 16:49, Ryan Blue  >> > > wrote:
> >> >>
> >> >> +1
> >> >>
> >> >> Build & tests look fine, checked signature and checksums for src
> >> >> tarball.
> >> >>
> >> >> On Tue, Feb 20, 2018 at 12:54 PM, Shixiong(Ryan) Zhu
> >> >> > wrote:
> >> >>>
> >> >>> I'm -1 because of the UI regression
> >> >>> https://issues.apache.org/jira/browse/SPARK-23470 
> >> >>>  : the All Jobs page
> >> >>> may be
> >> >>> too slow and cause "read timeout" when there are lots of jobs and
> >> >>> stages.
> >> >>> This is one of the most important pages because when it's broken, it's
> >> >>> pretty hard to use Spark Web 

Re: [VOTE] Spark 2.3.0 (RC4)

2018-02-21 Thread Justin Miller
Greetings,

I would also like to ask if the following ticket could make it in to 2.3.0. I’m 
currently testing the code in production as we were running into issues on 
non-compacted topics (very occasionally) running into non-consecutive offsets. 
I imagine other people will encounter similar issues if they’re doing 15+ 
billion records a day. 

https://github.com/apache/spark/pull/20572 
 (SPARK-17147)

Thanks,
Justin

> On Feb 21, 2018, at 10:21 AM, kant kodali  wrote:
> 
> Hi All,
> 
> +1 for the tickets proposed by Ryan Blue
> 
> Any possible chance of this one 
> https://issues.apache.org/jira/browse/SPARK-23406 
>  getting into 2.3.0? It's 
> a very important feature for us so if it doesn't make the cut I would have to 
> cherry-pick this commit and compile from the source for our production 
> release.
> 
> Thanks!
> 
> On Wed, Feb 21, 2018 at 9:01 AM, Ryan Blue  > wrote:
> What does everyone think about getting some of the newer DataSourceV2 
> improvements in? It should be low risk because it is a new code path, and v2 
> isn't very usable without things like support for using the output commit 
> coordinator to deconflict writes.
> 
> The ones I'd like to get in are:
> * Use the output commit coordinator: 
> https://issues.apache.org/jira/browse/SPARK-23323 
> 
> * Use immutable trees and the same push-down logic as other read paths: 
> https://issues.apache.org/jira/browse/SPARK-23203 
> 
> * Don't allow users to supply schemas when they aren't supported: 
> https://issues.apache.org/jira/browse/SPARK-23418 
> 
> 
> I think it would make the 2.3.0 release more usable for anyone interested in 
> the v2 read and write paths.
> 
> Thanks!
> 
> On Tue, Feb 20, 2018 at 7:07 PM, Weichen Xu  > wrote:
> +1
> 
> On Wed, Feb 21, 2018 at 10:07 AM, Marcelo Vanzin  > wrote:
> Done, thanks!
> 
> On Tue, Feb 20, 2018 at 6:05 PM, Sameer Agarwal  > wrote:
> > Sure, please feel free to backport.
> >
> > On 20 February 2018 at 18:02, Marcelo Vanzin  > > wrote:
> >>
> >> Hey Sameer,
> >>
> >> Mind including https://github.com/apache/spark/pull/20643 
> >> 
> >> (SPARK-23468)  in the new RC? It's a minor bug since I've only hit it
> >> with older shuffle services, but it's pretty safe.
> >>
> >> On Tue, Feb 20, 2018 at 5:58 PM, Sameer Agarwal  >> >
> >> wrote:
> >> > This RC has failed due to
> >> > https://issues.apache.org/jira/browse/SPARK-23470 
> >> > .
> >> > Now that the fix has been merged in 2.3 (thanks Marcelo!), I'll follow
> >> > up
> >> > with an RC5 soon.
> >> >
> >> > On 20 February 2018 at 16:49, Ryan Blue  >> > > wrote:
> >> >>
> >> >> +1
> >> >>
> >> >> Build & tests look fine, checked signature and checksums for src
> >> >> tarball.
> >> >>
> >> >> On Tue, Feb 20, 2018 at 12:54 PM, Shixiong(Ryan) Zhu
> >> >> > wrote:
> >> >>>
> >> >>> I'm -1 because of the UI regression
> >> >>> https://issues.apache.org/jira/browse/SPARK-23470 
> >> >>>  : the All Jobs page
> >> >>> may be
> >> >>> too slow and cause "read timeout" when there are lots of jobs and
> >> >>> stages.
> >> >>> This is one of the most important pages because when it's broken, it's
> >> >>> pretty hard to use Spark Web UI.
> >> >>>
> >> >>>
> >> >>> On Tue, Feb 20, 2018 at 4:37 AM, Marco Gaido  >> >>> >
> >> >>> wrote:
> >> 
> >>  +1
> >> 
> >>  2018-02-20 12:30 GMT+01:00 Hyukjin Kwon  >>  >:
> >> >
> >> > +1 too
> >> >
> >> > 2018-02-20 14:41 GMT+09:00 Takuya UESHIN  >> > >:
> >> >>
> >> >> +1
> >> >>
> >> >>
> >> >> On Tue, Feb 20, 2018 at 2:14 PM, Xingbo Jiang
> >> >> >
> >> >> wrote:
> >> >>>
> >> >>> +1
> >> >>>
> >> >>>
> >> >>> Wenchen Fan  >> >>> >于2018年2月20日 周二下午1:09写道:
> >> 
> >>  +1
> >> 
> >>  On Tue, Feb 20, 2018 at 12:53 PM, Reynold Xin
> >>  >
> 

Re: Spark 3

2018-01-19 Thread Justin Miller
Would that mean supporting both 2.12 and 2.11? Could be a while before some of 
our libraries are off of 2.11.

Thanks,
Justin

> On Jan 19, 2018, at 10:53 AM, Koert Kuipers  wrote:
> 
> i was expecting to be able to move to scala 2.12 sometime this year
> 
> if this cannot be done in spark 2.x then that could be a compelling reason to 
> move spark 3 up to 2018 i think
> 
> hadoop 3 sounds great but personally i have no use case for it yet
> 
> On Fri, Jan 19, 2018 at 12:31 PM, Sean Owen  > wrote:
> Forking this thread to muse about Spark 3. Like Spark 2, I assume it would be 
> more about making all those accumulated breaking changes and updating lots of 
> dependencies. Hadoop 3 looms large in that list as well as Scala 2.12.
> 
> Spark 1 was release in May 2014, and Spark 2 in July 2016. If Spark 2.3 is 
> out in Feb 2018 and it takes the now-usual 6 months until a next release, 
> Spark 3 could reasonably be next.
> 
> However the release cycles are naturally slowing down, and it could also be 
> said that 2019 would be more on schedule for Spark 3.
> 
> Nothing particularly urgent about deciding, but I'm curious if anyone had an 
> opinion on whether to move on to Spark 3 next or just continue with 2.4 later 
> this year.
> 
> On Fri, Jan 19, 2018 at 11:13 AM Sean Owen  > wrote:
> Yeah, if users are using Kryo directly, they should be insulated from a 
> Spark-side change because of shading.
> However this also entails updating (unshaded) Chill from 0.8.x to 0.9.x. I am 
> not sure if that causes problems for apps.
> 
> Normally I'd avoid any major-version change in a minor release. This one 
> looked potentially entirely internal.
> I think if there are any doubts, we can leave it for Spark 3. There was a bug 
> report that needed a fix from Kryo 4, but it might be minor after all.
> 
> 



Re: Timeline for Spark 2.3

2017-11-09 Thread Justin Miller
That sounds fine to me. I’m hoping that this ticket can make it into Spark 2.3: 
https://issues.apache.org/jira/browse/SPARK-18016 


It’s causing some pretty considerable problems when we alter the columns to be 
nullable, but we are OK for now without that.

Best,
Justin

> On Nov 9, 2017, at 4:54 PM, Michael Armbrust  wrote:
> 
> According to the timeline posted on the website, we are nearing branch cut 
> for Spark 2.3.  I'd like to propose pushing this out towards mid to late 
> December for a couple of reasons and would like to hear what people think.
> 
> 1. I've done release management during the Thanksgiving / Christmas time 
> before and in my experience, we don't actually get a lot of testing during 
> this time due to vacations and other commitments. I think beginning the RC 
> process in early January would give us the best coverage in the shortest 
> amount of time.
> 2. There are several large initiatives in progress that given a little more 
> time would leave us with a much more exciting 2.3 release. Specifically, the 
> work on the history server, Kubernetes and continuous processing.
> 3. Given the actual release date of Spark 2.2, I think we'll still get Spark 
> 2.3 out roughly 6 months after.
> 
> Thoughts?
> 
> Michael