Re: [vote] Apache Spark 3.0 RC3

2020-06-08 Thread Michael Armbrust
+1 (binding) On Mon, Jun 8, 2020 at 1:22 PM DB Tsai wrote: > +1 (binding) > > Sincerely, > > DB Tsai > -- > Web: https://www.dbtsai.com > PGP Key ID: 42E5B25A8F7A82C1 > > On Mon, Jun 8, 2020 at 1:03 PM Dongjoon Hyun > wrote: > > > > +1 >

Re: FYI: The evolution on `CHAR` type behavior

2020-03-17 Thread Michael Armbrust
> > What I'd oppose is to just ban char for the native data sources, and do > not have a plan to address this problem systematically. > +1 > Just forget about padding, like what Snowflake and MySQL have done. > Document that char(x) is just an alias for string. And then move on. Almost > no

Re: [VOTE] Amend Spark's Semantic Versioning Policy

2020-03-11 Thread Michael Armbrust
Thank you for the discussion everyone! This vote passes. I'll work to get this posed on the website. +1 Michael Armbrust Sean Owen Jules Damji 大啊 Ismaël Mejía Wenchen Fan Matei Zaharia Gengliang Wang Takeshi Yamamuro Denny Lee Xiao Li Xingbo Jiang Tkuya UESHIN Hichael Heuer John Zhuge Reynold Xin

Re: [VOTE] Amend Spark's Semantic Versioning Policy

2020-03-06 Thread Michael Armbrust
I'll start off the vote with a strong +1 (binding). On Fri, Mar 6, 2020 at 1:01 PM Michael Armbrust wrote: > I propose to add the following text to Spark's Semantic Versioning policy > <https://spark.apache.org/versioning-policy.html> and adopt it as the > rubric that shou

[VOTE] Amend Spark's Semantic Versioning Policy

2020-03-06 Thread Michael Armbrust
I propose to add the following text to Spark's Semantic Versioning policy and adopt it as the rubric that should be used when deciding to break APIs (even at major versions such as 3.0). I'll leave the vote open until Tuesday, March 10th at 2pm.

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

2020-02-27 Thread Michael Armbrust
Thanks for the discussion! A few responses: The decision needs to happen at api/config change time, otherwise the > deprecated warning has no purpose if we are never going to remove them. > Even if we never remove an API, I think deprecation warnings (when done right) can still serve a purpose.

Re: Clarification on the commit protocol

2020-02-27 Thread Michael Armbrust
No, it is not. Although the commit protocol has mostly been superseded by Delta Lake , which is available as a separate open source project that works natively with Apache Spark. In contrast to the commit protocol, Delta can guarantee full ACID (rather than just partition level

[Proposal] Modification to Spark's Semantic Versioning Policy

2020-02-24 Thread Michael Armbrust
Hello Everyone, As more users have started upgrading to Spark 3.0 preview (including myself), there have been many discussions around APIs that have been broken compared with Spark 2.x. In many of these discussions, one of the rationales for breaking an API seems to be "Spark follows semantic

Re: [DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`

2020-02-21 Thread Michael Armbrust
This plan for evolving the TRIM function to be more standards compliant sounds much better to me than the original change to just switch the order. It pushes users in the right direction and cleans up our tech debt without silently breaking existing workloads. It means that programs won't return

Re: [VOTE] Release Apache Spark 2.4.2

2019-04-19 Thread Michael Armbrust
+1 (binding), we've test this and it LGTM. On Thu, Apr 18, 2019 at 7:51 PM Wenchen Fan wrote: > Please vote on releasing the following candidate as Apache Spark version > 2.4.2. > > The vote is open until April 23 PST and passes if a majority +1 PMC votes > are cast, with > a minimum of 3 +1

Re: Spark 2.4.2

2019-04-16 Thread Michael Armbrust
and looks safe enough to me. I was just a > little surprised since I was expecting a correctness issue if this is > prompting a release. I'm definitely on the side of case-by-case judgments > on what to allow in patch releases and this looks fine. > > On Tue, Apr 16, 2019 at 4:27

Re: Spark 2.4.2

2019-04-16 Thread Michael Armbrust
by this behavior. Do you have a different proposal about how this should be handled? On Tue, Apr 16, 2019 at 4:23 PM Ryan Blue wrote: > Is this a bug fix? It looks like a new feature to me. > > On Tue, Apr 16, 2019 at 4:13 PM Michael Armbrust > wrote: > >> Hello All, >> >&

Spark 2.4.2

2019-04-16 Thread Michael Armbrust
Hello All, I know we just released Spark 2.4.1, but in light of fixing SPARK-27453 I was wondering if it might make sense to follow up quickly with 2.4.2. Without this fix its very hard to build a datasource that correctly handles partitioning

Re: Plan on Structured Streaming in next major/minor release?

2018-10-30 Thread Michael Armbrust
> > Agree. Just curious, could you explain what do you mean by "negation"? > Does it mean applying retraction on aggregated? > Yeah exactly. Our current streaming aggregation assumes that the input is in append-mode and multiple aggregations break this.

Re: Plan on Structured Streaming in next major/minor release?

2018-10-30 Thread Michael Armbrust
Thanks for bringing up some possible future directions for streaming. Here are some thoughts: - I personally view all of the activity on Spark SQL also as activity on Structured Streaming. The great thing about building streaming on catalyst / tungsten is that continued improvement to these

Re: Sorting on a streaming dataframe

2018-04-30 Thread Michael Armbrust
t; would be efficient in terms of performance as compared to implementing this > functionality inside the applications. > > Hemant > > On Thu, Apr 26, 2018 at 11:59 PM, Michael Armbrust <mich...@databricks.com > > wrote: > >> The basic tenet of structured streaming is that

Re: Sorting on a streaming dataframe

2018-04-26 Thread Michael Armbrust
The basic tenet of structured streaming is that a query should return the same answer in streaming or batch mode. We support sorting in complete mode because we have all the data and can sort it correctly and return the full answer. In update or append mode, sorting would only return a correct

Re: [VOTE] Spark 2.3.0 (RC5)

2018-02-26 Thread Michael Armbrust
+1 all our pipelines have been running the RC for several days now. On Mon, Feb 26, 2018 at 10:33 AM, Dongjoon Hyun wrote: > +1 (non-binding). > > Bests, > Dongjoon. > > > > On Mon, Feb 26, 2018 at 9:14 AM, Ryan Blue > wrote: > >> +1

Re: [VOTE] Spark 2.3.0 (RC4)

2018-02-21 Thread Michael Armbrust
I'm -1 on any changes that aren't fixing major regressions from 2.2 at this point. Also in any cases where its possible we should be flipping new features off if they are still regressing, rather than continuing to attempt to fix them. Since its experimental, I would support backporting the

Re: DataSourceV2: support for named tables

2018-02-02 Thread Michael Armbrust
I am definitely in favor of first-class / consistent support for tables and data sources. One thing that is not clear to me from this proposal is exactly what the interfaces are between: - Spark - A (The?) metastore - A data source If we pass in the table identifier is the data source then

Re: SQL logical plans and DataSourceV2 (was: data source v2 online meetup)

2018-02-02 Thread Michael Armbrust
> > So here are my recommendations for moving forward, with DataSourceV2 as a > starting point: > >1. Use well-defined logical plan nodes for all high-level operations: >insert, create, CTAS, overwrite table, etc. >2. Use rules that match on these high-level plan nodes, so that it >

Re: Max number of streams supported ?

2018-01-31 Thread Michael Armbrust
-dev +user > Similarly for structured streaming, Would there be any limit on number of > of streaming sources I can have ? > There is no fundamental limit, but each stream will have a thread on the driver that is doing coordination of execution. We comfortably run 20+ streams on a single

Re: Spark error while trying to spark.read.json()

2017-12-19 Thread Michael Armbrust
- dev java.lang.AbstractMethodError almost always means that you have different libraries on the classpath than at compilation time. In this case I would check to make sure you have the correct version of Scala (and only have one version of scala) on the classpath. On Tue, Dec 19, 2017 at 5:42

Re: Timeline for Spark 2.3

2017-12-19 Thread Michael Armbrust
Do people really need to be around for the branch cut (modulo the person cutting the branch)? 1st or 2nd doesn't really matter to me, but I am +1 kicking this off as soon as we enter the new year :) Michael On Tue, Dec 19, 2017 at 4:39 PM, Holden Karau wrote: > Sounds

Re: queryable state & streaming

2017-12-08 Thread Michael Armbrust
https://issues.apache.org/jira/browse/SPARK-16738 I don't believe anyone is working on it yet. I think the most useful thing is to start enumerating requirements and use cases and then we can talk about how to build it. On Fri, Dec 8, 2017 at 10:47 AM, Stavros Kontopoulos <

Timeline for Spark 2.3

2017-11-09 Thread Michael Armbrust
According to the timeline posted on the website, we are nearing branch cut for Spark 2.3. I'd like to propose pushing this out towards mid to late December for a couple of reasons and would like to hear what people think. 1. I've done release management during the Thanksgiving / Christmas time

Re: [Vote] SPIP: Continuous Processing Mode for Structured Streaming

2017-11-06 Thread Michael Armbrust
+1 On Sat, Nov 4, 2017 at 11:02 AM, Xiao Li wrote: > +1 > > 2017-11-04 11:00 GMT-07:00 Burak Yavuz : > >> +1 >> >> On Fri, Nov 3, 2017 at 10:02 PM, vaquar khan >> wrote: >> >>> +1 >>> >>> On Fri, Nov 3, 2017 at 8:14 PM, Weichen Xu

Re: Structured Stream equivalent of reduceByKey

2017-10-26 Thread Michael Armbrust
- dev I think you should be able to write an Aggregator . You probably want to run in update mode if you are looking for it to output any group that has changed in the batch. On Wed, Oct 25,

Re: Easy way to get offset metatada with Spark Streaming API

2017-09-14 Thread Michael Armbrust
orage (in the same transaction with the > data) and initialize the custom sink with right batch id when application > re-starts. After this just ignore batch if current batchId <= > latestBatchId. > > Dmitry > > > 2017-09-13 22:12 GMT+03:00 Michael Armbrust <mich...@dat

Re: Easy way to get offset metatada with Spark Streaming API

2017-09-13 Thread Michael Armbrust
ing with my own re-try logic (which is > basically, just ignore intermediate data, re-read from Kafka and re-try > processing and load)? > > Dmitry > > > 2017-09-12 22:43 GMT+03:00 Michael Armbrust <mich...@databricks.com>: > >> In the checkpoint directory t

Re: Easy way to get offset metatada with Spark Streaming API

2017-09-12 Thread Michael Armbrust
ore detail, please? Is there some kind of > offset manager API that works as get-offset by batch id lookup table? > > Dmitry > > 2017-09-12 20:29 GMT+03:00 Michael Armbrust <mich...@databricks.com>: > >> I think that we are going to have to change the Sink API as par

Re: Easy way to get offset metatada with Spark Streaming API

2017-09-12 Thread Michael Armbrust
I think that we are going to have to change the Sink API as part of SPARK-20928 , which is why I linked these tickets together. I'm still targeting an initial version for Spark 2.3 which should happen sometime towards the end of the year.

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-07 Thread Michael Armbrust
+1 On Thu, Sep 7, 2017 at 9:32 AM, Ryan Blue wrote: > +1 (non-binding) > > Thanks for making the updates reflected in the current PR. It would be > great to see the doc updated before it is finally published though. > > Right now it feels like this SPIP is focused

Re: Increase Timeout or optimize Spark UT?

2017-08-23 Thread Michael Armbrust
I think we already set the number of partitions to 5 in tests ? On Tue, Aug 22, 2017 at 3:25 PM, Maciej Szymkiewicz

Re: [SS] watermark, eventTime and "StreamExecution: Streaming query made progress"

2017-08-11 Thread Michael Armbrust
The point here is to tell you what watermark value was used when executing this batch. You don't know the new watermark until the batch is over and we don't want to do two passes over the data. In general the semantics of the watermark are designed to be conservative (i.e. just because data is

Re: Thoughts on release cadence?

2017-07-31 Thread Michael Armbrust
+1, should we update https://spark.apache.org/versioning-policy.html ? On Sun, Jul 30, 2017 at 3:34 PM, Reynold Xin wrote: > This is reasonable ... +1 > > > On Sun, Jul 30, 2017 at 2:19 AM, Sean Owen wrote: > >> The project had traditionally posted some

[ANNOUNCE] Announcing Apache Spark 2.2.0

2017-07-11 Thread Michael Armbrust
Hi all, Apache Spark 2.2.0 is the third release of the Spark 2.x line. This release removes the experimental tag from Structured Streaming. In addition, this release focuses on usability, stability, and polish, resolving over 1100 tickets. We'd like to thank our contributors and users for their

Re: [VOTE] Apache Spark 2.2.0 (RC6)

2017-07-07 Thread Michael Armbrust
This vote passes! I'll followup with the release on Monday. +1: Michael Armbrust (binding) Kazuaki Ishizaki Sean Owen (binding) Joseph Bradley (binding) Ricardo Almeida Herman van Hövell tot Westerflier (binding) Yanbo Liang Nick Pentreath (binding) Wenchen Fan (binding) Sameer Agarwal Denny Lee

Re: [VOTE] Apache Spark 2.2.0 (RC6)

2017-06-30 Thread Michael Armbrust
I'll kick off the vote with a +1. On Fri, Jun 30, 2017 at 6:44 PM, Michael Armbrust <mich...@databricks.com> wrote: > Please vote on releasing the following candidate as Apache Spark version > 2.2.0. The vote is open until Friday, July 7th, 2017 at 18:00 PST and > passe

[VOTE] Apache Spark 2.2.0 (RC6)

2017-06-30 Thread Michael Armbrust
Please vote on releasing the following candidate as Apache Spark version 2.2.0. The vote is open until Friday, July 7th, 2017 at 18:00 PST and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 2.2.0 [ ] -1 Do not release this package because ...

Re: [VOTE] Apache Spark 2.2.0 (RC5)

2017-06-26 Thread Michael Armbrust
gt; >> +1 (binding) >> >> >> On Wed, 21 Jun 2017 at 01:49 Michael Armbrust <mich...@databricks.com> >> wrote: >> >>> I will kick off the voting with a +1. >>> >>> On Tue, Jun 20, 2017 at 4:49 PM, Michael Armbrust < >>> mic

Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-21 Thread Michael Armbrust
particularly notice not having to spend a solid 2-3 weeks of time QAing >>>>>>>> (unlike in earlier Spark releases). One other point not mentioned >>>>>>>> above: I >>>>>>>> think they serve as a very helpful reminder/training for the community >>>

Re: [VOTE] Apache Spark 2.2.0 (RC5)

2017-06-20 Thread Michael Armbrust
I will kick off the voting with a +1. On Tue, Jun 20, 2017 at 4:49 PM, Michael Armbrust <mich...@databricks.com> wrote: > Please vote on releasing the following candidate as Apache Spark version > 2.2.0. The vote is open until Friday, June 23rd, 2017 at 18:00 PST and > passe

[VOTE] Apache Spark 2.2.0 (RC5)

2017-06-20 Thread Michael Armbrust
Please vote on releasing the following candidate as Apache Spark version 2.2.0. The vote is open until Friday, June 23rd, 2017 at 18:00 PST and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 2.2.0 [ ] -1 Do not release this package because

Re: cannot call explain or show on dataframe in structured streaming addBatch dataframe

2017-06-19 Thread Michael Armbrust
There is a little bit of weirdness to how we override the default query planner to replace it with an incrementalizing planner. As such, calling any operation that changes the query plan (such as a LIMIT) would cause it to revert to the batch planner and return the wrong answer. We should fix

Re: the scheme in stream reader

2017-06-19 Thread Michael Armbrust
The socket source can't know how to parse your data. I think the right thing would be for it to throw an exception saying that you can't set the schema here. Would you mind opening a JIRA ticket? If you are trying to parse data from something like JSON then you should use from_json` on the

Re: Nested "struct" fonction call creates a compilation error in Spark SQL

2017-06-15 Thread Michael Armbrust
> you think ? > > Regards, > > Olivier. > > > 2017-06-15 21:08 GMT+02:00 Michael Armbrust <mich...@databricks.com>: > >> Which version of Spark? If its recent I'd open a JIRA. >> >> On Thu, Jun 15, 2017 at 6:04 AM, Olivier Girardot < >>

Re: Nested "struct" fonction call creates a compilation error in Spark SQL

2017-06-15 Thread Michael Armbrust
Which version of Spark? If its recent I'd open a JIRA. On Thu, Jun 15, 2017 at 6:04 AM, Olivier Girardot < o.girar...@lateral-thoughts.com> wrote: > Hi everyone, > when we create recursive calls to "struct" (up to 5 levels) for extending > a complex datastructure we end up with the following

Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-14 Thread Michael Armbrust
for >>>>>>> rigor in development. Since we instituted QA JIRAs, contributors have >>>>>>> been >>>>>>> a lot better about adding in docs early, rather than waiting until the >>>>>>> end >>>>>>

Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-05 Thread Michael Armbrust
s, in general', > then I think they're superfluous at best. These aren't used consistently, > and their intent isn't actionable (i.e. it sounds like no particular > testing resolves the JIRA). They signal something that doesn't seem to > match the intent. > > Can we close the QA

Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-05 Thread Michael Armbrust
2.2 needs to > block the release; Joseph what's the status on those? > > On Mon, Jun 5, 2017 at 8:15 PM Michael Armbrust <mich...@databricks.com> > wrote: > >> Please vote on releasing the following candidate as Apache Spark version >> 2.2.0. The vote is open unti

[VOTE] Apache Spark 2.2.0 (RC4)

2017-06-05 Thread Michael Armbrust
Please vote on releasing the following candidate as Apache Spark version 2.2.0. The vote is open until Thurs, June 8th, 2017 at 12:00 PST and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 2.2.0 [ ] -1 Do not release this package because ...

Re: [VOTE] Apache Spark 2.2.0 (RC3)

2017-06-02 Thread Michael Armbrust
re, and should NEVER backport > non-bug-fix commits to an RC branch. Sorry again for the trouble! > > On Fri, Jun 2, 2017 at 2:40 PM, Michael Armbrust <mich...@databricks.com> > wrote: > >> Please vote on releasing the following candidate as Apache Spark version

[VOTE] Apache Spark 2.2.0 (RC3)

2017-06-02 Thread Michael Armbrust
Please vote on releasing the following candidate as Apache Spark version 2.2.0. The vote is open until Tues, June 6th, 2017 at 12:00 PST and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 2.2.0 [ ] -1 Do not release this package because ...

Re: [VOTE] Apache Spark 2.2.0 (RC2)

2017-06-02 Thread Michael Armbrust
ussion on TIMESTAMP semantics going on the thread "SQL >> TIMESTAMP semantics vs. SPARK-18350" which might impact Spark 2.2. Should >> we make a decision there before voting on the next RC for Spark 2.2? >> >> Thanks, >> Kostas >> >> On Tue, May

Re: [VOTE] Apache Spark 2.2.0 (RC2)

2017-05-30 Thread Michael Armbrust
> > Michael, > > If you haven't started cutting the new RC, I'm working on a documentation > PR right now I'm hoping we can get into Spark 2.2 as a migration note, even > if it's just a mention: https://issues.apache.org/jira/browse/SPARK-20888. > > Michael > >

Re: Running into the same problem as JIRA SPARK-19268

2017-05-24 Thread Michael Armbrust
-dev Have you tried clearing out the checkpoint directory? Can you also give the full stack trace? On Wed, May 24, 2017 at 3:45 PM, kant kodali wrote: > Even if I do simple count aggregation like below I get the same error as >

Re: [VOTE] Apache Spark 2.2.0 (RC2)

2017-05-22 Thread Michael Armbrust
t;>> We still have open ML/Graph/SparkR JIRAs targeted at 2.2, but they are >>> essentially all for documentation. >>> >>> Joseph >>> >>> On Thu, May 11, 2017 at 3:08 PM, Marcelo Vanzin <van...@cloudera.com> >>> wrote: >>> >

[VOTE] Apache Spark 2.2.0 (RC2)

2017-05-04 Thread Michael Armbrust
Please vote on releasing the following candidate as Apache Spark version 2.2.0. The vote is open until Tues, May 9th, 2017 at 12:00 PST and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 2.2.0 [ ] -1 Do not release this package because ...

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-05-03 Thread Michael Armbrust
h Sean. Spark only pulls in parquet-avro for tests. For >>>>>>> execution, it implements the record materialization APIs in Parquet to >>>>>>> go >>>>>>> directly to Spark SQL rows. This doesn't actually leak an Avro 1.8 >&g

Re: [ANNOUNCE] Apache Spark 2.1.1

2017-05-03 Thread Michael Armbrust
fir.ma...@equalum.io > > On Wed, May 3, 2017 at 1:18 AM, Michael Armbrust <mich...@databricks.com> > wrote: > >> We are happy to announce the availability of Spark 2.1.1! >> >> Apache Spark 2.1.1 is a maintenance release, based on the branch-2.1 >> main

[ANNOUNCE] Apache Spark 2.1.1

2017-05-02 Thread Michael Armbrust
We are happy to announce the availability of Spark 2.1.1! Apache Spark 2.1.1 is a maintenance release, based on the branch-2.1 maintenance branch of Spark. We strongly recommend all 2.1.x users to upgrade to this stable release. To download Apache Spark 2.1.1 visit

Re: Spark 2.2.0 or Spark 2.3.0?

2017-05-02 Thread Michael Armbrust
An RC for 2.2.0 was released last week. Please test. Note that update mode has been supported since 2.0. On Mon, May 1, 2017 at 10:43 PM, kant kodali wrote: > Hi All, > > If I understand the Spark

Re: [KafkaSourceProvider] Why topic option and column without reverting to path as the least priority?

2017-05-01 Thread Michael Armbrust
He's just suggesting that since the DataStreamWriter start() method can fill in an option named "path", we should make that a synonym for "topic". Then you could do something like. df.writeStream.format("kafka").start("topic") Seems reasonable if people don't think that is confusing. On Mon,

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-04-27 Thread Michael Armbrust
t we normally cut an RC after those things are ready? > > On Thu, Apr 27, 2017 at 7:31 PM Michael Armbrust <mich...@databricks.com> > wrote: > >> Please vote on releasing the following candidate as Apache Spark version >> 2.2.0. The vote is open until Tues, May 2nd, 2017 at 12:00

Re: [VOTE] Apache Spark 2.1.1 (RC4)

2017-04-27 Thread Michael Armbrust
I'll also +1 On Thu, Apr 27, 2017 at 4:20 AM, Sean Owen <so...@cloudera.com> wrote: > +1 , same result as with the last RC. All checks out for me. > > On Thu, Apr 27, 2017 at 1:29 AM Michael Armbrust <mich...@databricks.com> > wrote: > >> Please vote on

[VOTE] Apache Spark 2.2.0 (RC1)

2017-04-27 Thread Michael Armbrust
Please vote on releasing the following candidate as Apache Spark version 2.2.0. The vote is open until Tues, May 2nd, 2017 at 12:00 PST and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 2.2.0 [ ] -1 Do not release this package because ...

[VOTE] Apache Spark 2.1.1 (RC4)

2017-04-26 Thread Michael Armbrust
Please vote on releasing the following candidate as Apache Spark version 2.1.1. The vote is open until Sat, April 29th, 2018 at 18:00 PST and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 2.1.1 [ ] -1 Do not release this package because ...

Re: [VOTE] Apache Spark 2.1.1 (RC3)

2017-04-24 Thread Michael Armbrust
>>>>>> IIRC, the new "spark.sql.hive.caseSensitiveInferenceMode" stuff will >>>>>> only scan all table files only once, and write back the inferred schema >>>>>> to >>>>>> metastore so that we don't need to do the

Re: [VOTE] Apache Spark 2.1.1 (RC3)

2017-04-21 Thread Michael Armbrust
e actively investigating to find the > root cause of this problem, and specifically whether this is a problem in > the Spark codebase or not. I will report back when I have an answer to that > question. > > Michael > > > On Apr 18, 2017, at 11:59 AM, Michael Armbrust <mich...@databrick

Re: [VOTE] Apache Spark 2.1.1 (RC2)

2017-04-18 Thread Michael Armbrust
In case it wasn't obvious by the appearance of RC3, this vote failed. On Thu, Mar 30, 2017 at 4:09 PM, Michael Armbrust <mich...@databricks.com> wrote: > Please vote on releasing the following candidate as Apache Spark version > 2.1.0. The vote is open until Sun, April 2nd, 2018

[VOTE] Apache Spark 2.1.1 (RC3)

2017-04-18 Thread Michael Armbrust
Please vote on releasing the following candidate as Apache Spark version 2.1.1. The vote is open until Fri, April 21st, 2018 at 13:00 PST and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 2.1.1 [ ] -1 Do not release this package because ...

branch-2.2 has been cut

2017-04-18 Thread Michael Armbrust
I just cut the release branch for Spark 2.2. If you are merging important bug fixes, please backport as appropriate. If you have doubts if something should be backported, please ping me. I'll follow with an RC later this week.

Re: 2.2 branch

2017-04-17 Thread Michael Armbrust
I'm going to cut branch-2.2 tomorrow morning. On Thu, Apr 13, 2017 at 11:02 AM, Michael Armbrust <mich...@databricks.com> wrote: > Yeah, I was delaying until 2.1.1 was out and some of the hive questions > were resolved. I'll make progress on that by the end of the week. Lets &

Re: [VOTE] Apache Spark 2.1.1 (RC2)

2017-04-14 Thread Michael Armbrust
the Jenkins cluster is a bit on the older side). >> >>>> >> >>>> On Tue, Apr 4, 2017 at 3:06 PM, Holden Karau <hol...@pigscanfly.ca> >> >>>> wrote: >> >>>>> >> >>>>> So the fix is installing pandoc on

Re: 2.2 branch

2017-04-13 Thread Michael Armbrust
Yeah, I was delaying until 2.1.1 was out and some of the hive questions were resolved. I'll make progress on that by the end of the week. Lets aim for 2.2 branch cut next week. On Thu, Apr 13, 2017 at 8:56 AM, Koert Kuipers wrote: > i see there is no 2.2 branch yet for

Re: [VOTE] Apache Spark 2.1.1 (RC2)

2017-04-04 Thread Michael Armbrust
l...@pigscanfly.ca> > *Sent:* Friday, March 31, 2017 6:25:20 PM > *To:* Xiao Li > *Cc:* Michael Armbrust; dev@spark.apache.org > *Subject:* Re: [VOTE] Apache Spark 2.1.1 (RC2) > > -1 (non-binding) > > Python packaging doesn't seem to have quite worked out (looking > at

[VOTE] Apache Spark 2.1.1 (RC2)

2017-03-30 Thread Michael Armbrust
Please vote on releasing the following candidate as Apache Spark version 2.1.0. The vote is open until Sun, April 2nd, 2018 at 16:30 PST and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 2.1.1 [ ] -1 Do not release this package because ...

Re: Outstanding Spark 2.1.1 issues

2017-03-28 Thread Michael Armbrust
Thanks, > Asher Krim > Senior Software Engineer > > On Wed, Mar 22, 2017 at 7:44 PM, Michael Armbrust <mich...@databricks.com> > wrote: > >> An update: I cut the tag for RC1 last night. Currently fighting with the >> release process. Will post RC1 once I ge

Re: Outstanding Spark 2.1.1 issues

2017-03-22 Thread Michael Armbrust
seem like regression from 2.1 so we should be good to start the RC >> process. >> >> On Tue, Mar 21, 2017 at 1:41 PM, Michael Armbrust <mich...@databricks.com >> > wrote: >> >> Please speak up if I'm wrong, but none of these seem like critical >> regressi

Re: Outstanding Spark 2.1.1 issues

2017-03-21 Thread Michael Armbrust
Please speak up if I'm wrong, but none of these seem like critical regressions from 2.1. As such I'll start the RC process later today. On Mon, Mar 20, 2017 at 9:52 PM, Holden Karau wrote: > I'm not super sure it should be a blocker for 2.1.1 -- is it a regression? >

Spark 2.2 Code-freeze - 3/20

2017-03-15 Thread Michael Armbrust
Hey Everyone, Just a quick announcement that I'm planning to cut the branch for Spark 2.2 this coming Monday (3/20). Please try and get things merged before then and also please begin retargeting of any issues that you don't think will make the release. Michael

Re: Should we consider a Spark 2.1.1 release?

2017-03-15 Thread Michael Armbrust
Hey Holden, Thanks for bringing this up! I think we usually cut patch releases when there are enough fixes to justify it. Sometimes just a few weeks after the release. I guess if we are at 3 months Spark 2.1.0 was a pretty good release :) That said, it is probably time. I was about to start

Re: Structured Streaming Spark Summit Demo - Databricks people

2017-02-16 Thread Michael Armbrust
Thanks for your interest in Apache Spark Structured Streaming! There is nothing secret in that demo, though I did make some configuration changes in order to get the timing right (gotta have some dramatic effect :) ). Also I think the visualizations based on metrics output by the

Re: benefits of code gen

2017-02-10 Thread Michael Armbrust
Function1 is specialized, but nullSafeEval is Any => Any, so that's still going to box in the non-codegened execution path. On Fri, Feb 10, 2017 at 1:32 PM, Koert Kuipers wrote: > based on that i take it that math functions would be primary beneficiaries > since they work on

Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Michael Armbrust
leReferenceSink",tableName2) > .option("checkpointLocation","checkpoint") > .start() > > > On Tue, Feb 7, 2017 at 7:24 PM, Michael Armbrust <mich...@databricks.com> > wrote: > >> Read the JSON log of files that is in `/your/path/_spark_me

Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Michael Armbrust
not the case then how would I go about ensuring no duplicates? > > > Thanks again for the awesome support! > > Regards > Sam > On Tue, 7 Feb 2017 at 18:05, Michael Armbrust <mich...@databricks.com> > wrote: > >> Sorry, I think I was a little unclear. There a

Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Michael Armbrust
d not be any jobs left because I can see in the log > that its now polling for new changes, the latest offset is the right one > > After I kill it and relaunch it picks up that same file? > > > Sorry if I misunderstood you > > On Tue, Feb 7, 2017 at 5:20 PM, Michael Armbrust <m

Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Michael Armbrust
2017 at 4:58 PM, Sam Elamin <hussam.ela...@gmail.com> > wrote: > >> Thanks Micheal! >> >> >> >> On Tue, Feb 7, 2017 at 4:49 PM, Michael Armbrust <mich...@databricks.com> >> wrote: >> >>> Here a JIRA: https://issues.apache.org/jira

Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Michael Armbrust
Here a JIRA: https://issues.apache.org/jira/browse/SPARK-19497 We should add this soon. On Tue, Feb 7, 2017 at 8:35 AM, Sam Elamin wrote: > Hi All > > When trying to read a stream off S3 and I try and drop duplicates I get > the following error: > > Exception in thread

Re: specifing schema on dataframe

2017-02-05 Thread Michael Armbrust
-dev You can use withColumn to change the type after the data has been loaded . On Sat, Feb 4, 2017 at 6:22 AM, Sam Elamin

Re: [SQL][SPARK-14160] Maximum interval for o.a.s.sql.functions.window

2017-01-18 Thread Michael Armbrust
+1, we should just fix the error to explain why months aren't allowed and suggest that you manually specify some number of days. On Wed, Jan 18, 2017 at 9:52 AM, Maciej Szymkiewicz wrote: > Thanks for the response Burak, > > As any sane person I try to steer away from

Re: StateStoreSaveExec / StateStoreRestoreExec

2017-01-03 Thread Michael Armbrust
You might also be interested in this: https://issues.apache.org/jira/browse/SPARK-19031 On Tue, Jan 3, 2017 at 3:36 PM, Michael Armbrust <mich...@databricks.com> wrote: > I think we should add something similar to mapWithState in 2.2. It would > be great if you could add the descrip

Re: What is mainly different from a UDT and a spark internal type that ExpressionEncoder recognized?

2016-12-27 Thread Michael Armbrust
An encoder uses reflection to generate expressions that can extract data out of an object (by calling methods on the object) and encode its contents directly into the

Re: ability to provide custom serializers

2016-12-05 Thread Michael Armbrust
Lets start with a new ticket, link them and we can merge if the solution ends up working out for both cases. On Sun, Dec 4, 2016 at 5:39 PM, Erik LaBianca <erik.labia...@gmail.com> wrote: > Thanks Michael! > > On Dec 2, 2016, at 7:29 PM, Michael Armbrust <mich...@databricks.c

Re: ability to provide custom serializers

2016-12-02 Thread Michael Armbrust
I would love to see something like this. The closest related ticket is probably https://issues.apache.org/jira/browse/SPARK-7768 (though maybe there are enough people using UDTs in their current form that we should just make a new ticket) A few thoughts: - even if you can do implicit search, we

Re: Flink event session window in Spark

2016-12-02 Thread Michael Armbrust
Here is the JIRA for adding this feature: https://issues.apache.org/jira/browse/SPARK-10816 On Fri, Dec 2, 2016 at 11:20 AM, Fritz Budiyanto wrote: > Hi All, > > I need help on how to implement Flink event session window in Spark. Is > this possible? > > For instance, I

Re: Analyzing and reusing cached Datasets

2016-11-19 Thread Michael Armbrust
You are hitting a weird optimization in withColumn. Specifically, to avoid building up huge trees with chained calls to this method, we collapse projections eagerly (instead of waiting for the optimizer). Typically we look for cached data in between analysis and optimization, so that

Re: Multiple streaming aggregations in structured streaming

2016-11-18 Thread Michael Armbrust
Doing this generally is pretty hard. We will likely support algebraic aggregate eventually, but this is not currently slotted for 2.2. Instead I think we will add something like mapWithState that lets users compute arbitrary stateful things. What is your use case? On Wed, Nov 16, 2016 at 6:58

Re: How do I convert json_encoded_blob_column into a data frame? (This may be a feature request)

2016-11-16 Thread Michael Armbrust
On Wed, Nov 16, 2016 at 2:49 AM, Hyukjin Kwon wrote: > Maybe it sounds like you are looking for from_json/to_json functions after > en/decoding properly. > Which are new built-in functions that will be released with Spark 2.1.

  1   2   3   4   >