Re: [VOTE] Release Apache Spark 1.4.0 (RC1)

2015-05-22 Thread Michael Armbrust
Thanks for the feedback. As you stated UDTs are explicitly not a public api as we knew we were going to be making breaking changes to them. We hope to stabilize / open them up in future releases. Regarding the Hive issue, have you tried using TestHive instead. This is what we use for testing

Re: DataFrame#rdd doesn't respect DataFrame#cache, slowing down CrossValidator

2015-07-28 Thread Michael Armbrust
...@databricks.com wrote: Thanks for bringing this up! I talked with Michael Armbrust, and it sounds like this is a from a bug in DataFrame caching: https://issues.apache.org/jira/browse/SPARK-9141 It's marked as a blocker for 1.5. Joseph On Tue, Jul 28, 2015 at 2:36 AM, Justin Uang justin.u

Re: unit test failure for hive query

2015-07-29 Thread Michael Armbrust
I'd suggest using org.apache.spark.sql.hive.test.TestHive as the context in unit tests. It takes care of creating separate directories for each invocation automatically. On Wed, Jul 29, 2015 at 7:02 PM, JaeSung Jun jaes...@gmail.com wrote: Hi, I'm working on custom sql processing on top of

Re: [ANNOUNCE] Spark branch-1.5

2015-08-03 Thread Michael Armbrust
Would it be reasonable to start un-targeting non-bug non-blocker issues? like, would anyone yell if I started doing that? that would leave ~100 JIRAs, which still seems like more than can actually go into the release. And anyone can re-target as desired. I think the maintainers of the

Re: Data source aliasing

2015-07-30 Thread Michael Armbrust
+1 On Thu, Jul 30, 2015 at 11:18 AM, Patrick Wendell pwend...@gmail.com wrote: Yeah this could make sense - allowing data sources to register a short name. What mechanism did you have in mind? To use the jar service loader? The only issue is that there could be conflicts since many of these

Re: Hive Table with large number of partitions

2015-07-17 Thread Michael Armbrust
https://github.com/apache/spark/pull/7421 On Fri, Jul 17, 2015 at 3:26 AM, Xiaoyu Ma hzmaxia...@corp.netease.com wrote: Hi guys, I saw when Hive Table object created it tries to load all existing partitions. @transient val hiveQlPartitions: Seq[Partition] = table.getAllPartitions.map { p

Re: Support for views/ virtual tables in SparkSQL

2015-11-10 Thread Michael Armbrust
We do support hive style views, though all tables have to be visible to Hive. You can also turn on the experimental native view support (but it does not canonicalize the query). set spark.sql.nativeView = true On Mon, Nov 9, 2015 at 10:24 PM, Zhan Zhang wrote: > I

Spark 1.6 Release Schedule

2015-10-31 Thread Michael Armbrust
Hey All, Just a friendly reminder that today (October 31st) is the scheduled code freeze for Spark 1.6. Since a lot of developers were busy with the Spark Summit last week I'm going to delay cutting the branch until Monday, November 2nd. After that point, we'll package a release for testing and

Re: Spark 1.6 Release Schedule

2015-11-05 Thread Michael Armbrust
ore an RC), non-Blocker non-bugs > untargeted, or in a few cases pushed to 1.6.1 or beyond > > 4. After next week, non-Blocker and non-Critical bugs are pushed, as the > RC is then late. > > 5. No release candidate until no Blockers are open. > > 6. (Repeat 1 and 2 more reg

Re: [BUILD SYSTEM] quick jenkins downtime, november 5th 7am

2015-11-06 Thread Michael Armbrust
I'm noticing several problems with Jenkins since the upgrade. PR comments say: "Build started sha1 is merged." instead of actually printing the hash Also: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45246/console GitHub pull request #9527 of commit

Re: [VOTE] Release Apache Spark 1.5.2 (RC2)

2015-11-06 Thread Michael Armbrust
+1 On Fri, Nov 6, 2015 at 9:27 AM, Chester Chen wrote: > +1 > Test against CDH5.4.2 with hadoop 2.6.0 version using yesterday's code, > build locally. > > Regression running in Yarn Cluster mode against few internal ML ( logistic > regression, linear regression, random

Re: SparkSQL: First query execution is always slower than subsequent queries

2015-10-07 Thread Michael Armbrust
-dev +user 1). Is that the reason why it's always slow in the first run? Or are there > any other reasons? Apparently it loads data to memory every time so it > shouldn't be something to do with disk read should it? > You are probably seeing the effect of the JVMs JIT. The first run is

Re: Spark SQL: what does an exclamation mark mean in the plan?

2015-10-19 Thread Michael Armbrust
It means that there is an invalid attribute reference (i.e. a #n where the attribute is missing from the child operator). On Sun, Oct 18, 2015 at 11:38 PM, Xiao Li wrote: > Hi, all, > > After turning on the trace, I saw a strange exclamation mark in > the intermediate

Re: Should enforce the uniqueness of field name in DataFrame ?

2015-10-15 Thread Michael Armbrust
> > In hive, the ambiguous name can be resolved by using the table name as > prefix, but seems DataFrame don't support it ( I mean DataFrame API rather > than SparkSQL) You can do the same using pure DataFrames. Seq((1,2)).toDF("a", "b").registerTempTable("y") Seq((1,4)).toDF("a",

Re: spark hive branch location

2015-10-05 Thread Michael Armbrust
I think this is the most up to date branch (used in Spark 1.5): https://github.com/pwendell/hive/tree/release-1.2.1-spark On Mon, Oct 5, 2015 at 1:03 PM, weoccc wrote: > Hi, > > I would like to know where is the spark hive github location where spark > build depend on ? I was

Re: Dataframes: PrunedFilteredScan without Spark Side Filtering

2015-10-07 Thread Michael Armbrust
> >> That sounds fine to me, we already do the filtering so populating that >> field would be pretty simple. >> >> On Sun, Sep 27, 2015 at 2:08 PM Michael Armbrust <mich...@databricks.com> >> wrote: >> >>> We have to try and maintain binary compatib

Re: Scala 2.11 builds broken/ Can the PR build run also 2.11?

2015-10-09 Thread Michael Armbrust
> > How about just fixing the warning? I get it; it doesn't stop this from > happening again, but still seems less drastic than tossing out the > whole mechanism. > +1 It also does not seem that expensive to test only compilation for Scala 2.11 on PR builds.

Re: What steps to take to work on [Spark-8899] issue?

2015-07-08 Thread Michael Armbrust
There is a lot of info here: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark In this particular case I'd start by looking at the JIRA (which already has a pull request posted to it). On Wed, Jul 8, 2015 at 11:40 AM, Chandrashekhar Kotekar shekhar.kote...@gmail.com wrote:

Re: [VOTE] Release Apache Spark 1.4.1 (RC4)

2015-07-09 Thread Michael Armbrust
+1 On Thu, Jul 9, 2015 at 10:07 AM, Mark Hamstra m...@clearstorydata.com wrote: +1 On Wed, Jul 8, 2015 at 10:55 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.4.1! This release fixes a handful of known issues in

Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

2015-09-03 Thread Michael Armbrust
+1 Ran TPC-DS and ported several jobs over to 1.5 On Thu, Sep 3, 2015 at 9:57 AM, Burak Yavuz wrote: > +1. Tested complex R package support (Scala + R code), BLAS and DataFrame > fixes good. > > Burak > > On Thu, Sep 3, 2015 at 8:56 AM, mkhaitman >

Re: Fast Iteration while developing

2015-09-08 Thread Michael Armbrust
+1 to reynolds suggestion. This is probably the fastest way to iterate. Another option for more ad-hoc debugging is `sbt/sbt sparkShell` which is similar to bin/spark-shell but doesn't require you to rebuild the assembly jar. On Mon, Sep 7, 2015 at 9:03 PM, Reynold Xin

Re: Paring down / tagging tests (or some other way to avoid timeouts)?

2015-08-25 Thread Michael Armbrust
I'd be okay skipping the HiveCompatibilitySuite for core-only changes. They do often catch bugs in changes to catalyst or sql though. Same for HashJoinCompatibilitySuite/VersionsSuite. HiveSparkSubmitSuite/CliSuite should probably stay, as they do test things like addJar that have been broken by

Re: Spark builds: allow user override of project version at buildtime

2015-08-25 Thread Michael Armbrust
This isn't really answering the question, but for what it is worth, I manage several different branches of Spark and publish custom named versions regularly to an internal repository, and this is *much* easier with SBT than with maven. You can actually link the Spark SBT build into an external

Re: [SparkSQL]Could not alter table in Spark 1.5 use HiveContext

2015-09-10 Thread Michael Armbrust
Can you open a JIRA? On Wed, Sep 9, 2015 at 11:11 PM, StanZhai wrote: > After upgrade spark from 1.4.1 to 1.5.0, I encountered the following > exception when use alter table statement in HiveContext: > > The sql is: ALTER TABLE a RENAME TO b > > The exception is: > > FAILED:

Re: DF.intersection issue in 1.5

2015-09-10 Thread Michael Armbrust
Thanks for pointing this out. https://issues.apache.org/jira/browse/SPARK-10539 We will fix this for Spark 1.5.1. On Thu, Sep 10, 2015 at 6:16 AM, Nitay Joffe wrote: > The following fails for me in Spark 1.5: > https://gist.github.com/nitay/d08cb294ccf00b80c49a >

Re: Dataframes: PrunedFilteredScan without Spark Side Filtering

2015-09-27 Thread Michael Armbrust
We have to try and maintain binary compatibility here, so probably the easiest thing to do here would be to add a method to the class. Perhaps something like: def unhandledFilters(filters: Array[Filter]): Array[Filter] = filters By default, this could return all filters so behavior would remain

Re: column identifiers in Spark SQL

2015-09-22 Thread Michael Armbrust
Are you using a SQLContext or a HiveContext? The programming guide suggests the latter, as the former is really only there because some applications may have conflicts with Hive dependencies. SQLContext is case sensitive by default where as the HiveContext is not. The parser in HiveContext is

Re: column identifiers in Spark SQL

2015-09-22 Thread Michael Armbrust
entifiers are swallowed up: > > // this now returns rows consisting of the string literal "cd" > sqlContext.sql("""select "c""d" from test_data""").show > > Thanks, > -Rick > > Michael Armbrust <mich...@databric

Re: [VOTE] Release Apache Spark 1.6.0 (RC1)

2015-12-08 Thread Michael Armbrust
>> earlierOffsetRangesAsSets.contains(scala.Tuple2.apply[org.apache.spark.streaming.Time, >>> >>> scala.collection.immutable.Set[org.apache.spark.streaming.kafka.OffsetRange]](or._1, >>> >>> scala.this.Predef.refArrayOps[org.apache.spark.streaming.kafka.Of

[VOTE] Release Apache Spark 1.6.0 (RC1)

2015-12-02 Thread Michael Armbrust
Please vote on releasing the following candidate as Apache Spark version 1.6.0! The vote is open until Saturday, December 5, 2015 at 21:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.6.0 [ ] -1 Do not release this package

Re: When to cut RCs

2015-12-02 Thread Michael Armbrust
> > Sorry for a second email so soon. I meant to also ask, what keeps the cost > of making an RC high? Can we bring it down with better tooling? > There is a lot of tooling: https://amplab.cs.berkeley.edu/jenkins/view/Spark-Packaging/ Still you have check JIRA, sync with people who have been

Re: [VOTE] Release Apache Spark 1.6.0 (RC1)

2015-12-02 Thread Michael Armbrust
I'm going to kick the voting off with a +1 (binding). We ran TPC-DS and most queries are faster than 1.5. We've also ported several production pipelines to 1.6.

Re: SQL language vs DataFrame API

2015-12-09 Thread Michael Armbrust
com> wrote: > Hi, Michael, > > Does that mean SqlContext will be built on HiveQL in the near future? > > Thanks, > > Xiao Li > > > 2015-12-09 10:36 GMT-08:00 Michael Armbrust <mich...@databricks.com>: > >> I think that it is generally good to have p

Re: When to cut RCs

2015-12-02 Thread Michael Armbrust
Thanks for bringing this up Sean. I think we are all happy to adopt concrete suggestions to make the release process more transparent, including pinging the list before kicking off the release build. Technically there's still a Blocker bug: > https://issues.apache.org/jira/browse/SPARK-12000

Re: SQL language vs DataFrame API

2015-12-09 Thread Michael Armbrust
at 19:41, Xiao Li <gatorsm...@gmail.com> wrote: > >> That sounds great! When it is decided, please let us know and we can add >> more features and make it ANSI SQL compliant. >> >> Thank you! >> >> Xiao Li >> >> >> 2015-12-09 11:31

Re: [VOTE] Release Apache Spark 1.6.0 (RC1)

2015-12-10 Thread Michael Armbrust
We are getting close to merging patches for SPARK-12155 <https://issues.apache.org/jira/browse/SPARK-12155> and SPARK-12253 <https://issues.apache.org/jira/browse/SPARK-12253>. I'll be cutting RC2 shortly after that. Michael On Tue, Dec 8, 2015 at 10:31 AM, Michael Ar

[VOTE] Release Apache Spark 1.6.0 (RC3)

2015-12-16 Thread Michael Armbrust
Please vote on releasing the following candidate as Apache Spark version 1.6.0! The vote is open until Saturday, December 19, 2015 at 18:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.6.0 [ ] -1 Do not release this package

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

2015-12-16 Thread Michael Armbrust
+1 On Wed, Dec 16, 2015 at 4:37 PM, Andrew Or wrote: > +1 > > Mesos cluster mode regression in RC2 is now fixed (SPARK-12345 > / PR10332 > ). > > Also tested on standalone

Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

2015-12-12 Thread Michael Armbrust
I'll kick off the voting with a +1. On Sat, Dec 12, 2015 at 9:39 AM, Michael Armbrust <mich...@databricks.com> wrote: > Please vote on releasing the following candidate as Apache Spark version > 1.6.0! > > The vote is open until Tuesday, December 15, 2015 at 6:00 UTC and pass

[VOTE] Release Apache Spark 1.6.0 (RC2)

2015-12-12 Thread Michael Armbrust
Please vote on releasing the following candidate as Apache Spark version 1.6.0! The vote is open until Tuesday, December 15, 2015 at 6:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.6.0 [ ] -1 Do not release this package because

Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

2015-12-12 Thread Michael Armbrust
t; <https://github.com/apache/spark/pull/10193>: > Element[W|w]iseProductExample.scala is not the same in the docs and the > actual file name. > > On Sat, Dec 12, 2015 at 6:39 PM, Michael Armbrust <mich...@databricks.com> > wrote: > >> I'll kick off the vot

Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

2015-12-15 Thread Michael Armbrust
l:SPARK_VERSION: 1.6.0-SNAPSHOT > > On Mon, Dec 14, 2015 at 6:51 PM, Krishna Sankar <ksanka...@gmail.com> > wrote: > >> Guys, >>The sc.version gives 1.6.0-SNAPSHOT. Need to change to 1.6.0. Can you >> pl verify ? >> Cheers >> >> &g

Re: [VOTE] Release Apache Spark 1.6.0 (RC1)

2015-12-10 Thread Michael Armbrust
Cutting RC2 now. On Thu, Dec 10, 2015 at 12:59 PM, Michael Armbrust <mich...@databricks.com> wrote: > We are getting close to merging patches for SPARK-12155 > <https://issues.apache.org/jira/browse/SPARK-12155> and SPARK-12253 > <https://issues.apache.org/jira/br

Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

2015-12-14 Thread Michael Armbrust
wrote: > +1 tested SparkSQL and Streaming on some production sized workloads > > On Sat, Dec 12, 2015 at 4:16 PM, Mark Hamstra <m...@clearstorydata.com> > wrote: > >> +1 >> >> On Sat, Dec 12, 2015 at 9:39 AM, Michael Armbrust <mich...@databricks.com >>

Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

2015-12-12 Thread Michael Armbrust
> > I'm surprised you're suggesting there's not a coupling between a release's > code and the docs for that release. If a release happens and some time > later docs come out, that has some effect on people's usage. > I'm only suggesting that we shouldn't delay testing of the actual bits, or wait

Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

2015-12-12 Thread Michael Armbrust
can be released a little bit > after the code artifacts with last minute fixes. But, the whole release can > just happen later too. Why wouldn't this be a valid reason to block the > release? > > On Sat, Dec 12, 2015 at 6:31 PM, Michael Armbrust <mich...@databricks.com> > w

[ANNOUNCE] Announcing Spark 1.6.0

2016-01-04 Thread Michael Armbrust
Hi All, Spark 1.6.0 is the seventh release on the 1.x line. This release includes patches from 248+ contributors! To download Spark 1.6.0 visit the downloads page. (It may take a while for all mirrors to update.) A huge thanks go to all of the individuals and organizations involved in

Re: problem with reading source code-pull out nondeterministic expresssions

2015-12-30 Thread Michael Armbrust
The goal here is to ensure that the non-deterministic value is evaluated only once, so the result won't change for a given row (i.e. when sorting). On Tue, Dec 29, 2015 at 10:57 PM, 汪洋 wrote: > Hi fellas, > I am new to spark and I have a newbie question. I am currently

Re: Dataset throws: Task not serializable

2016-01-07 Thread Michael Armbrust
Were you running in the REPL? On Thu, Jan 7, 2016 at 10:34 AM, Michael Armbrust <mich...@databricks.com> wrote: > Thanks for providing a great description. I've opened > https://issues.apache.org/jira/browse/SPARK-12696 > > I'm actually getting a different error (running i

Re: Dataset throws: Task not serializable

2016-01-07 Thread Michael Armbrust
Thanks for providing a great description. I've opened https://issues.apache.org/jira/browse/SPARK-12696 I'm actually getting a different error (running in notebooks though). Something seems wrong either way. > > *P.S* mapping by name with case classes doesn't work if the order of the > fields

Re: Expression/LogicalPlan dichotomy in Spark SQL Catalyst

2015-12-21 Thread Michael Armbrust
> > Why was the choice made in Catalyst to make LogicalPlan/QueryPlan and > Expression separate subclasses of TreeNode, instead of e.g. also make > QueryPlan inherit from Expression? > I think this is a pretty common way to model things (glancing at postgres it looks similar). Expression and

Re: [VOTE] Release Apache Spark 1.6.0 (RC4)

2015-12-22 Thread Michael Armbrust
I'll kick the voting off with a +1. On Tue, Dec 22, 2015 at 12:10 PM, Michael Armbrust <mich...@databricks.com> wrote: > Please vote on releasing the following candidate as Apache Spark version > 1.6.0! > > The vote is open until Friday, December 25, 2015 at 18:00 UTC and pass

[VOTE] Release Apache Spark 1.6.0 (RC4)

2015-12-22 Thread Michael Armbrust
Please vote on releasing the following candidate as Apache Spark version 1.6.0! The vote is open until Friday, December 25, 2015 at 18:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.6.0 [ ] -1 Do not release this package because

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

2015-12-21 Thread Michael Armbrust
couple Stream Apps, all seem ok. >> >> On Wed, Dec 16, 2015 at 1:32 PM, Michael Armbrust <mich...@databricks.com >> > wrote: >> >>> Please vote on releasing the following candidate as Apache Spark version >>> 1.6.0! >>> >>> Th

Re: new datasource

2015-11-19 Thread Michael Armbrust
Yeah, CatalystScan should give you everything we can possibly push down in raw form. Note that this is not compatible across different spark versions. On Thu, Nov 19, 2015 at 8:55 AM, james.gre...@baesystems.com < james.gre...@baesystems.com> wrote: > Thanks Hao > > > > I have written a new

[ANNOUNCE] Spark 1.6.0 Release Preview

2015-11-22 Thread Michael Armbrust
In order to facilitate community testing of Spark 1.6.0, I'm excited to announce the availability of an early preview of the release. This is not a release candidate, so there is no voting involved. However, it'd be awesome if community members can start testing with this preview package and

Re: Databricks SparkPerf with Spark 2.0

2016-06-14 Thread Michael Armbrust
NoSuchMethodError always means that you are compiling against a different classpath than is available at runtime, so it sounds like you are on the right track. The project is not abandoned, we're just busy with the release. It would be great if you could open a pull request. On Tue, Jun 14,

Re: Spark 2.0.0-preview artifacts still not available in Maven

2016-06-01 Thread Michael Armbrust
Yeah, we don't usually publish RCs to central, right? On Wed, Jun 1, 2016 at 1:06 PM, Reynold Xin wrote: > They are here ain't they? > > https://repository.apache.org/content/repositories/orgapachespark-1182/ > > Did you mean publishing them to maven central? My

Re: Spark 2.0.0-preview artifacts still not available in Maven

2016-06-01 Thread Michael Armbrust
> > I'd think we want less effort, not more, to let people test it? for > example, right now I can't easily try my product build against > 2.0.0-preview. I don't feel super strongly one way or the other, so if we need to publish it permanently we can. However, either way you can still test

Re: [VOTE] Release Apache Spark 1.6.2 (RC2)

2016-06-22 Thread Michael Armbrust
+1 On Wed, Jun 22, 2016 at 11:33 AM, Jonathan Kelly wrote: > +1 > > On Wed, Jun 22, 2016 at 10:41 AM Tim Hunter > wrote: > >> +1 This release passes all tests on the graphframes and tensorframes >> packages. >> >> On Wed, Jun 22, 2016 at 7:19

Re: cutting 1.6.2 rc and 2.0.0 rc this week?

2016-06-15 Thread Michael Armbrust
+1 to both of these! On Wed, Jun 15, 2016 at 12:21 PM, Sean Owen wrote: > 1.6.2 RC seems fine to me; I don't know of outstanding issues. Clearly > we need to keep the 1.x line going for a bit, so a bug fix release > sounds good, > > Although we've got some work to do before

Re: Hello

2016-06-17 Thread Michael Armbrust
Another good signal is the "target version" (which by convention is only set by committers). When I set this for the upcoming version it means I think its important enough that I will prioritize reviewing a patch for it. On Fri, Jun 17, 2016 at 3:22 PM, Pedro Rodriguez

Re: Question about equality of o.a.s.sql.Row

2016-06-20 Thread Michael Armbrust
> > This is because two objects are compared by "o1 != o2" instead of > "o1.equals(o2)" at > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala#L408 Even equals(...) does not do what you want on the JVM: scala> Array(1,2).equals(Array(1,2))

Re: Encoder Guide / Option[T] Encoder

2016-06-16 Thread Michael Armbrust
There is no public API for writing encoders at the moment, though we are hoping to open this up in Spark 2.1. What is not working about encoders for options? Which version of Spark are you running? This is working as I would expect?

Spark 1.6.1

2016-01-13 Thread Michael Armbrust
Hey All, While I'm not aware of any critical issues with 1.6.0, there are several corner cases that users are hitting with the Dataset API that are fixed in branch-1.6. As such I'm considering a 1.6.1 release. At the moment there are only two critical issues targeted for 1.6.1: - SPARK-12624 -

Re: Preserving partitioning with dataframe select

2016-02-09 Thread Michael Armbrust
RDD level partitioning information is not used to decide when to shuffle for queries planned using Catalyst (since we have better information about distribution from the query plan itself). Instead you should be looking at the logic in EnsureRequirements

Re: Error aliasing an array column.

2016-02-09 Thread Michael Armbrust
That looks like a bug in toString for columns. Can you open a JIRA? On Tue, Feb 9, 2016 at 1:38 PM, Rakesh Chalasani wrote: > Sorry, didn't realize the mail didn't show the code. Using Spark release > 1.6.0 > > Below is an example to reproduce it. > > import

Re: Spark 1.6.1

2016-02-01 Thread Michael Armbrust
there other blockers for Spark 1.6.1 ? >> >> Thanks >> >> On Wed, Jan 13, 2016 at 5:39 PM, Michael Armbrust <mich...@databricks.com >> > wrote: >> >>> Hey All, >>> >>> While I'm not aware of any critical issues with 1.6.0, ther

Re: Spark 1.6.1

2016-02-02 Thread Michael Armbrust
ming along. Thanks! > > Mingyu > > From: Romi Kuntsman <r...@totango.com> > Date: Tuesday, February 2, 2016 at 3:16 AM > To: Michael Armbrust <mich...@databricks.com> > Cc: Hamel Kothari <hamelkoth...@gmail.com>, Ted Yu <yuzhih...@gmail.com>, > "dev@spa

Re: Spark 1.6.1

2016-02-02 Thread Michael Armbrust
> > What about the memory leak bug? > https://issues.apache.org/jira/browse/SPARK-11293 > Even after the memory rewrite in 1.6.0, it still happens in some cases. > Will it be fixed for 1.6.1? > I think we have enough issues queued up that I would not hold the release for that, but if there is a

Re: Spark 2.0.0 release plan

2016-01-27 Thread Michael Armbrust
We do maintenance releases on demand when there is enough to justify doing one. I'm hoping to cut 1.6.1 soon, but have not had time yet. On Wed, Jan 27, 2016 at 8:12 AM, Daniel Siegmann < daniel.siegm...@teamaol.com> wrote: > Will there continue to be monthly releases on the 1.6.x branch during

Re: Spark 2.0.0 release plan

2016-01-29 Thread Michael Armbrust
ark builds to > Scala > > 2.11 with Spark 2.0? > > > > Regards > > Deenar > > > > On 27 January 2016 at 19:55, Michael Armbrust <mich...@databricks.com> > > wrote: > >> > >> We do maintenance releases on demand when there is enough to justify >

Re: Spark 1.6.1

2016-01-29 Thread Michael Armbrust
I think this is fixed in branch-1.6 already. If you can reproduce it there can you please open a JIRA and ping me? On Fri, Jan 29, 2016 at 12:16 PM, deenar < deenar.toras...@thinkreactive.co.uk> wrote: > Hi Michael > > The Dataset aggregators do not appear to support complex Spark-SQL types. I

Re: Spark 1.6.1

2016-02-22 Thread Michael Armbrust
An update: people.apache.org has been shut down so the release scripts are broken. Will try again after we fix them. On Mon, Feb 22, 2016 at 6:28 PM, Michael Armbrust <mich...@databricks.com> wrote: > I've kicked off the build. Please be extra careful about merging into > branch-1.6

Re: Spark 1.6.1

2016-02-22 Thread Michael Armbrust
I've kicked off the build. Please be extra careful about merging into branch-1.6 until after the release. On Mon, Feb 22, 2016 at 10:24 AM, Michael Armbrust <mich...@databricks.com> wrote: > I will cut the RC today. Sorry for the delay! > > On Mon, Feb 22, 2016 at 5:19 AM

Re: Spark 1.6.1

2016-02-24 Thread Michael Armbrust
restored. > > FYI > > On Mon, Feb 22, 2016 at 10:07 PM, Luciano Resende <luckbr1...@gmail.com> > wrote: > >> >> >> On Mon, Feb 22, 2016 at 9:08 PM, Michael Armbrust <mich...@databricks.com >> > wrote: >> >>> An update: people.apach

Re: Spark 1.6.1

2016-02-22 Thread Michael Armbrust
I will cut the RC today. Sorry for the delay! On Mon, Feb 22, 2016 at 5:19 AM, Patrick Woody <patrick.woo...@gmail.com> wrote: > Hey Michael, > > Any update on a first cut of the RC? > > Thanks! > -Pat > > On Mon, Feb 15, 2016 at 6:50 PM, Michael Armbrust &l

Re: Spark 1.6.1

2016-02-15 Thread Michael Armbrust
two last unresolved > issues targeting 1.6.1 are fixed > <https://github.com/apache/spark/pull/11131> now > <https://github.com/apache/spark/pull/10539>. > > On 3 February 2016 at 08:16, Daniel Darabos < > daniel.dara...@lynxanalytics.com> wrote: > >> >

[ANNOUNCE] Announcing Spark 1.6.1

2016-03-10 Thread Michael Armbrust
Spark 1.6.1 is a maintenance release containing stability fixes. This release is based on the branch-1.6 maintenance branch of Spark. We *strongly recommend* all 1.6.0 users to upgrade to this release. Notable fixes include: - Workaround for OOM when writing large partitioned tables SPARK-12546

Re: question about catalyst and TreeNode

2016-03-15 Thread Michael Armbrust
Trees are immutable, and TreeNode takes care of copying unchanged parts of the tree when you are doing transformations. As a result, even if you do construct a DAG with the Dataset API, the first transformation will turn it back into a tree. The only exception to this rule is when we share the

Re: [VOTE] Release Apache Spark 1.6.1 (RC1)

2016-03-09 Thread Michael Armbrust
>> ^[[31m at >>>>>>> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)^[[0m >>>>>>> ^[[31m at >>>>>>> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.run(DockerJDBCIntegrationSuite.scala:58)^[[0m >>>>>&g

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-03-18 Thread Michael Armbrust
Patrick reuploaded the artifacts, so it should be fixed now. On Mar 16, 2016 5:48 PM, "Nicholas Chammas" wrote: > Looks like the other packages may also be corrupt. I’m getting the same > error for the Spark 1.6.1 / Hadoop 2.4 package. > > >

[RESULT] [VOTE] Release Apache Spark 1.6.1 (RC1)

2016-03-09 Thread Michael Armbrust
This vote passes with nine +1s (five binding) and one binding +0! Thanks to everyone who tested/voted. I'll start work on publishing the release today. +1: Mark Hamstra* Moshe Eshel Egor Pahomov Reynold Xin* Yin Huai* Andrew Or* Burak Yavuz Kousuke Saruta Michael Armbrust* 0: Sean Owen* -1

Re: Discuss: commit to Scala 2.10 support for Spark 2.x lifecycle

2016-03-30 Thread Michael Armbrust
+1 to Matei's reasoning. On Wed, Mar 30, 2016 at 9:21 AM, Matei Zaharia wrote: > I agree that putting it in 2.0 doesn't mean keeping Scala 2.10 for the > entire 2.x line. My vote is to keep Scala 2.10 in Spark 2.0, because it's > the default version we built with in

Re: Spark SQL UDF Returning Rows

2016-03-30 Thread Michael Armbrust
Some answers and more questions inline - UDFs can pretty much only take in Primitives, Seqs, Maps and Row objects > as parameters. I cannot take in a case class object in place of the > corresponding Row object, even if the schema matches because the Row object > will always be passed in at

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-03-24 Thread Michael Armbrust
>> >> $ wget >>> >> >> >>> >> >> >>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz >>> >> >> --2016-03-18 07:55:30-- >>> >> >> >>> >> >> >&g

Re: [discuss] ending support for Java 7 in Spark 2.0

2016-03-24 Thread Michael Armbrust
On Thu, Mar 24, 2016 at 4:54 PM, Mark Hamstra wrote: > It's a pain in the ass. Especially if some of your transitive > dependencies never upgraded from 2.10 to 2.11. > Yeah, I'm going to have to agree here. It is not as bad as it was in the 2.9 days, but its still

Re: Spark structured streaming

2016-03-08 Thread Michael Armbrust
This is in active development, so there is not much that can be done from an end user perspective. In particular the only sink that is available in apache/master is a testing sink that just stores the data in memory. We are working on a parquet based file sink and will eventually support all the

Re: Selecting column in dataframe created with incompatible schema causes AnalysisException

2016-03-02 Thread Michael Armbrust
-dev +user StructType(StructField(data,ArrayType(StructType(StructField( > *stuff,ArrayType(*StructType(StructField(onetype,ArrayType(StructType(StructField(id,LongType,true), > StructField(name,StringType,true)),true),true), StructField(othertype, >

Re: Nulls getting converted to 0 with spark 2.0 SNAPSHOT

2016-03-07 Thread Michael Armbrust
That looks like a bug to me. Open a JIRA? On Mon, Mar 7, 2016 at 11:30 AM, Franklyn D'souza < franklyn.dso...@shopify.com> wrote: > Just wanted to confirm that this is the expected behaviour. > > Basically I'm putting nulls into a non-nullable LongType column and doing > a transformation

Re: What influences the space complexity of Spark operations?

2016-04-01 Thread Michael Armbrust
Blocking operators like Sort, Join or Aggregate will put all of the data for a whole partition into a hash table or array. However, if you are running Spark 1.5+ we should be spilling to disk. In Spark 1.6 if you are seeing OOMs for SQL operations you should report it as a bug. On Thu, Mar 31,

Re: [SQL] Dataset.map gives error: missing parameter type for expanded function?

2016-04-04 Thread Michael Armbrust
It is called groupByKey now. Similar to joinWith, the schema produced by relational joins and aggregations is different than what you would expect when working with objects. So, when combining DataFrame+Dataset we renamed these functions to make this distinction clearer. On Sun, Apr 3, 2016 at

Re: Do transformation functions on RDD invoke a Job [sc.runJob]?

2016-04-25 Thread Michael Armbrust
Spark SQL's query planner has always delayed building the RDD, so has never needed to eagerly calculate the range boundaries (since Spark 1.0). On Mon, Apr 25, 2016 at 2:04 AM, Praveen Devarao wrote: > Thanks Reynold for the reason as to why sortBykey invokes a Job > >

Re: Possible Hive problem with Spark 2.0.0 preview.

2016-05-19 Thread Michael Armbrust
> > 1. “val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)” > doesn’t work because “HiveContext not a member of > org.apache.spark.sql.hive” I checked the documentation, and it looks like > it should still work for spark-2.0.0-preview-bin-hadoop2.7.tgz > HiveContext has been

Re: Using Travis for JDK7/8 compilation and lint-java.

2016-05-23 Thread Michael Armbrust
We did turn on travis a few years ago, but ended up turning it off because it was failing (I believe because of insufficient resources) which was confusing for developers. I wouldn't be opposed to turning it on if it provides more/faster signal, but its not obvious to me that it would. In

Re: Using Travis for JDK7/8 compilation and lint-java.

2016-05-24 Thread Michael Armbrust
> > i can't give you permissions -- that has to be (most likely) through > someone @ databricks, like michael. > Another clarification: not databricks, but the Apache Spark PMC grants access to the JIRA / wiki. That said... I'm not actually sure how its done.

Re: CompileException for spark-sql generated code in 2.0.0-SNAPSHOT

2016-05-17 Thread Michael Armbrust
Yeah, can you open a JIRA with that reproduction please? You can ping me on it. On Tue, May 17, 2016 at 4:55 PM, Reynold Xin wrote: > It seems like the problem here is that we are not using unique names > for mapelements_isNull? > > > > On Tue, May 17, 2016 at 3:29 PM,

Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-18 Thread Michael Armbrust
+1, excited for 2.0! On Wed, May 18, 2016 at 10:06 AM, Krishna Sankar wrote: > +1. Looks Good. > The mllib results are in line with 1.6.1. Deprecation messages. I will > convert to ml and test later in the day. > Also will try GraphX exercises for our Strata London Tutorial

Re: [discuss] separate API annotation into two components: InterfaceAudience & InterfaceStability

2016-05-13 Thread Michael Armbrust
+1 to the general structure of Reynold's proposal. I've found what we do currently a little confusing. In particular, it doesn't make much sense that @DeveloperApi things are always labeled as possibly changing. For example the Data Source API should arguably be one of the most stable

Re: Skipping Type Conversion and using InternalRows for UDF

2016-04-15 Thread Michael Armbrust
This would also probably improve performance: https://github.com/apache/spark/pull/9565 On Fri, Apr 15, 2016 at 8:44 AM, Hamel Kothari wrote: > Hi all, > > So we have these UDFs which take <1ms to operate and we're seeing pretty > poor performance around them in

<    1   2   3   4   >