Re: SPARK-8813 - combining small files in spark sql

2016-07-07 Thread Reynold Xin
When using native data sources (e.g. Parquet, ORC, JSON, ...), partitions are automatically merged so they would add up to a specific size, configurable by spark.sql.files.maxPartitionBytes. spark.sql.files.openCostInBytes is used to specify the cost of each "file". That is, an empty file will be

Re: Latest spark release in the 1.4 branch

2016-07-07 Thread Reynold Xin
I think last time I tried I had some trouble releasing it because the release scripts no longer work with branch-1.4. You can build from the branch yourself, but it might be better to upgrade to the later versions. On Wed, Jul 6, 2016 at 11:02 PM, Niranda Perera wrote:

Re: Latest spark release in the 1.4 branch

2016-07-07 Thread Reynold Xin
Yes definitely. On Wed, Jul 6, 2016 at 11:08 PM, Niranda Perera <niranda.per...@gmail.com> wrote: > Thanks Reynold for the prompt response. Do you think we could use a > 1.4-branch latest build in a production environment? > > > > On Thu, Jul 7, 2016 at

Re: Bad JIRA components

2016-07-07 Thread Reynold Xin
I deleted those. On Thu, Jul 7, 2016 at 1:27 PM, Nicholas Chammas wrote: > > https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-projects-plugin:components-panel > > There are several bad components in there, like docs, MLilb, and sq;.

Re: Understanding pyspark data flow on worker nodes

2016-07-08 Thread Reynold Xin
You can look into its source code: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala On Thu, Jul 7, 2016 at 11:01 PM, Amit Rana wrote: > Hi all, > > Did anyone get a chance to look into it?? > Any sort of

Re: Why's ds.foreachPartition(println) not possible?

2016-07-05 Thread Reynold Xin
This seems like a Scala compiler bug. On Tuesday, July 5, 2016, Jacek Laskowski wrote: > Well, there is foreach for Java and another foreach for Scala. That's > what I can understand. But while supporting two language-specific APIs > -- Scala and Java -- Dataset API lost

Re: branch-2.0 is now 2.0.1-SNAPSHOT?

2016-07-11 Thread Reynold Xin
I just bumped master branch version to 2.1.0-SNAPSHOT https://github.com/apache/spark/commit/ffcb6e055a28f36208ed058a42df09c154555332 We used to have a problem with binary compatibility check not having the 2.0.0 base version in Maven (because 2.0.0 hasn't been released yet) but I figured out a

Re: We don't use ASF Jenkins builds, right?

2016-08-04 Thread Reynold Xin
We don't. On Friday, August 5, 2016, Sean Owen wrote: > There was a recent message about deprecating many Maven, ant and JDK > combos for ASF Jenkins machines, and I was just triple-checking we're > only making use of the Amplab ones. > >

Re: PySpark: Make persist() return a context manager

2016-08-05 Thread Reynold Xin
Sounds like a great idea! On Friday, August 5, 2016, Nicholas Chammas wrote: > Context managers > are > a natural way to capture closely related setup and teardown code in Python. > > For example,

Re: [SPARK-2.0][SQL] UDF containing non-serializable object does not work as expected

2016-08-08 Thread Reynold Xin
> > The workaround I can imagine is just to cache and materialize `df` by > `df.cache.count()`, and then call `df.filter(...).show()`. > It should work, just a little bit tedious. > > > > On Mon, Aug 8, 2016 at 10:00 PM, Reynold Xin <r...@databricks.com> wrote: >

Re: SASL Support

2016-08-08 Thread Reynold Xin
Please send a pull request to update the doc. Thanks. On Tue, Aug 9, 2016 at 6:48 AM, Michael Gummelt wrote: > I was checking if RPC calls can be encrypted and I saw here that the docs > here (*http://spark.apache.org/docs/latest/configuration.html >

Re: [SPARK-2.0][SQL] UDF containing non-serializable object does not work as expected

2016-08-08 Thread Reynold Xin
That is unfortunately the way how Scala compiler captures (and defines) closures. Nothing is really final in the JVM. You can always use reflection or unsafe to modify the value of fields. On Mon, Aug 8, 2016 at 8:16 PM, Simon Scott wrote: > But does the “notSer”

Re: Debugging Spark itself in standalone cluster mode

2016-06-30 Thread Reynold Xin
Yes, scheduling is centralized in the driver. For debugging, I think you'd want to set the executor JVM, not the worker JVM flags. On Thu, Jun 30, 2016 at 11:36 AM, cbruegg wrote: > Hello everyone, > > I'm a student assistant in research at the University of Paderborn,

Re: Logical Plan

2016-06-30 Thread Reynold Xin
Which version are you using here? If the underlying files change, technically we should go through optimization again. Perhaps the real "fix" is to figure out why is logical plan creation so slow for 700 columns. On Thu, Jun 30, 2016 at 1:58 PM, Darshan Singh wrote: >

Re: Logical Plan

2016-06-30 Thread Reynold Xin
or subtracting these and it again takes > lots of time. > > Not sure what could be done here. > > Thanks > > On Thu, Jun 30, 2016 at 10:10 PM, Reynold Xin <r...@databricks.com> wrote: > >> Which version are you using here? If the underlying files change, >> technic

Re: Code Style Formatting

2016-07-01 Thread Reynold Xin
There isn't one pre-made, but the default works out OK. The main thing you'd need to update are spacing changes for function argument indentation and import ordering. On Fri, Jul 1, 2016 at 4:11 AM, Anton Okolnychyi wrote: > Hi, all. > > I've read the Spark code

Re: [jira] [Resolved] (SPARK-16345) Extract graphx programming guide example snippets from source files instead of hard code them

2016-07-02 Thread Reynold Xin
Because in that case you cannot merge anything meant for 2.1 until 2.0 is released. On Saturday, July 2, 2016, Jacek Laskowski wrote: > Hi, > > Always release from master. What could be the gotchas? > > Pozdrawiam, > Jacek Laskowski > > https://medium.com/@jaceklaskowski/

Re: Dataset and Aggregator API pain points

2016-07-02 Thread Reynold Xin
Thanks, Koert, for the great email. They are all great points. We should probably create an umbrella JIRA for easier tracking. On Saturday, July 2, 2016, Koert Kuipers wrote: > after working with the Dataset and Aggregator apis for a few weeks porting > some fairly complex

[VOTE] Release Apache Spark 1.6.2 (RC2)

2016-06-19 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 1.6.2. The vote is open until Wednesday, June 22, 2016 at 22:00 PDT and passes if a majority of at least 3+1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.6.2 [ ] -1 Do not release this package because

Re: Jenkins networking / port contention

2016-07-01 Thread Reynold Xin
Multiple instances of test runs are usually running in parallel, so they would need to bind to different ports. On Friday, July 1, 2016, Cody Koeninger wrote: > Thanks for the response. I'm talking about test code that starts up > embedded network services for integration

Re: Structured Streaming with Kafka sources/sinks

2016-08-16 Thread Reynold Xin
We (the team at Databricks) are working on one currently. On Mon, Aug 15, 2016 at 7:26 PM, Cody Koeninger wrote: > https://issues.apache.org/jira/browse/SPARK-15406 > > I'm not working on it (yet?), never got an answer to the question of > who was planning to work on it. >

Re: PSA: Java 8 unidoc build

2017-02-07 Thread Reynold Xin
I don't know if this would help but I think we can also officially stop supporting Java 7 ... On Tue, Feb 7, 2017 at 1:06 PM, Sean Owen wrote: > I believe that if we ran the Jenkins builds with Java 8 we would catch > these? this doesn't require dropping Java 7 support or

Re: drop java 7 support for spark 2.1.x or spark 2.2.x

2017-02-07 Thread Reynold Xin
BTW I created a JIRA ticket for tracking: https://issues.apache.org/jira/browse/SPARK-19493 We of course shouldn't do anything until we achieve consensus. On Tue, Feb 7, 2017 at 3:47 PM, Reynold Xin <r...@databricks.com> wrote: > Bumping this. > > Given we see the occassion

Re: drop java 7 support for spark 2.1.x or spark 2.2.x

2017-02-07 Thread Reynold Xin
Bumping this. Given we see the occassional build breaks with Java 8, we should reconsider this and do it for 2.2 or 2.3. By the time 2.2 is released, it will almost be an year since this thread started. On Sun, Jul 24, 2016 at 12:59 AM, Mark Hamstra wrote: > Sure,

Re: Executors exceed maximum memory defined with `--executor-memory` in Spark 2.1.0

2017-01-22 Thread Reynold Xin
Are you using G1 GC? G1 sometimes uses a lot more memory than the size allocated. On Sun, Jan 22, 2017 at 12:58 AM StanZhai wrote: > Hi all, > > > > We just upgraded our Spark from 1.6.2 to 2.1.0. > > > > Our Spark application is started by spark-submit with config of > >

Re: A question about creating persistent table when in-memory catalog is used

2017-01-22 Thread Reynold Xin
the regular data source tables and insert the > data into the tables. The major difference is whether the metadata is > persistently stored or not. > > Thanks, > > Xiao Li > > 2017-01-22 11:14 GMT-08:00 Reynold Xin <r...@databricks.com>: > > I think this is somethi

welcoming Burak and Holden as committers

2017-01-24 Thread Reynold Xin
Hi all, Burak and Holden have recently been elected as Apache Spark committers. Burak has been very active in a large number of areas in Spark, including linear algebra, stats/maths functions in DataFrames, Python/R APIs for DataFrames, dstream, and most recently Structured Streaming. Holden

Re: [PYTHON][DISCUSS] Moving to cloudpickle and or Py4J as a dependencies?

2017-02-13 Thread Reynold Xin
With any dependency update (or refactoring of existing code), I always ask this question: what's the benefit? In this case it looks like the benefit is to reduce efforts in backports. Do you know how often we needed to do those? On Tue, Feb 14, 2017 at 12:01 AM, Holden Karau

Re: Java 9

2017-02-09 Thread Reynold Xin
tl;dr: The critical internal APIs proposed to remain accessible in JDK 9 are: sun.misc.{Signal,SignalHandler} sun.misc.Unsafe (The functionality of many of the methods in this class is now available via variable handles (JEP 193).) sun.reflect.Reflection::getCallerClass(int) (The functionality

Re: benefits of code gen

2017-02-10 Thread Reynold Xin
With complex types it doesn't work as well, but for primitive types the biggest benefit of whole stage codegen is that we don't even need to put the intermediate data into rows or columns anymore. They are just variables (stored in CPU registers). On Fri, Feb 10, 2017 at 8:22 PM, Koert Kuipers

Re: [Newbie] spark conf

2017-02-10 Thread Reynold Xin
You can put them in spark's own conf/spark-defaults.conf file On Fri, Feb 10, 2017 at 10:35 PM, Sam Elamin wrote: > Hi All, > > > really newbie question here folks, i have properties like my aws access > and secret keys in the core-site.xml in hadoop among other

Re: Update Public Documentation - SparkSession instead of SparkContext

2017-02-15 Thread Reynold Xin
There is an existing pull request to update it: https://github.com/apache/spark/pull/16856 But it is a little bit tricky. On Wed, Feb 15, 2017 at 7:44 AM, Chetan Khatri wrote: > Hello Spark Dev Team, > > I was working with my team having most of the confusion

Re: Spark Improvement Proposals

2017-02-13 Thread Reynold Xin
. >> >> But it's been almost half a year, and nothing visible has been done. >> >> Reynold, are you going to do this? >> >> If so, when? >> >> If not, why? >> >> You already did the right thing by including long-deserved committers. &

welcoming Takuya Ueshin as a new Apache Spark committer

2017-02-13 Thread Reynold Xin
Hi all, Takuya-san has recently been elected an Apache Spark committer. He's been active in the SQL area and writes very small, surgical patches that are high quality. Please join me in congratulating Takuya-san!

Re: Spark Improvement Proposals

2017-02-16 Thread Reynold Xin
en. > There's a mention of a month for finding a shepherd, but that's different. > > Other than that, LGTM. > > On Mon, Feb 13, 2017 at 9:02 AM, Reynold Xin <r...@databricks.com> wrote: > >> Here's a new draft that incorporated most of the feedback: >> https://docs.go

Re: File JIRAs for all flaky test failures

2017-02-16 Thread Reynold Xin
What exactly is the issue? I've been working on Spark dev for a long time and very rarely do I actually run into an issue that only manifest on Jenkins but not locally. I don't have some magic local setup either. We should definitely cut down test flakiness. On Thu, Feb 16, 2017 at 5:26 PM,

Re: File JIRAs for all flaky test failures

2017-02-16 Thread Reynold Xin
Josh's tool should give enough signal there already. I don't think we need some manual process to document them. If you want to work on those that'd be great. I bet you will get a lot of love because all developers hate flaky tests. On Thu, Feb 16, 2017 at 6:19 PM, Saikat Kanjilal

Re: [Spark SQL] Making InferSchema and JacksonParser public

2017-01-18 Thread Reynold Xin
That is internal, but the amount of code is not a lot. Can you just copy the relevant classes over to your project? On Wed, Jan 18, 2017 at 5:52 AM Brian Hong wrote: > I work for a mobile game company. I'm solving a simple question: "Can we > efficiently/cheaply

critical bugs to be fixed in Spark 2.0.1?

2016-08-22 Thread Reynold Xin
We should work on a 2.0.1 release soon, since we have found couple critical bugs in 2.0.0. Are there any critical bugs outstanding that we should address in 2.0.1?

Re: Tree for SQL Query

2016-08-25 Thread Reynold Xin
00, splits=400) > > I'd like to have whole tree with expressions. > > So when I have "select x + y" there should by Add expresion etc. > > M. > > 2016-08-24 22:39 GMT+02:00 Reynold Xin <r...@databricks.com>: > > It's basically the output of the exp

Re: Structured Streaming with Kafka sources/sinks

2016-08-30 Thread Reynold Xin
t;> be ported to Structured Streaming without Kafka support. >> >> >> >> Is there a design document somewhere? Or can someone from the >> DataBricks >> >> team break down the existing monolithic JIRA issue into smaller steps >> that >> >&g

Re: 3Ps for Datasets not available?! (=Parquet Predicate Pushdown)

2016-08-30 Thread Reynold Xin
The UDF is a black box so Spark can't know what it is dealing with. There are simple cases in which we can analyze the UDFs byte code and infer what it is doing, but it is pretty difficult to do in general. On Tuesday, August 30, 2016, Jacek Laskowski wrote: > Hi, > > I've been

Reynold on vacation next two weeks

2016-08-30 Thread Reynold Xin
A lot of people have been pinging me on github and email directly and expect instant reply. Just FYI I'm on vacation for two weeks with limited internet access.

Re: Saving less data to improve Pregel performance in GraphX?

2016-09-14 Thread Reynold Xin
This is definitely useful, but in reality it might be very difficult to do. On Mon, Aug 29, 2016 at 6:46 PM, Fang Zhang wrote: > Dear developers, > > I am running some tests using Pregel API. > > It seems to me that more than 90% of the volume of a graph object is >

Re: @scala.annotation.varargs or @_root_.scala.annotation.varargs?

2016-09-08 Thread Reynold Xin
Yea but the earlier email was asking they were introduced in the first place. On Friday, September 9, 2016, Marcelo Vanzin <van...@cloudera.com> wrote: > Not after SPARK-14642, right? > > On Thu, Sep 8, 2016 at 5:07 PM, Reynold Xin <r...@databricks.com >

Re: @scala.annotation.varargs or @_root_.scala.annotation.varargs?

2016-09-08 Thread Reynold Xin
There is a package called scala. On Friday, September 9, 2016, Hyukjin Kwon wrote: > I was also actually wondering why it is being written like this. > > I actually took a look for this before and wanted to fix them but I found >

Re: UDF and native functions performance

2016-09-12 Thread Reynold Xin
Not sure if this is why but perhaps the constraint framework? On Tuesday, September 13, 2016, Mendelson, Assaf wrote: > I did, they look the same: > > > > scala> my_func.explain(true) > > == Parsed Logical Plan == > > Filter smaller#3L < 10 > > +- Project [id#0L AS

Re: Compatibility of 1.6 spark.eventLog with a 2.0 History Server

2016-09-15 Thread Reynold Xin
They should be compatible. On Thu, Sep 15, 2016 at 10:21 AM, Mario Ds Briggs wrote: > Hi, > > I would like to use a Spark 2.0 History Server instance on spark1.6 > generated eventlogs. (This is because clicking the refresh button in > browser, updates the UI with

Re: Why we get 0 when the key is null?

2016-09-15 Thread Reynold Xin
What else do you expect to get? A non-zero hash value? It can technically be any constant. On Thu, Sep 15, 2016 at 6:15 PM, WangJianfei < wangjianfe...@otcaix.iscas.ac.cn> wrote: > this func is in Partitioner > def getPartition(key: Any): Int = key match { > case null => 0 > //case

Re: Why Expression.deterministic method and Nondeterministic trait?

2016-09-23 Thread Reynold Xin
deterministic method describes whether this instance of the expression tree is deterministic, whereas Nondeterministic trait is about a class. On Fri, Sep 23, 2016 at 10:46 AM, Jacek Laskowski wrote: > Hi Herman, > > That helps to know that someone can explain why we've got

Re: ArrayType support in Spark SQL

2016-09-26 Thread Reynold Xin
Seems fair & easy to support. Can somebody open a JIRA ticket and patch? On Mon, Sep 26, 2016 at 9:05 AM, Takeshi Yamamuro wrote: > Hi, > > Since `Literal#default` can handle array types, it seems there is no > strong reason > for unsupporting the type in

[VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-24 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 2.0.1. The vote is open until Tue, Sep 27, 2016 at 15:30 PDT and passes if a majority of at least 3+1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 2.0.1 [ ] -1 Do not release this package because ... The

Re: Documentation for package ‘SparkR’ version mismatch

2016-09-24 Thread Reynold Xin
Thanks for reporting. I sent out an email for rc3 fixing the issue. We have also automated the version number update for documentation pages so this won't happen again in the future. On Fri, Sep 23, 2016 at 12:25 AM, Jagadeesan As wrote: > Hi Reyonld, > > While checking the

Re: [VOTE] Release Apache Spark 2.0.1 (RC2)

2016-09-24 Thread Reynold Xin
Hi all, The R API documentation version error was reported in a separate thread. I've built a release candidate (RC3) and will send out a new vote email in a bit. On Thu, Sep 22, 2016 at 11:01 PM, Reynold Xin <r...@databricks.com> wrote: > Please vote on releasing the following

Re: [question] Why Spark SQL grammar allows : ?

2016-09-29 Thread Reynold Xin
Is there any harm in supporting it? Mostly curious whether we really need to "fix" this. On Thu, Sep 29, 2016 at 7:22 PM, Herman van Hövell tot Westerflier < hvanhov...@databricks.com> wrote: > Tejas, > > This is because we use the same rule to parse top level and nested data > fields. For

Re: [discuss] separate API annotation into two components: InterfaceAudience & InterfaceStability

2016-10-05 Thread Reynold Xin
I think this is fairly important to do so I went ahead and created a PR for the first mini step: https://github.com/apache/spark/pull/15374 On Wed, Aug 24, 2016 at 9:48 AM, Reynold Xin <r...@databricks.com> wrote: > Looks like I'm general people like it. Next step is for somebod

[ANNOUNCE] Announcing Spark 2.0.1

2016-10-04 Thread Reynold Xin
We are happy to announce the availability of Spark 2.0.1! Apache Spark 2.0.1 is a maintenance release containing 300 stability and bug fixes. This release is based on the branch-2.0 maintenance branch of Spark. We strongly recommend all 2.0.0 users to upgrade to this stable release. To download

Re: [ANNOUNCE] Announcing Spark 2.0.1

2016-10-04 Thread Reynold Xin
They have been published yesterday, but can take a while to propagate. On Tue, Oct 4, 2016 at 12:58 PM, Prajwal Tuladhar <p...@infynyxx.com> wrote: > Hi, > > It seems like, 2.0.1 artifact hasn't been published to Maven Central. Can > anyone confirm? > > On Tue, Oct 4,

Re: Spark Improvement Proposals

2016-10-07 Thread Reynold Xin
bject to >>> change, then why even mark them as such? >>> >>> Ideally a finished SIP should give me a checklist of things that an >>> implementation must do, and things that it doesn't need to do. >>> Contributors/committers should be seriously disco

Re: Spark Improvement Proposals

2016-10-07 Thread Reynold Xin
I called Cody last night and talked about some of the topics in his email. It became clear to me Cody genuinely cares about the project. Some of the frustrations come from the success of the project itself becoming very "hot", and it is difficult to get clarity from people who don't dedicate all

Re: PSA: JIRA resolutions and meanings

2016-10-08 Thread Reynold Xin
On Sat, Oct 8, 2016 at 2:09 AM, Sean Owen wrote: > > - Resolve as Fixed if there's a change you can point to that resolved the > issue > - If the issue is a proper subset of another issue, mark it a Duplicate of > that issue (rather than the other way around) > - If it's

Re: Spark Improvement Proposals

2016-10-07 Thread Reynold Xin
dev@. One very lightweight idea is to have a new type of > JIRA called a SIP and have a link to a filter that shows all such JIRAs > from http://spark.apache.org. I also like the idea of SIP and design doc > templates (in fact many projects have them). > > Matei > > On Oct

Re: Reading back hdfs files saved as case class

2016-10-07 Thread Reynold Xin
You can use the Dataset API -- it should solve this issue for case classes that are not very complex. On Fri, Oct 7, 2016 at 12:20 PM, Deepak Sharma wrote: > Hi > I am saving RDD[Example] in hdfs from spark program , where Example is > case class. > Now when i am trying

Re: Kafaka 0.8, 0.9 in Structured Streaming

2016-10-07 Thread Reynold Xin
Does Kafka 0.10 work on a Kafka 0.8/0.9 cluster? On Fri, Oct 7, 2016 at 1:14 PM, Jeremy Smith wrote: > +1 > > We're on CDH, and it will probably be a while before they support Kafka > 0.10. At the same time, we don't use their Spark and we're looking forward > to

Re: PSA: JIRA resolutions and meanings

2016-10-08 Thread Reynold Xin
I think so (at least I think it is socially acceptable). Of course, use good judgement here :) On Sat, Oct 8, 2016 at 12:06 PM, Cody Koeninger <c...@koeninger.org> wrote: > So to be clear, can I go clean up the Kafka cruft? > > On Sat, Oct 8, 2016 at 1:41 PM, Reynold Xin <r

Re: Monitoring system extensibility

2016-10-07 Thread Reynold Xin
They have always been private, haven't they? https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/metrics/source/Source.scala On Thu, Oct 6, 2016 at 7:38 AM, Alexander Oleynikov wrote: > Hi. > > As of v2.0.1, the traits

Re: Monitoring system extensibility

2016-10-07 Thread Reynold Xin
rought this up last year and there was a Jira raised: > https://issues.apache.org/jira/browse/SPARK-14151 > > For now I just have my SInk and Source in an o.a.s package name which is > not ideal but the only way round this. > > On Fri, 7 Oct 2016 at 08:30 Reynold Xin <r...@data

Re: Looking for a Spark-Python expert

2016-10-07 Thread Reynold Xin
Boris, Thanks for the email, but this is not a list for soliciting job applications. Please do not post any recruiting messages -- otherwise we will ban your account. On Fri, Oct 7, 2016 at 12:44 AM, Boris Lenzinger wrote: > > Hi all, > > I don't know where to post

Re: [ANNOUNCE] Announcing Spark 2.0.1

2016-10-05 Thread Reynold Xin
There is now. Thanks for the email. On Wed, Oct 5, 2016 at 12:06 PM, Michael Gummelt <mgumm...@mesosphere.io> wrote: > There seems to be no 2.0.1 tag? > > https://github.com/apache/spark/tags > > On Tue, Oct 4, 2016 at 1:23 PM, Reynold Xin <r...@databricks.com>

Re: Memory usage for spark types

2016-09-18 Thread Reynold Xin
Take a look at UnsafeArrayData and UnsafeMapData. On Sun, Sep 18, 2016 at 9:06 AM, assaf.mendelson wrote: > Hi, > > I am trying to understand how spark types are kept in memory and accessed. > > I tried to look at the code at the definition of MapType and ArrayType for

Re: [SPARK-15717][GraphX] status

2016-09-22 Thread Reynold Xin
Did you try the proposed fix? Would be good to know whether it fixes the issue. On Thu, Sep 22, 2016 at 2:49 PM, Asher Krim wrote: > Does anyone know what the status of SPARK-15717 is? It's a simple enough > looking PR, but there has been no activity on it since June 16th. >

[VOTE] Release Apache Spark 2.0.1 (RC2)

2016-09-23 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 2.0.1. The vote is open until Sunday, Sep 25, 2016 at 23:59 PDT and passes if a majority of at least 3+1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 2.0.1 [ ] -1 Do not release this package because ...

R docs no longer building for branch-2.0

2016-09-22 Thread Reynold Xin
I'm working on packaging 2.0.1 rc but encountered a problem: R doc fails to build. Can somebody take a look at the issue ASAP? ** knitting documentation of write.parquet ** knitting documentation of write.text ** knitting documentation of year ~/workspace/spark-release-docs/spark/R

Re: Spark 2.0.1 release?

2016-09-16 Thread Reynold Xin
2.0.1 is definitely coming soon. Was going to tag a rc yesterday but ran into some issue. I will try to do it early next week for rc. On Fri, Sep 16, 2016 at 11:16 AM, Ewan Leith wrote: > Hi all, > > Apologies if I've missed anything, but is there likely to see a

Re: Found a typo in Catalyst's exception and want to write a test -- help needed

2016-08-18 Thread Reynold Xin
I'd use the new SQLQueryTestSuite. Test cases defined in sql files. On Wed, Aug 17, 2016 at 11:46 PM, Jacek Laskowski wrote: > Hi devs, > > While reviewing the code in Catalyst for doing query parsing I found > that UnresolvedStar has this typo in the exception [1]. > > I do

Re: Mesos is now a maven module

2016-08-26 Thread Reynold Xin
This is great! On Fri, Aug 26, 2016 at 1:20 PM, Michael Gummelt wrote: > Hello devs, > > Much like YARN, Mesos has been refactored into a Maven module. So when > building, you must add "-Pmesos" to enable Mesos support. > > The pre-built distributions from Apache will

Re: Anyone else having trouble with replicated off heap RDD persistence?

2016-08-23 Thread Reynold Xin
Does this problem still exist on today's master/branch-2.0? SPARK-16550 was merged. It might be fixed already. On Tue, Aug 23, 2016 at 9:37 AM, Michael Allman wrote: > FYI, I posted this to user@ and have followed up with a bug report: >

Re: [discuss] separate API annotation into two components: InterfaceAudience & InterfaceStability

2016-08-24 Thread Reynold Xin
Looks like I'm general people like it. Next step is for somebody to take the lead and implement it. Tom do you have cycles to do this? On Wednesday, August 24, 2016, Tom Graves wrote: > ping, did this discussion conclude or did we decide what we are doing? > > Tom > > >

Re: Tree for SQL Query

2016-08-24 Thread Reynold Xin
It's basically the output of the explain command. On Wed, Aug 24, 2016 at 12:31 PM, Maciej Bryński wrote: > Hi, > I read this article: > https://databricks.com/blog/2015/04/13/deep-dive-into- > spark-sqls-catalyst-optimizer.html > > And I have a question. Is it possible to

Re: [VOTE] Release Apache Spark 2.0.1 (RC4)

2016-09-29 Thread Reynold Xin
I will kick it off with my own +1. On Wed, Sep 28, 2016 at 7:14 PM, Reynold Xin <r...@databricks.com> wrote: > Please vote on releasing the following candidate as Apache Spark version > 2.0.1. The vote is open until Sat, Oct 1, 2016 at 20:00 PDT and passes if a > majority of a

[VOTE] Release Apache Spark 2.0.1 (RC4)

2016-09-28 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 2.0.1. The vote is open until Sat, Oct 1, 2016 at 20:00 PDT and passes if a majority of at least 3+1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 2.0.1 [ ] -1 Do not release this package because ... The

Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-27 Thread Reynold Xin
submitted it is a fairly significant issue. > > On Tue, Sep 27, 2016 at 1:31 PM, Reynold Xin <r...@databricks.com> wrote: > >> Actually I'm going to have to -1 the release myself. Sorry for crashing >> the party, but I saw two super critical issues discovered in the last 2 &

[discuss] Spark 2.x release cadence

2016-09-27 Thread Reynold Xin
We are 2 months past releasing Spark 2.0.0, an important milestone for the project. Spark 2.0.0 deviated (took 6 month from the regular release cadence we had for the 1.x line, and we never explicitly discussed what the release cadence should look like for 2.x. Thus this email. During Spark 1.x,

Re: [VOTE] Release Apache Spark 2.0.1 (RC4)

2016-10-02 Thread Reynold Xin
Thanks for voting. The vote has passed with the following +1 votes and no -1 votes. I will work on packaging the release. +1 Reynold Xin* Ricardo Almeida Jagadeesan As Weiqing Yang Herman van Hövell tot Westerflier Matei Zaharia* Mridul Muralidharan* Michael Armbrust* Sean Owen* Sameer Agarwal

Re: renaming "minor release" to "feature release"

2016-09-26 Thread Reynold Xin
t to move to possibly-API-breaking major releases super often, but we do >>> have lots of large features that come out all the time, and our current >>> name doesn't convey that. >>> >>> Matei >>> >>> On Jul 28, 2016, at 4:15 PM, Reyn

Re: Sliding Window Memory use

2016-09-26 Thread Reynold Xin
I ran it on Databricks community edition which was a local[8] cluster with 6GB of RAM. It ran fine. That said, looking at the plan, we can definitely simplify this quite a bit. We had a new Window physical execution node for each window expression, when we could have collapsed all of them into a

Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-27 Thread Reynold Xin
, but these aren't > anything that should hold up the release. > > +1 > > On Sat, Sep 24, 2016 at 3:08 PM, Reynold Xin <r...@databricks.com> wrote: > >> Please vote on releasing the following candidate as Apache Spark version >> 2.0.1. The vote is open until Tue,

Re: Should LeafExpression have children final override (like Nondeterministic)?

2016-09-27 Thread Reynold Xin
Yes - same thing with children in UnaryExpression, BinaryExpression. Although I have to say the utility isn't that big here. On Tue, Sep 27, 2016 at 12:53 AM, Jacek Laskowski wrote: > Hi, > > Perhaps nitpicking...you've been warned. > > While reviewing expressions in Catalyst

welcoming Xiao Li as a committer

2016-10-03 Thread Reynold Xin
Hi all, Xiao Li, aka gatorsmile, has recently been elected as an Apache Spark committer. Xiao has been a super active contributor to Spark SQL. Congrats and welcome, Xiao! - Reynold

Re: Suggestion in README.md for guiding pull requests/JIRAs (probably about linking CONTRIBUTING.md or wiki)

2016-10-09 Thread Reynold Xin
Github already links to CONTRIBUTING.md. -- of course, a lot of people ignore that. One thing we can do is to add an explicit link to the wiki contributing page in the template (but note that even that introduces some overhead for every pull request). Aside from that, I am not sure if the other

Re: Suggestion in README.md for guiding pull requests/JIRAs (probably about linking CONTRIBUTING.md or wiki)

2016-10-09 Thread Reynold Xin
Actually let's move the discussion to the JIRA ticket, given there is a ticket. On Sun, Oct 9, 2016 at 5:36 PM, Reynold Xin <r...@databricks.com> wrote: > Github already links to CONTRIBUTING.md. -- of course, a lot of people > ignore that. One thing we can do is to add an e

SPARK-17845 - window function frame boundary API

2016-10-09 Thread Reynold Xin
Hi all, I tried to use the window function DataFrame API this weekend and found it awkward to use, especially with respect to specifying frame boundaries. I wrote down some options here and am curious your thoughts. If you have suggestions on the API beyond what's already listed in the JIRA

Re: This Exception has been really hard to trace

2016-10-09 Thread Reynold Xin
You should probably check with DataStax who build the Cassandra connector for Spark. On Sun, Oct 9, 2016 at 8:13 PM, kant kodali wrote: > > I tried SpanBy but look like there is a strange error that happening no > matter which way I try. Like the one here described for Java

[VOTE] Release Apache Spark 2.0.2 (RC1)

2016-10-27 Thread Reynold Xin
Greetings from Spark Summit Europe at Brussels. Please vote on releasing the following candidate as Apache Spark version 2.0.2. The vote is open until Sun, Oct 30, 2016 at 00:30 PDT and passes if a majority of at least 3+1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 2.0.2 [ ]

Re: Straw poll: dropping support for things like Scala 2.10

2016-10-26 Thread Reynold Xin
We can do the following concrete proposal: 1. Plan to remove support for Java 7 / Scala 2.10 in Spark 2.2.0 (Mar/Apr 2017). 2. In Spark 2.1.0 release, aggressively and explicitly announce the deprecation of Java 7 / Scala 2.10 support. (a) It should appear in release notes, documentations that

Re: Straw poll: dropping support for things like Scala 2.10

2016-10-27 Thread Reynold Xin
<ko...@tresata.com> wrote: > >> that sounds good to me >> >> On Wed, Oct 26, 2016 at 2:26 PM, Reynold Xin <r...@databricks.com> wrote: >> >> We can do the following concrete proposal: >> >> 1. Plan to remove support for Java 7 / Sca

Re: Spark Improvement Proposals

2016-11-07 Thread Reynold Xin
oard in general has less of an issue > with that, sure. As long as it is clearly announced, lasts at least > 72 hours, and has a clear outcome. > > The other points are hard to comment on without being able to see the > text in question. > > > On Mon, Nov 7, 2016 at 3:11 AM

Re: Diffing execution plans to understand an optimizer bug

2016-11-08 Thread Reynold Xin
If you want to peek into the internals and do crazy things, it is much easier to do it in Scala with df.queryExecution. For explain string output, you can work around the comparison simply by doing replaceAll("#\\d+", "#x") similar to the patch here:

Re: Handling questions in the mailing lists

2016-11-06 Thread Reynold Xin
visible than it is. On Wed, Nov 2, 2016 at 10:21 AM, Reynold Xin <r...@databricks.com> wrote: > Actually after talking with more ASF members, I believe the only policy is > that development decisions have to be made and announced on ASF properties > (dev list or jira), but user

[ANNOUNCE] Announcing Apache Spark 1.6.3

2016-11-07 Thread Reynold Xin
We are happy to announce the availability of Spark 1.6.3! This maintenance release includes fixes across several areas of Spark and encourage users on the 1.6.x line to upgrade to 1.6.3. Head to the project's download page to download the new version: http://spark.apache.org/downloads.html

<    3   4   5   6   7   8   9   10   11   12   >